# Word Count Algorithm

Classic MapReduce example in PySpark.

## Prepare Input Data

First, let's create some sample text data:

In [None]:
# Create sample data
data = [
    "crazy crazy fox jumped",
    "crazy fox jumped", 
    "fox is fast",
    "fox is smart",
    "dog is smart"
]

# Write to file
with open("data.txt", "w") as f:
    for line in data:
        f.write(line + "
")

print("Sample data created:")
with open("data.txt", "r") as f:
    print(f.read())

## Word Count Implementation

Now let's implement the classic word count algorithm:

In [None]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \n    .appName("WordCount") \n    .getOrCreate()

# Read text file
text_file = spark.sparkContext.textFile("data.txt")

# Word count implementation
counts = text_file \n    .flatMap(lambda line: line.split(" ")) \n    .map(lambda word: (word, 1)) \n    .reduceByKey(lambda a, b: a + b)

# Collect and display results
results = counts.collect()
print("Word Count Results:")
for word, count in sorted(results):
    print(f"{word}: {count}")

## Alternative Implementation

Using DataFrame API for comparison:

In [None]:
from pyspark.sql.functions import split, explode, col

# Read as DataFrame
df = spark.read.text("data.txt")

# Word count using DataFrame API
word_counts_df = df \n    .select(split(col("value"), " ").alias("words")) \n    .select(explode(col("words")).alias("word")) \n    .groupBy("word") \n    .count() \n    .orderBy("word")

print("DataFrame Word Count Results:")
word_counts_df.show()

## Key Concepts

- **flatMap**: Split lines into individual words
- **map**: Transform each word to (word, 1) pair
- **reduceByKey**: Aggregate counts by word
- **collect**: Bring results to driver (use carefully with large data)