# 📝 Word Count Algorithm

Classic MapReduce word count implementation in PySpark - the "Hello World" of big data processing.

## 🎯 Overview

Word count is the quintessential big data algorithm that demonstrates:

- ✅ **Distributed processing** across multiple nodes
- ✅ **MapReduce paradigm** - Map → Shuffle → Reduce
- ✅ **Key-value pair transformations**
- ✅ **Scalable aggregation** patterns

**Why Word Count Matters:** It forms the foundation for text analytics, search indexing, and many NLP applications.

---

## ⚙️ PySpark Setup

Initialize Spark for word count operations.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder \
    .appName("WordCount_Algorithm") \
    .master("local[*]") \
    .getOrCreate()

sc = spark.sparkContext

print(f"Spark Version: {spark.version}")
print("Ready for word count operations!")

## 📝 Prepare Input Data

Create sample text data for word count demonstration.

In [None]:
# Create sample text data
sample_text = [
    "crazy crazy fox jumped",
    "crazy fox jumped",
    "fox is fast",
    "fox is smart",
    "dog is smart",
    "big data is powerful",
    "spark processes big data",
    "machine learning with spark"
]

# Write to file
with open("wordcount_data.txt", "w") as f:
    for line in sample_text:
        f.write(line + "\n")

print("Sample data created:")
with open("wordcount_data.txt", "r") as f:
    content = f.read()
    print(content)
    print(f"\nTotal lines: {len(content.splitlines())}")

## 🔄 Classic RDD Word Count Implementation

The traditional MapReduce word count using RDD transformations.

In [None]:
# Load text file as RDD
text_rdd = sc.textFile("wordcount_data.txt")

print(f"Loaded {text_rdd.count()} lines from file")
print("Sample lines:")
for line in text_rdd.take(3):
    print(f"  '{line}'")

In [None]:
# Step 1: Split lines into words (FLATMAP)
words_rdd = text_rdd.flatMap(lambda line: line.split())
print(f"After flatMap (splitting): {words_rdd.count()} words")
print("Sample words:")
for word in words_rdd.take(10):
    print(f"  '{word}'")

In [None]:
# Step 2: Create key-value pairs (MAP)
word_pairs_rdd = words_rdd.map(lambda word: (word, 1))
print("\nAfter map (key-value pairs):")
for pair in word_pairs_rdd.take(10):
    print(f"  {pair}")

In [None]:
# Step 3: Aggregate counts (REDUCE BY KEY)
word_counts_rdd = word_pairs_rdd.reduceByKey(lambda a, b: a + b)
print("\nAfter reduceByKey (aggregated counts):")
for pair in word_counts_rdd.take(10):
    print(f"  {pair}")

In [None]:
# Step 4: Collect and display final results
final_results = word_counts_rdd.collect()

print("\n🎯 FINAL WORD COUNT RESULTS (RDD Approach):")
print("=" * 50)
for word, count in sorted(final_results):
    print(f"{word:15}: {count}")
print("=" * 50)
print(f"Total unique words: {len(final_results)}")

## 📊 DataFrame API Word Count

Modern approach using PySpark DataFrame API for comparison.

In [None]:
from pyspark.sql.functions import split, explode, col

# Read as DataFrame
df = spark.read.text("wordcount_data.txt")

print("DataFrame loaded:")
df.show(truncate=False)

In [None]:
# DataFrame word count pipeline
word_counts_df = df \
    .select(split(col("value"), " ").alias("words")) \
    .select(explode(col("words")).alias("word")) \
    .filter(col("word") != "") \
    .groupBy("word") \
    .count() \
    .orderBy(col("count").desc(), col("word"))

print("\n🎯 FINAL WORD COUNT RESULTS (DataFrame Approach):")
print("=" * 50)
word_counts_df.show(20, truncate=False)
print("=" * 50)

## ⚡ Performance Comparison

Compare RDD vs DataFrame approaches for word count.

In [None]:
import time

# Create larger dataset for meaningful comparison
large_text = sample_text * 100  # Repeat 100 times
with open("large_wordcount_data.txt", "w") as f:
    for line in large_text:
        f.write(line + "\n")

print(f"Created larger dataset: {len(large_text)} lines")


In [None]:
# Method 1: RDD Approach
print("=== RDD Word Count ===")
start_time = time.time()

rdd_result = sc.textFile("large_wordcount_data.txt") \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .count()  # Just count, don't collect all

rdd_time = time.time() - start_time
print(f"RDD approach time: {rdd_time:.3f} seconds")
print(f"Unique words found: {rdd_result}")


In [None]:
# Method 2: DataFrame Approach
print("\n=== DataFrame Word Count ===")
start_time = time.time()

df_result = spark.read.text("large_wordcount_data.txt") \
    .select(split(col("value"), " ").alias("words")) \
    .select(explode(col("words")).alias("word")) \
    .filter(col("word") != "") \
    .groupBy("word") \
    .count()

df_count = df_result.count()
df_time = time.time() - start_time
print(f"DataFrame approach time: {df_time:.3f} seconds")
print(f"Unique words found: {df_count}")

print(f"\nPerformance comparison:")
if rdd_time < df_time:
    print(f"RDD was {df_time/rdd_time:.2f}x faster")
else:
    print(f"DataFrame was {rdd_time/df_time:.2f}x faster")


## 🔧 Advanced Word Count Techniques

Handle edge cases and improve word count quality.

In [None]:
# Advanced word count with text cleaning
import re

def clean_word(word):
    """Clean and normalize words"""
    # Convert to lowercase
    word = word.lower()
    # Remove punctuation
    word = re.sub(r'[^a-zA-Z]', '', word)
    return word

# Advanced word count with cleaning
cleaned_word_counts = sc.textFile("wordcount_data.txt") \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (clean_word(word), 1)) \
    .filter(lambda pair: len(pair[0]) > 0) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda pair: pair[1], ascending=False)

print("\n🧹 CLEANED WORD COUNT RESULTS:")
print("=" * 50)
for word, count in cleaned_word_counts.collect():
    print(f"{word:15}: {count}")
print("=" * 50)
print("Applied: lowercase conversion, punctuation removal, empty word filtering, sorting by frequency")


In [None]:
# Stop words filtering
stop_words = {"is", "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"}

filtered_word_counts = sc.textFile("wordcount_data.txt") \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: clean_word(word)) \
    .filter(lambda word: len(word) > 0 and word not in stop_words) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda pair: pair[1], ascending=False)

print("\n🚫 FILTERED WORD COUNT (No Stop Words):")
print("=" * 50)
for word, count in filtered_word_counts.take(10):  # Top 10
    print(f"{word:15}: {count}")
print("=" * 50)
print(f"Removed {len(stop_words)} stop words from analysis")


## 🎯 Interview Questions & Key Takeaways

### Common Interview Questions:
1. **Explain the word count algorithm in MapReduce terms**
2. **What's the difference between RDD and DataFrame word count?**
3. **How would you handle very large text files?**
4. **What are the performance implications of `collect()`?**

### Key Takeaways:
- ✅ **Word count demonstrates core MapReduce principles**
- ✅ **flatMap()** splits lines into words (1:N transformation)
- ✅ **map()** creates key-value pairs (1:1 transformation)
- ✅ **reduceByKey()** aggregates by key (N:1 transformation)
- ✅ **DataFrames provide higher-level APIs** with optimization
- ✅ **Always consider data size** before using `collect()`
- ✅ **Text preprocessing** (cleaning, stop words) improves quality

### Real-World Applications:
- **Search engine indexing**
- **Document classification**
- **Sentiment analysis**
- **Topic modeling**
- **Spam detection**

---

**🚀 Word count is just the beginning! These patterns apply to countless big data analytics problems.**