# 🔑 combineByKey Algorithm

Advanced aggregation operations in PySpark RDDs using `combineByKey()`.

## 🎯 Overview

`combineByKey()` is a powerful transformation for custom aggregations on key-value RDDs. It allows you to:

- ✅ **Create initial accumulator values** for each key
- ✅ **Merge values** into existing accumulators
- ✅ **Combine accumulators** from different partitions

---

## ⚙️ PySpark Setup

Initialize Spark for the combineByKey operations.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder \
    .appName("combineByKey_Demo") \
    .master("local[*]") \
    .getOrCreate()

sc = spark.sparkContext

print(f"Spark Version: {spark.version}")
print("Ready for combineByKey operations!")

## 📊 Basic Example: Word Count

Classic word count implementation using `combineByKey()`.

In [None]:
# Sample word count data
word_data = [
    ("spark", 1),
    ("hadoop", 1),
    ("spark", 1),
    ("kafka", 1),
    ("spark", 1),
    ("hadoop", 1)
]

# Create RDD
word_rdd = sc.parallelize(word_data)

print("Input data:")
for word, count in word_data:
    print(f"  {word}: {count}")

In [None]:
# combineByKey for word count
word_count_result = word_rdd.combineByKey(
    lambda v: v,                    # createCombiner: initial value
    lambda acc, v: acc + v,         # mergeValue: add to accumulator
    lambda acc1, acc2: acc1 + acc2  # mergeCombiners: combine accumulators
)

print("\nWord Count Results using combineByKey():")
for word, count in sorted(word_count_result.collect()):
    print(f"  {word}: {count}")

## 📈 Advanced Example: Statistics per Key

Calculate min, max, and count for each key using a single `combineByKey()` operation.

In [None]:
# Data for statistical calculations
stats_data = [
    ("A", 10), ("A", 20), ("A", 15),
    ("B", 5), ("B", 25), ("B", 30),
    ("C", 8), ("C", 12)
]

stats_rdd = sc.parallelize(stats_data)

print("Input data for statistics:")
for key, value in stats_data:
    print(f"  {key}: {value}")

In [None]:
# Calculate min, max, count for each key using combineByKey
stats_result = stats_rdd.combineByKey(
    lambda v: (v, v, 1),                    # createCombiner: (min, max, count)
    lambda acc, v: (min(acc[0], v), max(acc[1], v), acc[2] + 1),  # mergeValue
    lambda acc1, acc2: (min(acc1[0], acc2[0]), max(acc1[1], acc2[1]), acc1[2] + acc2[2])  # mergeCombiners
)

print("\nStatistics per Key:")
for key, (min_val, max_val, count) in sorted(stats_result.collect()):
    print(f"  {key}: min={min_val}, max={max_val}, count={count}")

## ⚡ Performance Comparison

Why `combineByKey()` is more efficient than `groupByKey()`.

In [None]:
# Large dataset for performance comparison
large_data = [(f"key_{i % 100}", i) for i in range(10000)]
large_rdd = sc.parallelize(large_data)

print(f"Dataset size: {large_rdd.count()} elements")
print(f"Unique keys: {large_rdd.map(lambda x: x[0]).distinct().count()}")


In [None]:
import time

# Method 1: Using groupByKey (less efficient)
print("=== Method 1: groupByKey + mapValues ===")
start_time = time.time()

groupby_result = large_rdd.groupByKey() \
    .mapValues(lambda values: sum(values)) \
    .collect()

groupby_time = time.time() - start_time
print(f"groupByKey time: {groupby_time:.3f} seconds")
print(f"Results count: {len(groupby_result)}")


In [None]:
# Method 2: Using combineByKey (more efficient)
print("\n=== Method 2: combineByKey ===")
start_time = time.time()

combinebykey_result = large_rdd.combineByKey(
    lambda v: v,                    # createCombiner
    lambda acc, v: acc + v,         # mergeValue
    lambda acc1, acc2: acc1 + acc2  # mergeCombiners
).collect()

combinebykey_time = time.time() - start_time
print(f"combineByKey time: {combinebykey_time:.3f} seconds")
print(f"Results count: {len(combinebykey_result)}")

print(f"\nPerformance improvement: {groupby_time/combinebykey_time:.2f}x faster")


## 🎯 Interview Questions & Key Takeaways

### Common Interview Questions:
1. **What is the difference between `reduceByKey()` and `combineByKey()`?**
2. **When would you use `combineByKey()` instead of `groupByKey()`?**
3. **How does `combineByKey()` handle combiner functions?**

### Key Takeaways:
- ✅ `combineByKey()` provides fine-grained control over aggregation
- ✅ **More efficient** than `groupByKey()` for most use cases
- ✅ **Essential** for complex custom aggregations
- ✅ **Supports incremental updates** and memory efficiency
- ✅ **Critical** for production distributed computing

---

**🚀 Ready to master advanced PySpark aggregations? `combineByKey()` is your gateway to efficient big data processing!**