# combineByKey Algorithm

Advanced aggregation operations in PySpark RDDs.

## Overview

 is a powerful transformation for custom aggregations on key-value RDDs. It allows you to:
- Create initial accumulator values
- Merge values into accumulators
- Combine accumulators from different partitions

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder \n    .appName("combineByKey") \n    .getOrCreate()

sc = spark.sparkContext

## Basic Example: Word Count

Let's start with a simple word count using combineByKey:

In [None]:
# Sample data
data = [
    ("spark", 1),
    ("hadoop", 1),
    ("spark", 1),
    ("kafka", 1),
    ("spark", 1),
    ("hadoop", 1)
]

# Create RDD
rdd = sc.parallelize(data)

# combineByKey for word count
result = rdd.combineByKey(
    lambda v: v,                    # createCombiner: initial value
    lambda acc, v: acc + v,         # mergeValue: add to accumulator
    lambda acc1, acc2: acc1 + acc2  # mergeCombiners: combine accumulators
)

print("Word Count Results:")
for word, count in sorted(result.collect()):
    print(f"{word}: {count}")

## Advanced Example: Statistics

Calculate min, max, and count for each key:

In [None]:
# Data: (key, value)
stats_data = [
    ("A", 10), ("A", 20), ("A", 15),
    ("B", 5), ("B", 25), ("B", 30),
    ("C", 8), ("C", 12)
]

stats_rdd = sc.parallelize(stats_data)

# Calculate min, max, count for each key
stats_result = stats_rdd.combineByKey(
    lambda v: (v, v, 1),                    # (min, max, count)
    lambda acc, v: (min(acc[0], v), max(acc[1], v), acc[2] + 1),
    lambda acc1, acc2: (min(acc1[0], acc2[0]), max(acc1[1], acc2[1]), acc1[2] + acc2[2])
)

print("Statistics per Key:")
for key, (min_val, max_val, count) in sorted(stats_result.collect()):
    print(f"{key}: min={min_val}, max={max_val}, count={count}")

## Complex Example: Moving Average

Calculate moving averages with combineByKey:

In [None]:
# Time series data: (key, (timestamp, value))
time_data = [
    ("stock_A", (1, 100)),
    ("stock_A", (2, 105)),
    ("stock_A", (3, 102)),
    ("stock_B", (1, 50)),
    ("stock_B", (2, 52)),
    ("stock_B", (3, 48))
]

time_rdd = sc.parallelize(time_data)

# Calculate moving averages
moving_avg = time_rdd.combineByKey(
    lambda v: ([v[1]], v[1]),                    # (values_list, current_sum)
    lambda acc, v: (acc[0] + [v[1]], acc[1] + v[1]),
    lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])
)

print("Moving Averages:")
for key, (values, total) in sorted(moving_avg.collect()):
    avg = total / len(values)
    print(f"{key}: values={values}, average={avg:.2f}")

## Key Parameters

- **createCombiner**: Creates initial accumulator for each key
- **mergeValue**: Merges a new value into existing accumulator
- **mergeCombiners**: Combines accumulators from different partitions

## Performance Notes

- More efficient than  + custom aggregation
- Allows for incremental updates and memory efficiency
- Essential for complex aggregations in distributed computing