# PySpark RDD Operations

## Overview
This notebook covers Resilient Distributed Datasets (RDDs) - the fundamental data structure in Spark, along with transformations and actions.

## Learning Objectives
- Understand RDD concepts and when to use them
- Create RDDs from various sources
- Apply transformations (map, filter, flatMap, etc.)
- Execute actions (collect, reduce, count, etc.)
- Work with pair RDDs
- Understand partitioning and persistence

---

## 1. RDD Basics

### What is an RDD?

**RDD (Resilient Distributed Dataset)** is:
- An immutable distributed collection of objects
- Partitioned across the cluster
- Fault-tolerant (can be reconstructed if lost)
- Low-level API (DataFrames are built on top of RDDs)

### When to Use RDDs vs DataFrames?

**Use RDDs when:**
- Need fine-grained control over data
- Working with unstructured data
- Need to manipulate data at low level
- Legacy code requires RDDs

**Use DataFrames when:**
- Working with structured/semi-structured data
- Want automatic optimization
- Need SQL-like operations
- Better performance in most cases (recommended)

## 2. Creating RDDs

In [None]:
from pyspark import SparkContext

# In Databricks, sc (SparkContext) is already available
# For local: sc = SparkContext("local", "RDD Operations")

# Method 1: From Python collection
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data)

print(f"RDD created with {rdd.count()} elements")
print(f"First 5 elements: {rdd.take(5)}")

In [None]:
# Method 2: From text file
# text_rdd = sc.textFile("/path/to/file.txt")

# Method 3: From DataFrame
df = spark.range(1, 11)
rdd_from_df = df.rdd

print(f"RDD from DataFrame: {rdd_from_df.take(3)}")

In [None]:
# Create RDD with specific number of partitions
rdd_4_partitions = sc.parallelize(range(1, 101), 4)

print(f"Number of partitions: {rdd_4_partitions.getNumPartitions()}")
print(f"Elements per partition: {rdd_4_partitions.glom().map(len).collect()}")

## 3. Transformations (Lazy Operations)

### map() - Transform each element

In [None]:
# Square each number
numbers = sc.parallelize([1, 2, 3, 4, 5])
squared = numbers.map(lambda x: x ** 2)

print(f"Original: {numbers.collect()}")
print(f"Squared: {squared.collect()}")

### filter() - Keep elements matching condition

In [None]:
# Keep only even numbers
evens = numbers.filter(lambda x: x % 2 == 0)

print(f"Original: {numbers.collect()}")
print(f"Evens: {evens.collect()}")

### flatMap() - Transform and flatten

In [None]:
# Split sentences into words
sentences = sc.parallelize([
    "Hello world",
    "PySpark tutorial",
    "Databricks course"
])

# map returns list of lists
words_map = sentences.map(lambda x: x.split())
print(f"Using map: {words_map.collect()}")

# flatMap flattens the result
words_flatmap = sentences.flatMap(lambda x: x.split())
print(f"Using flatMap: {words_flatmap.collect()}")

### distinct() - Remove duplicates

In [None]:
# Remove duplicates
duplicates = sc.parallelize([1, 2, 2, 3, 3, 3, 4, 5, 5])
unique = duplicates.distinct()

print(f"With duplicates: {duplicates.collect()}")
print(f"Distinct: {unique.collect()}")

### union() - Combine RDDs

In [None]:
# Union of two RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
combined = rdd1.union(rdd2)

print(f"RDD1: {rdd1.collect()}")
print(f"RDD2: {rdd2.collect()}")
print(f"Union: {combined.collect()}")

### intersection() and subtract()

In [None]:
# Set operations
rdd_a = sc.parallelize([1, 2, 3, 4, 5])
rdd_b = sc.parallelize([3, 4, 5, 6, 7])

# Intersection
common = rdd_a.intersection(rdd_b)
print(f"Intersection: {common.collect()}")

# Subtract (elements in A but not in B)
diff = rdd_a.subtract(rdd_b)
print(f"A - B: {diff.collect()}")

### cartesian() - Cartesian product

In [None]:
# Cartesian product
rdd_x = sc.parallelize([1, 2])
rdd_y = sc.parallelize(['a', 'b'])
product = rdd_x.cartesian(rdd_y)

print(f"Cartesian product: {product.collect()}")

## 4. Actions (Trigger Execution)

### collect() - Retrieve all elements

In [None]:
# Collect all elements to driver (careful with large datasets!)
data = sc.parallelize(range(1, 11))
all_elements = data.collect()

print(f"All elements: {all_elements}")

### count(), first(), take()

In [None]:
# Count elements
print(f"Count: {data.count()}")

# Get first element
print(f"First: {data.first()}")

# Take n elements
print(f"First 3: {data.take(3)}")

# Take top n (requires ordering)
print(f"Top 3: {data.top(3)}")

### reduce() - Aggregate elements

In [None]:
# Sum all numbers
numbers = sc.parallelize([1, 2, 3, 4, 5])
sum_all = numbers.reduce(lambda a, b: a + b)

print(f"Sum: {sum_all}")

# Find maximum
max_val = numbers.reduce(lambda a, b: a if a > b else b)
print(f"Max: {max_val}")

### fold() - Like reduce with initial value

In [None]:
# Fold with initial value
result = numbers.fold(0, lambda a, b: a + b)
print(f"Fold result: {result}")

### foreach() - Apply function to each element

In [None]:
# Process each element (side effects)
# Note: Output won't show in notebook, use for side effects like saving
numbers.foreach(lambda x: print(f"Processing: {x}"))

print("foreach completed (check executor logs for output)")

## 5. Pair RDD Operations

### Creating Pair RDDs

In [None]:
# Create pair RDD (key-value pairs)
pairs = sc.parallelize([
    ("apple", 5),
    ("banana", 3),
    ("apple", 2),
    ("orange", 4),
    ("banana", 6)
])

print(f"Pair RDD: {pairs.collect()}")

### keys() and values()

In [None]:
# Get keys and values
keys = pairs.keys()
values = pairs.values()

print(f"Keys: {keys.collect()}")
print(f"Values: {values.collect()}")
print(f"Distinct keys: {keys.distinct().collect()}")

### reduceByKey() - Aggregate by key

In [None]:
# Sum values by key
totals = pairs.reduceByKey(lambda a, b: a + b)

print(f"Totals by key: {totals.collect()}")

### groupByKey() - Group values by key

In [None]:
# Group all values for each key
grouped = pairs.groupByKey()

# Need to convert iterables to lists
grouped_list = grouped.mapValues(list)

print(f"Grouped by key: {grouped_list.collect()}")

### mapValues() - Transform values only

In [None]:
# Double all values
doubled = pairs.mapValues(lambda x: x * 2)

print(f"Original: {pairs.collect()}")
print(f"Doubled values: {doubled.collect()}")

### sortByKey() - Sort by key

In [None]:
# Sort by key
sorted_asc = pairs.sortByKey(ascending=True)
sorted_desc = pairs.sortByKey(ascending=False)

print(f"Sorted ascending: {sorted_asc.collect()}")
print(f"Sorted descending: {sorted_desc.collect()}")

### join() operations on Pair RDDs

In [None]:
# Join pair RDDs
prices = sc.parallelize([
    ("apple", 0.5),
    ("banana", 0.3),
    ("orange", 0.6)
])

quantities = sc.parallelize([
    ("apple", 10),
    ("banana", 20),
    ("grape", 15)
])

# Inner join
inner_join = prices.join(quantities)
print(f"Inner join: {inner_join.collect()}")

# Left outer join
left_join = prices.leftOuterJoin(quantities)
print(f"Left join: {left_join.collect()}")

# Right outer join
right_join = prices.rightOuterJoin(quantities)
print(f"Right join: {right_join.collect()}")

### countByKey()

In [None]:
# Count occurrences of each key
counts = pairs.countByKey()

print(f"Counts by key: {counts}")

## 6. Partitioning

In [None]:
# Check partitions
data = sc.parallelize(range(1, 101), 4)

print(f"Number of partitions: {data.getNumPartitions()}")

# View partition distribution
partition_sizes = data.glom().map(len).collect()
print(f"Elements per partition: {partition_sizes}")

# Repartition
repartitioned = data.repartition(8)
print(f"After repartition: {repartitioned.getNumPartitions()} partitions")

# Coalesce (reduce partitions)
coalesced = data.coalesce(2)
print(f"After coalesce: {coalesced.getNumPartitions()} partitions")

### Custom Partitioner for Pair RDDs

In [None]:
# Partition by key (hash partitioning)
pair_rdd = sc.parallelize([(i, i*2) for i in range(1, 21)])

# Partition by key with 4 partitions
partitioned_pairs = pair_rdd.partitionBy(4)

print(f"Partitioned by key: {partitioned_pairs.getNumPartitions()} partitions")

## 7. Persistence and Caching

In [None]:
from pyspark import StorageLevel

# Create RDD and cache it
data = sc.parallelize(range(1, 1000))

# Cache (memory only)
cached_rdd = data.cache()

# Or persist with specific storage level
persisted_rdd = data.persist(StorageLevel.MEMORY_AND_DISK)

# Use cached RDD multiple times
count1 = cached_rdd.count()
count2 = cached_rdd.filter(lambda x: x > 500).count()

print(f"Total: {count1}, Greater than 500: {count2}")

# Unpersist when done
cached_rdd.unpersist()
persisted_rdd.unpersist()

## 8. Word Count Example (Classic RDD Example)

In [None]:
# Classic word count
text = [
    "Apache Spark is fast",
    "Spark is easy to use",
    "Spark runs everywhere"
]

text_rdd = sc.parallelize(text)

# Word count pipeline
word_counts = text_rdd \
    .flatMap(lambda line: line.lower().split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], ascending=False)

print("Word counts:")
for word, count in word_counts.collect():
    print(f"{word}: {count}")

## Practice Exercises

### Exercise 1: Filter and Aggregate
Given an RDD of numbers, find the sum of all even numbers.

In [None]:
# Your solution here
numbers = sc.parallelize(range(1, 101))
# TODO: Filter evens and sum them

### Exercise 2: Pair RDD Operations
Given sales data as (product, amount) pairs, find total sales per product.

In [None]:
# Your solution here
sales = sc.parallelize([
    ("A", 100), ("B", 200), ("A", 150), ("C", 300), ("B", 100)
])
# TODO: Calculate total per product

## Summary

In this notebook, you learned:

✅ RDD concepts and when to use them
✅ Creating RDDs from various sources
✅ Transformations (map, filter, flatMap, etc.)
✅ Actions (collect, reduce, count, etc.)
✅ Pair RDD operations (reduceByKey, join, etc.)
✅ Partitioning strategies
✅ Persistence and caching
✅ Classic RDD patterns (word count)

## Next Steps

1. Practice with more complex RDD operations
2. Compare RDD vs DataFrame performance
3. Learn when to use each API
4. Study advanced partitioning strategies

## Additional Resources

- [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [PySpark RDD API](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html)
- [Spark By Examples - RDD](https://sparkbyexamples.com/pyspark-rdd/)