
# Learning Spark: Broadcast, Persist, and Data Distribution

## Introduction
This notebook focuses on an advanced view of Broadcast Variables, Persisting Data, and Data Distribution, with detailed comparisons to Partitioning. It complements a previous notebook that covers the fundamentals of Broadcast in Spark.
---
By Meital Abadi

# **PART 1:**
## 1. Understanding Broadcast Variables in Spark

### What are Broadcast Variables?
- **Read-Only Sharing**: Broadcast variables are read-only once distributed.
- **Efficient Distribution**: Data is sent to each node only once.
- **Local Caching**: Each node caches the broadcast data for faster access.

### Why use Broadcast?
- **Avoids redundant data transmission**: Instead of sending the same data multiple times, it is broadcasted once.
- **Optimizes performance**: Spark ensures that each worker node can access the broadcasted data without excessive network traffic.

#### 🔹 Example: Using Broadcast for Efficient Joins


In [None]:
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast, col

spark = SparkSession.builder.appName("BroadcastComparison").getOrCreate()

large_df = spark.range(1, 1000001).toDF("id")
lookup_data = [(i, f"Category {i % 10}") for i in range(1, 51)]
small_df = spark.createDataFrame(lookup_data, ["id", "category"])

start_time = time.time()
regular_join_df = large_df.join(small_df, "id", "left_outer")
regular_join_df.count()
regular_time = time.time() - start_time

start_time = time.time()
broadcast_join_df = large_df.join(broadcast(small_df), "id", "left_outer")
broadcast_join_df.count()
broadcast_time = time.time() - start_time

print(f"Regular Join Time: {regular_time:.4f} sec")
print(f"Broadcast Join Time: {broadcast_time:.4f} sec")

total_rows = broadcast_join_df.count()
matched_rows = broadcast_join_df.filter(col("category").isNotNull()).count()
unmatched_rows = total_rows - matched_rows

print(f"Total rows: {total_rows}, Matched: {matched_rows}, Unmatched: {unmatched_rows}")

broadcast_join_df.orderBy("id").show(20, False)


----


### Understanding the Output

Broadcast join ran about **25% faster** than the regular join since it reduces shuffling.  
Only **50 rows matched**, while **999,950 had no match** because their IDs weren’t in the lookup table.  
This method works best when joining a **large dataset with a small lookup table**, avoiding unnecessary data transfers.  

We tested both approaches to compare performance and see when broadcasting is useful.  
It helps speed up joins by keeping small tables in memory instead of repeatedly transferring them. 




In [None]:
import random

sc = spark.sparkContext

big_lookup = {str(i): f"value_{i}" for i in range(1000)}
broadcast_big = sc.broadcast(big_lookup)

test_keys = [str(random.randint(0, 999)) for _ in range(10000)]
test_rdd = sc.parallelize(test_keys)

def lookup_no_broadcast(key):
    return big_lookup.get(key, "not_found")

def lookup_with_broadcast(key):
    return broadcast_big.value.get(key, "not_found")

print("Running performance tests...")

start = time.time()
test_rdd.map(lookup_no_broadcast).count()
time_without = time.time() - start

start = time.time()
test_rdd.map(lookup_with_broadcast).count()
time_with = time.time() - start

print(f"Time without broadcast: {time_without:.2f} seconds")
print(f"Time with broadcast: {time_with:.2f} seconds")
print(f"Performance improvement: {(time_without/time_with):.2f}x faster")




### Performance Test Results  

The test compares **direct lookups** vs. **broadcasted lookups** in an RDD.  
- **Without broadcast**, each worker accesses the dictionary separately, causing repeated data transfer.  
- **With broadcast**, the dictionary is stored locally on each worker, avoiding unnecessary communication.  
- The final output shows **broadcast significantly speeds up lookups**, reducing execution time by **X times**.  

This proves that **broadcasting is useful when multiple tasks need to access the same static data**, minimizing network overhead and improving performance.  



## 2. Persisting Data in Spark

### 🔹 What is `persist()`?
Persisting (or caching) allows Spark to store data in memory (or disk) to speed up repeated computations. It is useful in iterative algorithms or when multiple actions use the same dataset.we doing it to improve computation efficiency.

### 🔹 When to use `persist()`?
- When the same dataset is used multiple times within a job.
- To **reduce recomputation overhead** in iterative algorithms (e.g., Machine Learning, Graph Processing).

#### ✨ Example: Comparing Performance with and without `persist()`


In [None]:
import time

# Measure execution time without persist
start_time = time.time()
broadcasted_df.count()
print("Without persist:", time.time() - start_time, "seconds")

# Persist the DataFrame in memory
broadcasted_df.persist()

# Measure execution time with persist
start_time = time.time()
broadcasted_df.count()
print("With persist:", time.time() - start_time, "seconds")

broadcasted_df.show()


### Understanding the Results  

The output shows the execution time without `persist()` (0.168s) and with `persist()` (0.151s). The difference is small because Spark is running on a single machine, meaning there is no real overhead from reloading the data. However, in a distributed environment, `persist()` significantly reduces recomputation time by storing the DataFrame in memory.  



* The execution time difference is small because Spark is running on a single machine, where all data is already in local memory. In a distributed environment, persist() significantly reduces recomputation time by avoiding repeated data transfers between nodes.


## 3. RDD vs. DataFrame: When to Use RDDs?
RDDs (Resilient Distributed Datasets) provide **low-level control** over distributed data, whereas DataFrames are optimized for SQL-like operations.

| Feature  | RDD | DataFrame |
|----------|----|-----------|
| Flexibility | High (custom logic) | Moderate (SQL-based) |
| Performance | Lower (no optimizations) | High (query optimization) |
| Schema | Not enforced | Enforced |
| API Usability | Verbose | Concise |

#### 🔹 Example: When RDDs are Necessary

Unlike DataFrames, RDDs allow full control over data partitioning, transformations, and distributed execution. They are useful when applying complex transformations, handling unstructured data, or working with custom partitioning logic.


In [None]:
rdd = spark.sparkContext.parallelize([("apple", 10), ("banana", 20), ("orange", 30)])

# Applying custom transformations using RDD API
rdd_filtered = rdd.filter(lambda x: x[1] > 15)
print(rdd_filtered.collect())



### Why Use RDD Instead of DataFrame?
RDDs allow for custom transformations that DataFrames do not support, such as complex filtering, custom partitioning, and transformations that require full control over the data distribution.

----

# **PART 2:** 

### Partitioning in Spark
Partitioning in Spark determines how data is distributed across nodes in a cluster, impacting performance and efficiency. Proper partitioning reduces data shuffling and improves parallelism.

🔹 Why is partitioning important?

- Ensures balanced workload distribution across cluster nodes.
- Reduces data movement (shuffling) in operations like join and groupBy().
- Helps optimize parallel processing by utilizing all available resources efficiently.






### 📌 Basic Partitioning Example in Spark:


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitioningExample").getOrCreate()
sc = spark.sparkContext

# Creating an RDD with 6 partitions
rdd = sc.parallelize(range(1, 101), numSlices=6)

# Checking the number of partitions
print(f"Number of partitions: {rdd.getNumPartitions()}")

# Repartitioning to increase partitions
repartitioned_rdd = rdd.repartition(12)
print(f"After repartitioning: {repartitioned_rdd.getNumPartitions()}")

# Coalescing to decrease partitions
coalesced_rdd = repartitioned_rdd.coalesce(4)
print(f"After coalescing: {coalesced_rdd.getNumPartitions()}")

### 📌 When to Use Partitioning vs. Broadcast  
Partitioning is useful when **handling large datasets** that need efficient processing across a distributed system. However, **Broadcasting** is an alternative that works better when **joining a large dataset with a small lookup table** to minimize shuffling.  

#### 💡 Partitioning vs. Broadcast - Key Differences  

| **Feature**       | **Partitioning**                         | **Broadcast**                       |
|-------------------|-----------------------------------------|-------------------------------------|
| **Use Case**      | Large datasets, distributed processing  | Small lookup tables, joins         |
| **Performance**   | Optimized for large-scale computation  | Reduces network transfer in joins  |
| **Shuffling**     | Some shuffling when data is moved      | Avoids shuffling altogether        |
| **Scalability**   | Works for massive data                 | Best when lookup tables fit in memory |

---

### 📌 Demonstrating Broadcast vs. Partitioning  

In [None]:
from pyspark.sql.functions import broadcast

# Creating a large DataFrame
large_df = spark.range(1, 1000001).toDF("id")

# Small lookup table
lookup_data = [(i, f"Category {i % 10}") for i in range(1, 101)]
small_df = spark.createDataFrame(lookup_data, ["id", "category"])

# Regular Join (Without Broadcast)
regular_join = large_df.join(small_df, "id", "left_outer")
print(f"Regular Join partitions: {regular_join.rdd.getNumPartitions()}")

# Broadcast Join (Optimized for Small Tables)
broadcast_join = large_df.join(broadcast(small_df), "id", "left_outer")
print(f"Broadcast Join partitions: {broadcast_join.rdd.getNumPartitions()}")

### 📌 Explaining the Results  

I tested **regular join vs. broadcast join** to see how partitioning behaves in each case.  

- **Regular Join: 5 partitions** → Spark shuffled data between partitions, which can slow performance.  
- **Broadcast Join: 8 partitions** → The lookup table was sent to all nodes, reducing shuffling but increasing partitions.  

### 🔹 Key Differences  
- **Regular join** moves data across partitions, causing overhead.  
- **Broadcast join** avoids shuffling, making it faster for small lookup tables.  
- **Partitioning changes dynamically** based on the join strategy.  

Broadcasting works best for **small tables**, while partitioning is needed for **large datasets**. Next, I’ll explore how both methods can be combined for better optimization. 🚀  


-----


# עד כאן

-------