### An example of reducebykey step by step to understand how it works

### **Scenario:**
You have sales data for different regions, and you want to calculate the **total sales for each region**.

---

### **Input Data**
Here’s the input dataset (key-value pairs):
```
data = [
    ("North", 100),
    ("South", 200),
    ("North", 150),
    ("East", 50),
    ("South", 300),
    ("East", 100),
    ("North", 50)
]
```

---

### **Step-by-Step Execution of `reduceByKey`**

#### 1. **Create an RDD**
First, we create an RDD from the input data:
```python
rdd = sc.parallelize(data)
```

At this stage, the RDD contains:
```
[("North", 100), ("South", 200), ("North", 150), ("East", 50), ("South", 300), ("East", 100), ("North", 50)]
```

---

#### 2. **Group Data by Key**
When you call `reduceByKey`, Spark groups all the values with the same key together **across partitions**.

```python
grouped_data = {
    "North": [100, 150, 50],
    "South": [200, 300],
    "East": [50, 100]
}
```

This grouping happens in a distributed manner across Spark partitions.

---

#### 3. **Apply the Reduction Function**
Next, the **reduction function** is applied to each group of values. In this example, the function is:
```python
lambda x, y: x + y
```

This function takes two values at a time and combines them. Here’s how it works for each key:

- **For "North"**:
  ```
  Step 1: 100 + 150 = 250
  Step 2: 250 + 50 = 300
  Result: ("North", 300)
  ```

- **For "South"**:
  ```
  Step 1: 200 + 300 = 500
  Result: ("South", 500)
  ```

- **For "East"**:
  ```
  Step 1: 50 + 100 = 150
  Result: ("East", 150)
  ```

---

#### 4. **Create a New RDD**
After applying the reduction function, a new RDD is created with the results:
```
[("North", 300), ("South", 500), ("East", 150)]
```

---

### **Complete Code Example**
Here’s the complete Spark code to compute total sales by region:
```python
from pyspark import SparkConf, SparkContext

# Spark Configuration
conf = SparkConf().setMaster("local").setAppName("ReduceByKeyExample")
sc = SparkContext(conf=conf)

# Input data
data = [
    ("North", 100),
    ("South", 200),
    ("North", 150),
    ("East", 50),
    ("South", 300),
    ("East", 100),
    ("North", 50)
]

# Step 1: Create an RDD
rdd = sc.parallelize(data)

# Step 2: Apply reduceByKey to calculate total sales per region
result_rdd = rdd.reduceByKey(lambda x, y: x + y)

# Step 3: Collect the results to view them
results = result_rdd.collect()

# Step 4: Print the results
for region, total_sales in results:
    print(f"{region}: {total_sales}")
```

---

### **Output**
```
North: 300
South: 500
East: 150
```

---

### **Visualization of Steps**

1. **Input Data**:
   ```
   ("North", 100), ("South", 200), ("North", 150), ("East", 50), ("South", 300), ("East", 100), ("North", 50)
   ```

2. **Grouped by Key**:
   ```
   "North": [100, 150, 50]
   "South": [200, 300]
   "East": [50, 100]
   ```

3. **Reduced by Key**:
   ```
   "North": 300
   "South": 500
   "East": 150
   ```

---

### **Key Points**
1. **Grouping Happens First**: All values for the same key are grouped.
2. **Reduction Happens in Steps**: The function is applied iteratively to combine values.
3. **Distributed Processing**: The grouping and reduction are distributed across Spark partitions for scalability.
