# Understanding Bucketing in PySpark

## Learning Objectives

By the end of this notebook, you will understand:

1. **What bucketing is** and how it differs from partitioning
2. **Why bucketing is important** for optimizing joins and queries
3. **How to create bucketed tables** in Spark
4. **When to use bucketing** vs when to use partitioning
5. **Best practices** for bucketing in production
6. **Common mistakes** and how to avoid them

## Prerequisites

- Understanding of Spark architecture (executors, cores, tasks) - see `08_a_Spark_Architecture.ipynb`
- Understanding of partitions - see `08_b_Partitions_Concepts.ipynb`
- Understanding of joins - see `06_joins.ipynb`
- Basic familiarity with Spark DataFrame operations

---

> **Note:** This notebook builds on the concepts from `08_b_Partitions_Concepts.ipynb` and `06_joins.ipynb`. Make sure you understand partitions and joins before proceeding.


## Introduction: What is Bucketing?

### The Real-World Analogy

**Think of bucketing like organizing books in a library:**

**Partitioning (by topic):**
- Books are organized by topic (Science, History, Fiction)
- Each topic has its own section
- You know which section to go to

**Bucketing (by author within topic):**
- Within each topic, books are organized by author
- Authors are distributed into buckets (A-F, G-M, N-Z)
- When you join two tables, matching buckets are processed together

### Technical Definition

**Bucketing:**
> A technique that divides data into a fixed number of buckets based on the hash value of one or more columns. Data with the same hash value goes into the same bucket.

**Key Characteristics:**
- Fixed number of buckets (e.g., 32, 64, 128)
- Based on hash function (deterministic)
- Same values always go to the same bucket
- Optimizes joins and aggregations

### Why Does This Matter?

**Without bucketing:**
- Joins require shuffling all data
- Expensive network transfers
- Slow query performance

**With bucketing:**
- Joins only need to match corresponding buckets
- Minimal shuffling required
- Much faster query performance


## Bucketing vs Partitioning: Understanding the Difference

### Partitioning

**Partitioning divides data by column values:**
```python
# Partitioned by date
df.write.partitionBy("date").parquet("path/")
# Creates: path/date=2024-01-01/, date=2024-01-02/, etc.
```

**Characteristics:**
- Creates separate directories/folders
- Based on actual column values
- Number of partitions = number of distinct values
- Used for filtering and pruning

### Bucketing

**Bucketing divides data by hash values:**
```python
# Bucketed by customer_id into 32 buckets
df.write.bucketBy(32, "customer_id").saveAsTable("bucketed_table")
# Creates: 32 files, data distributed by hash
```

**Characteristics:**
- Fixed number of buckets (you specify)
- Based on hash function (not actual values)
- Number of buckets = fixed (e.g., 32)
- Used for optimizing joins

### Visual Comparison

**Partitioning (by region):**
```
data/
  ‚îú‚îÄ‚îÄ region=North/
  ‚îú‚îÄ‚îÄ region=South/
  ‚îú‚îÄ‚îÄ region=East/
  ‚îî‚îÄ‚îÄ region=West/
```
- 4 partitions (one per region)
- Each partition is a separate directory

**Bucketing (by customer_id, 4 buckets):**
```
data/
  ‚îú‚îÄ‚îÄ part-00000.parquet  (bucket 0: hash % 4 == 0)
  ‚îú‚îÄ‚îÄ part-00001.parquet  (bucket 1: hash % 4 == 1)
  ‚îú‚îÄ‚îÄ part-00002.parquet  (bucket 2: hash % 4 == 2)
  ‚îî‚îÄ‚îÄ part-00003.parquet  (bucket 3: hash % 4 == 3)
```
- 4 buckets (fixed number)
- All buckets in same directory
- Data distributed by hash

### Key Differences

| Aspect | Partitioning | Bucketing |
|--------|-------------|-----------|
| **Purpose** | Filtering, pruning | Join optimization |
| **Number** | Variable (based on values) | Fixed (you specify) |
| **Based on** | Actual column values | Hash of column values |
| **Storage** | Separate directories | Same directory, different files |
| **Use case** | Time-series, categorical data | Join keys, frequently joined columns |


## Why Bucketing Matters: The Join Problem

### The Problem: Expensive Joins

**Scenario: Joining two large tables**

```python
# Large table 1: Sales data (100 GB)
sales_df = spark.read.parquet("sales.parquet")

# Large table 2: Customer data (10 GB)
customers_df = spark.read.parquet("customers.parquet")

# Join on customer_id
result = sales_df.join(customers_df, on="customer_id")
```

**What happens without bucketing:**
1. All data from both tables is shuffled
2. Data is redistributed across the cluster
3. Network I/O is massive (100+ GB)
4. Join is slow and expensive

### The Solution: Bucketing

**With bucketing:**
```python
# Both tables bucketed by customer_id (same number of buckets)
sales_df.write.bucketBy(32, "customer_id").saveAsTable("sales_bucketed")
customers_df.write.bucketBy(32, "customer_id").saveAsTable("customers_bucketed")

# Join only matches corresponding buckets
sales_bucketed = spark.table("sales_bucketed")
customers_bucketed = spark.table("customers_bucketed")
result = sales_bucketed.join(customers_bucketed, on="customer_id")
```

**What happens with bucketing:**
1. Only matching buckets are joined (bucket 0 with bucket 0, etc.)
2. No shuffling needed (data already co-located)
3. Minimal network I/O
4. Join is fast and efficient

### Visual Representation

**Without Bucketing:**
```
Sales Table          Customers Table
[All data]    ‚Üí     [All data]
     ‚Üì                   ‚Üì
   Shuffle ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Shuffle
     ‚Üì                   ‚Üì
   Join (expensive!)
```

**With Bucketing:**
```
Sales Table          Customers Table
Bucket 0 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Bucket 0  (join locally)
Bucket 1 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Bucket 1  (join locally)
Bucket 2 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Bucket 2  (join locally)
...
No shuffle needed!
```

### Key Insight

> **Bucketing pre-organizes data so that rows with the same join key values are in the same bucket. When you join two bucketed tables on the same key, Spark only needs to join corresponding buckets, avoiding expensive shuffles.**


## How Bucketing Works

### The Hash Function

**Bucketing uses a hash function:**
```python
bucket_number = hash(column_value) % number_of_buckets
```

**Example:**
```python
# 4 buckets, customer_id = 123
bucket = hash(123) % 4
# Result: Always bucket 2 (for customer_id = 123)
```

**Key Properties:**
- **Deterministic:** Same value always goes to same bucket
- **Distributed:** Values are evenly distributed across buckets
- **Fast:** Hash computation is very fast

### Step-by-Step: Creating Bucketed Data

**Step 1: Write data with bucketing**
```python
df.write.bucketBy(32, "customer_id").saveAsTable("bucketed_table")
```

**Step 2: What Spark does:**
1. Reads each row
2. Computes hash of `customer_id`
3. Determines bucket: `hash(customer_id) % 32`
4. Writes row to corresponding bucket file

**Step 3: Result:**
- 32 files (one per bucket)
- Each file contains rows with same hash values
- Data is pre-organized for joins

### Visual Example

**Input Data:**
```
customer_id | name    | amount
------------|---------|--------
100         | Alice   | 1000
101         | Bob     | 2000
102         | Charlie | 1500
103         | David   | 3000
```

**After Bucketing (4 buckets):**
```
Bucket 0 (hash % 4 == 0): customer_id 100, 104, 108, ...
Bucket 1 (hash % 4 == 1): customer_id 101, 105, 109, ...
Bucket 2 (hash % 4 == 2): customer_id 102, 106, 110, ...
Bucket 3 (hash % 4 == 3): customer_id 103, 107, 111, ...
```

**When joining:**
- Bucket 0 from table 1 joins with Bucket 0 from table 2
- Bucket 1 from table 1 joins with Bucket 1 from table 2
- No cross-bucket joins needed!


## Practical Example: Demonstrating Bucketing

Let's see bucketing in action with a practical example.


In [1]:
# Initialize Spark Session
from pyspark.sql import SparkSession
import time

# Create Spark session
spark = SparkSession.builder \
    .appName("BucketingDemo") \
    .master("local[*]") \
    .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse") \
    .enableHiveSupport() \
    .getOrCreate()

print("=" * 70)
print("SPARK SESSION INITIALIZED")
print("=" * 70)
print(f"Spark Version: {spark.version}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")
print("=" * 70)


26/01/03 06:38:20 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.4 instead (on interface en0)
26/01/03 06:38:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/03 06:38:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/01/03 06:38:21 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


SPARK SESSION INITIALIZED
Spark Version: 3.5.1
Default Parallelism: 11


### Step 1: Create Sample Data


In [2]:
# Create sample data for demonstration
print("=" * 70)
print("CREATING SAMPLE DATA")
print("=" * 70)

# Sales data
sales_data = [(i, f"Product_{i % 100}", 100.0 + i, i % 1000) 
              for i in range(10000)]
sales_df = spark.createDataFrame(sales_data, ["sale_id", "product", "amount", "customer_id"])

# Customer data
customer_data = [(i, f"Customer_{i}", f"Region_{i % 5}") 
                 for i in range(1000)]
customers_df = spark.createDataFrame(customer_data, ["customer_id", "customer_name", "region"])

print(f"\nSales DataFrame: {sales_df.count():,} rows")
print(f"Customers DataFrame: {customers_df.count():,} rows")
print(f"\nBoth tables have 'customer_id' column for joining")
print("=" * 70)


CREATING SAMPLE DATA


[Stage 0:>                                                        (0 + 11) / 11]


Sales DataFrame: 10,000 rows
Customers DataFrame: 1,000 rows

Both tables have 'customer_id' column for joining


                                                                                

### Step 2: Join Without Bucketing (The Problem)


In [3]:
# Join without bucketing - requires shuffle
print("=" * 70)
print("JOIN WITHOUT BUCKETING")
print("=" * 70)

print("\nPerforming join on non-bucketed tables...")
print("‚ö†Ô∏è  This will trigger a shuffle operation!")

start = time.time()
joined_df = sales_df.join(customers_df, on="customer_id", how="inner")
result_count = joined_df.count()
join_time = time.time() - start

print(f"\n‚úÖ Join completed!")
print(f"   ‚Ä¢ Result: {result_count:,} rows")
print(f"   ‚Ä¢ Time: {join_time:.3f} seconds")
print(f"   ‚Ä¢ What happened:")
print(f"     - All data from both tables was shuffled")
print(f"     - Data was redistributed across the cluster")
print(f"     - Expensive network I/O occurred")
print("=" * 70)


JOIN WITHOUT BUCKETING

Performing join on non-bucketed tables...
‚ö†Ô∏è  This will trigger a shuffle operation!


                                                                                


‚úÖ Join completed!
   ‚Ä¢ Result: 10,000 rows
   ‚Ä¢ Time: 1.037 seconds
   ‚Ä¢ What happened:
     - All data from both tables was shuffled
     - Data was redistributed across the cluster
     - Expensive network I/O occurred


### Step 3: Create Bucketed Tables


In [4]:
# Create bucketed tables
print("=" * 70)
print("CREATING BUCKETED TABLES")
print("=" * 70)

# Number of buckets (should be a power of 2, typically 32, 64, 128, etc.)
num_buckets = 32

print(f"\nCreating bucketed tables with {num_buckets} buckets...")
print("Both tables will be bucketed by 'customer_id'")

# Write sales data as bucketed table
print("\n1Ô∏è‚É£  Creating bucketed sales table...")
sales_df.write \
    .mode("overwrite") \
    .bucketBy(num_buckets, "customer_id") \
    .sortBy("customer_id") \
    .saveAsTable("sales_bucketed")

print("   ‚úÖ Sales table bucketed by customer_id")

# Write customers data as bucketed table
print("\n2Ô∏è‚É£  Creating bucketed customers table...")
customers_df.write \
    .mode("overwrite") \
    .bucketBy(num_buckets, "customer_id") \
    .sortBy("customer_id") \
    .saveAsTable("customers_bucketed")

print("   ‚úÖ Customers table bucketed by customer_id")

print(f"\nüí° Key Points:")
print(f"   ‚Ä¢ Both tables have {num_buckets} buckets")
print(f"   ‚Ä¢ Both are bucketed by the same column (customer_id)")
print(f"   ‚Ä¢ Same customer_id values will be in the same bucket number")
print("=" * 70)


CREATING BUCKETED TABLES

Creating bucketed tables with 32 buckets...
Both tables will be bucketed by 'customer_id'

1Ô∏è‚É£  Creating bucketed sales table...


26/01/03 06:38:53 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/01/03 06:38:53 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/01/03 06:38:54 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
26/01/03 06:38:54 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore rohityadav@127.0.0.1
26/01/03 06:38:54 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
26/01/03 06:38:56 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
26/01/03 06:38:56 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
26/01/03 06:38:56 WARN MemoryManager: Total allocation exceeds 95.00% (1,02

   ‚úÖ Sales table bucketed by customer_id

2Ô∏è‚É£  Creating bucketed customers table...


                                                                                

   ‚úÖ Customers table bucketed by customer_id

üí° Key Points:
   ‚Ä¢ Both tables have 32 buckets
   ‚Ä¢ Both are bucketed by the same column (customer_id)
   ‚Ä¢ Same customer_id values will be in the same bucket number


### Step 4: Join With Bucketing (The Solution)


In [5]:
# Read bucketed tables and join
print("=" * 70)
print("JOIN WITH BUCKETING")
print("=" * 70)

# Read the bucketed tables
sales_bucketed = spark.table("sales_bucketed")
customers_bucketed = spark.table("customers_bucketed")

print("\nReading bucketed tables...")
print("Performing join on bucketed tables...")
print("‚úÖ This should avoid shuffle!")

start = time.time()
joined_bucketed = sales_bucketed.join(customers_bucketed, on="customer_id", how="inner")
result_count_bucketed = joined_bucketed.count()
join_time_bucketed = time.time() - start

print(f"\n‚úÖ Join completed!")
print(f"   ‚Ä¢ Result: {result_count_bucketed:,} rows")
print(f"   ‚Ä¢ Time: {join_time_bucketed:.3f} seconds")
print(f"   ‚Ä¢ What happened:")
print(f"     - Only corresponding buckets were joined (bucket 0 with bucket 0, etc.)")
print(f"     - No shuffle was needed (data already co-located)")
print(f"     - Minimal network I/O")

# Compare performance
if join_time > 0:
    speedup = join_time / join_time_bucketed
    print(f"\nüöÄ Performance Comparison:")
    print(f"   ‚Ä¢ Without bucketing: {join_time:.3f}s")
    print(f"   ‚Ä¢ With bucketing: {join_time_bucketed:.3f}s")
    if speedup > 1:
        print(f"   ‚Ä¢ Bucketing is {speedup:.2f}√ó faster!")
    else:
        print(f"   ‚Ä¢ Note: For small datasets, the difference may not be significant")
        print(f"     Bucketing shows more benefit with larger datasets and clusters")
print("=" * 70)


JOIN WITH BUCKETING

Reading bucketed tables...
Performing join on bucketed tables...
‚úÖ This should avoid shuffle!

‚úÖ Join completed!
   ‚Ä¢ Result: 10,000 rows
   ‚Ä¢ Time: 0.782 seconds
   ‚Ä¢ What happened:
     - Only corresponding buckets were joined (bucket 0 with bucket 0, etc.)
     - No shuffle was needed (data already co-located)
     - Minimal network I/O

üöÄ Performance Comparison:
   ‚Ä¢ Without bucketing: 1.037s
   ‚Ä¢ With bucketing: 0.782s
   ‚Ä¢ Bucketing is 1.33√ó faster!


## Understanding Bucket Requirements

### Critical Requirements for Bucketing to Work

**For bucketing to optimize joins, you MUST:**

1. **Same number of buckets**
   - Both tables must have the same number of buckets
   - Example: Both tables must have 32 buckets

2. **Same bucketing column**
   - Both tables must be bucketed by the same column(s)
   - Example: Both bucketed by `customer_id`

3. **Join on bucketing column**
   - The join must be on the bucketing column
   - Example: `df1.join(df2, on="customer_id")` where both are bucketed by `customer_id`

### What Happens If Requirements Aren't Met?

**If requirements aren't met:**
- Spark will still perform the join
- But it will fall back to regular shuffle join
- No benefit from bucketing
- You'll get a warning in Spark UI

### Example: Requirements Met ‚úÖ

```python
# Both tables: 32 buckets, bucketed by customer_id
sales_df.write.bucketBy(32, "customer_id").saveAsTable("sales")
customers_df.write.bucketBy(32, "customer_id").saveAsTable("customers")

# Join on customer_id
result = spark.table("sales").join(spark.table("customers"), on="customer_id")
# ‚úÖ Bucket join - no shuffle!
```

### Example: Requirements NOT Met ‚ùå

```python
# Different number of buckets
sales_df.write.bucketBy(32, "customer_id").saveAsTable("sales")
customers_df.write.bucketBy(64, "customer_id").saveAsTable("customers")

# Join on customer_id
result = spark.table("sales").join(spark.table("customers"), on="customer_id")
# ‚ùå Regular shuffle join - bucketing doesn't help!
```


## Choosing the Number of Buckets

### How Many Buckets Should You Use?

**General Guidelines:**

1. **Power of 2** (recommended)
   - 16, 32, 64, 128, 256, 512, 1024
   - Easier for Spark to optimize

2. **Based on data size**
   - Small data (< 1 GB): 16-32 buckets
   - Medium data (1-100 GB): 32-128 buckets
   - Large data (> 100 GB): 128-512 buckets

3. **Based on cluster size**
   - Should be multiple of number of cores
   - Example: 16 cores ‚Üí 32, 64, or 128 buckets

4. **Avoid too many buckets**
   - Too many small files (overhead)
   - Generally avoid > 1000 buckets

### Common Choices

| Data Size | Recommended Buckets | Reason |
|-----------|-------------------|--------|
| < 1 GB | 16-32 | Small data, fewer buckets sufficient |
| 1-10 GB | 32-64 | Balanced performance |
| 10-100 GB | 64-128 | More buckets for better distribution |
| > 100 GB | 128-512 | Large data needs more buckets |

### Rule of Thumb

> **Start with 32 or 64 buckets. Adjust based on your data size and performance requirements. Use powers of 2 for best results.**


## Bucketing with Sorting

### Adding Sort Within Buckets

**You can also sort data within each bucket:**

```python
df.write \
    .bucketBy(32, "customer_id") \
    .sortBy("customer_id", "order_date") \
    .saveAsTable("bucketed_sorted")
```

**Benefits:**
- Data within each bucket is sorted
- Can optimize range queries
- Can optimize merge joins
- Better compression

**Trade-offs:**
- Takes longer to write (sorting overhead)
- More CPU during write
- Use when you frequently query by sorted columns

### When to Use Sorting

**Use sorting when:**
- You frequently filter by sorted columns
- You do range queries
- You want better compression
- Write time is not critical

**Skip sorting when:**
- You only do equality joins
- Write time is critical
- Data is already well-distributed


## When to Use Bucketing

### ‚úÖ Good Use Cases for Bucketing

**1. Frequent Joins on Same Key**
```python
# You frequently join on customer_id
sales.join(customers, on="customer_id")
orders.join(customers, on="customer_id")
# Bucket all tables by customer_id
```

**2. Large Tables**
```python
# Large tables that are joined frequently
large_table1.write.bucketBy(128, "join_key").saveAsTable("table1")
large_table2.write.bucketBy(128, "join_key").saveAsTable("table2")
```

**3. Multiple Joins on Same Column**
```python
# Multiple tables joined on same key
df1.join(df2, on="key").join(df3, on="key")
# Bucket all by "key"
```

**4. Aggregations After Joins**
```python
# Join then aggregate
joined = df1.join(df2, on="key")
result = joined.groupBy("key").agg(...)
# Bucketing helps both join and aggregation
```

### ‚ùå When NOT to Use Bucketing

**1. Small Tables**
```python
# Small tables don't benefit much
small_df.write.bucketBy(32, "key")  # ‚ùå Overhead not worth it
```

**2. Frequently Changing Data**
```python
# If data changes frequently, bucketing overhead is high
# Each write requires rebucketing
```

**3. Different Join Keys**
```python
# If you join on different keys, bucketing doesn't help
df1.join(df2, on="key1")
df1.join(df3, on="key2")  # Different key!
```

**4. One-Time Queries**
```python
# If you only query once, bucketing setup cost isn't worth it
```

### Decision Tree

```
Will you join these tables frequently?
‚îÇ
‚îú‚îÄ NO ‚Üí Don't bucket ‚ùå
‚îÇ
‚îî‚îÄ YES
   ‚îÇ
   ‚îú‚îÄ Are tables large (> 1 GB)?
   ‚îÇ  ‚îÇ
   ‚îÇ  ‚îú‚îÄ NO ‚Üí Don't bucket ‚ùå (overhead not worth it)
   ‚îÇ  ‚îÇ
   ‚îÇ  ‚îî‚îÄ YES ‚Üí Bucket ‚úÖ
   ‚îÇ     ‚îÇ
   ‚îÇ     ‚îî‚îÄ Will you join on the same key?
   ‚îÇ        ‚îÇ
   ‚îÇ        ‚îú‚îÄ NO ‚Üí Don't bucket ‚ùå (won't help)
   ‚îÇ        ‚îÇ
   ‚îÇ        ‚îî‚îÄ YES ‚Üí Bucket ‚úÖ
```


## Bucketing vs Partitioning: When to Use Each

### Use Partitioning When:

**1. Time-Series Data**
```python
# Partition by date for time-based queries
df.write.partitionBy("date").parquet("path/")
```

**2. Categorical Filtering**
```python
# Partition by region for region-based queries
df.write.partitionBy("region").parquet("path/")
```

**3. Data Pruning**
```python
# Partitioning allows Spark to skip entire partitions
spark.read.parquet("path/").filter(col("date") == "2024-01-01")
# Only reads date=2024-01-01/ partition
```

### Use Bucketing When:

**1. Join Optimization**
```python
# Bucket for join performance
df1.write.bucketBy(32, "join_key").saveAsTable("table1")
df2.write.bucketBy(32, "join_key").saveAsTable("table2")
```

**2. High Cardinality Columns**
```python
# Customer IDs, order IDs, etc. (too many distinct values for partitioning)
df.write.bucketBy(64, "customer_id").saveAsTable("customers")
```

**3. Multiple Joins on Same Key**
```python
# Multiple tables joined on same key
# Bucket all by that key
```

### Use Both Together:

**You can combine partitioning and bucketing:**

```python
# Partition by date, bucket by customer_id
df.write \
    .partitionBy("date") \
    .bucketBy(32, "customer_id") \
    .saveAsTable("sales")
```

**Benefits:**
- Partitioning: Prunes by date (skips irrelevant dates)
- Bucketing: Optimizes joins on customer_id

**Use when:**
- You filter by partition column (date)
- You join on bucket column (customer_id)


## Best Practices

### ‚úÖ DO

1. **Use same number of buckets for joined tables**
   ```python
   # Both tables must have same bucket count
   df1.write.bucketBy(32, "key").saveAsTable("table1")
   df2.write.bucketBy(32, "key").saveAsTable("table2")  # Same: 32
   ```

2. **Use powers of 2 for bucket count**
   ```python
   # Good: 16, 32, 64, 128, 256
   df.write.bucketBy(32, "key").saveAsTable("table")
   ```

3. **Bucket by join keys**
   ```python
   # Bucket by columns you frequently join on
   df.write.bucketBy(32, "customer_id").saveAsTable("sales")
   ```

4. **Use appropriate bucket count**
   ```python
   # Based on data size: 32-128 for most cases
   df.write.bucketBy(64, "key").saveAsTable("table")
   ```

5. **Combine with sorting when beneficial**
   ```python
   # Sort within buckets for range queries
   df.write.bucketBy(32, "key").sortBy("key", "date").saveAsTable("table")
   ```

### ‚ùå DON'T

1. **Don't use different bucket counts**
   ```python
   # ‚ùå BAD: Different bucket counts
   df1.write.bucketBy(32, "key").saveAsTable("table1")
   df2.write.bucketBy(64, "key").saveAsTable("table2")  # Won't work!
   ```

2. **Don't bucket small tables**
   ```python
   # ‚ùå BAD: Small table, overhead not worth it
   small_df.write.bucketBy(32, "key")  # Overhead > benefit
   ```

3. **Don't use too many buckets**
   ```python
   # ‚ùå BAD: Too many small files
   df.write.bucketBy(10000, "key")  # Creates 10000 tiny files!
   ```

4. **Don't bucket by wrong column**
   ```python
   # ‚ùå BAD: Bucketing by column you don't join on
   df.write.bucketBy(32, "product_name")  # But you join on customer_id!
   ```

5. **Don't forget to bucket both tables**
   ```python
   # ‚ùå BAD: Only one table bucketed
   df1.write.bucketBy(32, "key").saveAsTable("table1")
   df2.write.saveAsTable("table2")  # Not bucketed - won't help!
   ```


## Common Mistakes and How to Avoid Them

### Mistake 1: Different Bucket Counts

**Wrong:**
```python
df1.write.bucketBy(32, "customer_id").saveAsTable("sales")
df2.write.bucketBy(64, "customer_id").saveAsTable("customers")
# ‚ùå Different bucket counts - bucketing won't work!
```

**Correct:**
```python
df1.write.bucketBy(32, "customer_id").saveAsTable("sales")
df2.write.bucketBy(32, "customer_id").saveAsTable("customers")
# ‚úÖ Same bucket count - bucketing will work!
```

### Mistake 2: Bucketing Only One Table

**Wrong:**
```python
df1.write.bucketBy(32, "customer_id").saveAsTable("sales")
df2.write.saveAsTable("customers")  # ‚ùå Not bucketed!
# Join won't benefit from bucketing
```

**Correct:**
```python
df1.write.bucketBy(32, "customer_id").saveAsTable("sales")
df2.write.bucketBy(32, "customer_id").saveAsTable("customers")
# ‚úÖ Both bucketed - join will be optimized!
```

### Mistake 3: Joining on Different Column

**Wrong:**
```python
# Bucketed by customer_id
df1.write.bucketBy(32, "customer_id").saveAsTable("sales")
df2.write.bucketBy(32, "customer_id").saveAsTable("customers")

# But joining on different column
result = spark.table("sales").join(spark.table("customers"), on="order_id")
# ‚ùå Join on order_id, but bucketed by customer_id - no benefit!
```

**Correct:**
```python
# Bucket by the join key
df1.write.bucketBy(32, "order_id").saveAsTable("sales")
df2.write.bucketBy(32, "order_id").saveAsTable("customers")

# Join on same column
result = spark.table("sales").join(spark.table("customers"), on="order_id")
# ‚úÖ Join on order_id, bucketed by order_id - works!
```

### Mistake 4: Too Many Buckets

**Wrong:**
```python
# ‚ùå BAD: Too many buckets for small data
small_df.write.bucketBy(1000, "key").saveAsTable("table")
# Creates 1000 tiny files - overhead!
```

**Correct:**
```python
# ‚úÖ GOOD: Appropriate number of buckets
small_df.write.bucketBy(32, "key").saveAsTable("table")
# Creates 32 reasonably-sized files
```

### Mistake 5: Not Using Hive Tables

**Wrong:**
```python
# ‚ùå BAD: Writing to Parquet (bucketing metadata lost)
df.write.bucketBy(32, "key").parquet("path/")
# Bucketing information not preserved!
```

**Correct:**
```python
# ‚úÖ GOOD: Using saveAsTable (preserves bucketing)
df.write.bucketBy(32, "key").saveAsTable("table_name")
# Bucketing information preserved in Hive metastore
```


## Key Takeaways

### The Core Concept

**Bucketing:**
- ‚úÖ Divides data into fixed number of buckets based on hash
- ‚úÖ Pre-organizes data for efficient joins
- ‚úÖ Avoids expensive shuffles during joins
- ‚úÖ Requires same bucket count and column for joined tables

**Partitioning:**
- ‚úÖ Divides data by actual column values
- ‚úÖ Creates separate directories
- ‚úÖ Used for filtering and data pruning
- ‚úÖ Variable number of partitions

### When to Use Bucketing

**Use bucketing when:**
- You frequently join large tables
- Join is on the same column(s)
- Tables are large (> 1 GB)
- You can control how data is written

**Don't use bucketing when:**
- Tables are small
- You join on different columns
- Data changes frequently
- One-time queries

### Requirements for Bucket Joins

1. **Same number of buckets** in both tables
2. **Same bucketing column(s)** in both tables
3. **Join on bucketing column(s)**
4. **Tables must be saved as Hive tables** (saveAsTable)

### The Golden Rules

1. **Bucket count must match** for joined tables
2. **Bucket by join keys** for maximum benefit
3. **Use powers of 2** for bucket count (16, 32, 64, 128)
4. **Choose bucket count** based on data size
5. **Use saveAsTable** to preserve bucketing metadata

### Remember

1. **Bucketing = Hash-based organization for joins**
2. **Partitioning = Value-based organization for filtering**
3. **Both tables must have same bucket count and column**
4. **Bucketing avoids shuffles, partitioning avoids reads**
5. **You can combine partitioning and bucketing**

### Next Steps

- Practice creating bucketed tables
- Monitor Spark UI to see bucket joins in action
- Experiment with different bucket counts
- Review `06_joins.ipynb` to understand join optimization
- Review `08_b_Partitions_Concepts.ipynb` to understand partitioning


## Summary

### What We Learned

1. **What bucketing is**
   - Divides data into fixed number of buckets based on hash
   - Pre-organizes data for efficient joins
   - Different from partitioning (which uses actual values)

2. **Why bucketing matters**
   - Optimizes joins by avoiding shuffles
   - Pre-organizes data so matching rows are co-located
   - Significantly improves join performance

3. **How to create bucketed tables**
   - Use `bucketBy(num_buckets, "column")`
   - Must use `saveAsTable()` to preserve metadata
   - Can combine with `sortBy()` for additional optimization

4. **Requirements for bucket joins**
   - Same number of buckets in both tables
   - Same bucketing column(s)
   - Join on bucketing column(s)
   - Tables saved as Hive tables

5. **When to use bucketing**
   - Frequent joins on same key
   - Large tables (> 1 GB)
   - Multiple joins on same column
   - When you can control data writing

6. **Best practices**
   - Use powers of 2 for bucket count
   - Choose bucket count based on data size
   - Bucket by join keys
   - Ensure both tables have same bucket configuration

### The Bottom Line

> **Bucketing is a powerful optimization technique that pre-organizes data for efficient joins. By ensuring rows with the same join key values are in the same bucket, Spark can join corresponding buckets without expensive shuffles. Use bucketing when you frequently join large tables on the same key, but remember that both tables must have the same bucket count and bucketing column for it to work.**

---

**Related Notebooks:**
- `08_b_Partitions_Concepts.ipynb` - Understanding partitioning
- `06_joins.ipynb` - Understanding joins and join optimization
- `08_a_Spark_Architecture.ipynb` - Understanding executors, cores, and tasks
- `08_performance_optimization.ipynb` - Comprehensive performance optimization guide


In [6]:
# Clean up - drop tables
spark.sql("DROP TABLE IF EXISTS sales_bucketed")
spark.sql("DROP TABLE IF EXISTS customers_bucketed")

# Stop Spark session
spark.stop()
print("Spark session stopped.")


Spark session stopped.
