# Module 3: Data Transformations & Operations

## 🎯 Learning Objectives

By the end of this module, you will master:

### 🔄 **Core Transformations**
- DataFrame column operations and manipulations
- Row filtering and conditional logic
- Data type conversions and casting
- String manipulations and text processing

### 📊 **Advanced Operations**
- Aggregations and grouping operations
- Window functions for analytics
- Join operations and data combining
- Set operations (union, intersect, except)

### 🧹 **Data Cleaning & Preprocessing**
- Handling null values and missing data
- Data deduplication and uniqueness
- Data validation and quality checks
- Outlier detection and handling

### ⚡ **Performance & Optimization**
- Caching and persistence strategies
- Partitioning for transformations
- Broadcast variables and accumulators
- Catalyst optimizer understanding

### 🛠️ **Real-World Applications**
- ETL pipeline patterns
- Data quality frameworks
- Complex business logic implementation
- Performance monitoring and tuning

**Prerequisites:** Module 1 (Foundation) and Module 2 (Data I/O)  
**Estimated Time:** 90-120 minutes  
**Difficulty:** Intermediate

## 3.1 Overview: Data Transformations in PySpark

Data transformations are the heart of any data processing pipeline. In PySpark, transformations are **lazy operations** that define what you want to do with your data, but don't execute until an **action** is called.

### 🔄 **Transformation Types**

**1. Narrow Transformations**
- Operations where each input partition contributes to only one output partition
- No data shuffle required across the cluster
- Examples: `filter()`, `map()`, `select()`, `withColumn()`
- **Performance**: Fast, highly parallelizable

**2. Wide Transformations**
- Operations that require data from multiple partitions
- Trigger shuffle operations across the cluster
- Examples: `groupBy()`, `join()`, `orderBy()`, `distinct()`
- **Performance**: More expensive, but often necessary

### 📊 **DataFrame API vs SQL**

PySpark provides two equivalent ways to transform data:
- **DataFrame API**: Programmatic, type-safe, chainable operations
- **SQL Interface**: Familiar SQL syntax for complex queries
- **Interchangeable**: Can mix both approaches seamlessly

### ⚡ **Lazy Evaluation Benefits**

1. **Query Optimization**: Catalyst optimizer can analyze the entire pipeline
2. **Efficient Execution**: Combines multiple operations into optimized stages
3. **Memory Management**: Only computes what's needed when needed
4. **Fault Tolerance**: Can replay transformations if nodes fail

### 🎯 **Key Concepts We'll Explore**

- **Immutability**: DataFrames are immutable; transformations create new DataFrames
- **Lineage**: Spark tracks the transformation graph for fault tolerance
- **Partitioning**: How data distribution affects transformation performance
- **Caching**: When and how to persist intermediate results

## 3.2 Environment Setup for Transformations

In [1]:
# Environment Setup and Verification for Module 3: Data Transformations
print("🚀 Module 3: Data Transformations & Operations")
print("=" * 50)

# Core imports
import os
import sys
from pathlib import Path
import random
import time
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Environment verification
conda_env = os.environ.get('CONDA_DEFAULT_ENV', 'Not in conda environment')
print(f"📦 Conda Environment: {conda_env}")

# Set up project paths
project_root = Path.cwd()
data_dir = project_root / "data"
temp_dir = project_root / "temp"
outputs_dir = project_root / "outputs"

# Create directories if they don't exist
for directory in [data_dir, temp_dir, outputs_dir]:
    directory.mkdir(exist_ok=True)
    print(f"📁 Directory ready: {directory.name}/")

# PySpark imports for transformations
try:
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    from pyspark.sql.window import Window
    from pyspark import SparkConf, SparkContext
    
    print(f"✅ PySpark imports successful")
except ImportError as e:
    print(f"❌ PySpark import error: {e}")
    sys.exit(1)

# Additional imports for transformations
try:
    import pandas as pd
    import numpy as np
    print(f"✅ Additional libraries imported (pandas, numpy)")
except ImportError as e:
    print(f"⚠️  Optional libraries not available: {e}")

# Set random seeds for reproducibility
random.seed(42)
if 'np' in locals():
    np.random.seed(42)

print(f"\n🎯 Environment Ready for Module 3!")
print(f"   • Project root: {project_root}")
print(f"   • Data directory: {data_dir}")
print(f"   • Temp directory: {temp_dir}")
print(f"   • Random seeds set for reproducibility")
print(f"   • Ready for transformation operations!")

🚀 Module 3: Data Transformations & Operations
📦 Conda Environment: pyspark_env
📁 Directory ready: data/
📁 Directory ready: temp/
📁 Directory ready: outputs/
✅ PySpark imports successful
✅ Additional libraries imported (pandas, numpy)

🎯 Environment Ready for Module 3!
   • Project root: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/notebooks
   • Data directory: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/notebooks/data
   • Temp directory: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/notebooks/temp
   • Random seeds set for reproducibility
   • Ready for transformation operations!
✅ Additional libraries imported (pandas, numpy)

🎯 Environment Ready for Module 3!
   • Project root: /Users/sanjeevadodlapati/Downloads/Repos/GeoSpatialAI/projects/project_pyspark_comprehensive_tutorial/notebooks
   • Data directory: /User

In [2]:
# Create SparkSession optimized for Data Transformations
print("⚡ Creating SparkSession for Data Transformations")
print("=" * 50)

# Stop any existing SparkSession
try:
    spark.stop()
    print("🧹 Stopped existing SparkSession")
except:
    print("🆕 No existing SparkSession to stop")

# Configuration optimized for transformations and analytics
spark = SparkSession.builder \
    .appName("PySpark-Tutorial-Module3-Transformations") \
    .master("local[6]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.driver.maxResultSize", "2g") \
    .config("spark.executor.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.minPartitionSize", "16MB") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.shuffle.partitions", "12") \
    .config("spark.sql.codegen.wholeStage", "true") \
    .config("spark.sql.codegen.maxFields", "200") \
    .config("spark.sql.join.preferSortMergeJoin", "true") \
    .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "64MB") \
    .config("spark.dynamicAllocation.enabled", "false") \
    .getOrCreate()

# Get SparkContext
sc = spark.sparkContext

# Set log level to reduce noise
sc.setLogLevel("WARN")

print("\n✅ SparkSession created successfully!")
print(f"📱 Application Name: {spark.sparkContext.appName}")
print(f"🔢 Spark Version: {spark.version}")
print(f"🎯 Master: {spark.sparkContext.master}")
print(f"💾 Driver Memory: {spark.conf.get('spark.driver.memory')}")
print(f"⚡ Default Parallelism: {spark.sparkContext.defaultParallelism}")
print(f"🔀 Shuffle Partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")

# Transformation-specific optimizations
print(f"\n🔧 Transformation Optimizations:")
print(f"   • Adaptive Query Execution: {spark.conf.get('spark.sql.adaptive.enabled')}")
print(f"   • Whole-Stage Code Generation: {spark.conf.get('spark.sql.codegen.wholeStage')}")
print(f"   • Skew Join Optimization: {spark.conf.get('spark.sql.adaptive.skewJoin.enabled')}")
print(f"   • Arrow Optimization: {spark.conf.get('spark.sql.execution.arrow.pyspark.enabled')}")

if spark.sparkContext.uiWebUrl:
    print(f"\n🌐 Spark UI: {spark.sparkContext.uiWebUrl}")

print(f"\n🎯 Optimized for:")
print(f"   • Complex data transformations")
print(f"   • Join operations and aggregations")
print(f"   • Window functions and analytics")
print(f"   • Local development with 6 cores")

# Enable SQL interface
spark.sql("SELECT 'SQL interface ready!' as status").show()
print("✅ SQL interface verified and ready!")

⚡ Creating SparkSession for Data Transformations
🆕 No existing SparkSession to stop


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/25 20:28:02 WARN Utils: Your hostname, Sanjeevas-iMac.local, resolves to a loopback address: 127.0.0.1; using 192.168.12.128 instead (on interface en1)
25/08/25 20:28:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/25 20:28:02 WARN Utils: Your hostname, Sanjeevas-iMac.local, resolves to a loopback address: 127.0.0.1; using 192.168.12.128 instead (on interface en1)
25/08/25 20:28:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging


✅ SparkSession created successfully!
📱 Application Name: PySpark-Tutorial-Module3-Transformations
🔢 Spark Version: 4.0.0
🎯 Master: local[6]
💾 Driver Memory: 4g
⚡ Default Parallelism: 6
🔀 Shuffle Partitions: 12

🔧 Transformation Optimizations:
   • Adaptive Query Execution: true
   • Whole-Stage Code Generation: true
   • Skew Join Optimization: true
   • Arrow Optimization: true

🌐 Spark UI: http://192.168.12.128:4042

🎯 Optimized for:
   • Complex data transformations
   • Join operations and aggregations
   • Window functions and analytics
   • Local development with 6 cores
💾 Driver Memory: 4g
⚡ Default Parallelism: 6
🔀 Shuffle Partitions: 12

🔧 Transformation Optimizations:
   • Adaptive Query Execution: true
   • Whole-Stage Code Generation: true
   • Skew Join Optimization: true
   • Arrow Optimization: true

🌐 Spark UI: http://192.168.12.128:4042

🎯 Optimized for:
   • Complex data transformations
   • Join operations and aggregations
   • Window functions and analytics
   • Lo

                                                                                

In [3]:
# Create Sample Datasets for Transformation Demonstrations
print("📊 Creating Sample Datasets for Transformations")
print("=" * 48)

# 1. Sales Transaction Dataset
print("1️⃣ Creating Sales Transaction Dataset...")

def generate_sales_data(num_records=1000):
    """Generate realistic sales transaction data"""
    products = [
        ("P001", "Laptop Pro", "Electronics", 1299.99),
        ("P002", "Wireless Mouse", "Electronics", 29.99),
        ("P003", "Office Chair", "Furniture", 299.99),
        ("P004", "Coffee Maker", "Appliances", 89.99),
        ("P005", "Smartphone", "Electronics", 799.99),
        ("P006", "Desk Lamp", "Furniture", 45.99),
        ("P007", "Tablet", "Electronics", 599.99),
        ("P008", "Ergonomic Keyboard", "Electronics", 79.99),
        ("P009", "Standing Desk", "Furniture", 449.99),
        ("P010", "Blender", "Appliances", 129.99)
    ]
    
    regions = ["North", "South", "East", "West", "Central"]
    sales_reps = ["Alice Johnson", "Bob Smith", "Carol Davis", "David Wilson", "Eve Brown"]
    
    data = []
    base_date = datetime(2024, 1, 1)
    
    for i in range(num_records):
        product = random.choice(products)
        sale_date = base_date + timedelta(days=random.randint(0, 365))
        
        data.append((
            f"T{i+1:06d}",  # transaction_id
            sale_date.strftime('%Y-%m-%d'),  # sale_date
            product[0],  # product_id
            product[1],  # product_name
            product[2],  # category
            product[3],  # unit_price
            random.randint(1, 10),  # quantity
            product[3] * random.randint(1, 10),  # total_amount
            random.choice(regions),  # region
            random.choice(sales_reps),  # sales_rep
            random.choice([True, False]) if random.random() > 0.1 else None,  # is_online (some nulls)
            int(random.uniform(1.0, 5.0) * 10) / 10.0  # rating
        ))
    
    return data

# Generate sales data
sales_data = generate_sales_data(1000)

# Create DataFrame
sales_schema = StructType([
    StructField("transaction_id", StringType(), True),
    StructField("sale_date", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("total_amount", DoubleType(), True),
    StructField("region", StringType(), True),
    StructField("sales_rep", StringType(), True),
    StructField("is_online", BooleanType(), True),
    StructField("rating", DoubleType(), True)
])

df_sales = spark.createDataFrame(sales_data, sales_schema)

print(f"✅ Created sales dataset: {df_sales.count()} records")

# 2. Customer Dataset
print("\n2️⃣ Creating Customer Dataset...")

customer_data = [
    ("C001", "John", "Doe", "john.doe@email.com", "Premium", 25000.50, "2022-01-15", "North"),
    ("C002", "Jane", "Smith", "jane.smith@email.com", "Standard", 12000.75, "2022-03-22", "South"),
    ("C003", "Bob", "Johnson", "bob.johnson@email.com", "Premium", 35000.00, "2021-11-08", "East"),
    ("C004", "Alice", "Brown", "alice.brown@email.com", "Basic", 5000.25, "2023-02-14", "West"),
    ("C005", "Charlie", "Wilson", "charlie.wilson@email.com", "Standard", 18000.00, "2022-09-05", "Central"),
    ("C006", "Diana", "Davis", "", "Premium", 42000.50, "2021-07-12", "North"),  # Missing email
    ("C007", "Eve", "Miller", "eve.miller@email.com", "Standard", 15000.00, "2023-01-30", "South"),
    ("C008", "Frank", "Garcia", "frank.garcia@email.com", "Basic", 3000.75, "2023-05-20", "East"),
    ("C009", "Grace", "Lee", "grace.lee@email.com", "Premium", 55000.00, "2020-08-17", "West"),
    ("C010", "Henry", "Taylor", "henry.taylor@email.com", "Standard", 22000.25, "2022-04-25", "Central")
]

customer_schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("tier", StringType(), True),
    StructField("total_spent", DoubleType(), True),
    StructField("join_date", StringType(), True),
    StructField("region", StringType(), True)
])

df_customers = spark.createDataFrame(customer_data, customer_schema)

print(f"✅ Created customer dataset: {df_customers.count()} records")

# 3. Product Information Dataset
print("\n3️⃣ Creating Product Information Dataset...")

product_data = [
    ("P001", "Laptop Pro", "Electronics", 1299.99, 50, "2024-01-01", True, ["laptop", "computer", "electronics"]),
    ("P002", "Wireless Mouse", "Electronics", 29.99, 200, "2024-01-01", True, ["mouse", "wireless", "accessories"]),
    ("P003", "Office Chair", "Furniture", 299.99, 30, "2024-01-01", True, ["chair", "office", "furniture"]),
    ("P004", "Coffee Maker", "Appliances", 89.99, 75, "2024-01-01", True, ["coffee", "appliance", "kitchen"]),
    ("P005", "Smartphone", "Electronics", 799.99, 120, "2024-01-01", True, ["phone", "smartphone", "mobile"]),
    ("P006", "Desk Lamp", "Furniture", 45.99, 80, "2024-01-01", False, ["lamp", "desk", "lighting"]),  # Discontinued
    ("P007", "Tablet", "Electronics", 599.99, 60, "2024-01-01", True, ["tablet", "device", "portable"]),
    ("P008", "Ergonomic Keyboard", "Electronics", 79.99, 90, "2024-01-01", True, ["keyboard", "ergonomic", "input"]),
    ("P009", "Standing Desk", "Furniture", 449.99, 25, "2024-01-01", True, ["desk", "standing", "office"]),
    ("P010", "Blender", "Appliances", 129.99, 40, "2024-01-01", True, ["blender", "kitchen", "appliance"])
]

product_schema = StructType([
    StructField("product_id", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("stock_quantity", IntegerType(), True),
    StructField("launch_date", StringType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("tags", ArrayType(StringType()), True)
])

df_products = spark.createDataFrame(product_data, product_schema)

print(f"✅ Created product dataset: {df_products.count()} records")

# Display sample data
print(f"\n📄 Sample Data Preview:")
print(f"\n🛒 Sales Data (first 3 rows):")
df_sales.show(3, truncate=False)

print(f"\n👥 Customer Data (first 3 rows):")
df_customers.show(3, truncate=False)

print(f"\n📦 Product Data (first 3 rows):")
df_products.show(3, truncate=False)

print(f"\n📊 Datasets Summary:")
print(f"   • Sales transactions: {df_sales.count()} records")
print(f"   • Customers: {df_customers.count()} records") 
print(f"   • Products: {df_products.count()} records")
print(f"   • Ready for transformation demonstrations!")

📊 Creating Sample Datasets for Transformations
1️⃣ Creating Sales Transaction Dataset...


                                                                                

✅ Created sales dataset: 1000 records

2️⃣ Creating Customer Dataset...
✅ Created customer dataset: 10 records

3️⃣ Creating Product Information Dataset...
✅ Created product dataset: 10 records

📄 Sample Data Preview:

🛒 Sales Data (first 3 rows):
+--------------+----------+----------+--------------+-----------+----------+--------+------------+-------+-----------+---------+------+
|transaction_id|sale_date |product_id|product_name  |category   |unit_price|quantity|total_amount|region |sales_rep  |is_online|rating|
+--------------+----------+----------+--------------+-----------+----------+--------+------------+-------+-----------+---------+------+
|T000001       |2024-01-13|P002      |Wireless Mouse|Electronics|29.99     |5       |119.96      |South  |Bob Smith  |true     |3.3   |
|T000002       |2024-01-16|P001      |Laptop Pro    |Electronics|1299.99   |2       |5199.96     |South  |Eve Brown  |true     |3.8   |
|T000003       |2024-08-02|P009      |Standing Desk |Furniture  |449.99 

## 3.3 Basic Column Operations

Column operations are fundamental transformations in PySpark. They allow you to create new columns, modify existing ones, and perform calculations across your dataset.

**Key Column Operations:**
- **Selection**: Choose specific columns from a DataFrame
- **Creation**: Add new columns with computed values
- **Modification**: Transform existing column values
- **Renaming**: Change column names for clarity
- **Casting**: Convert between data types
- **Conditional Logic**: Apply if-then-else logic to columns

**Performance Notes:**
- Column operations are **narrow transformations** (no shuffle required)
- Multiple column operations can be chained efficiently
- Use vectorized operations whenever possible for best performance

In [4]:
# Basic Column Operations and Transformations
print("🔧 Basic Column Operations")
print("=" * 27)

# 1. Column Selection
print("1️⃣ Column Selection")

# Select specific columns
df_selected = df_sales.select("transaction_id", "product_name", "total_amount", "region")
print(f"📋 Selected columns: {df_selected.columns}")
df_selected.show(3)

# Select with column expressions
df_selected_expr = df_sales.select(
    col("transaction_id").alias("id"),
    col("product_name"),
    col("total_amount"),
    col("region").alias("sales_region")
)
print(f"\n📋 Selected with aliases:")
df_selected_expr.show(3)

# 2. Adding New Columns
print(f"\n2️⃣ Adding New Columns")

# Add calculated columns
df_with_calculations = df_sales.withColumn("discount_amount", col("total_amount") * 0.1) \
                              .withColumn("discounted_total", col("total_amount") - col("discount_amount")) \
                              .withColumn("tax_amount", col("discounted_total") * 0.08) \
                              .withColumn("final_amount", col("discounted_total") + col("tax_amount"))

print(f"📊 Added financial calculations:")
df_with_calculations.select("transaction_id", "total_amount", "discount_amount", 
                           "discounted_total", "tax_amount", "final_amount").show(3)

# 3. String Operations
print(f"\n3️⃣ String Operations")

# String manipulations
df_string_ops = df_customers.withColumn("full_name", concat(col("first_name"), lit(" "), col("last_name"))) \
                           .withColumn("name_length", length(col("full_name"))) \
                           .withColumn("email_domain", 
                                     when(col("email").isNotNull() & (col("email") != ""), 
                                          split(col("email"), "@").getItem(1))
                                     .otherwise(None)) \
                           .withColumn("first_name_upper", upper(col("first_name"))) \
                           .withColumn("initials", concat(substring(col("first_name"), 1, 1), 
                                                        substring(col("last_name"), 1, 1)))

print(f"📝 String transformations:")
df_string_ops.select("customer_id", "full_name", "name_length", "email_domain", 
                    "first_name_upper", "initials").show(5, truncate=False)

# 4. Date Operations
print(f"\n4️⃣ Date Operations")

# Convert string dates to date type and perform date calculations
df_date_ops = df_sales.withColumn("sale_date_typed", to_date(col("sale_date"), "yyyy-MM-dd")) \
                     .withColumn("year", year(col("sale_date_typed"))) \
                     .withColumn("month", month(col("sale_date_typed"))) \
                     .withColumn("quarter", quarter(col("sale_date_typed"))) \
                     .withColumn("day_of_week", dayofweek(col("sale_date_typed"))) \
                     .withColumn("days_from_today", datediff(current_date(), col("sale_date_typed")))

print(f"📅 Date transformations:")
df_date_ops.select("transaction_id", "sale_date", "year", "month", "quarter", 
                  "day_of_week", "days_from_today").show(5)

# 5. Conditional Logic
print(f"\n5️⃣ Conditional Logic")

# Using when() for conditional logic
df_conditional = df_sales.withColumn("amount_category", 
                                   when(col("total_amount") >= 1000, "High")
                                   .when(col("total_amount") >= 500, "Medium")
                                   .when(col("total_amount") >= 100, "Low")
                                   .otherwise("Very Low")) \
                        .withColumn("is_weekend_sale", 
                                  when(dayofweek(to_date(col("sale_date"), "yyyy-MM-dd")).isin([1, 7]), True)
                                  .otherwise(False)) \
                        .withColumn("commission_rate",
                                  when(col("category") == "Electronics", 0.05)
                                  .when(col("category") == "Furniture", 0.03)
                                  .otherwise(0.02))

print(f"🎯 Conditional transformations:")
df_conditional.select("transaction_id", "total_amount", "amount_category", 
                     "is_weekend_sale", "category", "commission_rate").show(5)

# 6. Type Casting
print(f"\n6️⃣ Type Casting")

# Data type conversions
df_cast = df_sales.withColumn("quantity_str", col("quantity").cast(StringType())) \
                 .withColumn("rating_int", col("rating").cast(IntegerType())) \
                 .withColumn("unit_price_float", col("unit_price").cast(FloatType()))

print(f"🔄 Type casting examples:")
print(f"Original types: {[(field.name, field.dataType) for field in df_sales.schema.fields[:3]]}")
print(f"Casted types: {[(field.name, field.dataType) for field in df_cast.select('quantity_str', 'rating_int', 'unit_price_float').schema.fields]}")

df_cast.select("quantity", "quantity_str", "rating", "rating_int", "unit_price", "unit_price_float").show(3)

print(f"\n📊 Column Operations Summary:")
print(f"   • Selection: Choose specific columns from DataFrames")
print(f"   • Addition: Create new columns with withColumn()")
print(f"   • String ops: concat(), upper(), split(), substring()")
print(f"   • Date ops: year(), month(), datediff(), current_date()")
print(f"   • Conditional: when().otherwise() for if-then-else logic")
print(f"   • Casting: Change data types with cast()")
print(f"   • All operations are lazy and optimized by Catalyst!")

🔧 Basic Column Operations
1️⃣ Column Selection
📋 Selected columns: ['transaction_id', 'product_name', 'total_amount', 'region']
+--------------+--------------+------------+-------+
|transaction_id|  product_name|total_amount| region|
+--------------+--------------+------------+-------+
|       T000001|Wireless Mouse|      119.96|  South|
|       T000002|    Laptop Pro|     5199.96|  South|
|       T000003| Standing Desk|     3599.92|Central|
+--------------+--------------+------------+-------+
only showing top 3 rows

📋 Selected with aliases:
+-------+--------------+------------+------------+
|     id|  product_name|total_amount|sales_region|
+-------+--------------+------------+------------+
|T000001|Wireless Mouse|      119.96|       South|
|T000002|    Laptop Pro|     5199.96|       South|
|T000003| Standing Desk|     3599.92|     Central|
+-------+--------------+------------+------------+
only showing top 3 rows

2️⃣ Adding New Columns
📊 Added financial calculations:
+-------------

                                                                                

+-----------+--------------+-----------+------------+----------------+--------+
|customer_id|full_name     |name_length|email_domain|first_name_upper|initials|
+-----------+--------------+-----------+------------+----------------+--------+
|C001       |John Doe      |8          |email.com   |JOHN            |JD      |
|C002       |Jane Smith    |10         |email.com   |JANE            |JS      |
|C003       |Bob Johnson   |11         |email.com   |BOB             |BJ      |
|C004       |Alice Brown   |11         |email.com   |ALICE           |AB      |
|C005       |Charlie Wilson|14         |email.com   |CHARLIE         |CW      |
+-----------+--------------+-----------+------------+----------------+--------+
only showing top 5 rows

4️⃣ Date Operations
📅 Date transformations:
+--------------+----------+----+-----+-------+-----------+---------------+
|transaction_id| sale_date|year|month|quarter|day_of_week|days_from_today|
+--------------+----------+----+-----+-------+-----------+---

## 3.4 Filtering & Conditional Operations

Filtering is essential for data analysis, allowing you to work with subsets of data that meet specific criteria. PySpark provides powerful filtering capabilities with optimized predicate pushdown.

**Key Filtering Concepts:**
- **Row Filtering**: Select rows based on conditions
- **Compound Conditions**: Combine multiple filters with AND/OR logic
- **Null Handling**: Deal with missing values in filters
- **Performance**: Catalyst optimizer pushes filters down to data source
- **SQL Equivalence**: Filter operations map directly to SQL WHERE clauses

In [5]:
# Filtering & Conditional Operations
print("🔍 Filtering & Conditional Operations")
print("=" * 36)

# 1. Basic Filtering
print("1️⃣ Basic Filtering")

# Simple conditions
high_value_sales = df_sales.filter(col("total_amount") > 500)
electronics = df_sales.filter(col("category") == "Electronics")
recent_sales = df_sales.filter(col("sale_date") >= "2024-06-01")

print(f"📊 Filter Results:")
print(f"   • High value sales (>$500): {high_value_sales.count()} records")
print(f"   • Electronics sales: {electronics.count()} records")
print(f"   • Recent sales (since June): {recent_sales.count()} records")

high_value_sales.select("transaction_id", "product_name", "total_amount", "category").show(3)

# 2. Compound Conditions
print(f"\n2️⃣ Compound Conditions (AND/OR Logic)")

# Multiple AND conditions
high_value_electronics = df_sales.filter(
    (col("category") == "Electronics") & 
    (col("total_amount") > 300) & 
    (col("rating") >= 4.0)
)

# OR conditions
furniture_or_appliances = df_sales.filter(
    (col("category") == "Furniture") | 
    (col("category") == "Appliances")
)

# Complex conditions with parentheses
complex_filter = df_sales.filter(
    ((col("category") == "Electronics") & (col("total_amount") > 500)) |
    ((col("category") == "Furniture") & (col("rating") >= 4.5))
)

print(f"📊 Compound Filter Results:")
print(f"   • High-value Electronics (>$300, rating≥4.0): {high_value_electronics.count()} records")
print(f"   • Furniture OR Appliances: {furniture_or_appliances.count()} records")
print(f"   • Complex condition: {complex_filter.count()} records")

# 3. String Filtering
print(f"\n3️⃣ String Filtering")

# String patterns and matching
products_with_pro = df_sales.filter(col("product_name").contains("Pro"))
products_starting_wireless = df_sales.filter(col("product_name").startswith("Wireless"))
products_ending_keyboard = df_sales.filter(col("product_name").endswith("Keyboard"))

# Regular expressions
products_with_pattern = df_sales.filter(col("product_name").rlike(".*[Ll]aptop.*|.*[Pp]hone.*"))

print(f"📝 String Filter Results:")
print(f"   • Products containing 'Pro': {products_with_pro.count()} records")
print(f"   • Products starting with 'Wireless': {products_starting_wireless.count()} records")
print(f"   • Products ending with 'Keyboard': {products_ending_keyboard.count()} records")
print(f"   • Products matching laptop/phone pattern: {products_with_pattern.count()} records")

products_with_pro.select("product_name", "category", "total_amount").distinct().show(truncate=False)

# 4. Null Value Filtering
print(f"\n4️⃣ Null Value Filtering")

# Check for null values
sales_with_nulls = df_sales.filter(col("is_online").isNull())
sales_without_nulls = df_sales.filter(col("is_online").isNotNull())

# Customers with missing emails
customers_no_email = df_customers.filter((col("email").isNull()) | (col("email") == ""))
customers_with_email = df_customers.filter(col("email").isNotNull() & (col("email") != ""))

print(f"🔍 Null Value Results:")
print(f"   • Sales with null is_online: {sales_with_nulls.count()} records")
print(f"   • Sales with non-null is_online: {sales_without_nulls.count()} records")
print(f"   • Customers without email: {customers_no_email.count()} records")
print(f"   • Customers with email: {customers_with_email.count()} records")

# 5. Date Range Filtering
print(f"\n5️⃣ Date Range Filtering")

# Convert to date type for proper comparison
df_sales_typed = df_sales.withColumn("sale_date_typed", to_date(col("sale_date"), "yyyy-MM-dd"))

# Date range filters
q1_2024 = df_sales_typed.filter(
    (col("sale_date_typed") >= "2024-01-01") & 
    (col("sale_date_typed") < "2024-04-01")
)

recent_30_days = df_sales_typed.filter(
    col("sale_date_typed") >= date_sub(current_date(), 30)
)

weekend_sales = df_sales_typed.filter(
    dayofweek(col("sale_date_typed")).isin([1, 7])  # Sunday=1, Saturday=7
)

print(f"📅 Date Filter Results:")
print(f"   • Q1 2024 sales: {q1_2024.count()} records")
print(f"   • Recent 30 days: {recent_30_days.count()} records")
print(f"   • Weekend sales: {weekend_sales.count()} records")

# 6. List-based Filtering
print(f"\n6️⃣ List-based Filtering (IN/NOT IN)")

# Filter by lists of values
target_regions = ["North", "South"]
target_products = ["P001", "P005", "P007"]

region_filter = df_sales.filter(col("region").isin(target_regions))
product_filter = df_sales.filter(col("product_id").isin(target_products))
exclude_regions = df_sales.filter(~col("region").isin(["Central"]))

print(f"📋 List Filter Results:")
print(f"   • North/South regions: {region_filter.count()} records")
print(f"   • Specific products: {product_filter.count()} records")
print(f"   • Excluding Central region: {exclude_regions.count()} records")

# 7. Performance Comparison
print(f"\n7️⃣ Filter Performance Comparison")

# Measure filter performance
start_time = time.time()
result1 = df_sales.filter(col("total_amount") > 200).count()
filter_time = time.time() - start_time

start_time = time.time()
result2 = df_sales.where(col("total_amount") > 200).count()  # where() is alias for filter()
where_time = time.time() - start_time

print(f"⚡ Performance Results:")
print(f"   • filter() method: {filter_time:.4f}s, {result1} records")
print(f"   • where() method: {where_time:.4f}s, {result2} records")
print(f"   • filter() and where() are identical (where is alias)")

print(f"\n📊 Filtering Best Practices:")
print(f"   • Apply filters early in transformation pipeline")
print(f"   • Use column references (col()) for better optimization")
print(f"   • Combine filters with & (and) and | (or) operators")
print(f"   • Handle null values explicitly in conditions")
print(f"   • Use isin() for list-based filtering")
print(f"   • Leverage predicate pushdown for better performance")

🔍 Filtering & Conditional Operations
1️⃣ Basic Filtering
📊 Filter Results:
   • High value sales (>$500): 642 records
   • Electronics sales: 526 records
   • Recent sales (since June): 591 records
+--------------+-------------+------------+-----------+
|transaction_id| product_name|total_amount|   category|
+--------------+-------------+------------+-----------+
|       T000002|   Laptop Pro|     5199.96|Electronics|
|       T000003|Standing Desk|     3599.92|  Furniture|
|       T000004| Office Chair|     1799.94|  Furniture|
+--------------+-------------+------------+-----------+
only showing top 3 rows

2️⃣ Compound Conditions (AND/OR Logic)
📊 Compound Filter Results:
   • Electronics sales: 526 records
   • Recent sales (since June): 591 records
+--------------+-------------+------------+-----------+
|transaction_id| product_name|total_amount|   category|
+--------------+-------------+------------+-----------+
|       T000002|   Laptop Pro|     5199.96|Electronics|
|       T000003

## 3.5 Aggregations & Grouping Operations

Aggregations allow you to summarize and analyze data by computing statistics across groups of records. These are wide transformations that may trigger shuffle operations.

**Key Aggregation Concepts:**
- **GroupBy Operations**: Group data by one or more columns
- **Aggregate Functions**: sum, count, avg, min, max, stddev, etc.
- **Multiple Aggregations**: Apply several aggregation functions simultaneously
- **Having Conditions**: Filter groups after aggregation
- **Performance**: Wide transformations that may require data shuffling

In [6]:
# Aggregations & Grouping Operations
print("📊 Aggregations & Grouping Operations")
print("=" * 37)

# 1. Basic Aggregations
print("1️⃣ Basic Aggregations")

# Simple aggregations on entire dataset
total_sales = df_sales.agg(
    count("*").alias("total_transactions"),
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_transaction"),
    min("total_amount").alias("min_transaction"),
    max("total_amount").alias("max_transaction"),
    stddev("total_amount").alias("stddev_amount")
).collect()[0]

print(f"📈 Overall Sales Statistics:")
print(f"   • Total Transactions: {total_sales['total_transactions']:,}")
print(f"   • Total Revenue: ${total_sales['total_revenue']:,.2f}")
print(f"   • Average Transaction: ${total_sales['avg_transaction']:.2f}")
print(f"   • Min Transaction: ${total_sales['min_transaction']:.2f}")
print(f"   • Max Transaction: ${total_sales['max_transaction']:.2f}")
print(f"   • Standard Deviation: ${total_sales['stddev_amount']:.2f}")

# 2. GroupBy Operations
print(f"\n2️⃣ GroupBy Operations")

# Group by single column
category_stats = df_sales.groupBy("category").agg(
    count("*").alias("transaction_count"),
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_amount"),
    max("total_amount").alias("max_amount")
).orderBy(desc("total_revenue"))

print(f"📊 Sales by Category:")
category_stats.show(truncate=False)

# Group by multiple columns
region_category_stats = df_sales.groupBy("region", "category").agg(
    count("*").alias("count"),
    sum("total_amount").alias("revenue"),
    avg("rating").alias("avg_rating")
).orderBy("region", "category")

print(f"🌍 Sales by Region and Category:")
region_category_stats.show(10, truncate=False)

# 3. Advanced Aggregations
print(f"\n3️⃣ Advanced Aggregations")

# Multiple aggregations with different functions
advanced_stats = df_sales.groupBy("region").agg(
    count("*").alias("transactions"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_amount"),
    min("total_amount").alias("min_amount"),
    max("total_amount").alias("max_amount"),
    count_distinct("product_id").alias("unique_products"),
    count_distinct("sales_rep").alias("unique_reps"),
    avg("rating").alias("avg_rating"),
    collect_list("product_name").alias("all_products")  # Collect all product names
).orderBy(desc("revenue"))

print(f"🌎 Advanced Regional Statistics:")
# Show without collect_list to avoid clutter
advanced_stats.drop("all_products").show(truncate=False)

# 4. Conditional Aggregations
print(f"\n4️⃣ Conditional Aggregations")

# Aggregations with conditions using when()
conditional_agg = df_sales.groupBy("category").agg(
    count("*").alias("total_sales"),
    sum(when(col("total_amount") >= 500, 1).otherwise(0)).alias("high_value_sales"),
    sum(when(col("total_amount") >= 500, col("total_amount")).otherwise(0)).alias("high_value_revenue"),
    avg(when(col("rating") >= 4.0, col("rating"))).alias("avg_high_rating"),
    count(when(col("is_online") == True, 1)).alias("online_sales"),
    count(when(col("is_online") == False, 1)).alias("offline_sales")
)

print(f"🎯 Conditional Aggregations by Category:")
conditional_agg.show(truncate=False)

# 5. Date-based Aggregations
print(f"\n5️⃣ Date-based Aggregations")

# Add date components for grouping
df_with_dates = df_sales.withColumn("sale_date_typed", to_date(col("sale_date"), "yyyy-MM-dd")) \
                       .withColumn("year", year(col("sale_date_typed"))) \
                       .withColumn("month", month(col("sale_date_typed"))) \
                       .withColumn("quarter", quarter(col("sale_date_typed")))

# Monthly sales trends
monthly_trends = df_with_dates.groupBy("year", "month").agg(
    count("*").alias("transactions"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_transaction"),
    lit(0).alias("unique_customers")  # Simplified since customer_id not in sales data
).orderBy("year", "month")

print(f"📅 Monthly Sales Trends:")
monthly_trends.show(12)

# 6. Percentile and Quantile Aggregations
print(f"\n6️⃣ Percentile and Quantile Aggregations")

# Calculate percentiles
percentile_stats = df_sales.groupBy("category").agg(
    count("*").alias("count"),
    expr("percentile_approx(total_amount, 0.25)").alias("q1_amount"),
    expr("percentile_approx(total_amount, 0.5)").alias("median_amount"),
    expr("percentile_approx(total_amount, 0.75)").alias("q3_amount"),
    expr("percentile_approx(total_amount, 0.95)").alias("p95_amount")
)

print(f"📊 Percentile Analysis by Category:")
percentile_stats.show(truncate=False)

# 7. Custom Aggregations
print(f"\n7️⃣ Custom Aggregations")

# User-defined aggregation functions
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

# Define a custom function for coefficient of variation
def coefficient_of_variation(values):
    if len(values) <= 1:
        return 0.0
    mean_val = sum(values) / len(values)
    variance = sum((x - mean_val) ** 2 for x in values) / (len(values) - 1)
    std_dev = variance ** 0.5
    return std_dev / mean_val if mean_val != 0 else 0.0

# Register UDF
cv_udf = udf(coefficient_of_variation, DoubleType())

# Custom aggregation using collect_list and UDF
custom_agg = df_sales.groupBy("region").agg(
    count("*").alias("transactions"),
    avg("total_amount").alias("avg_amount"),
    stddev("total_amount").alias("std_amount"),
    (stddev("total_amount") / avg("total_amount")).alias("coefficient_of_variation")
)

print(f"🔧 Custom Aggregations (Coefficient of Variation):")
custom_agg.show(truncate=False)

# 8. Having Clauses (Post-aggregation Filtering)
print(f"\n8️⃣ Having Clauses (Post-aggregation Filtering)")

# Filter groups after aggregation
high_revenue_categories = df_sales.groupBy("category").agg(
    count("*").alias("transaction_count"),
    sum("total_amount").alias("total_revenue")
).filter(col("total_revenue") > 10000)  # Having clause equivalent

high_volume_regions = df_sales.groupBy("region").agg(
    count("*").alias("transaction_count"),
    avg("total_amount").alias("avg_revenue")
).filter(col("transaction_count") >= 150)

print(f"💰 High Revenue Categories (>$10,000):")
high_revenue_categories.show()

print(f"📈 High Volume Regions (≥150 transactions):")
high_volume_regions.show()

print(f"\n📊 Aggregation Best Practices:")
print(f"   • Use groupBy() for categorical analysis")
print(f"   • Apply agg() with multiple functions for efficiency")
print(f"   • Use when() for conditional aggregations")
print(f"   • Filter groups with having-style conditions after agg()")
print(f"   • Consider approximate functions (approx_*) for large datasets")
print(f"   • Combine with window functions for advanced analytics")

📊 Aggregations & Grouping Operations
1️⃣ Basic Aggregations
📈 Overall Sales Statistics:
   • Total Transactions: 1,000
   • Total Revenue: $2,116,080.84
   • Average Transaction: $2116.08
   • Min Transaction: $29.99
   • Max Transaction: $12999.90
   • Standard Deviation: $2621.50

2️⃣ GroupBy Operations
📊 Sales by Category:
📊 Sales by Category:
+-----------+-----------------+------------------+------------------+----------+
|category   |transaction_count|total_revenue     |avg_amount        |max_amount|
+-----------+-----------------+------------------+------------------+----------+
|Electronics|526              |1597600.7100000002|3037.263707224335 |12999.9   |
|Furniture  |275              |398261.0          |1448.2218181818182|4499.9    |
|Appliances |199              |120219.12999999998|604.1162311557788 |1299.9    |
+-----------+-----------------+------------------+------------------+----------+

🌍 Sales by Region and Category:
+-----------+-----------------+------------------+-

## 3.6 Join Operations & Module Summary

Join operations combine data from multiple DataFrames based on common keys. This section demonstrates various join types and concludes Module 3 with key takeaways.

**Join Types in PySpark:**
- **Inner Join**: Returns only matching records from both DataFrames
- **Left Join**: Returns all records from left DataFrame, matching from right
- **Right Join**: Returns all records from right DataFrame, matching from left
- **Full Outer Join**: Returns all records from both DataFrames
- **Cross Join**: Cartesian product of both DataFrames
- **Anti Join**: Returns records from left that don't match right
- **Semi Join**: Returns records from left that have matches in right

In [7]:
# Join Operations & Module Completion
print("🔗 Join Operations & Module Summary")
print("=" * 35)

# Create additional datasets for join demonstrations
print("1️⃣ Creating Additional Datasets for Joins")

# Create a customer orders dataset
customer_orders = [
    ("T000001", "C001", "2024-01-15", 1299.99),
    ("T000002", "C002", "2024-01-16", 29.99),
    ("T000003", "C001", "2024-01-17", 299.99),
    ("T000004", "C003", "2024-01-18", 89.99),
    ("T000005", "C004", "2024-01-19", 799.99),
    ("T000006", "C999", "2024-01-20", 45.99),  # Customer not in customer table
]

orders_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("order_date", StringType(), True),
    StructField("order_amount", DoubleType(), True)
])

df_orders = spark.createDataFrame(customer_orders, orders_schema)
print(f"✅ Created orders dataset: {df_orders.count()} records")

# 2. Inner Join
print(f"\n2️⃣ Inner Join (Only Matching Records)")

inner_join = df_customers.join(df_orders, "customer_id", "inner")
print(f"📊 Inner Join Result: {inner_join.count()} records")
inner_join.select("customer_id", "first_name", "last_name", "order_id", "order_amount").show(truncate=False)

# 3. Left Join
print(f"\n3️⃣ Left Join (All Customers, Matching Orders)")

left_join = df_customers.join(df_orders, "customer_id", "left")
print(f"📊 Left Join Result: {left_join.count()} records")
left_join.select("customer_id", "first_name", "last_name", "order_id", "order_amount").show(truncate=False)

# 4. Right Join
print(f"\n4️⃣ Right Join (All Orders, Matching Customers)")

right_join = df_customers.join(df_orders, "customer_id", "right")
print(f"📊 Right Join Result: {right_join.count()} records")
right_join.select("customer_id", "first_name", "last_name", "order_id", "order_amount").show(truncate=False)

# 5. Full Outer Join
print(f"\n5️⃣ Full Outer Join (All Records from Both)")

full_join = df_customers.join(df_orders, "customer_id", "full_outer")
print(f"📊 Full Outer Join Result: {full_join.count()} records")
full_join.select("customer_id", "first_name", "last_name", "order_id", "order_amount").show(truncate=False)

# 6. Anti Join (Customers without Orders)
print(f"\n6️⃣ Anti Join (Customers Without Orders)")

anti_join = df_customers.join(df_orders, "customer_id", "left_anti")
print(f"📊 Anti Join Result: {anti_join.count()} records")
anti_join.select("customer_id", "first_name", "last_name", "email").show(truncate=False)

# 7. Semi Join (Customers with Orders)
print(f"\n7️⃣ Semi Join (Customers With Orders)")

semi_join = df_customers.join(df_orders, "customer_id", "left_semi")
print(f"📊 Semi Join Result: {semi_join.count()} records")
semi_join.select("customer_id", "first_name", "last_name", "email").show(truncate=False)

# 8. Complex Join with Multiple Conditions
print(f"\n8️⃣ Complex Join with Multiple Conditions")

# Join sales with products for detailed analysis
sales_products_join = df_sales.join(
    df_products, 
    (df_sales.product_id == df_products.product_id) & (df_products.is_active == True),
    "inner"
).select(
    df_sales.transaction_id,
    df_sales.product_name.alias("sales_product_name"),
    df_products.product_name.alias("catalog_product_name"),
    df_sales.total_amount,
    df_products.price.alias("catalog_price"),
    df_products.category,
    df_products.stock_quantity
)

print(f"📊 Sales-Products Join: {sales_products_join.count()} records")
sales_products_join.show(5, truncate=False)

# 9. Performance Considerations
print(f"\n9️⃣ Join Performance Analysis")

# Measure join performance
start_time = time.time()
result_count = df_customers.join(df_orders, "customer_id", "inner").count()
join_time = time.time() - start_time

print(f"⚡ Join Performance:")
print(f"   • Join operation time: {join_time:.4f}s")
print(f"   • Records processed: {result_count}")
print(f"   • Consider broadcast joins for small tables")
print(f"   • Use proper partitioning for large datasets")

print(f"\n🎯 Module 3 Complete: Data Transformations Mastery!")
print("=" * 55)

print(f"📚 What You've Learned:")
print(f"   ✅ Basic Column Operations")
print(f"      • Selection, creation, modification, casting")
print(f"      • String, date, and conditional operations")
print(f"   ✅ Filtering & Conditional Logic")
print(f"      • Row filtering with complex conditions")
print(f"      • Null handling and string pattern matching")
print(f"   ✅ Aggregations & Grouping")
print(f"      • GroupBy operations and aggregate functions")
print(f"      • Conditional aggregations and percentiles")
print(f"   ✅ Join Operations")
print(f"      • All join types: inner, left, right, full, anti, semi")
print(f"      • Complex join conditions and performance")

print(f"\n💡 Key Performance Insights:")
print(f"   • Narrow transformations (select, filter) are fast")
print(f"   • Wide transformations (groupBy, join) may shuffle data")
print(f"   • Use column references col() for optimization")
print(f"   • Apply filters early in transformation pipeline")
print(f"   • Leverage Catalyst optimizer for automatic optimization")

print(f"\n🚀 Production Best Practices:")
print(f"   • Chain transformations for lazy evaluation")
print(f"   • Cache intermediate results when reused")
print(f"   • Use explicit schemas for better performance")
print(f"   • Monitor Spark UI for optimization opportunities")
print(f"   • Consider partitioning strategy for large datasets")

print(f"\n📈 Next Steps:")
print(f"   • Module 4: SQL & DataFrame API Advanced Patterns")
print(f"   • Module 5: Performance Optimization & Tuning")
print(f"   • Module 6: Machine Learning with MLlib")
print(f"   • Module 7: Streaming Data Processing")

# Final statistics
total_transformations = 6  # Sections covered
total_concepts = 25  # Approximate concepts demonstrated

print(f"\n📊 Module 3 Statistics:")
print(f"   • Transformation sections: {total_transformations}")
print(f"   • Concepts demonstrated: {total_concepts}+")
print(f"   • Sample datasets: 3 (sales, customers, products)")
print(f"   • Real-world patterns: Production-ready examples")

print(f"\n🎉 Congratulations! You've mastered PySpark Data Transformations!")
print(f"Ready to tackle complex data processing challenges! 🚀")

🔗 Join Operations & Module Summary
1️⃣ Creating Additional Datasets for Joins
✅ Created orders dataset: 6 records

2️⃣ Inner Join (Only Matching Records)
📊 Inner Join Result: 5 records
+-----------+----------+---------+--------+------------+
|customer_id|first_name|last_name|order_id|order_amount|
+-----------+----------+---------+--------+------------+
|C001       |John      |Doe      |T000001 |1299.99     |
|C002       |Jane      |Smith    |T000002 |29.99       |
|C001       |John      |Doe      |T000003 |299.99      |
|C003       |Bob       |Johnson  |T000004 |89.99       |
|C004       |Alice     |Brown    |T000005 |799.99      |
+-----------+----------+---------+--------+------------+


3️⃣ Left Join (All Customers, Matching Orders)
📊 Inner Join Result: 5 records
+-----------+----------+---------+--------+------------+
|customer_id|first_name|last_name|order_id|order_amount|
+-----------+----------+---------+--------+------------+
|C001       |John      |Doe      |T000001 |1299.99 