# üìä DataFrame Aggregations: GroupBy & Statistics

**Time to complete:** 35 minutes  
**Difficulty:** Intermediate  
**Prerequisites:** DataFrame basics, column expressions

---

## üéØ Learning Objectives

By the end of this notebook, you will master:
- ‚úÖ **`groupBy()`** - Group data by columns
- ‚úÖ **`agg()`** - Apply aggregation functions
- ‚úÖ **Statistical functions** - sum, avg, min, max, count
- ‚úÖ **Multiple aggregations** - Combine multiple stats
- ‚úÖ **Custom aggregations** - User-defined aggregation logic
- ‚úÖ **Performance optimization** - Efficient grouping strategies

**Aggregations are the heart of data analysis in Spark!**

---

## üîç Understanding DataFrame Aggregations

**Aggregations** combine multiple rows into summary statistics. They're essential for:

- **Business Intelligence**: Sales by region, revenue by product
- **Data Analysis**: Average age by department, count by category
- **Reporting**: Statistical summaries and KPIs
- **Data Quality**: Checking distributions and outliers

### Aggregation Flow:
```
Raw Data ‚Üí Group By ‚Üí Aggregate Functions ‚Üí Summary Results
```

**GroupBy operations are wide transformations** - they cause data shuffling!

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, min, max, count, stddev, variance
from pyspark.sql.functions import countDistinct, approx_count_distinct, first, last
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("DataFrame_Aggregations") \
    .master("local[*]") \
    .getOrCreate()

print(f"‚úÖ Spark ready - Version: {spark.version}")

# Create comprehensive sales data
sales_data = [
    ("Alice", "North", "Electronics", 1200, "2023-01"),
    ("Bob", "South", "Electronics", 800, "2023-01"),
    ("Charlie", "North", "Clothing", 150, "2023-01"),
    ("Diana", "East", "Electronics", 950, "2023-01"),
    ("Eve", "South", "Clothing", 200, "2023-01"),
    ("Frank", "North", "Electronics", 1300, "2023-02"),
    ("Grace", "East", "Clothing", 300, "2023-02"),
    ("Henry", "South", "Electronics", 1100, "2023-02"),
    ("Ivy", "North", "Books", 75, "2023-02"),
    ("Jack", "East", "Books", 125, "2023-02")
]

sales_df = spark.createDataFrame(
    sales_data, 
    ["salesperson", "region", "category", "amount", "month"]
)

print("üìä Sales Dataset:")
sales_df.show()
print(f"Total records: {sales_df.count()}")

## üéØ Basic GroupBy Operations

### Single Column Grouping

In [None]:
# Basic aggregations
print("üéØ BASIC GROUPBY OPERATIONS")
print("=" * 50)

# Group by region and calculate total sales
region_sales = sales_df.groupBy("region").sum("amount")
print("Total sales by region:")
region_sales.show()

# Group by category and count transactions
category_count = sales_df.groupBy("category").count()
print("\nTransaction count by category:")
category_count.show()

# Group by salesperson and get average sale
salesperson_avg = sales_df.groupBy("salesperson").avg("amount")
print("\nAverage sale amount by salesperson:")
salesperson_avg.show()

### Multiple Column Grouping

In [None]:
# Group by multiple columns
print("üîÄ MULTIPLE COLUMN GROUPING")
print("=" * 50)

# Group by region and category
region_category_sales = sales_df.groupBy(["region", "category"]).sum("amount")
print("Sales by region and category:")
region_category_sales.show()

# Group by month and region
monthly_region_sales = sales_df.groupBy(["month", "region"]).agg(
    sum("amount").alias("total_sales"),
    count("*").alias("transaction_count"),
    avg("amount").alias("avg_transaction")
)
print("\nMonthly sales by region:")
monthly_region_sales.show()

# Sort results for better readability
monthly_region_sales.orderBy(["month", "region"]).show()

## üßÆ Advanced Aggregation Functions

### Using the agg() Method

In [None]:
# Advanced aggregations with agg()
print("üßÆ ADVANCED AGGREGATIONS")
print("=" * 50)

# Comprehensive sales analysis by region
comprehensive_analysis = sales_df.groupBy("region").agg(
    count("*").alias("total_transactions"),
    sum("amount").alias("total_sales"),
    avg("amount").alias("avg_transaction"),
    min("amount").alias("smallest_sale"),
    max("amount").alias("largest_sale"),
    stddev("amount").alias("sales_stddev"),
    countDistinct("salesperson").alias("unique_salespeople"),
    countDistinct("category").alias("unique_categories")
)

print("Comprehensive regional analysis:")
comprehensive_analysis.show()

# Category performance analysis
category_analysis = sales_df.groupBy("category").agg(
    sum("amount").alias("total_revenue"),
    count("*").alias("transaction_count"),
    (sum("amount") / count("*")).alias("avg_transaction_size"),
    approx_count_distinct("salesperson").alias("unique_sellers"),
    first("salesperson").alias("first_seller"),
    last("salesperson").alias("last_seller")
)

print("\nCategory performance analysis:")
category_analysis.show()

### Statistical Aggregations

In [None]:
# Statistical aggregations
print("üìà STATISTICAL AGGREGATIONS")
print("=" * 50)

# Statistical summary of sales amounts
stats_summary = sales_df.select(
    sum("amount").alias("total_sales"),
    avg("amount").alias("mean_sale"),
    stddev("amount").alias("std_deviation"),
    variance("amount").alias("variance"),
    min("amount").alias("min_sale"),
    max("amount").alias("max_sale"),
    (F.percentile_approx("amount", 0.5)).alias("median_sale"),
    (F.percentile_approx("amount", 0.25)).alias("q1"),
    (F.percentile_approx("amount", 0.75)).alias("q3")
)

print("Overall sales statistics:")
stats_summary.show()

# Statistical analysis by category
category_stats = sales_df.groupBy("category").agg(
    count("*").alias("count"),
    avg("amount").alias("mean"),
    stddev("amount").alias("std"),
    min("amount").alias("min"),
    max("amount").alias("max"),
    (max("amount") - min("amount")).alias("range"),
    (stddev("amount") / avg("amount")).alias("coefficient_of_variation")
)

print("\nStatistical analysis by category:")
category_stats.show()

# Calculate coefficient of variation (CV)
# CV < 0.15: Low variability
# CV 0.15-0.35: Moderate variability  
# CV > 0.35: High variability
cv_analysis = category_stats.withColumn(
    "variability_level",
    F.when(F.col("coefficient_of_variation") < 0.15, "Low")
    .when(F.col("coefficient_of_variation") < 0.35, "Moderate")
    .otherwise("High")
)

print("\nVariability analysis:")
cv_analysis.select("category", "coefficient_of_variation", "variability_level").show()

## üé® Custom Aggregation Functions

### Using aggregateByKey for Complex Aggregations

In [None]:
# Custom aggregations
print("üé® CUSTOM AGGREGATIONS")
print("=" * 50)

# Define custom aggregation functions
def combine_stats(acc, value):
    """Combine statistics incrementally"""
    count, total, min_val, max_val, sum_squares = acc
    return (
        count + 1,
        total + value,
        min(min_val, value),
        max(max_val, value),
        sum_squares + (value ** 2)
    )

def merge_stats(acc1, acc2):
    """Merge statistics from different partitions"""
    c1, t1, min1, max1, ss1 = acc1
    c2, t2, min2, max2, ss2 = acc2
    return (
        c1 + c2,
        t1 + t2,
        min(min1, min2),
        max(max1, max2),
        ss1 + ss2
    )

# Calculate comprehensive statistics including standard deviation
initial_acc = (0, 0, float('inf'), float('-inf'), 0)  # count, sum, min, max, sum_squares

custom_stats = sales_df.rdd.map(lambda row: (row.region, row.amount)) \
    .aggregateByKey(initial_acc, combine_stats, merge_stats)

# Convert back to DataFrame and calculate final statistics
stats_df = custom_stats.mapValues(lambda stats: {
    'count': stats[0],
    'sum': stats[1],
    'min': stats[2],
    'max': stats[3],
    'mean': stats[1] / stats[0] if stats[0] > 0 else 0,
    'variance': (stats[4] / stats[0] - (stats[1] / stats[0]) ** 2) if stats[0] > 1 else 0,
    'stddev': ((stats[4] / stats[0] - (stats[1] / stats[0]) ** 2) ** 0.5) if stats[0] > 1 else 0
}).toDF(["region", "statistics"])

# Extract statistics into separate columns
final_stats = stats_df.select(
    "region",
    F.col("statistics.count").alias("transaction_count"),
    F.col("statistics.sum").alias("total_sales"),
    F.col("statistics.mean").alias("avg_sale"),
    F.col("statistics.min").alias("min_sale"),
    F.col("statistics.max").alias("max_sale"),
    F.col("statistics.stddev").alias("std_dev")
)

print("Custom aggregation results:")
final_stats.show()

# Verify calculations
verification = sales_df.groupBy("region").agg(
    count("*").alias("count"),
    sum("amount").alias("sum"),
    avg("amount").alias("avg"),
    min("amount").alias("min"),
    max("amount").alias("max"),
    stddev("amount").alias("std")
)

print("\nVerification (built-in functions):")
verification.show()

print("\n‚úÖ Custom aggregation matches built-in functions!")

## ‚ö° Performance Optimization

### Efficient Aggregation Patterns

In [None]:
# Performance optimization patterns
print("‚ö° PERFORMANCE OPTIMIZATION")
print("=" * 50)

# Create larger dataset for performance testing
large_sales_data = [
    (f"salesperson_{i%50}", f"region_{(i%5)+1}", f"category_{(i%4)+1}", 
     100 + (i % 900), f"2023-{(i%12)+1:02d}")
    for i in range(50000)
]

large_df = spark.createDataFrame(large_sales_data, 
    ["salesperson", "region", "category", "amount", "month"])

print(f"Large dataset: {large_df.count():,} records")

# Pattern 1: Pre-filter before aggregation
print("\nPattern 1: Filter before aggregation")
high_value_sales = large_df.filter(col("amount") > 500)
filtered_result = high_value_sales.groupBy("region").sum("amount")
print(f"High-value sales by region: {filtered_result.count()} regions")

# Pattern 2: Use approximate functions for large datasets
print("\nPattern 2: Approximate aggregations")
approx_stats = large_df.select(
    F.approx_count_distinct("salesperson").alias("unique_sellers"),
    F.percentile_approx("amount", 0.5).alias("median_amount"),
    F.percentile_approx("amount", [0.25, 0.75]).alias("quartiles")
)
approx_stats.show()

# Pattern 3: Cache intermediate results for multiple aggregations
print("\nPattern 3: Cache for multiple operations")
cached_df = large_df.filter(col("amount") > 200).cache()

result1 = cached_df.groupBy("region").sum("amount")
result2 = cached_df.groupBy("category").avg("amount")
result3 = cached_df.agg(count("*"), sum("amount"))

print(f"Multiple aggregations on cached data:")
print(f"- Regional totals: {result1.count()} regions")
print(f"- Category averages: {result2.count()} categories")
print(f"- Overall stats computed")

# Clean up cache
cached_df.unpersist()
print("\nCache cleared")

### Aggregation Pipeline Optimization

In [None]:
# Optimized aggregation pipeline
print("üîß AGGREGATION PIPELINE OPTIMIZATION")
print("=" * 50)

# Complete business analysis pipeline
business_analysis = sales_df \
    .filter(col("amount") > 0) \
    .withColumn("revenue_category",
                F.when(col("amount") >= 1000, "High")
                .when(col("amount") >= 500, "Medium")
                .otherwise("Low")) \
    .groupBy(["region", "revenue_category"]) \
    .agg(
        count("*").alias("transaction_count"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_transaction"),
        stddev("amount").alias("revenue_stddev"),
        collect_list("salesperson").alias("top_sellers")
    ) \
    .withColumn("revenue_per_transaction", col("total_revenue") / col("transaction_count")) \
    .withColumn("efficiency_score", 
                F.when(col("transaction_count") > 2, "High")
                .when(col("transaction_count") > 1, "Medium")
                .otherwise("Low")) \
    .orderBy(["region", "revenue_category"])

print("Complete business analysis pipeline:")
business_analysis.show(truncate=False)

# Executive summary
executive_summary = sales_df.agg(
    countDistinct("salesperson").alias("total_salespeople"),
    countDistinct("region").alias("regions_covered"),
    countDistinct("category").alias("product_categories"),
    sum("amount").alias("total_revenue"),
    avg("amount").alias("avg_transaction"),
    max("amount").alias("highest_sale"),
    min("amount").alias("lowest_sale")
)

print("\nüìä EXECUTIVE SUMMARY")
executive_summary.show()

# Monthly trends
monthly_trends = sales_df.groupBy("month").agg(
    count("*").alias("transactions"),
    sum("amount").alias("revenue"),
    avg("amount").alias("avg_sale"),
    countDistinct("salesperson").alias("active_sellers")
).orderBy("month")

print("\nüìà MONTHLY TRENDS")
monthly_trends.show()

# Performance analysis
performance_analysis = sales_df.groupBy("salesperson").agg(
    count("*").alias("total_sales"),
    sum("amount").alias("total_revenue"),
    avg("amount").alias("avg_sale_size"),
    max("amount").alias("best_sale"),
    countDistinct("category").alias("categories_sold"),
    countDistinct("region").alias("regions_covered")
).withColumn(
    "productivity_score", 
    (col("total_sales") * 0.3 + col("total_revenue") / 1000 * 0.7)
).orderBy(col("productivity_score").desc())

print("\nüèÜ SALESPERSON PERFORMANCE RANKING")
performance_analysis.select(
    "salesperson", "total_sales", "total_revenue", 
    "avg_sale_size", "productivity_score"
).show()

## üö® Common Mistakes and Debugging

In [None]:
# Common mistakes and solutions
print("üö® COMMON MISTAKES")
print("=" * 50)

# Mistake 1: Forgetting to collect() results
print("‚ùå Mistake: Forgetting to collect results")
grouped_data = sales_df.groupBy("region").sum("amount")
print(f"Type: {type(grouped_data)} (DataFrame, not collected results)")

print("\n‚úÖ Solution: Use collect() or show() to see results")
results = grouped_data.collect()
print(f"Collected {len(results)} results")

# Mistake 2: Using wrong column references
print("\n‚ùå Mistake: Wrong column references in agg")
try:
    bad_agg = sales_df.groupBy("region").agg(sum("nonexistent_column"))
    bad_agg.show()
except Exception as e:
    print(f"Error: {str(e)[:100]}...")

print("\n‚úÖ Solution: Use correct column names")
correct_agg = sales_df.groupBy("region").agg(sum("amount"))
correct_agg.show()

# Mistake 3: Performance - aggregating without filtering
print("\n‚ùå Mistake: Aggregating massive datasets without filtering")
print("Imagine aggregating 1TB of data without any filters...")

print("\n‚úÖ Solution: Filter before aggregating")
optimized_agg = sales_df \
    .filter(col("amount") > 100) \
    .filter(col("region") == "North") \
    .groupBy("category") \
    .sum("amount")

print("Filtered and aggregated efficiently:")
optimized_agg.show()

# Mistake 4: Using collect() on large result sets
print("\n‚ùå Mistake: collect() on large result sets")
print("Can cause OutOfMemoryError on driver")

print("\n‚úÖ Solutions:")
print("- Use show(n) for preview")
print("- Use take(n) to limit results")
print("- Write to file instead of collecting")
print("- Use limit() before collect()")

# Safe result collection
safe_results = sales_df.groupBy("region").sum("amount").limit(10).collect()
print(f"\nSafely collected {len(safe_results)} results")

## üéØ Key Takeaways

### What You Learned:
- ‚úÖ **`groupBy()`** - Group data by one or more columns
- ‚úÖ **`agg()`** - Apply aggregation functions to grouped data
- ‚úÖ **Statistical functions** - sum, avg, min, max, stddev, variance
- ‚úÖ **Multiple aggregations** - Combine multiple stats in one operation
- ‚úÖ **Custom aggregations** - Using aggregateByKey for complex logic
- ‚úÖ **Performance optimization** - Filter, cache, and pipeline efficiently

### Aggregation Types:
- üî∏ **Simple aggregations**: `sum()`, `avg()`, `count()`, `min()`, `max()`
- üî∏ **Statistical aggregations**: `stddev()`, `variance()`, `percentile_approx()`
- üî∏ **Distinct operations**: `countDistinct()`, `approx_count_distinct()`
- üî∏ **Custom aggregations**: `aggregateByKey()` with user-defined functions
- üî∏ **Collection aggregations**: `collect_list()`, `collect_set()`

### Performance Best Practices:
- üî∏ **Filter before aggregating** to reduce data volume
- üî∏ **Use approximate functions** for large datasets
- üî∏ **Cache intermediate results** for multiple operations
- üî∏ **Choose appropriate aggregation functions** for your use case
- üî∏ **Monitor shuffle operations** in Spark UI

### Common Patterns:
- üî∏ `df.groupBy(cols).agg(funcs)` - Basic aggregation pattern
- üî∏ `df.groupBy(cols).sum/avg/count()` - Simple single-function aggregations
- üî∏ `F.percentile_approx(col, percentile)` - Approximate percentiles
- üî∏ `F.approx_count_distinct(col)` - Approximate distinct counts
- üî∏ `aggregateByKey()` - Custom aggregation logic

---

## üöÄ Next Steps

Now that you master DataFrame aggregations, you're ready for:

1. **Window Functions** - Advanced analytical operations
2. **Joins** - Combining multiple DataFrames
3. **Complex Data Types** - Arrays, maps, and structs
4. **Advanced Analytics** - Time series and trend analysis

**Aggregations are the foundation of data analysis in Spark!**

---

**üéâ Congratulations! You now have the power to analyze and summarize data at scale with Spark aggregations!**