# 🔄 Pandas Tutorial 2: GroupBy and Aggregation

Welcome to the second notebook in our Pandas series! This notebook focuses on one of the most powerful features in pandas: grouping data and performing aggregations.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
- Use the `groupby()` method to group data by categories
- Apply various aggregation functions (sum, mean, count, etc.)
- Create pivot tables for data summarization
- Perform multi-level grouping and complex aggregations
- Use the `.agg()` method for custom aggregations

## 📊 What is GroupBy?

GroupBy operations involve:
1. **Splitting** the data into groups based on some criteria
2. **Applying** a function to each group independently  
3. **Combining** the results into a data structure

This is often called the "split-apply-combine" strategy and is fundamental to data analysis.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load our sample datasets
sales_df = pd.read_csv('../data/sales_data.csv')
employees_df = pd.read_csv('../data/employees.csv')
weather_df = pd.read_csv('../data/weather_data.csv')

# Convert date columns
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
employees_df['Join_Date'] = pd.to_datetime(employees_df['Join_Date'])
weather_df['Date'] = pd.to_datetime(weather_df['Date'])

print("✅ Data loaded and preprocessed successfully!")
print(f"📊 Sales data shape: {sales_df.shape}")
print(f"👥 Employee data shape: {employees_df.shape}")
print(f"🌤️ Weather data shape: {weather_df.shape}")

## 📊 Section 1: Basic GroupBy Operations

Let's start with simple grouping operations to understand the split-apply-combine strategy.

In [None]:
# Basic GroupBy Examples
print("🔄 BASIC GROUPBY OPERATIONS")
print("=" * 50)

# 1. Group sales by product and sum revenue
print("1️⃣ Total Revenue by Product:")
product_revenue = sales_df.groupby('Product')['Revenue'].sum()
print(product_revenue)
print(f"   📈 Best selling product: {product_revenue.idxmax()} (${product_revenue.max():,})")

print("\n" + "="*50)

# 2. Group by category and calculate multiple statistics
print("2️⃣ Sales Statistics by Category:")
category_stats = sales_df.groupby('Category').agg({
    'Revenue': ['sum', 'mean', 'count'],
    'Quantity': ['sum', 'mean']
})
display(category_stats)

print("\n" + "="*50)

# 3. Group employees by department
print("3️⃣ Department Analysis:")
dept_analysis = employees_df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max', 'count'],
    'Age': 'mean',
    'Performance_Score': 'mean'
}).round(2)
display(dept_analysis)

## 🎲 Section 2: Advanced Aggregation Functions

Let's explore more sophisticated aggregation methods and custom functions.

In [None]:
# Advanced aggregation with custom functions
print("⚡ ADVANCED AGGREGATION FUNCTIONS")
print("=" * 50)

# Custom aggregation function
def revenue_per_item(group):
    """Calculate average revenue per item sold"""
    return group['Revenue'].sum() / group['Quantity'].sum()

# Apply custom function
print("1️⃣ Revenue per Item by Product:")
revenue_per_item_product = sales_df.groupby('Product').apply(revenue_per_item)
print(revenue_per_item_product.round(2))

print("\n" + "="*50)

# Multiple aggregations with custom names
print("2️⃣ Comprehensive Sales Analysis:")
sales_analysis = sales_df.groupby('Region').agg(
    total_revenue=('Revenue', 'sum'),
    avg_revenue=('Revenue', 'mean'),
    total_quantity=('Quantity', 'sum'),
    num_transactions=('Product', 'count'),
    unique_products=('Product', 'nunique')
).round(2)
display(sales_analysis)

print("\n" + "="*50)

# Percentile calculations
print("3️⃣ Salary Percentiles by Department:")
salary_percentiles = employees_df.groupby('Department')['Salary'].agg([
    ('25th_percentile', lambda x: x.quantile(0.25)),
    ('median', 'median'),
    ('75th_percentile', lambda x: x.quantile(0.75)),
    ('salary_range', lambda x: x.max() - x.min())
]).round(0)
display(salary_percentiles)

## 🔄 Section 3: Multi-Level GroupBy

Sometimes we need to group by multiple columns simultaneously to get deeper insights.

In [None]:
# Multi-level grouping examples
print("🎯 MULTI-LEVEL GROUPBY OPERATIONS")
print("=" * 50)

# Group by category and region
print("1️⃣ Sales by Category and Region:")
category_region = sales_df.groupby(['Category', 'Region'])['Revenue'].sum().unstack(fill_value=0)
display(category_region)

# Calculate percentages
print("\n📊 Revenue Distribution (%):")
category_region_pct = category_region.div(category_region.sum(axis=1), axis=0) * 100
display(category_region_pct.round(1))

print("\n" + "="*50)

# Group by multiple columns with multiple aggregations
print("2️⃣ Detailed Analysis by Category and Product:")
detailed_analysis = sales_df.groupby(['Category', 'Product']).agg({
    'Revenue': ['sum', 'mean', 'count'],
    'Quantity': ['sum', 'mean']
}).round(2)

# Flatten column names for better readability
detailed_analysis.columns = [f"{col[0]}_{col[1]}" for col in detailed_analysis.columns]
display(detailed_analysis.head(10))

print("\n" + "="*50)

# Time-based grouping (add month to sales data first)
sales_df['Month'] = sales_df['Date'].dt.month
print("3️⃣ Monthly Sales by Category:")
monthly_sales = sales_df.groupby(['Month', 'Category'])['Revenue'].sum().unstack(fill_value=0)
display(monthly_sales)

## 📋 Section 4: Pivot Tables

Pivot tables provide another powerful way to summarize and reshape your data.

In [None]:
# Pivot table examples
print("📊 PIVOT TABLE OPERATIONS")
print("=" * 50)

# Basic pivot table
print("1️⃣ Basic Pivot Table - Revenue by Product and Region:")
pivot_basic = pd.pivot_table(
    sales_df, 
    values='Revenue', 
    index='Product', 
    columns='Region', 
    aggfunc='sum',
    fill_value=0
)
display(pivot_basic)

print("\n" + "="*50)

# Multiple value columns
print("2️⃣ Multiple Metrics Pivot Table:")
pivot_multi = pd.pivot_table(
    sales_df,
    values=['Revenue', 'Quantity'],
    index='Category',
    columns='Region',
    aggfunc={'Revenue': 'sum', 'Quantity': 'sum'},
    fill_value=0
)
display(pivot_multi)

print("\n" + "="*50)

# Pivot with multiple aggregation functions
print("3️⃣ Advanced Pivot with Multiple Aggregations:")
pivot_advanced = pd.pivot_table(
    sales_df,
    values='Revenue',
    index='Product',
    columns='Region',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)
display(pivot_advanced.round(2))

print("\n" + "="*50)

# Employee pivot table
print("4️⃣ Employee Analysis Pivot Table:")
employee_pivot = pd.pivot_table(
    employees_df,
    values=['Salary', 'Performance_Score'],
    index='Department',
    aggfunc={
        'Salary': ['mean', 'min', 'max'],
        'Performance_Score': 'mean'
    }
).round(2)
display(employee_pivot)

## 📈 Section 5: GroupBy Visualizations

Let's create compelling visualizations of our grouped data to better understand patterns and trends.

In [None]:
# Create visualizations for grouped data
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Revenue by Product (Bar Chart)
product_revenue = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
product_revenue.plot(kind='bar', ax=axes[0,0], color='lightblue', alpha=0.8)
axes[0,0].set_title('📊 Total Revenue by Product', fontsize=14, fontweight='bold')
axes[0,0].set_ylabel('Revenue ($)')
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Average Salary by Department (Horizontal Bar)
dept_salary = employees_df.groupby('Department')['Salary'].mean().sort_values()
dept_salary.plot(kind='barh', ax=axes[0,1], color='lightgreen', alpha=0.8)
axes[0,1].set_title('💰 Average Salary by Department', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Average Salary ($)')

# 3. Revenue by Category and Region (Stacked Bar)
category_region_revenue = sales_df.groupby(['Category', 'Region'])['Revenue'].sum().unstack()
category_region_revenue.plot(kind='bar', stacked=True, ax=axes[1,0], 
                            colormap='viridis', alpha=0.8)
axes[1,0].set_title('🏢 Revenue by Category and Region', fontsize=14, fontweight='bold')
axes[1,0].set_ylabel('Revenue ($)')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')

# 4. Performance vs Salary by Department (Scatter with groupby)
dept_perf_salary = employees_df.groupby('Department').agg({
    'Salary': 'mean',
    'Performance_Score': 'mean',
    'Employee_ID': 'count'  # Count as size
}).reset_index()

scatter = axes[1,1].scatter(dept_perf_salary['Performance_Score'], 
                           dept_perf_salary['Salary'],
                           s=dept_perf_salary['Employee_ID']*50,  # Size by count
                           alpha=0.6, c=range(len(dept_perf_salary)), 
                           cmap='coolwarm')
axes[1,1].set_title('⭐ Avg Performance vs Salary by Department', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Average Performance Score')
axes[1,1].set_ylabel('Average Salary ($)')

# Add department labels
for i, dept in enumerate(dept_perf_salary['Department']):
    axes[1,1].annotate(dept, 
                      (dept_perf_salary['Performance_Score'].iloc[i], 
                       dept_perf_salary['Salary'].iloc[i]),
                      xytext=(5, 5), textcoords='offset points', fontsize=10)

plt.tight_layout()
plt.show()

# Summary statistics table
print("\n📋 SUMMARY STATISTICS BY GROUP")
print("=" * 50)
summary_stats = sales_df.groupby('Category').agg({
    'Revenue': ['sum', 'mean', 'std', 'count'],
    'Quantity': ['sum', 'mean']
}).round(2)
summary_stats.columns = [f"{col[0]}_{col[1]}" for col in summary_stats.columns]
display(summary_stats)

## 🎓 Section 6: Key Takeaways and Best Practices

### What We've Mastered:

1. **🔄 Basic GroupBy**: Split-apply-combine strategy with simple aggregations
2. **⚡ Advanced Aggregations**: Custom functions and multiple statistics
3. **🎯 Multi-Level Grouping**: Grouping by multiple columns simultaneously
4. **📋 Pivot Tables**: Reshaping data for better analysis and presentation
5. **📈 Visualization**: Creating compelling charts from grouped data

### 💡 Best Practices:

- **Use meaningful column names** when creating aggregations
- **Round numerical results** for better readability
- **Handle missing values** before grouping operations
- **Consider performance** with large datasets (use appropriate data types)
- **Combine groupby with visualization** for better insights

### 🔧 Common Patterns:

```python
# Basic pattern
df.groupby('column')['target'].agg_function()

# Multiple aggregations
df.groupby('column').agg({'col1': 'sum', 'col2': 'mean'})

# Custom aggregations with names
df.groupby('column').agg(
    total=('revenue', 'sum'),
    average=('revenue', 'mean')
)
```

### Next Up:

In **Notebook 3**, we'll explore:
- 🔎 **Advanced Filtering** - Boolean indexing and query methods
- 🧹 **Data Cleaning** - Handling missing values and outliers  
- 📅 **Time Series** - Working with datetime data effectively
- 🔗 **Data Merging** - Combining multiple datasets

Ready to level up your pandas skills? Let's continue! 🚀

In [None]:
# Let's create some practical examples
print("🚀 PRACTICAL GROUPBY EXAMPLES")
print("=" * 50)

# Example 1: Monthly sales analysis
sales_df['Month'] = sales_df['Date'].dt.strftime('%Y-%m')
monthly_summary = sales_df.groupby('Month').agg({
    'Revenue': ['sum', 'mean', 'count'],
    'Quantity': 'sum',
    'Product': 'nunique'
}).round(2)

print("📊 Monthly Sales Summary:")
display(monthly_summary)

# Example 2: Top performing products by region
print("\n🏆 Top Products by Region (Revenue):")
top_products = sales_df.groupby(['Region', 'Product'])['Revenue'].sum().unstack(fill_value=0)
display(top_products)

# Example 3: Employee performance by department
print("\n👥 Department Performance Analysis:")
dept_performance = employees_df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Performance_Score': ['mean', 'std'],
    'Experience_Years': 'mean',
    'Age': 'mean'
}).round(2)

print("Department summary statistics:")
display(dept_performance)

## 📊 Section 7: Advanced Visualization of GroupBy Results

Visualizing grouped data helps identify patterns and insights more effectively.

In [None]:
# Create comprehensive visualizations
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Revenue by Category
category_revenue = sales_df.groupby('Category')['Revenue'].sum()
category_revenue.plot(kind='bar', ax=axes[0,0], color=['skyblue', 'lightcoral'])
axes[0,0].set_title('💰 Total Revenue by Category', fontsize=14, fontweight='bold')
axes[0,0].set_ylabel('Revenue ($)')
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Average Salary by Department
dept_salary = employees_df.groupby('Department')['Salary'].mean()
dept_salary.plot(kind='barh', ax=axes[0,1], color='lightgreen')
axes[0,1].set_title('💼 Average Salary by Department', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Average Salary ($)')

# 3. Sales Trend by Region
region_daily = sales_df.groupby(['Date', 'Region'])['Revenue'].sum().unstack()
for region in region_daily.columns:
    axes[1,0].plot(region_daily.index, region_daily[region], marker='o', label=region)
axes[1,0].set_title('📈 Daily Revenue Trends by Region', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Date')
axes[1,0].set_ylabel('Revenue ($)')
axes[1,0].legend()
axes[1,0].tick_params(axis='x', rotation=45)

# 4. Performance vs Experience
perf_exp = employees_df.groupby('Experience_Years')['Performance_Score'].mean()
axes[1,1].scatter(perf_exp.index, perf_exp.values, s=100, alpha=0.7, color='purple')
axes[1,1].plot(perf_exp.index, perf_exp.values, '--', alpha=0.5, color='purple')
axes[1,1].set_title('⭐ Performance vs Experience', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Years of Experience')
axes[1,1].set_ylabel('Average Performance Score')

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📋 QUICK INSIGHTS:")
print("=" * 40)
print(f"🏆 Highest revenue category: {category_revenue.idxmax()} (${category_revenue.max():,.0f})")
print(f"💼 Highest paying department: {dept_salary.idxmax()} (${dept_salary.max():,.0f})")
print(f"⭐ Best performing experience level: {perf_exp.idxmax()} years ({perf_exp.max():.1f} score)")