# Grouping and Aggregation in Pandas

This notebook covers grouping data and performing aggregation operations using Pandas' `groupby()` functionality.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Sample Data

Let's create a larger sample DataFrame suitable for grouping and aggregation operations.

In [None]:
# Create larger sample DataFrame
np.random.seed(42)  # For reproducible results

data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR'],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'New York', 'London', 'Paris'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 75000, 58000, 52000],
    'Experience': [2, 5, 8, 3, 6, 10, 4, 3],
    'Performance': ['Good', 'Excellent', 'Good', 'Excellent', 'Good', 'Excellent', 'Good', 'Good']
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

## Basic Grouping

The `groupby()` method allows you to group data by one or more columns and perform operations on each group.

In [None]:
# Group by Department and get group sizes
print("Group sizes by Department:")
print(df.groupby('Department').size())

# Group by Department and calculate mean salary
print("\nAverage salary by Department:")
print(df.groupby('Department')['Salary'].mean())

# Group by Department and get multiple aggregations
print("\nMultiple aggregations by Department:")
print(df.groupby('Department')['Salary'].agg(['mean', 'min', 'max', 'count']))

# Group by multiple columns
print("\nAverage salary by Department and City:")
print(df.groupby(['Department', 'City'])['Salary'].mean())

## Custom Aggregation Functions

You can use custom functions with `agg()` or apply functions to grouped data.

In [None]:
# Custom aggregation function
def salary_range(group):
    return group.max() - group.min()

print("Salary range by Department:")
print(df.groupby('Department')['Salary'].agg(salary_range))

# Multiple custom aggregations
print("\nMultiple custom aggregations:")
result = df.groupby('Department')['Salary'].agg([
    'mean',
    'std',
    lambda x: x.max() - x.min(),  # salary range
    lambda x: x.quantile(0.75) - x.quantile(0.25)  # IQR
])
result.columns = ['mean', 'std', 'range', 'iqr']
print(result)

# Apply different functions to different columns
print("\nDifferent aggregations for different columns:")
print(df.groupby('Department').agg({
    'Salary': ['mean', 'max'],
    'Experience': 'mean',
    'Employee': 'count'
}))

## Transformation and Filtering with Groups

You can transform data within groups or filter groups based on conditions.

In [None]:
# Transform: standardize within groups
print("Salary standardized within Department:")
df['Salary_Std'] = df.groupby('Department')['Salary'].transform(lambda x: (x - x.mean()) / x.std())
print(df[['Employee', 'Department', 'Salary', 'Salary_Std']])

# Filter: keep only groups with more than 1 member
print("\nDepartments with more than 1 employee:")
filtered_groups = df.groupby('Department').filter(lambda x: len(x) > 1)
print(filtered_groups[['Employee', 'Department']])

# Filter: keep only groups where average salary > 55000
print("\nDepartments with average salary > 55000:")
high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 55000)
print(high_salary_depts[['Employee', 'Department', 'Salary']])

## Summary

You have learned grouping and aggregation techniques in Pandas:

- **Basic Grouping**: Using `groupby()` with single and multiple columns
- **Aggregation Functions**: Built-in functions like `mean()`, `sum()`, `count()`
- **Custom Aggregations**: Using `agg()` with custom functions and lambda expressions
- **Transformation**: Using `transform()` to modify data within groups
- **Filtering**: Using `filter()` to select groups based on conditions

These operations are essential for data analysis and summarization tasks.