# Grouping and Aggregation in Pandas

This notebook covers grouping data and performing aggregation operations using Pandas' `groupby()` functionality.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.2.3
NumPy version: 2.2.4


## Sample Data

Let's create a larger sample DataFrame suitable for grouping and aggregation operations.

In [2]:
# Create larger sample DataFrame
np.random.seed(42)  # For reproducible results

data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR'],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'New York', 'London', 'Paris'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 75000, 58000, 52000],
    'Experience': [2, 5, 8, 3, 6, 10, 4, 3],
    'Performance': ['Good', 'Excellent', 'Good', 'Excellent', 'Good', 'Excellent', 'Good', 'Good']
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

Sample DataFrame:
  Employee Department      City  Salary  Experience Performance
0    Alice         HR  New York   50000           2        Good
1      Bob         IT    London   60000           5   Excellent
2  Charlie    Finance     Paris   70000           8        Good
3    Diana         IT     Tokyo   55000           3   Excellent
4      Eve         HR    Sydney   65000           6        Good
5    Frank    Finance  New York   75000          10   Excellent
6    Grace         IT    London   58000           4        Good
7    Henry         HR     Paris   52000           3        Good


## Basic Grouping

The `groupby()` method allows you to group data by one or more columns and perform operations on each group.

In [3]:
# Group by Department and get group sizes
print("Group sizes by Department:")
print(df.groupby('Department').size())

# Group by Department and calculate mean salary
print("\nAverage salary by Department:")
print(df.groupby('Department')['Salary'].mean())

# Group by Department and get multiple aggregations
print("\nMultiple aggregations by Department:")
print(df.groupby('Department')['Salary'].agg(['mean', 'min', 'max', 'count']))

# Group by multiple columns
print("\nAverage salary by Department and City:")
print(df.groupby(['Department', 'City'])['Salary'].mean())

Group sizes by Department:
Department
Finance    2
HR         3
IT         3
dtype: int64

Average salary by Department:
Department
Finance    72500.000000
HR         55666.666667
IT         57666.666667
Name: Salary, dtype: float64

Multiple aggregations by Department:
                    mean    min    max  count
Department                                   
Finance     72500.000000  70000  75000      2
HR          55666.666667  50000  65000      3
IT          57666.666667  55000  60000      3

Average salary by Department and City:
Department  City    
Finance     New York    75000.0
            Paris       70000.0
HR          New York    50000.0
            Paris       52000.0
            Sydney      65000.0
IT          London      59000.0
            Tokyo       55000.0
Name: Salary, dtype: float64


## Custom Aggregation Functions

You can use custom functions with `agg()` or apply functions to grouped data.

In [4]:
# Custom aggregation function
def salary_range(group):
    return group.max() - group.min()

print("Salary range by Department:")
print(df.groupby('Department')['Salary'].agg(salary_range))

# Multiple custom aggregations
print("\nMultiple custom aggregations:")
result = df.groupby('Department')['Salary'].agg([
    'mean',
    'std',
    lambda x: x.max() - x.min(),  # salary range
    lambda x: x.quantile(0.75) - x.quantile(0.25)  # IQR
])
result.columns = ['mean', 'std', 'range', 'iqr']
print(result)

# Apply different functions to different columns
print("\nDifferent aggregations for different columns:")
print(df.groupby('Department').agg({
    'Salary': ['mean', 'max'],
    'Experience': 'mean',
    'Employee': 'count'
}))

Salary range by Department:
Department
Finance     5000
HR         15000
IT          5000
Name: Salary, dtype: int64

Multiple custom aggregations:
                    mean          std  range     iqr
Department                                          
Finance     72500.000000  3535.533906   5000  2500.0
HR          55666.666667  8144.527815  15000  7500.0
IT          57666.666667  2516.611478   5000  2500.0

Different aggregations for different columns:
                  Salary        Experience Employee
                    mean    max       mean    count
Department                                         
Finance     72500.000000  75000   9.000000        2
HR          55666.666667  65000   3.666667        3
IT          57666.666667  60000   4.000000        3


## Transformation and Filtering with Groups

You can transform data within groups or filter groups based on conditions.

In [5]:
# Transform: standardize within groups
print("Salary standardized within Department:")
df['Salary_Std'] = df.groupby('Department')['Salary'].transform(lambda x: (x - x.mean()) / x.std())
print(df[['Employee', 'Department', 'Salary', 'Salary_Std']])

# Filter: keep only groups with more than 1 member
print("\nDepartments with more than 1 employee:")
filtered_groups = df.groupby('Department').filter(lambda x: len(x) > 1)
print(filtered_groups[['Employee', 'Department']])

# Filter: keep only groups where average salary > 55000
print("\nDepartments with average salary > 55000:")
high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 55000)
print(high_salary_depts[['Employee', 'Department', 'Salary']])

Salary standardized within Department:
  Employee Department  Salary  Salary_Std
0    Alice         HR   50000   -0.695764
1      Bob         IT   60000    0.927173
2  Charlie    Finance   70000   -0.707107
3    Diana         IT   55000   -1.059626
4      Eve         HR   65000    1.145964
5    Frank    Finance   75000    0.707107
6    Grace         IT   58000    0.132453
7    Henry         HR   52000   -0.450200

Departments with more than 1 employee:
  Employee Department
0    Alice         HR
1      Bob         IT
2  Charlie    Finance
3    Diana         IT
4      Eve         HR
5    Frank    Finance
6    Grace         IT
7    Henry         HR

Departments with average salary > 55000:
  Employee Department  Salary
0    Alice         HR   50000
1      Bob         IT   60000
2  Charlie    Finance   70000
3    Diana         IT   55000
4      Eve         HR   65000
5    Frank    Finance   75000
6    Grace         IT   58000
7    Henry         HR   52000


## Summary

You have learned grouping and aggregation techniques in Pandas:

- **Basic Grouping**: Using `groupby()` with single and multiple columns
- **Aggregation Functions**: Built-in functions like `mean()`, `sum()`, `count()`
- **Custom Aggregations**: Using `agg()` with custom functions and lambda expressions
- **Transformation**: Using `transform()` to modify data within groups
- **Filtering**: Using `filter()` to select groups based on conditions

These operations are essential for data analysis and summarization tasks.