# Grouping & Aggregation

Grouping and aggregation are fundamental operations for summarizing data. While R's `dplyr::group_by()` and `summarize()` provide an intuitive interface, pandas offers powerful grouping capabilities through its `groupby()` method. This chapter will show you how to achieve tidyverse-style grouped operations in pandas.

## Best Practices Summary

Quick reference for groupby patterns:

| Task | R (dplyr) | Pandas |
|------|-----------|--------|
| Simple aggregate | `group_by(df, col) %>% summarize(mean = mean(x))` | `df.groupby('col')['x'].mean()` |
| Multiple aggregates | `summarize(mean = mean(x), sum = sum(x))` | `df.groupby('col')['x'].agg(['mean', 'sum'])` |
| Named aggregates | `summarize(avg_x = mean(x))` | `df.groupby('col').agg(avg_x=('x', 'mean'))` |
| Transform | `group_by(df, col) %>% mutate(pct = x/sum(x))` | `df.groupby('col')['x'].transform(lambda x: x/x.sum())` |
| Filter groups | `group_by(df, col) %>% filter(mean(x) > 10)` | `df.groupby('col').filter(lambda x: x['x'].mean() > 10)` |
| Top n per group | `group_by(df, col) %>% slice_max(x, n=3)` | `df.sort_values('x').groupby('col').head(3)` |
| Multiple grouping | `group_by(df, col1, col2)` | `df.groupby(['col1', 'col2'])` |

## Tips for Tidyverse Users

1. **Think aggregate vs transform**: `agg()` reduces groups to one row (like `summarize()`), while `transform()` maintains all rows (like `mutate()` after `group_by()`).

2. **Use named aggregations**: The syntax `agg(new_name=('column', 'function'))` is cleaner and more dplyr-like.

3. **Chain operations**: Groupby works well in method chains:
   ```python
   (df
    .query('value > 0')
    .groupby('category')
    .agg(mean_value=('value', 'mean'))
    .sort_values('mean_value'))
   ```

4. **Remember reset_index()**: After groupby operations, use `.reset_index()` if you want the grouping columns as regular columns.

5. **Leverage pivot_table**: For reshaping grouped data, `pivot_table()` can be more intuitive than manual groupby + reshape.

Grouping and aggregation in pandas is incredibly powerful. While the syntax differs from dplyr, the concepts translate well, and pandas often provides more flexibility for complex aggregations.

## Basic Grouping and Aggregation

The fundamental groupby operations in pandas:

In [2]:
import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT', 'HR', 'Sales', 'IT', 'HR'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Iris'],
    'salary': [70000, 85000, 65000, 72000, 90000, 68000, 75000, 88000, 71000],
    'years_exp': [5, 8, 3, 6, 10, 4, 7, 9, 5],
    'performance': [4.2, 4.5, 3.8, 4.0, 4.7, 3.9, 4.3, 4.6, 4.1]
})

# Simple aggregation
# R: df %>% group_by(department) %>% summarize(avg_salary = mean(salary))
df.groupby('department')['salary'].mean()

department
HR       68000.000000
IT       87666.666667
Sales    72333.333333
Name: salary, dtype: float64

In [3]:
# Multiple aggregations
# R: df %>% 
#     group_by(department) %>% 
#     summarize(avg_salary = mean(salary),
#               total_salary = sum(salary),
#               count = n())
df.groupby('department')['salary'].agg(['mean', 'sum', 'count'])

Unnamed: 0_level_0,mean,sum,count
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,68000.0,204000,3
IT,87666.666667,263000,3
Sales,72333.333333,217000,3


In [4]:
# Custom named aggregations
# R: df %>%
#     group_by(department) %>%
#     summarize(avg_salary = mean(salary),
#               max_performance = max(performance),
#               total_years = sum(years_exp))
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    max_performance=('performance', 'max'),
    total_years=('years_exp', 'sum')
).round(2)

Unnamed: 0_level_0,avg_salary,max_performance,total_years
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,68000.0,4.1,12
IT,87666.67,4.7,27
Sales,72333.33,4.3,18


## Multiple Aggregation Functions

Applying different functions to different columns:

In [5]:
# Different functions for different columns
# R: df %>%
#     group_by(department) %>%
#     summarize(across(salary, list(mean = mean, sd = sd)),
#               across(performance, list(min = min, max = max)))
agg_dict = {
    'salary': ['mean', 'std', 'min', 'max'],
    'performance': ['mean', 'min', 'max'],
    'years_exp': ['sum', 'mean']
}

df.groupby('department').agg(agg_dict).round(2)

Unnamed: 0_level_0,salary,salary,salary,salary,performance,performance,performance,years_exp,years_exp
Unnamed: 0_level_1,mean,std,min,max,mean,min,max,sum,mean
department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
HR,68000.0,3000.0,65000,71000,3.93,3.8,4.1,12,4.0
IT,87666.67,2516.61,85000,90000,4.6,4.5,4.7,27,9.0
Sales,72333.33,2516.61,70000,75000,4.17,4.0,4.3,18,6.0


In [6]:
# Flattening multi-level column names
result = df.groupby('department').agg(agg_dict)
result.columns = ['_'.join(col).strip() for col in result.columns.values]
result.round(2)

Unnamed: 0_level_0,salary_mean,salary_std,salary_min,salary_max,performance_mean,performance_min,performance_max,years_exp_sum,years_exp_mean
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
HR,68000.0,3000.0,65000,71000,3.93,3.8,4.1,12,4.0
IT,87666.67,2516.61,85000,90000,4.6,4.5,4.7,27,9.0
Sales,72333.33,2516.61,70000,75000,4.17,4.0,4.3,18,6.0


## Custom Aggregation Functions

Using custom functions in aggregations:

In [7]:
# Define custom aggregation functions
def salary_range(x):
    """Calculate salary range (max - min)"""
    return x.max() - x.min()

def high_performers(x):
    """Count high performers (performance >= 4.5)"""
    return (x >= 4.5).sum()

# Apply custom functions
# R: df %>%
#     group_by(department) %>%
#     summarize(salary_range = max(salary) - min(salary),
#               high_performers = sum(performance >= 4.5))
df.groupby('department').agg({
    'salary': salary_range,
    'performance': high_performers
}).rename(columns={'salary': 'salary_range', 'performance': 'high_performers'})

Unnamed: 0_level_0,salary_range,high_performers
department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,6000,0
IT,5000,3
Sales,5000,0


In [8]:
# Multiple custom functions with lambda
df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    salary_range=('salary', lambda x: x.max() - x.min()),
    high_perf_count=('performance', lambda x: (x >= 4.5).sum()),
    exp_weighted_salary=('salary', lambda x: np.average(x, weights=df.loc[x.index, 'years_exp']))
).round(2)

Unnamed: 0_level_0,avg_salary,salary_range,high_perf_count,exp_weighted_salary
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,68000.0,6000,0,68500.0
IT,87666.67,5000,3,87851.85
Sales,72333.33,5000,0,72611.11


## Transform vs Aggregate

Understanding the difference between transform and aggregate:

In [9]:
# Transform: returns same-sized result (like mutate after group_by)
# R: df %>% group_by(department) %>% mutate(dept_avg_salary = mean(salary))
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']
df

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg
0,Sales,Alice,70000,5,4.2,72333.333333,-2333.333333
1,IT,Bob,85000,8,4.5,87666.666667,-2666.666667
2,HR,Charlie,65000,3,3.8,68000.0,-3000.0
3,Sales,David,72000,6,4.0,72333.333333,-333.333333
4,IT,Eve,90000,10,4.7,87666.666667,2333.333333
5,HR,Frank,68000,4,3.9,68000.0,0.0
6,Sales,Grace,75000,7,4.3,72333.333333,2666.666667
7,IT,Henry,88000,9,4.6,87666.666667,333.333333
8,HR,Iris,71000,5,4.1,68000.0,3000.0


In [10]:
# Multiple transforms
# R: df %>% 
#     group_by(department) %>%
#     mutate(dept_rank = rank(-salary),
#            pct_of_dept_total = salary / sum(salary) * 100)
df_transformed = df.assign(
    dept_rank = lambda x: x.groupby('department')['salary'].rank(ascending=False),
    pct_of_dept_total = lambda x: x.groupby('department')['salary'].transform(lambda s: s / s.sum() * 100),
    z_score = lambda x: x.groupby('department')['salary'].transform(lambda s: (s - s.mean()) / s.std())
)
df_transformed.round(2)

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg,dept_rank,pct_of_dept_total,z_score
0,Sales,Alice,70000,5,4.2,72333.33,-2333.33,3.0,32.26,-0.93
1,IT,Bob,85000,8,4.5,87666.67,-2666.67,3.0,32.32,-1.06
2,HR,Charlie,65000,3,3.8,68000.0,-3000.0,3.0,31.86,-1.0
3,Sales,David,72000,6,4.0,72333.33,-333.33,2.0,33.18,-0.13
4,IT,Eve,90000,10,4.7,87666.67,2333.33,1.0,34.22,0.93
5,HR,Frank,68000,4,3.9,68000.0,0.0,2.0,33.33,0.0
6,Sales,Grace,75000,7,4.3,72333.33,2666.67,1.0,34.56,1.06
7,IT,Henry,88000,9,4.6,87666.67,333.33,2.0,33.46,0.13
8,HR,Iris,71000,5,4.1,68000.0,3000.0,1.0,34.8,1.0


## Multiple Grouping Variables

Grouping by multiple columns:

In [11]:
# Create DataFrame with multiple grouping variables
df_multi = pd.DataFrame({
    'region': ['East', 'East', 'West', 'West', 'East', 'West'] * 3,
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'HR'] * 3,
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q1', 'Q1',
                'Q2', 'Q2', 'Q2', 'Q2', 'Q2', 'Q2',
                'Q3', 'Q3', 'Q3', 'Q3', 'Q3', 'Q3'],
    'revenue': np.random.randint(50000, 150000, 18),
    'costs': np.random.randint(30000, 80000, 18)
})

# Group by multiple columns
# R: df %>% 
#     group_by(region, department) %>%
#     summarize(total_revenue = sum(revenue),
#               total_costs = sum(costs),
#               profit = sum(revenue - costs))
df_multi.groupby(['region', 'department']).agg(
    total_revenue=('revenue', 'sum'),
    total_costs=('costs', 'sum')
).assign(profit=lambda x: x['total_revenue'] - x['total_costs'])

Unnamed: 0_level_0,Unnamed: 1_level_0,total_revenue,total_costs,profit
region,department,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
East,HR,301580,162119,139461
East,IT,283541,183463,100078
East,Sales,376540,161563,214977
West,HR,256471,161226,95245
West,IT,334453,188775,145678
West,Sales,316252,147385,168867


In [12]:
# Hierarchical grouping with subtotals
# First level: by region
region_summary = df_multi.groupby('region').agg({
    'revenue': 'sum',
    'costs': 'sum'
}).assign(level='Region Total')

# Second level: by region and department
region_dept_summary = df_multi.groupby(['region', 'department']).agg({
    'revenue': 'sum',
    'costs': 'sum'
})

region_dept_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue,costs
region,department,Unnamed: 2_level_1,Unnamed: 3_level_1
East,HR,301580,162119
East,IT,283541,183463
East,Sales,376540,161563
West,HR,256471,161226
West,IT,334453,188775
West,Sales,316252,147385


## Filter and Slice Groups

Working with grouped data using filter and head/tail:

In [13]:
# Filter groups based on group statistics
# R: df %>% 
#     group_by(department) %>%
#     filter(mean(salary) > 75000)
df_filtered = df.groupby('department').filter(lambda x: x['salary'].mean() > 75000)
df_filtered

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg
1,IT,Bob,85000,8,4.5,87666.666667,-2666.666667
4,IT,Eve,90000,10,4.7,87666.666667,2333.333333
7,IT,Henry,88000,9,4.6,87666.666667,333.333333


In [14]:
# Keep only top performers in each department
# R: df %>% 
#     group_by(department) %>%
#     slice_max(performance, n = 2)
df.sort_values('performance', ascending=False).groupby('department').head(2)

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg
4,IT,Eve,90000,10,4.7,87666.666667,2333.333333
7,IT,Henry,88000,9,4.6,87666.666667,333.333333
6,Sales,Grace,75000,7,4.3,72333.333333,2666.666667
0,Sales,Alice,70000,5,4.2,72333.333333,-2333.333333
8,HR,Iris,71000,5,4.1,68000.0,3000.0
5,HR,Frank,68000,4,3.9,68000.0,0.0


In [15]:
# More complex filtering
# Keep departments where all employees have 5+ years experience
# R: df %>% 
#     group_by(department) %>%
#     filter(all(years_exp >= 5))
df.groupby('department').filter(lambda x: (x['years_exp'] >= 5).all())

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg
0,Sales,Alice,70000,5,4.2,72333.333333,-2333.333333
1,IT,Bob,85000,8,4.5,87666.666667,-2666.666667
3,Sales,David,72000,6,4.0,72333.333333,-333.333333
4,IT,Eve,90000,10,4.7,87666.666667,2333.333333
6,Sales,Grace,75000,7,4.3,72333.333333,2666.666667
7,IT,Henry,88000,9,4.6,87666.666667,333.333333


## Window Functions with Groups

Calculating running totals and other window functions within groups:

In [16]:
# Create time-series grouped data
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South'] * 6,
    'sales': np.random.randint(10000, 50000, 12)
})

# Cumulative calculations within groups
# R: df %>% 
#     group_by(region) %>%
#     arrange(date) %>%
#     mutate(cumsum_sales = cumsum(sales),
#            rolling_avg = rollmean(sales, k = 3, fill = NA, align = "right"))
df_sales_calc = (df_sales
    .sort_values(['region', 'date'])
    .assign(
        cumsum_sales = lambda x: x.groupby('region')['sales'].cumsum(),
        rolling_avg_3m = lambda x: x.groupby('region')['sales'].transform(lambda s: s.rolling(3, min_periods=1).mean()),
        pct_of_region_total = lambda x: x.groupby('region')['sales'].transform(lambda s: s / s.sum() * 100)
    )
)
df_sales_calc.round(2)

  'date': pd.date_range('2024-01-01', periods=12, freq='M'),


Unnamed: 0,date,region,sales,cumsum_sales,rolling_avg_3m,pct_of_region_total
0,2024-01-31,North,44298,44298,44298.0,17.64
2,2024-03-31,North,45257,89555,44777.5,18.02
4,2024-05-31,North,43361,132916,44305.33,17.27
6,2024-07-31,North,25793,158709,38137.0,10.27
8,2024-09-30,North,48385,207094,39179.67,19.27
10,2024-11-30,North,44037,251131,39405.0,17.54
1,2024-02-29,South,36389,36389,36389.0,22.01
3,2024-04-30,South,24205,60594,30297.0,14.64
5,2024-06-30,South,11925,72519,24173.0,7.21
7,2024-08-31,South,43933,116452,26687.67,26.58


## Pivot Tables as Grouped Aggregations

Using pivot_table for grouped summaries:

In [20]:
df_multi.head(5)

Unnamed: 0,region,department,quarter,revenue,costs
0,East,Sales,Q1,142313,43082
1,East,IT,Q1,143488,51335
2,West,Sales,Q1,149890,44367
3,West,IT,Q1,103865,78044
4,East,HR,Q1,85971,56692


In [17]:
# Pivot table for cross-tabulation
# R: df %>% 
#     group_by(region, quarter) %>%
#     summarize(total_revenue = sum(revenue)) %>%
#     pivot_wider(names_from = quarter, values_from = total_revenue)
pivot_result = df_multi.pivot_table(
    values='revenue',
    index='region',
    columns='quarter',
    aggfunc='sum'
)
pivot_result

quarter,Q1,Q2,Q3
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,371772,287065,302824
West,360321,250209,296646


In [18]:
# Multiple aggregations in pivot table
pivot_multi = df_multi.pivot_table(
    values=['revenue', 'costs'],
    index='region',
    columns='department',
    aggfunc={'revenue': 'sum', 'costs': 'mean'},
    fill_value=0
)
pivot_multi.round(0)

Unnamed: 0_level_0,costs,costs,costs,revenue,revenue,revenue
department,HR,IT,Sales,HR,IT,Sales
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
East,54040.0,61154.0,53854.0,301580,283541,376540
West,53742.0,62925.0,49128.0,256471,334453,316252


## Advanced Groupby Operations

Complex groupby patterns:

In [21]:
# Create complex dataset
np.random.seed(42)
df_complex = pd.DataFrame({
    'store_id': np.repeat(['S01', 'S02', 'S03'], 12),
    'month': pd.date_range('2024-01-01', periods=12, freq='M').tolist() * 3,
    'product': np.tile(['A', 'B', 'C', 'D'] * 3, 3),
    'units_sold': np.random.randint(50, 200, 36),
    'price': np.random.uniform(10, 50, 36)
})
df_complex['revenue'] = df_complex['units_sold'] * df_complex['price']

# Multiple level aggregation with custom functions
result = (df_complex
    .groupby(['store_id', 'product'])
    .agg(
        total_units=('units_sold', 'sum'),
        avg_price=('price', 'mean'),
        total_revenue=('revenue', 'sum'),
        months_active=('month', 'nunique'),
        best_month=('revenue', lambda x: df_complex.loc[x.idxmax(), 'month'].strftime('%Y-%m'))
    )
    .round(2)
)
result

  'month': pd.date_range('2024-01-01', periods=12, freq='M').tolist() * 3,


Unnamed: 0_level_0,Unnamed: 1_level_0,total_units,avg_price,total_revenue,months_active,best_month
store_id,product,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
S01,A,397,16.56,6445.15,3,2024-05
S01,B,349,30.56,11866.72,3,2024-02
S01,C,382,32.46,10680.49,3,2024-07
S01,D,476,38.77,18234.59,3,2024-12
S02,A,274,33.27,8069.62,3,2024-09
S02,B,424,33.23,13853.02,3,2024-02
S02,C,357,35.06,11123.41,3,2024-03
S02,D,419,39.56,16950.57,3,2024-08
S03,A,355,20.11,7461.52,3,2024-05
S03,B,396,23.47,9929.7,3,2024-10


## Named Aggregations Pattern

Using the modern named aggregation syntax:

In [22]:
# Clean named aggregations (pandas >= 0.25)
# R: df %>%
#     group_by(department) %>%
#     summarize(
#         n_employees = n(),
#         avg_salary = mean(salary),
#         sd_salary = sd(salary),
#         median_performance = median(performance),
#         salary_per_year_exp = sum(salary) / sum(years_exp)
#     )
summary = df.groupby('department').agg(
    n_employees=pd.NamedAgg(column='employee', aggfunc='count'),
    avg_salary=pd.NamedAgg(column='salary', aggfunc='mean'),
    sd_salary=pd.NamedAgg(column='salary', aggfunc='std'),
    median_performance=pd.NamedAgg(column='performance', aggfunc='median'),
    total_salary=pd.NamedAgg(column='salary', aggfunc='sum'),
    total_years=pd.NamedAgg(column='years_exp', aggfunc='sum')
).assign(
    salary_per_year_exp=lambda x: x['total_salary'] / x['total_years']
).drop(columns=['total_salary', 'total_years']).round(2)

summary

Unnamed: 0_level_0,n_employees,avg_salary,sd_salary,median_performance,salary_per_year_exp
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HR,3,68000.0,3000.0,3.9,17000.0
IT,3,87666.67,2516.61,4.6,9740.74
Sales,3,72333.33,2516.61,4.2,12055.56


## Grouped Operations in Method Chains

Integrating groupby into larger data pipelines:

In [23]:
# Complex chain with groupby
# R: df %>%
#     filter(years_exp >= 3) %>%
#     group_by(department) %>%
#     summarize(avg_salary = mean(salary),
#               avg_performance = mean(performance)) %>%
#     arrange(desc(avg_salary))
result_chain = (df
    .query('years_exp >= 3')
    .groupby('department')
    .agg(
        avg_salary=('salary', 'mean'),
        avg_performance=('performance', 'mean'),
        count=('employee', 'count')
    )
    .round(2)
    .sort_values('avg_salary', ascending=False)
)
result_chain

Unnamed: 0_level_0,avg_salary,avg_performance,count
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IT,87666.67,4.6,3
Sales,72333.33,4.17,3
HR,68000.0,3.93,3


## Groupby with Time Series

Special considerations for time-based grouping:

In [24]:
# Create time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df_ts = pd.DataFrame({
    'date': dates,
    'store': np.random.choice(['A', 'B', 'C'], 365),
    'sales': np.random.randint(1000, 5000, 365) + np.random.randn(365) * 500
})

# Group by multiple time periods
# R: df %>%
#     mutate(month = floor_date(date, "month"),
#            week = floor_date(date, "week")) %>%
#     group_by(store, month) %>%
#     summarize(monthly_sales = sum(sales))
df_ts_summary = (df_ts
    .assign(
        month=lambda x: x['date'].dt.to_period('M'),
        week=lambda x: x['date'].dt.to_period('W'),
        quarter=lambda x: x['date'].dt.to_period('Q')
    )
    .groupby(['store', 'month'])
    .agg(
        monthly_sales=('sales', 'sum'),
        days_active=('date', 'count'),
        best_day_sales=('sales', 'max')
    )
    .round(0)
)
df_ts_summary.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,monthly_sales,days_active,best_day_sales
store,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2024-01,36599.0,12,5046.0
A,2024-02,27849.0,9,4961.0
A,2024-03,35669.0,12,5332.0
A,2024-04,32326.0,11,5189.0
A,2024-05,27099.0,9,4541.0
A,2024-06,32734.0,10,6203.0
A,2024-07,36259.0,12,4469.0
A,2024-08,33317.0,11,6041.0
A,2024-09,28631.0,11,4714.0
A,2024-10,22272.0,9,3926.0


## Creating Tidyverse-Style Helper Functions

Make groupby operations more dplyr-like:

In [25]:
def group_by_summarize(df, groupby_cols, **agg_funcs):
    """Mimics dplyr's group_by %>% summarize"""
    return df.groupby(groupby_cols).agg(**agg_funcs)

def group_by_mutate(df, groupby_cols, **transform_funcs):
    """Mimics dplyr's group_by %>% mutate"""
    df_copy = df.copy()
    for name, (column, func) in transform_funcs.items():
        df_copy[name] = df.groupby(groupby_cols)[column].transform(func)
    return df_copy

# Usage examples
# R: df %>% group_by(department) %>% summarize(avg_salary = mean(salary))
group_by_summarize(df, 'department', avg_salary=('salary', 'mean'))

Unnamed: 0_level_0,avg_salary
department,Unnamed: 1_level_1
HR,68000.0
IT,87666.666667
Sales,72333.333333


In [26]:
# R: df %>% group_by(department) %>% mutate(salary_pct = salary / sum(salary) * 100)
group_by_mutate(df, 'department', 
                salary_pct=('salary', lambda x: x / x.sum() * 100)).round(2)

Unnamed: 0,department,employee,salary,years_exp,performance,dept_avg_salary,salary_vs_dept_avg,salary_pct
0,Sales,Alice,70000,5,4.2,72333.33,-2333.33,32.26
1,IT,Bob,85000,8,4.5,87666.67,-2666.67,32.32
2,HR,Charlie,65000,3,3.8,68000.0,-3000.0,31.86
3,Sales,David,72000,6,4.0,72333.33,-333.33,33.18
4,IT,Eve,90000,10,4.7,87666.67,2333.33,34.22
5,HR,Frank,68000,4,3.9,68000.0,0.0,33.33
6,Sales,Grace,75000,7,4.3,72333.33,2666.67,34.56
7,IT,Henry,88000,9,4.6,87666.67,333.33,33.46
8,HR,Iris,71000,5,4.1,68000.0,3000.0,34.8


## Performance Tips

Efficient groupby operations:

In [27]:
# Create large DataFrame
large_df = pd.DataFrame({
    'group': np.random.choice(['A', 'B', 'C', 'D'], 100000),
    'value': np.random.randn(100000)
})

import time

# Method 1: Single pass aggregation (fastest)
start = time.time()
result1 = large_df.groupby('group')['value'].agg(['mean', 'sum', 'std'])
print(f"Single pass: {time.time() - start:.4f} seconds")

# Method 2: Multiple separate aggregations (slower)
start = time.time()
mean_result = large_df.groupby('group')['value'].mean()
sum_result = large_df.groupby('group')['value'].sum()
std_result = large_df.groupby('group')['value'].std()
print(f"Multiple passes: {time.time() - start:.4f} seconds")

Single pass: 0.0040 seconds
Multiple passes: 0.0098 seconds
