# pandas lesson 3 (Group By / Aggregate)

This lesson shows how to group and aggregate data in a pandas DataFrame.

**Grouping** is the process of splitting data into groups based on some criteria (e.g., grouping sales data by region).

**Aggregation** is the process of computing summary statistics for each group (e.g., total sales, average sales, count of transactions).

The typical pattern is: **split-apply-combine**
1. **Split**: Divide the DataFrame into groups based on column values
2. **Apply**: Perform calculations on each group independently
3. **Combine**: Merge the results back into a single DataFrame or Series

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # pandas uses matplotlib for plotting

## Sample Dataset

In this lesson we will use a small DataFrame containing football team statistics. This dataset includes teams from different cities and whether they qualified for the Champions League.

In [None]:
fb_dict = {
        'id': ['MCY', 'LIV', 'TOT', 'CHE', 'ARL'],
        'city': ['Manchester',	'Liverpool', 'London', 'London', 'London'],
        'team':	['Manchester City', 'Liverpool', 'Tottenham Hotspur', 'Chelsea', 'Arsenal'],
        'champions_league': ['Yes', 'Yes', 'No', 'No', 'Yes'],
        'won':	[5, 6, 6, 5, 5],
        'drawn': [4, 1, 0, 2,0],
        'lost': [0, 0, 2, 0, 2],
        'form': ['DWWWW', 'WWWWD', 'LLWWW', 'WWWDD', 'WWWWW']
        }

fb = pd.DataFrame(fb_dict)
fb

In [None]:
# set the index to the unique values of the 'id' column - more useful than 0,1,2...
fb = fb.set_index('id')
fb

## Basic Grouping

Group by city to analyze team performance by city. The `groupby()` method creates a GroupBy object which can then be aggregated.

In [None]:
# Create a GroupBy object by grouping on the 'city' column
# as_index=False means the grouping column won't become the index
fb_by_city = fb.groupby(['city'], as_index=False)

# The .groups attribute shows which rows belong to each group
fb_by_city.groups

## Common Aggregation Functions

Get the totals of all numeric columns per city using the `sum()` aggregation function.

In [None]:
# Sum all numeric columns (won, drawn, lost) for each city
# This shows total wins, draws, and losses across all teams in each city
fb_by_city.sum(numeric_only=True)

### Selecting Specific Columns to Aggregate

You can select specific columns before aggregating to focus on particular metrics.

In [None]:
# Only sum the 'lost' and 'won' columns for each city
# Notice the column order in the result matches the order we specify
fb_by_city[['lost', 'won']].sum()

Exercise: group by those teams in (and not in) the Champions League (champions_league = 'Yes' or 'No')
Sum the won, drawn and lost  columms

In [None]:
# Write your code here
# fb.groupby('champions_league', as_index=False)[['won', 'drawn', 'lost']].sum()

### Other Aggregation Functions

Besides `sum()`, pandas GroupBy objects support many other aggregation functions:
- `count()` - count non-null values
- `mean()` - calculate the average
- `median()` - calculate the median
- `min()` - find minimum value
- `max()` - find maximum value
- `std()` - calculate standard deviation
- `var()` - calculate variance

In [None]:
# Count how many teams are in each city
fb_by_city['team'].count()

In [None]:
# Calculate the average (mean) number of wins per city
fb_by_city['won'].mean()

In [None]:
# Find the maximum and minimum wins per city
print("Maximum wins per city:")
print(fb_by_city['won'].max())
print("\nMinimum wins per city:")
print(fb_by_city['won'].min())

## Using .agg() for Multiple Aggregations

The `.agg()` method is very powerful - it allows you to apply multiple aggregation functions at once, and even apply different functions to different columns.

In [None]:
# Apply multiple aggregation functions to the 'won' column
# This shows sum, mean, min, and max all at once
fb_by_city['won'].agg(['sum', 'mean', 'min', 'max'])

In [None]:
# Apply different aggregation functions to different columns
# won: get sum and mean, lost: get sum and max
fb_by_city.agg({
    'won': ['sum', 'mean'],
    'lost': ['sum', 'max'],
    'drawn': 'mean'
})

## Grouping by Multiple Columns

You can group by more than one column to create hierarchical groupings. This creates subgroups within groups.

In [None]:
# Group by both city AND champions_league status
# This creates groups like: (Liverpool, Yes), (London, No), (London, Yes), etc.
fb_multi_group = fb.groupby(['city', 'champions_league'], as_index=False)
fb_multi_group['won'].sum()

## Understanding size() vs count()

Both `.size()` and `.count()` give you the number of items in each group, but they behave differently:
- **size()**: Returns the number of **rows** in each group (including rows with missing values)
- **count()**: Returns the number of **non-null values** in each column for each group

In [None]:
# size() returns the total number of rows in each group
print("Size (number of teams per city):")
print(fb_by_city.size())

print("\nCount (non-null values per column per city):")
print(fb_by_city.count())

## Custom Aggregation Functions

You can create your own custom aggregation functions and use them with `.agg()` or `.apply()`. This is useful when the built-in functions don't meet your needs.

In [None]:
# Define a custom function to calculate the range (max - min)
def calc_range(x):
    return x.max() - x.min()

# Use the custom function with agg()
fb_by_city['won'].agg(['mean', calc_range])

## Transform vs Aggregate

**Key Difference:**
- **Aggregate**: Reduces each group to a single value (group size → 1 row per group)
- **Transform**: Returns a value for each row in the original DataFrame (group size → same size as input)

Transform is useful when you want to broadcast group statistics back to every row in the group.

In [None]:
# AGGREGATE: Returns one value per group (3 cities → 3 values)
print("Aggregate - mean wins per city:")
print(fb.groupby('city')['won'].mean())
print("\nShape:", fb.groupby('city')['won'].mean().shape)

In [None]:
# TRANSFORM: Returns one value per row in original DataFrame (5 teams → 5 values)
# Each team gets its city's mean wins
print("Transform - add city mean to each team:")
fb['city_mean_wins'] = fb.groupby('city')['won'].transform('mean')
print(fb[['team', 'city', 'won', 'city_mean_wins']])
print("\nShape:", fb['city_mean_wins'].shape)

## Filtering Groups with .filter()

The `.filter()` method allows you to filter entire groups based on a condition. If a group doesn't meet the condition, all rows in that group are removed.

In [None]:
# Filter to keep only cities where the total wins across all teams > 10
# This will keep London teams (total: 16) and remove Liverpool (6) and Manchester (5)
filtered_teams = fb.groupby('city').filter(lambda x: x['won'].sum() > 10)
print("Teams from cities with total wins > 10:")
print(filtered_teams[['team', 'city', 'won']])

## Summary and Best Practices

### Quick Reference

**Common GroupBy Patterns:**
```python
# Basic grouping and aggregation
df.groupby('column').sum()
df.groupby('column')['specific_col'].mean()

# Multiple aggregations
df.groupby('column').agg(['sum', 'mean', 'count'])

# Different aggregations per column
df.groupby('column').agg({'col1': 'sum', 'col2': 'mean'})

# Group by multiple columns
df.groupby(['col1', 'col2']).sum()

# Transform to broadcast group stats
df['group_mean'] = df.groupby('column')['value'].transform('mean')
```

### When to Use What

- **`.agg()`**: When you need summary statistics per group
- **`.transform()`**: When you need to add group statistics back to each row
- **`.filter()`**: When you need to keep/remove entire groups based on conditions
- **`.apply()`**: When you need maximum flexibility (can return aggregated or transformed data)

### Common Use Cases

1. **Sales Analysis**: Group by region/product to calculate total sales, average order value
2. **Time Series**: Group by date/month to find trends, seasonality
3. **Customer Segmentation**: Group by customer type to analyze behavior patterns
4. **Data Quality**: Group by category to count nulls, find outliers

## Practice Exercise

Using the football DataFrame (fb), complete the following tasks:

1. Group by `champions_league` and calculate the mean of `won`, `drawn`, and `lost`
2. Create a new column called `win_rate` that shows each team's win percentage compared to their city's average
3. Find cities where at least one team has more than 5 wins

Try these on your own before checking the solutions below!

In [None]:
# Solutions (uncomment to run):

# 1. Group by champions_league and calculate mean
# fb.groupby('champions_league')[['won', 'drawn', 'lost']].mean()

# 2. Create win_rate column (team wins / city average wins)
# city_avg = fb.groupby('city')['won'].transform('mean')
# fb['win_rate'] = fb['won'] / city_avg
# fb[['team', 'city', 'won', 'win_rate']]

# 3. Find cities where at least one team has more than 5 wins
# fb.groupby('city').filter(lambda x: x['won'].max() > 5)['city'].unique()