# ⚙️ Part 5: Advanced GroupBy, Aggregation, and Transformation

**Goal:** To move beyond basic aggregation and master powerful Pandas techniques like multi-column grouping, custom aggregations, and the critical difference between **Aggregate** and **Transform** operations.

---
### Key Learning Objectives
1.  Understand the structure of a Pandas **GroupBy Object**.
2.  Perform multi-level grouping and reshape output using `unstack()`.
3.  Use the `agg()` function for running multiple statistics on multiple columns.
4.  Implement custom logic using `.apply()` and `.transform()`.
5.  Filter groups based on aggregate conditions using `.filter()`.

In [1]:
import pandas as pd
import numpy as np
import os

# Load our cleaned and featured data 
# We use the final data set from the previous notebook
try:
    titanic_df = pd.read_csv('data-visualization/data/titanic_with_features.csv')
    print("✅ Loaded titanic_with_features.csv")
except FileNotFoundError:
    url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
    titanic_df = pd.read_csv(url)
    print("❌ Error: Missing featured data. Loaded original data for structure demo.")

print(f"Dataset shape: {titanic_df.shape}")
print(f"Columns: {list(titanic_df.columns)}")

✅ Loaded titanic_with_features.csv
Dataset shape: (891, 15)
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'AgeGroup', 'FamilySize', 'IsAlone', 'Title_simple', 'Deck']


## 1. The GroupBy Object

The `groupby()` method doesn't immediately perform calculations; it creates a special **GroupBy Object**. This object holds the instructions on *how* to split the DataFrame.

We use methods like `.ngroups` and `.size()` to inspect the groups before calculating anything.

In [2]:
# SECTION 1: UNDERSTANDING GROUPBY (as in original script)

grouped = titanic_df.groupby('Sex')
print(f"Type of groupby object: {type(grouped)}")
print(f"Number of groups: {grouped.ngroups}")
print(f"Group sizes:")
print(grouped.size())

Type of groupby object: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Number of groups: 2
Group sizes:
Sex
female    314
male      577
dtype: int64


## 2. Basic GroupBy Aggregation

Aggregation is when you collapse multiple rows into a single value (e.g., calculating the mean).

* **Single Stat:** Applying a single method (like `.mean()` or `.count()`) after grouping.
* **`.describe()`:** Automatically calculates a full statistical summary for numeric columns within each group.

In [3]:
# SECTION 2: BASIC GROUPBY OPERATIONS (as in original script)

print("Survival rate by gender:")
gender_survival = titanic_df.groupby('Sex')['Survived'].mean()
print(gender_survival)
print(f"Type: {type(gender_survival)}")

print("\nMultiple statistics at once using describe():")
gender_stats = titanic_df.groupby('Sex')['Age'].describe()
print(gender_stats)

Survival rate by gender:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
Type: <class 'pandas.core.series.Series'>

Multiple statistics at once using describe():
        count       mean        std   min   25%   50%   75%   max
Sex                                                              
female  314.0  27.261146  13.111827  0.75  21.0  24.0  35.0  63.0
male    577.0  30.119879  13.311447  0.42  23.0  27.0  37.0  80.0


## 3. Multi-Column Grouping and Reshaping

Grouping by two or more columns creates a **MultiIndex** (hierarchical index).

* **Hierarchical Index:** The output Series has two or more index levels (e.g., `Sex` and `Pclass`).
* **`.unstack()`:** Pivots the innermost index level (e.g., `Pclass`) into columns, making the data easier to read and visualize.

In [4]:
# SECTION 3: MULTIPLE COLUMN GROUPING (as in original script)

print("Survival rate by Sex AND Passenger Class:")
multi_group = titanic_df.groupby(['Sex', 'Pclass'])['Survived'].mean()
print(multi_group)

print("\nThis creates a hierarchical index (MultiIndex)")
print(f"Index levels: {multi_group.index.names}")

print("\nUnstacked view (like a pivot table):")
pivot_view = multi_group.unstack()
print(pivot_view)

Survival rate by Sex AND Passenger Class:
Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

This creates a hierarchical index (MultiIndex)
Index levels: ['Sex', 'Pclass']

Unstacked view (like a pivot table):
Pclass         1         2         3
Sex                                 
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447


## 4. Advanced Aggregation (`.agg()`)

The `.agg()` method provides fine-grained control over aggregation:

* **Multiple Stats on One Column:** Pass a list of function names (e.g., `['mean', 'median', 'std']`).
* **Different Stats on Different Columns:** Pass a dictionary mapping column names to lists of functions.
* **Named Aggregations:** Pass keyword arguments to rename the output columns directly.

In [5]:
# SECTION 4: ADVANCED AGGREGATION (as in original script)

print("Different functions for different columns:")
complex_agg = titanic_df.groupby('Sex').agg({
    'Age': ['mean', 'median'],
    'Fare': ['mean', 'max'],
    'Survived': 'sum'
})
print(complex_agg)

print("\nNamed aggregations (cleaner output):")
named_agg = titanic_df.groupby('Sex').agg(
    avg_age=('Age', 'mean'),
    median_age=('Age', 'median'),
    total_survivors=('Survived', 'sum'),
    passenger_count=('PassengerId', 'count')
)
print(named_agg)

Different functions for different columns:
              Age              Fare           Survived
             mean median       mean       max      sum
Sex                                                   
female  27.261146   24.0  44.479818  512.3292      233
male    30.119879   27.0  25.523893  512.3292      109

Named aggregations (cleaner output):
          avg_age  median_age  total_survivors  passenger_count
Sex                                                            
female  27.261146        24.0              233              314
male    30.119879        27.0              109              577


## 5. Transform vs. Aggregate

This is a critical distinction in Pandas:

* **Aggregate (`.agg()`):** Returns a result that has **fewer rows** than the original DataFrame (e.g., mean age by class).
* **Transform (`.transform()`):** Returns a result with the **same number of rows** as the original, allowing the aggregate value to be added back to the DataFrame for comparison or calculation (e.g., calculating age deviation *within* each class).

In [9]:
# SECTION 5: TRANSFORM VS AGGREGATE (as in original script)

print("Aggregate example - mean age by class:")
agg_result = titanic_df.groupby('Pclass')['Age'].mean()
print(f"Shape: {agg_result.shape}")

print("\nTransform example - subtract group mean from each value:")
titanic_df['Age_minus_class_mean'] = titanic_df.groupby('Pclass')['Age'].transform(lambda x: x - x.mean())
print("First 10 rows showing original Age and Age minus class mean:")
print(titanic_df[['PassengerId', 'Pclass', 'Age', 'Age_minus_class_mean']].head(10))

print(f"\nOriginal DataFrame shape: {titanic_df.shape}")
print("Transform kept the same number of rows!")

Aggregate example - mean age by class:
Shape: (3,)

Transform example - subtract group mean from each value:
First 10 rows showing original Age and Age minus class mean:
   PassengerId  Pclass   Age  Age_minus_class_mean
0            1       3  22.0             -2.802281
1            2       1  38.0             -0.270463
2            3       3  26.0              1.197719
3            4       1  35.0             -3.270463
4            5       3  35.0             10.197719
5            6       3  25.0              0.197719
6            7       1  54.0             15.729537
7            8       3   2.0            -22.802281
8            9       3  27.0              2.197719
9           10       2  14.0            -15.863207

Original DataFrame shape: (891, 16)
Transform kept the same number of rows!


## 6. Filtering Groups

The `.filter()` method is used to keep or discard entire groups based on a condition of the group's aggregated value (e.g., only keep groups where the count is > 100, or the mean survival rate is > 50%).

In [8]:
# SECTION 6: FILTERING GROUPS (as in original script)

print("Only show passenger classes with more than 100 passengers:")
large_classes = titanic_df.groupby('Pclass').filter(lambda x: len(x) > 100)
print(f"Original shape: {titanic_df.shape}")
print(f"Filtered shape: {large_classes.shape}")
print("Passenger counts by class in filtered data:")
print(large_classes['Pclass'].value_counts().sort_index())

print("\nOnly show groups where survival rate > 50%:")
high_survival = titanic_df.groupby('Sex').filter(lambda x: x['Survived'].mean() > 0.5)
print("Gender distribution in high-survival groups:")
print(high_survival['Sex'].value_counts())

Only show passenger classes with more than 100 passengers:
Original shape: (891, 16)
Filtered shape: (891, 16)
Passenger counts by class in filtered data:
Pclass
1    216
2    184
3    491
Name: count, dtype: int64

Only show groups where survival rate > 50%:
Gender distribution in high-survival groups:
Sex
female    314
Name: count, dtype: int64


## 7. Real-World Analysis with GroupBy

Using the techniques above, we can now answer complex, business-relevant questions quickly. Here, we leverage the **featured data** created in the previous notebook.

In [10]:
# SECTION 7: REAL-WORLD ANALYSIS (as in original script)

print("1. Survival rate by passenger class:")
class_survival = titanic_df.groupby('Pclass')['Survived'].agg(['count', 'sum', 'mean'])
class_survival.columns = ['Total_Passengers', 'Survivors', 'Survival_Rate']
class_survival = class_survival.round(3)
print(class_survival)

print("\n2. Average age of survivors vs non-survivors by gender:")
age_by_survival = titanic_df.groupby(['Sex', 'Survived'])['Age'].mean().unstack()
age_by_survival.columns = ['Non_Survivors', 'Survivors']
print(age_by_survival.round(1))

print("\n3. Survival rate by title (using engineered feature):")
title_survival = titanic_df.groupby('Title_simple')['Survived'].agg(['count', 'mean'])
title_survival.columns = ['Count', 'Survival_Rate']
title_survival = title_survival.sort_values('Survival_Rate', ascending=False)
print(title_survival.head(5).round(3))

1. Survival rate by passenger class:
        Total_Passengers  Survivors  Survival_Rate
Pclass                                            
1                    216        136          0.630
2                    184         87          0.473
3                    491        119          0.242

2. Average age of survivors vs non-survivors by gender:
        Non_Survivors  Survivors
Sex                             
female           24.3       28.3
male             30.7       27.7

3. Survival rate by title (using engineered feature):
              Count  Survival_Rate
Title_simple                      
Mrs             125          0.792
Miss            182          0.698
Master           40          0.575
Other            27          0.444
Mr              517          0.157


## 8. Practice and Conclusion

These final examples demonstrate complex, multi-level analysis and performance techniques essential for large-scale data science projects.

* **Complexity:** Grouping by three variables and aggregating different functions.
* **Performance:** Using efficient `.agg()` over multiple single operations.

In [11]:
# SECTION 8: ADVANCED TECHNIQUES (as in original script)

print("Complex analysis: Age groups by gender and class (using featured data)")
age_group_analysis = titanic_df.groupby(['Sex', 'Pclass', 'AgeGroup']).agg({
    'PassengerId': 'count',
    'Survived': ['sum', 'mean'],
    'Fare': 'mean'
}).round(2)

age_group_analysis.columns = ['_'.join(col).strip() for col in age_group_analysis.columns]
print(age_group_analysis.head(10))


# Final Summary
print("\n--- KEY LEARNING CONCLUSION ---")
print("Cleaned data enables detailed grouping, feature-based analysis, and answering key business questions.")
print("We mastered grouping, aggregation, transformation, filtering, and performance tips with Pandas!")

Complex analysis: Age groups by gender and class (using featured data)
                        PassengerId_count  Survived_sum  Survived_mean  \
Sex    Pclass AgeGroup                                                   
female 1      Adult                    93            91           0.98   
              Child                     1             0           0.00   
       2      Adult                    68            62           0.91   
              Child                     8             8           1.00   
       3      Adult                   121            61           0.50   
              Child                    23            11           0.48   
male   1      Adult                   113            41           0.36   
              Child                     3             3           1.00   
              Elder                     6             1           0.17   
       2      Adult                    97             8           0.08   

                        Fare_mean  
Sex 