# GroupBy Operations in Pandas

### What is `.groupby()` in Pandas?

The `.groupby()` function in Pandas is a **foundational tool** used to group data based on one or more columns and then **apply aggregation functions** like `.mean()`, `.sum()`, `.count()`, etc. It follows the classic **"Split-Apply-Combine" strategy**:

1. **Split** the data into groups based on a column.
2. **Apply** an operation (like mean or sum) to each group.
3. **Combine** the results into a new structure.

For example, in the Titanic dataset, instead of analyzing passengers one by one, we could ask:

- “What’s the average fare paid by each class?”
- “What’s the survival rate per gender?”
- “How does age vary between survivors and non-survivors?”

All of these are solved using `.groupby()`.

### Why GroupBy Matters in AI/ML

GroupBy operations are **essential** in AI and machine learning for:

- **Feature Engineering:** Add new columns like average fare by class.
- **Fairness Audits:** Check model fairness by inspecting metrics across subgroups.
- **Bias Detection:** Uneven distributions in gender/class can bias training.
- **Data Exploration:** Understand patterns within subgroups.

In classification tasks, like predicting Titanic survival, grouped insights can reveal hidden trends that improve model accuracy. For example, survival rates by gender or class can inspire features like `Is_Child`, `Is_Rich`, or `Is_Female`.

If we train without knowing group-wise differences, we risk misleading outcomes. ML models reflect the data they see — `groupby()` helps us **understand those reflections**.

### Syntax and Examples

Basic GroupBy syntax:

```python
df.groupby('column_name')['target_column'].mean()
```

Examples with Titanic dataset:

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# 1. Mean fare by class
print(df.groupby('Pclass')['Fare'].mean())

# 2. Survival count by gender
print(df.groupby('Sex')['Survived'].sum())

# 3. Age stats by survival
print(df.groupby('Survived')['Age'].agg(['mean', 'min', 'max']))

# 4. Multi-level groupby
print(df.groupby(['Sex', 'Pclass'])['Fare'].mean())

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64
Sex
female    233
male      109
Name: Survived, dtype: int64
               mean   min   max
Survived                       
0         30.626179  1.00  74.0
1         28.343690  0.42  80.0
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64


### Advanced Use Cases

In [2]:
# Custom aggregation
print(df.groupby('Embarked')['Fare'].agg(lambda x: x.max() - x.min()))

# Multiple columns
print(df.groupby('Pclass')[['Fare', 'Age']].agg(['mean', 'median']))

Embarked
C    508.3167
Q     83.2500
S    263.0000
Name: Fare, dtype: float64
             Fare                 Age       
             mean   median       mean median
Pclass                                      
1       84.154687  60.2875  38.233441   37.0
2       20.662183  14.2500  29.877630   29.0
3       13.675550   8.0500  25.140620   24.0


GroupBy supports many options:

- `keep='first'` / `keep='last'`
- Grouping on multiple columns
- Aggregating multiple stats

### Exercises

Q1. Average age per class

In [3]:
print(df.groupby('Pclass')['Age'].mean())

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64


Q2. Count survivors per port

In [4]:
print(df.groupby('Embarked')['Survived'].sum())

Embarked
C     93
Q     30
S    217
Name: Survived, dtype: int64


Q3. Fare summary by class

In [None]:
print(df.groupby('Pclass')['Fare'].agg(['min', 'mean', 'max']))

Q4. Passengers count by class and gender

In [5]:
print(df.groupby(['Sex', 'Pclass'])['PassengerId'].count())

Sex     Pclass
female  1          94
        2          76
        3         144
male    1         122
        2         108
        3         347
Name: PassengerId, dtype: int64


### Summary

The `.groupby()` method in Pandas is **one of the most powerful tools** for structured data analysis. It allows us to break down our data by categories and apply aggregations to each group. This is especially useful for:

- EDA (Exploratory Data Analysis)
- Creating features for ML
- Auditing for bias and imbalance

With just one line of code, we can unlock rich, category-wise insights that would be tedious to compute manually. In machine learning workflows, `groupby()` is used before model training, during preprocessing, and even post-modeling to interpret predictions.

By mastering `groupby()`, we gain the ability to **ask deeper questions** of our data and **engineer better ML solutions**.