# Pivot Tables and Cross-Tabulations in Pandas

### What Is a Pivot Table?

A **pivot table** is a high-level data summarization tool that allows us to **rearrange, group, and compute aggregations** over a DataFrame. Think of it as an Excel-style table that we can “pivot” around different columns to understand data patterns better.

In Pandas, the `.pivot_table()` function lets us:

- Choose a **row index** (e.g., class)
- Choose a **column index** (e.g., gender)
- Choose **values to aggregate** (e.g., fare or survival)
- Choose **how to aggregate** (mean, sum, count, etc.)

### Basic syntax

```python
df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')
```

This shows the **average fare** paid by each gender in each class — instantly! Unlike `groupby()`, pivot tables also **automatically fill in missing combinations** with `NaN` and support **multi-indexing** for even deeper insight.

Why does this matter in AI/ML? Before training any model, we want to know:

- Which groups paid more or had better survival odds?
- Are there biases in our features?
- Are some combinations rare or dominant?

These insights help us choose better features, balance our datasets, and avoid misleading trends. Pivot tables are fast, flexible, and perfect for summarizing complex multi-dimensional data.

### Example:

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# Pivot: Mean fare by Pclass and Sex
pivot = df.pivot_table(values='Fare', index='Pclass', columns='Sex', aggfunc='mean')
print(pivot)

Sex         female       male
Pclass                       
1       106.125798  67.226127
2        21.970121  19.741782
3        16.118810  12.661633


### What Is a Cross Tabulation?

A **cross-tabulation (crosstab)** is similar to a pivot table but is specifically designed for **counting frequencies** between two or more categorical variables. Pandas provides `.crosstab()` to generate this quickly.

### Example

In [2]:
print(pd.crosstab(df['Sex'], df['Survived']))

Survived    0    1
Sex               
female     81  233
male      468  109


This outputs a table where:

- Rows = Sex (Male, Female)
- Columns = Survived (0 or 1)
- Values = Count of occurrences

It's one of the best tools for checking **class balance**, **relationships between categories**, and **potential feature-target dependencies**.

Why does this matter in AI/ML?

- We can use `.crosstab()` to **check class imbalance** in our target.
- Spot **dominant groups** (e.g., most survivors are females?).
- Explore relationships without visualizations.

### Syntax

```python
pd.crosstab(index=df['Pclass'], columns=df['Survived'], margins=True)
```

The `margins=True` argument adds total row/column counts.
We can even normalize:

In [3]:
# Normalize by row to see % survival by class
print(pd.crosstab(df['Pclass'], df['Survived'], normalize='index'))

Survived         0         1
Pclass                      
1         0.370370  0.629630
2         0.527174  0.472826
3         0.757637  0.242363


This gives survival **ratios** instead of raw counts — very helpful when engineering features like "Class_Survival_Rate".

Crosstabs also support multiple columns:

In [4]:
print(pd.crosstab([df['Sex'], df['Embarked']], df['Survived']))

Survived           0    1
Sex    Embarked          
female C           9   64
       Q           9   27
       S          63  140
male   C          66   29
       Q          38    3
       S         364   77


This multi-index lets us answer deep questions like:

> “How did female passengers embarking from Cherbourg fare?”
> 

That’s the beauty of crosstabs — they bring **hidden relationships** to the surface.

### Exercises

Q1. Create a pivot table showing average Age by Pclass and Sex.

In [5]:
print(df.pivot_table(values='Age', index='Pclass', columns='Sex', aggfunc='mean'))

Sex        female       male
Pclass                      
1       34.611765  41.281386
2       28.722973  30.740707
3       21.750000  26.507589


Q2. Show count of passengers by Sex and Survived using crosstab.

In [6]:
print(pd.crosstab(df['Sex'], df['Survived']))

Survived    0    1
Sex               
female     81  233
male      468  109


Q3. What percentage of passengers survived by class?

In [7]:
print(pd.crosstab(df['Pclass'], df['Survived'], normalize='index'))

Survived         0         1
Pclass                      
1         0.370370  0.629630
2         0.527174  0.472826
3         0.757637  0.242363


Q4. Use pivot_table to show median Fare by Embarked and Sex.

In [8]:
print(df.pivot_table(values='Fare', index='Embarked', columns='Sex', aggfunc='median'))

Sex         female   male
Embarked                 
C         56.92920  24.00
Q          7.76875   7.75
S         24.15000  10.50


### Summary

**Pivot tables** and **crosstabs** are two of the most valuable tools in any data scientist’s arsenal for **summarizing structured data**. While they may seem similar, each has its specific strength:

- Use **pivot tables** when we want to **aggregate** (sum, mean, count, median) over multiple dimensions.
- Use **crosstabs** when we want to **count relationships** between categorical features.

Both allow us to uncover **non-obvious group patterns**, engineer useful features, and validate dataset assumptions before building any models.

In the context of AI/ML:

- These tools help identify **group-level trends**, **imbalances**, and **biases** that could affect model fairness or performance.
- We can extract domain knowledge that improves feature selection and model logic.
- They’re vital for preprocessing, quality checks, and post-model analysis.

For example, if a pivot table shows that 1st-class female passengers have a 90% survival rate, that insight might influence our feature engineering. Or, if a crosstab shows that nearly all 3rd-class males died, it might highlight a need to balance training data for better generalization.

In conclusion, pivot tables and cross-tabs are not just EDA tricks — they’re **AI/ML decision-making tools** that every serious data scientist should master.