# MultiIndex and Hierarchical Data in Pandas

### What Is MultiIndex and Hierarchical Data?

In many real-world datasets — especially those from **time series**, **grouped analyses**, or **pivot tables** — data is organized across **multiple levels of indexing**. This is called a **MultiIndex**, and it allows Pandas to represent and manipulate **hierarchical data** efficiently.

Instead of a single column as the row index, a **MultiIndex** can have two or more levels — for example, grouping data by both `Pclass` and `Sex`. This structure enables more complex, grouped operations, elegant reshaping, and more readable outputs when working with grouped data summaries.

### Example Setup: Group Titanic Passengers by Class and Sex

We'll use Titanic dataset columns like `Pclass`, `Sex`, and `Survived` to demonstrate a MultiIndex.

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")
grouped = df.groupby(['Pclass', 'Sex'])['Survived'].agg(['count', 'sum', 'mean'])
print(grouped)

               count  sum      mean
Pclass Sex                         
1      female     94   91  0.968085
       male      122   45  0.368852
2      female     76   70  0.921053
       male      108   17  0.157407
3      female    144   72  0.500000
       male      347   47  0.135447


Here, both `Pclass` and `Sex` are used as **row indexes**, creating a **MultiIndex**. We can think of this as a tree-like structure: first level `Pclass`, then nested `Sex`.

### Why Use MultiIndex?

MultiIndexes are useful when:

- We want to analyze data across **two or more dimensions**.
- We want to create **pivot-like structures** but still maintain a DataFrame.
- We want to **reshape** data between long and wide forms.
- We’re dealing with **hierarchical categories** (e.g., Year → Month → Day).

### Key Operations with MultiIndex

1. Creating a MultiIndex with `.groupby()`
    
As seen above, using `.groupby(['col1', 'col2'])` creates a MultiIndex in the result:

In [2]:
grouped = df.groupby(['Pclass', 'Sex'])['Survived'].mean()
print(grouped)

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64


2. Accessing Values in a MultiIndex
    
We can access values by **tuple indexing**:

In [3]:
print(grouped.loc[(1, 'female')])  # Mean survival of 1st class females

0.9680851063829787


3. Resetting the Index

Convert MultiIndex back to flat DataFrame with `.reset_index()`:

In [4]:
flat_df = grouped.reset_index()
print(flat_df.head())

   Pclass     Sex  Survived
0       1  female  0.968085
1       1    male  0.368852
2       2  female  0.921053
3       2    male  0.157407
4       3  female  0.500000


4. Unstacking and Pivoting Levels

We can convert one level of index into columns using `.unstack()`:

In [5]:
unstacked = grouped.unstack()  # Moves 'Sex' from index to columns
print(unstacked)

Sex       female      male
Pclass                    
1       0.968085  0.368852
2       0.921053  0.157407
3       0.500000  0.135447


This is great for comparison between categories.

5. Stack Back to MultiIndex

Reverse `.unstack()` using `.stack()`:

In [6]:
restacked = unstacked.stack()
print(restacked.head())

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
dtype: float64


6. Swapping Index Levels

In [7]:
swapped = grouped.swaplevel()
print(swapped.head())

Sex     Pclass
female  1         0.968085
male    1         0.368852
female  2         0.921053
male    2         0.157407
female  3         0.500000
Name: Survived, dtype: float64


7. Sorting a MultiIndex

Always sort a MultiIndex with `.sort_index()` to prevent confusion:

In [8]:
sorted_group = grouped.sort_index()
print(sorted_group.head())

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
Name: Survived, dtype: float64


### AI/ML Use Case: Hierarchical Feature Extraction

Hierarchical indexing can help:

- Analyze **nested categories** (e.g., Class → Gender → Survival).
- Build **complex aggregation pipelines** for feature engineering.
- Structure **multi-level time series** (e.g., user → session → timestamp).
- Represent **panel data** or **multi-series forecasting datasets**.

For example, in a Titanic ML pipeline:

- Use group-level survival rates (`Pclass`, `Sex`) as new features.
- Pivot MultiIndex into columns to prepare wide-format datasets.

### Exercises

Q1. Group by `Pclass` and `Embarked`, get survival rate

In [9]:
flat_df = grouped.reset_index()
print(flat_df.head())

   Pclass     Sex  Survived
0       1  female  0.968085
1       1    male  0.368852
2       2  female  0.921053
3       2    male  0.157407
4       3  female  0.500000


Q2. Convert that MultiIndex result into a flat DataFrame

In [10]:
flat_df = grouped.reset_index()
print(flat_df.head())

   Pclass     Sex  Survived
0       1  female  0.968085
1       1    male  0.368852
2       2  female  0.921053
3       2    male  0.157407
4       3  female  0.500000


Q3. Unstack the `Embarked` level

In [11]:
unstacked = grouped.unstack()
print(unstacked)

Sex       female      male
Pclass                    
1       0.968085  0.368852
2       0.921053  0.157407
3       0.500000  0.135447


Q4. Access survival rate for 3rd class passengers who embarked from "S”

In [12]:
# Group by Pclass and Embarked to get the mean survival rate
grouped_embarked = df.groupby(['Pclass', 'Embarked'])['Survived'].mean()
print(grouped_embarked.loc[(3, 'S')])

0.18980169971671387


Q5. Sort the MultiIndex by both levels

In [13]:
sorted_group = grouped.sort_index()
print(sorted_group)

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64


### Summary

MultiIndexing in Pandas allows data scientists to represent and work with **hierarchically structured data** across multiple dimensions — such as `Pclass` and `Sex`, or `Date` and `City`. While flat tables are easy to view, real-world data often contains nested relationships. MultiIndex provides a **clean and powerful way to analyze such datasets** without needing complex joins or data duplication.

By using methods like `.groupby()`, `.unstack()`, `.stack()`, `.swaplevel()`, and `.reset_index()`, we can easily **navigate between long and wide formats**. This flexibility is especially important in **machine learning workflows**, where feature engineering may involve aggregate statistics at group levels. MultiIndex also simplifies **multi-time-series**, **panel data**, and **deeply nested categories** in industries like finance, healthcare, and customer analytics.

In summary, mastering MultiIndex allows us to explore our data across layers, perform advanced grouped operations, and prepare our datasets efficiently for modeling and visualization. It bridges the gap between structured tables and real-world hierarchical relationships.