# Advanced Pandas (Introduction)

_This notebook introduces advanced Pandas techniques for complex data manipulation and analysis, building on the fundamentals you learnt in Week 04._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Learning objectives

By the end of this week, you will be able to:

- Understand and apply tidy data principles
- Reshape data between wide and long formats using `melt()`, `pivot()`, and `pivot_table()`
- Work with hierarchical data using MultiIndex
- Apply advanced data cleaning and transformation techniques
- Create efficient data processing pipelines using method chaining
- Analyse time series data with appropriate Pandas tools

## Building on previous sessions

In Week 04, you learnt the fundamental Pandas operations:

- Creating and manipulating DataFrames
- Filtering and selecting data
- Basic aggregation with `groupby()`
- Sorting and ranking data
- Handling missing values

This week extends these foundations with advanced techniques for reshaping and analysing complex datasets. You'll learn how to transform data structures to suit different analytical needs and handle multi-dimensional data efficiently.

## Key concepts

### Tidy data principles

The concept of "tidy data" was introduced by statistician Hadley Wickham. Tidy data follows three principles:

1. **Each variable forms a column** – Every column represents a single attribute
2. **Each observation forms a row** – Every row represents a single measurement or record
3. **Each type of observational unit forms a table** – Related data is organised in a single DataFrame

Consider this example:

In [None]:
import pandas as pd

# Messy data: months as columns
messy_sales = pd.DataFrame({
    "product": ["Widget A", "Widget B"],
    "Jan": [100, 150],
    "Feb": [120, 140],
    "Mar": [110, 160]
})

print("Messy (wide) format:")
print(messy_sales)

# Tidy data: one row per observation
tidy_sales = pd.DataFrame({
    "product": ["Widget A", "Widget A", "Widget A", "Widget B", "Widget B", "Widget B"],
    "month": ["Jan", "Feb", "Mar", "Jan", "Feb", "Mar"],
    "sales": [100, 120, 110, 150, 140, 160]
})

print("\nTidy (long) format:")
print(tidy_sales)

The tidy format makes it easier to:
- Calculate statistics by product or month
- Create visualisations
- Integrate with other datasets
- Apply standard analytical techniques

### Data reshaping: wide vs long format

**Wide format** has:
- Multiple columns representing related values
- Fewer rows
- Easier for human reading
- Common in spreadsheets and reports

**Long format** has:
- Single column for values
- More rows
- Better for statistical analysis
- Preferred for plotting and modelling

Pandas provides several methods to convert between formats:

- `melt()` – Converts wide to long
- `pivot()` – Converts long to wide (no aggregation)
- `pivot_table()` – Converts long to wide (with aggregation)

### MultiIndex (hierarchical indexing)

MultiIndex allows you to have multiple levels of row or column labels, enabling you to work with higher-dimensional data in a two-dimensional DataFrame:

In [None]:
# Creating a MultiIndex DataFrame
arrays = [
    ["London", "London", "Manchester", "Manchester"],
    ["Q1", "Q2", "Q1", "Q2"]
]
index = pd.MultiIndex.from_arrays(arrays, names=["city", "quarter"])

df = pd.DataFrame({"sales": [100, 120, 80, 90]}, index=index)
print(df)

MultiIndex is particularly useful for:
- Representing multi-dimensional data
- Organising data with natural hierarchies (e.g., country → region → city)
- Efficient slicing and dicing of complex datasets

### Method chaining

Method chaining allows you to perform multiple operations in a single, readable statement:

In [None]:
# Example of method chaining
result = (
    tidy_sales
    .query("sales > 100")  # Filter
    .assign(sales_band="high")  # Add column
    .sort_values("sales", ascending=False)  # Sort
    .reset_index(drop=True)  # Reset index
)
print(result)

Benefits of method chaining:
- More readable code (follows data flow)
- Fewer intermediate variables
- Easier to debug (comment out individual steps)
- Professional coding style

## Prerequisites

Before starting this week's materials, you should be comfortable with:

- **Week 04 Pandas fundamentals**:
  - Creating DataFrames from various sources
  - Selecting and filtering data
  - Using `groupby()` for aggregation
  - Handling missing values
  - Merging and concatenating DataFrames

- **Python fundamentals (Weeks 01-02)**:
  - Lists, dictionaries, and functions
  - Control flow (if/else, loops)
  - List comprehensions

If you need to review any of these topics, please revisit the appropriate week's materials before proceeding.

## Further resources

### Academic readings

- **Wickham, H.** (2014). "Tidy Data". *Journal of Statistical Software*, 59(10), 1-23. [https://doi.org/10.18637/jss.v059.i10](https://doi.org/10.18637/jss.v059.i10)  
  The foundational paper on tidy data principles

- **McKinney, W.** (2022). *Python for Data Analysis* (3rd ed.). O'Reilly Media.  
  Written by the creator of Pandas; Chapters 7-8 cover reshaping and data wrangling

### Documentation

- [Pandas User Guide: Reshaping and Pivot Tables](https://pandas.pydata.org/docs/user_guide/reshaping.html)  
  Official documentation on melt, pivot, and pivot_table

- [Pandas User Guide: MultiIndex / Advanced Indexing](https://pandas.pydata.org/docs/user_guide/advanced.html)  
  Comprehensive guide to hierarchical indexing

- [Pandas User Guide: Time Series / Date Functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html)  
  Guide to working with dates and times

### Video tutorials

- [Corey Schafer: Pandas reshaping with melt, pivot, and pivot_table](https://www.youtube.com/watch?v=F8yRVmSPZqc) (18 mins)  
  Clear explanation of reshaping operations with examples

- [Data School: Working with hierarchical data using MultiIndex](https://www.youtube.com/watch?v=tcRGa2soc-c) (23 mins)  
  Practical examples of MultiIndex operations

### Interactive resources

- [Kaggle Learn: Pandas Course](https://www.kaggle.com/learn/pandas)  
  Hands-on exercises including data reshaping

- [Real Python: Pandas pivot_table() Tutorial](https://realpython.com/pandas-pivot-table/)  
  In-depth guide with business examples

### Additional materials

- [Towards Data Science: Tidy Data in Python](https://towardsdatascience.com/tidy-data-in-python-461e7bc5f67c)  
  Practical guide to applying tidy data principles

- [Stack Overflow: Pandas pivot, pivot_table, melt questions](https://stackoverflow.com/questions/tagged/pandas+pivot)  
  Real-world problems and solutions

## Next steps

Continue to the [Demonstration](Demonstration.ipynb) notebook to see these concepts in action with practical business examples. You'll work through detailed examples of data reshaping, MultiIndex operations, and advanced data transformations.

After completing the Demonstration, practice your skills with the [Exercises](Exercises.ipynb), which are designed to take approximately 90 minutes and focus on real-world business scenarios.