# Advanced Pandas (Solutions)

_This notebook provides solutions to the Week 05 exercises. Each solution includes explanations and, where appropriate, alternative approaches._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better output
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)

## Part 1: Data reshaping (20 minutes)

In this section, you'll practice converting data between wide and long formats using `melt()` and `pivot()`.

### Exercise 1: Sales data reshaping

You have quarterly sales data in wide format. Convert it to long format for analysis.

1. Use `melt()` to convert the quarterly columns to long format
2. Clean the `quarter` column to remove the `sales_` prefix
3. Display the first 10 rows of the melted data

In [None]:
# Sales data in wide format
sales_wide = pd.DataFrame({
    "product": ["Laptop", "Mouse", "Keyboard", "Monitor"],
    "region": ["North", "North", "South", "South"],
    "sales_Q1": [15000, 3000, 12000, 18000],
    "sales_Q2": [18000, 3500, 14000, 20000],
    "sales_Q3": [16000, 3200, 13000, 19000],
    "sales_Q4": [20000, 4000, 15000, 22000]
})

print("Wide format:")
print(sales_wide)

In [None]:
# Solution
# Step 1: Melt the quarterly columns
sales_long = sales_wide.melt(
    id_vars=["product", "region"],
    value_vars=["sales_Q1", "sales_Q2", "sales_Q3", "sales_Q4"],
    var_name="quarter",
    value_name="sales"
)

# Step 2: Clean the quarter column
sales_long["quarter"] = sales_long["quarter"].str.replace("sales_", "")

# Step 3: Display first 10 rows
print("Long format:")
print(sales_long.head(10))

**Explanation**: The `melt()` function converts wide data to long format by:
- Keeping `id_vars` columns as-is (product, region)
- Taking the column names from `value_vars` and putting them in the `quarter` column
- Taking the values from those columns and putting them in the `sales` column

We then use `.str.replace()` to clean up the quarter names by removing the "sales_" prefix.

### Exercise 2: Employee performance data

Convert long-format employee performance data back to wide format.

1. Use `pivot()` to create a wide format with employees as rows and quarters as columns
2. Reset the index to make `employee` a regular column
3. Calculate the average performance for each employee across all quarters

In [None]:
# Performance data in long format
perf_long = pd.DataFrame({
    "employee": ["Alice", "Alice", "Alice", "Alice", 
                 "Bob", "Bob", "Bob", "Bob",
                 "Carol", "Carol", "Carol", "Carol"],
    "quarter": ["Q1", "Q2", "Q3", "Q4"] * 3,
    "score": [4.2, 4.5, 4.3, 4.6,
              3.8, 4.0, 4.1, 3.9,
              4.7, 4.8, 4.6, 4.9]
})

print("Long format:")
print(perf_long)

In [None]:
# Solution
# Step 1: Pivot to wide format
perf_wide = perf_long.pivot(
    index="employee",
    columns="quarter",
    values="score"
)

# Step 2: Reset index
perf_wide = perf_wide.reset_index()

# Step 3: Calculate average performance
perf_wide["average"] = perf_wide[["Q1", "Q2", "Q3", "Q4"]].mean(axis=1)

print("Wide format with averages:")
print(perf_wide)

**Explanation**: The `pivot()` function converts long data to wide format by:
- Using `employee` as the row index
- Using `quarter` values as column names
- Filling the table with `score` values

After pivoting, `employee` becomes the index, so we use `.reset_index()` to convert it back to a regular column.

To calculate the average, we use `.mean(axis=1)` which calculates the mean across columns (axis=1) for each row.

### Exercise 3: Revenue by product and region

Practice the melt-pivot workflow:

1. Melt the revenue data to long format
2. Pivot it back to show products as rows and months as columns
3. Add a column showing total revenue for each product

In [None]:
# Revenue data
revenue_data = pd.DataFrame({
    "product": ["Widget A", "Widget B", "Widget C"],
    "region": ["East", "East", "West"],
    "Jan": [5000, 7000, 6000],
    "Feb": [5500, 7500, 6500],
    "Mar": [6000, 8000, 7000]
})

print("Original data:")
print(revenue_data)

In [None]:
# Solution
# Step 1: Melt to long format
revenue_long = revenue_data.melt(
    id_vars=["product", "region"],
    value_vars=["Jan", "Feb", "Mar"],
    var_name="month",
    value_name="revenue"
)

print("Long format:")
print(revenue_long.head())

# Step 2: Pivot back to wide format with products as rows
revenue_wide = revenue_long.pivot(
    index="product",
    columns="month",
    values="revenue"
).reset_index()

# Step 3: Add total column
revenue_wide["Total"] = revenue_wide[["Jan", "Feb", "Mar"]].sum(axis=1)

print("\nWide format with totals:")
print(revenue_wide)

**Explanation**: This exercise demonstrates the common workflow of:
1. Melt → long format (good for analysis)
2. Pivot → wide format (good for reporting)

Note that when we pivot, we lose the `region` information because pivot creates a single value for each product-month combination. If we needed to preserve region, we would include it in the index: `index=["product", "region"]`.

The `.sum(axis=1)` calculates row-wise totals across the month columns.

## Part 2: Pivot tables and aggregation (20 minutes)

Practice creating pivot tables and performing advanced aggregations.

### Exercise 4: Sales analysis by region and product

Create a pivot table to analyse sales performance:

1. Create a pivot table with `region` as rows and `product` as columns
2. Show total sales for each region-product combination
3. Add margins to show row and column totals

In [None]:
# Sales transaction data
sales_data = pd.DataFrame({
    "date": pd.date_range("2024-01-01", periods=20),
    "region": ["North", "South", "East", "West"] * 5,
    "product": ["Laptop", "Mouse", "Keyboard", "Monitor", "Laptop"] * 4,
    "sales_amount": np.random.randint(1000, 10000, 20)
})

print("Sales data:")
print(sales_data.head(10))

In [None]:
# Solution
sales_pivot = pd.pivot_table(
    sales_data,
    values="sales_amount",
    index="region",
    columns="product",
    aggfunc="sum",
    margins=True,
    margins_name="Total"
)

print("Sales by region and product:")
print(sales_pivot)

**Explanation**: `pivot_table()` is more powerful than `pivot()` because:
- It can handle duplicate index-column combinations by aggregating them
- It supports aggregation functions like `sum`, `mean`, `count`, etc.
- It can add margins (totals) automatically

The `margins=True` parameter adds:
- A "Total" row showing column sums
- A "Total" column showing row sums
- A grand total in the bottom-right corner

### Exercise 5: Multi-metric department analysis

Create a comprehensive departmental summary:

1. Use `pivot_table()` to analyse salary, performance, and project counts by department
2. Apply different aggregation functions to each metric:
   - Salary: mean
   - Performance: mean, min, max
   - Projects: sum
3. Round the results to 2 decimal places

In [None]:
# Employee data
employee_data = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dan", "Eve", "Frank", "Grace", "Henry"],
    "department": ["Sales", "IT", "HR", "Sales", "IT", "HR", "Sales", "IT"],
    "salary": [50000, 65000, 55000, 52000, 68000, 57000, 54000, 70000],
    "performance": [4.2, 3.8, 4.5, 4.0, 4.1, 4.3, 3.9, 4.4],
    "projects_completed": [5, 8, 6, 7, 9, 5, 6, 10]
})

print("Employee data:")
print(employee_data)

In [None]:
# Solution
dept_summary = pd.pivot_table(
    employee_data,
    values=["salary", "performance", "projects_completed"],
    index="department",
    aggfunc={
        "salary": "mean",
        "performance": ["mean", "min", "max"],
        "projects_completed": "sum"
    }
).round(2)

print("Department summary:")
print(dept_summary)

**Explanation**: This demonstrates the power of `pivot_table()` for multi-metric analysis. The `aggfunc` parameter accepts:
- A single function (e.g., `"mean"`)
- A list of functions for all columns (e.g., `["mean", "sum"]`)
- A dictionary mapping specific columns to specific functions

Using a dictionary allows us to apply different aggregations to different columns:
- Salary: average makes sense (typical salary)
- Performance: mean, min, max show the range within each department
- Projects: sum shows total department output

The `.round(2)` method rounds all numeric values to 2 decimal places for better readability.

### Exercise 6: Time-based sales analysis

Analyse sales trends over time:

1. Create a pivot table showing average sales by product and month
2. Identify which product has the highest average monthly sales
3. Calculate the month-over-month growth rate for each product

In [None]:
# Monthly sales data
monthly_sales = pd.DataFrame({
    "date": pd.date_range("2024-01-01", periods=60, freq="D"),
    "product": ["Laptop", "Mouse", "Keyboard"] * 20,
    "sales": np.random.randint(5000, 20000, 60)
})

# Extract month from date
monthly_sales["month"] = monthly_sales["date"].dt.to_period("M")

print("Daily sales data:")
print(monthly_sales.head())

In [None]:
# Solution
# Step 1: Create pivot table of average sales
sales_by_month = pd.pivot_table(
    monthly_sales,
    values="sales",
    index="product",
    columns="month",
    aggfunc="mean"
).round(0)

print("Average sales by product and month:")
print(sales_by_month)

# Step 2: Identify highest average monthly sales
overall_avg = sales_by_month.mean(axis=1)
print(f"\nOverall average sales by product:")
print(overall_avg)
print(f"\nHighest average: {overall_avg.idxmax()} (£{overall_avg.max():.0f})")

# Step 3: Calculate month-over-month growth rate
growth_rate = sales_by_month.pct_change(axis=1) * 100
print(f"\nMonth-over-month growth rate (%):")
print(growth_rate.round(1))

**Explanation**: This exercise shows time-based analysis patterns:

1. **Creating the pivot table**: We use `aggfunc="mean"` to average daily sales within each month

2. **Finding the highest average**: 
   - `.mean(axis=1)` calculates the average across all months for each product
   - `.idxmax()` returns the index label (product name) of the maximum value

3. **Calculating growth rates**:
   - `.pct_change(axis=1)` calculates percentage change across columns (months)
   - `axis=1` is crucial—it compares each month to the previous month (left-to-right)
   - Multiply by 100 to convert decimals to percentages

The first month will have NaN for growth rate because there's no previous month to compare to.

## Part 3: MultiIndex operations (15 minutes)

Work with hierarchical indices to organise and analyse multi-dimensional data.

### Exercise 7: Creating and navigating MultiIndex

Practice working with hierarchical indices:

1. Create a MultiIndex DataFrame with `region` and `city` as index levels
2. Sort the DataFrame by the index
3. Select all data for the 'North' region using cross-section (`.xs()`)
4. Calculate the total sales for each region

In [None]:
# Regional sales data
regional_data = pd.DataFrame({
    "region": ["North", "North", "South", "South", "East", "East"],
    "city": ["London", "Manchester", "Brighton", "Southampton", "Norwich", "Cambridge"],
    "sales_Q1": [15000, 12000, 18000, 14000, 16000, 13000],
    "sales_Q2": [16000, 13000, 19000, 15000, 17000, 14000]
})

print("Regional sales data:")
print(regional_data)

In [None]:
# Solution
# Step 1: Create MultiIndex
regional_multi = regional_data.set_index(["region", "city"])

# Step 2: Sort by index
regional_multi = regional_multi.sort_index()

print("MultiIndex DataFrame:")
print(regional_multi)

# Step 3: Select all data for North region
print("\nNorth region data:")
north_data = regional_multi.xs("North", level="region")
print(north_data)

# Step 4: Calculate total sales for each region
print("\nTotal sales by region:")
region_totals = regional_multi.groupby(level="region").sum()
print(region_totals)

**Explanation**: MultiIndex provides hierarchical organization:

1. **Creating MultiIndex**: `.set_index(["region", "city"])` creates two index levels

2. **Sorting**: `.sort_index()` sorts by the index hierarchy (region first, then city within each region)

3. **Cross-section selection**: `.xs()` selects data at a specific level:
   - `xs("North", level="region")` selects all rows where region is "North"
   - This returns a DataFrame with only the remaining index level (city)

4. **Grouping by level**: When you have a MultiIndex, you can group by specific index levels:
   - `groupby(level="region")` groups by the region index level
   - This sums all cities within each region

### Exercise 8: Stack and unstack operations

Practice reshaping with `stack()` and `unstack()`:

1. Set `product` and `quarter` as a MultiIndex
2. Use `unstack()` to move `quarter` to columns
3. Calculate the total sales for each product
4. Use `stack()` to convert back to long format

In [None]:
# Quarterly product sales
product_sales = pd.DataFrame({
    "product": ["Widget A", "Widget A", "Widget A", "Widget A",
                "Widget B", "Widget B", "Widget B", "Widget B"],
    "quarter": ["Q1", "Q2", "Q3", "Q4"] * 2,
    "sales": [10000, 12000, 11000, 13000,
              8000, 9000, 8500, 9500]
})

print("Product sales data:")
print(product_sales)

In [None]:
# Solution
# Step 1: Create MultiIndex
sales_multi = product_sales.set_index(["product", "quarter"])
print("MultiIndex format:")
print(sales_multi)

# Step 2: Unstack quarter to columns
sales_unstacked = sales_multi.unstack()
print("\nUnstacked (wide format):")
print(sales_unstacked)

# Step 3: Calculate total sales for each product
sales_unstacked["Total"] = sales_unstacked.sum(axis=1)
print("\nWith totals:")
print(sales_unstacked)

# Step 4: Stack back to long format
sales_stacked = sales_unstacked.stack()
print("\nStacked back to long format:")
print(sales_stacked)

**Explanation**: `stack()` and `unstack()` are powerful reshaping operations:

1. **Unstack**: Moves the innermost index level to columns
   - Creates a wider DataFrame
   - Useful for comparison and calculation across categories
   - Note the hierarchical column names: (`sales`, `Q1`), (`sales`, `Q2`), etc.

2. **Adding totals**: Once unstacked, we can easily calculate row-wise totals

3. **Stack**: Moves column level back to index
   - Creates a taller, narrower DataFrame
   - The inverse of `unstack()`
   - Returns a Series with MultiIndex

**Note**: When you add a "Total" column before stacking, it becomes part of the stacked data. In practice, you might want to remove it first or use `.iloc[:, :-1]` to exclude it.

## Part 4: Window functions and advanced operations (20 minutes)

Apply window functions and advanced transformations to analyse trends.

### Exercise 9: Rolling averages for trend analysis

Calculate rolling statistics to identify trends:

1. Calculate a 7-day rolling average of daily sales
2. Calculate a 7-day rolling standard deviation
3. Identify days where sales are more than 2 standard deviations above the rolling mean

In [None]:
# Daily sales data
np.random.seed(42)
daily_sales = pd.DataFrame({
    "date": pd.date_range("2024-01-01", periods=30),
    "sales": np.random.randint(5000, 15000, 30)
})

print("Daily sales:")
print(daily_sales.head(10))

In [None]:
# Solution
# Step 1: Calculate 7-day rolling average
daily_sales["rolling_mean"] = daily_sales["sales"].rolling(window=7).mean()

# Step 2: Calculate 7-day rolling standard deviation
daily_sales["rolling_std"] = daily_sales["sales"].rolling(window=7).std()

# Step 3: Identify outliers (> 2 std above mean)
daily_sales["upper_bound"] = daily_sales["rolling_mean"] + 2 * daily_sales["rolling_std"]
daily_sales["is_outlier"] = daily_sales["sales"] > daily_sales["upper_bound"]

print("Sales with rolling statistics:")
print(daily_sales)

print("\nOutlier days:")
outliers = daily_sales[daily_sales["is_outlier"] == True]
print(outliers[["date", "sales", "rolling_mean", "upper_bound"]])

**Explanation**: Rolling window calculations are essential for trend analysis:

1. **Rolling mean**: `.rolling(window=7).mean()` calculates the average of the current row and the previous 6 rows (7 days total)
   - First 6 rows will be NaN because there aren't enough previous values
   - Smooths out daily fluctuations to show the trend

2. **Rolling standard deviation**: `.rolling(window=7).std()` measures variability over the window
   - Higher values indicate more volatile sales
   - Lower values indicate stable sales

3. **Outlier detection**: The "2 standard deviations" rule:
   - In a normal distribution, ~95% of values fall within 2 std of the mean
   - Values beyond this are statistical outliers
   - Useful for detecting unusual spikes or drops in sales

**Business application**: This helps identify exceptional sales days that might warrant investigation (successful promotion, seasonal effect, data error, etc.)

### Exercise 10: Group-wise transformations

Apply transformations within groups:

1. Calculate each employee's salary as a percentage of their department's total salary
2. Calculate each employee's ranking within their department based on performance
3. Create a column showing the department average performance

In [None]:
# Employee performance data
employees = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dan", "Eve", "Frank"],
    "department": ["Sales", "Sales", "IT", "IT", "HR", "HR"],
    "salary": [50000, 55000, 65000, 70000, 48000, 52000],
    "performance": [4.2, 4.5, 3.8, 4.1, 4.6, 4.3]
})

print("Employee data:")
print(employees)

In [None]:
# Solution
# Step 1: Salary as percentage of department total
employees["dept_total_salary"] = employees.groupby("department")["salary"].transform("sum")
employees["salary_pct"] = (employees["salary"] / employees["dept_total_salary"] * 100).round(1)

# Step 2: Ranking within department by performance
employees["dept_rank"] = employees.groupby("department")["performance"].rank(ascending=False, method="min")

# Step 3: Department average performance
employees["dept_avg_performance"] = employees.groupby("department")["performance"].transform("mean").round(2)

print("Employee data with transformations:")
print(employees)

**Explanation**: The `.transform()` method is powerful for group-wise operations:

1. **Salary percentage**:
   - `.transform("sum")` calculates the sum for each group and broadcasts it back to all rows
   - Each employee gets their department's total salary
   - We then calculate the individual percentage: `(individual / total) * 100`

2. **Department ranking**:
   - `.rank(ascending=False)` ranks highest performance as 1
   - `method="min"` handles ties by giving them the minimum rank
   - Ranking is done within each department group

3. **Department average**:
   - `.transform("mean")` calculates the mean for each department
   - Broadcasts it back so each employee has their department's average
   - Useful for comparing individual vs. department performance

**Why use `.transform()` vs `.agg()`?**
- `.agg()` returns one row per group (reduces rows)
- `.transform()` returns the same number of rows as the original DataFrame (broadcasts values)
- Use `.transform()` when you need to keep the original row structure

## Part 5: Data cleaning and transformation (15 minutes)

Practice advanced data cleaning and feature engineering techniques.

### Exercise 11: Binning and categorization

Create meaningful categories from continuous data:

1. Use `pd.cut()` to create age groups: 'Young' (< 30), 'Mid-career' (30-45), 'Senior' (> 45)
2. Use `pd.qcut()` to create salary quartiles
3. Create a pivot table showing the count of employees in each age group by salary quartile

In [None]:
# Employee demographics
demographics = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dan", "Eve", "Frank", "Grace", "Henry",
             "Ivy", "Jack", "Kelly", "Liam"],
    "age": [25, 28, 35, 42, 31, 48, 26, 39, 44, 52, 29, 36],
    "salary": [45000, 52000, 68000, 85000, 58000, 92000, 47000, 72000,
               88000, 95000, 51000, 70000]
})

print("Employee demographics:")
print(demographics)

In [None]:
# Solution
# Step 1: Create age groups using pd.cut()
demographics["age_group"] = pd.cut(
    demographics["age"],
    bins=[0, 30, 45, 100],
    labels=["Young", "Mid-career", "Senior"]
)

# Step 2: Create salary quartiles using pd.qcut()
demographics["salary_quartile"] = pd.qcut(
    demographics["salary"],
    q=4,
    labels=["Q1", "Q2", "Q3", "Q4"]
)

print("With categories:")
print(demographics)

# Step 3: Create pivot table
age_salary_table = pd.pivot_table(
    demographics,
    values="name",
    index="age_group",
    columns="salary_quartile",
    aggfunc="count",
    fill_value=0
)

print("\nEmployee count by age group and salary quartile:")
print(age_salary_table)

**Explanation**: Binning converts continuous data into categorical groups:

1. **`pd.cut()` - Fixed width bins**:
   - You specify the bin edges: `[0, 30, 45, 100]`
   - Creates categories: 0-30, 30-45, 45-100
   - Use when you have natural cut points (e.g., age groups, score ranges)
   - Good for business rules or policy-based groupings

2. **`pd.qcut()` - Quantile-based bins**:
   - You specify the number of bins: `q=4`
   - Pandas automatically creates equal-sized groups (quartiles)
   - Each quartile contains approximately the same number of observations
   - Use for statistical analysis or when you want evenly distributed groups

3. **The pivot table**:
   - Shows the distribution of employees across age groups and salary levels
   - `fill_value=0` replaces NaN with 0 for empty combinations
   - Helps identify patterns (e.g., are senior employees in higher quartiles?)

**Business application**: This type of analysis helps with:
- Compensation planning
- Identifying pay equity issues
- Understanding career progression patterns

### Exercise 12: Method chaining for data pipelines

Create an efficient data processing pipeline using method chaining:

1. Filter to include only sales above 5000
2. Create a new column `revenue_category` based on sales:
   - 'Low': < 7500
   - 'Medium': 7500-12500
   - 'High': > 12500
3. Group by `product` and `revenue_category` and count transactions
4. Sort by count in descending order

Try to complete this in a single chained expression.

In [None]:
# Transaction data
transactions = pd.DataFrame({
    "product": ["Laptop", "Mouse", "Keyboard", "Monitor", "Laptop",
                "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse"] * 3,
    "sales": [15000, 3000, 8000, 18000, 12000,
              4000, 6000, 20000, 16000, 3500] * 3
})

print("Transaction data:")
print(transactions.head(10))

In [None]:
# Solution (method chaining)
result = (
    transactions
    .query("sales > 5000")  # Step 1: Filter
    .assign(  # Step 2: Create revenue_category
        revenue_category=lambda x: pd.cut(
            x["sales"],
            bins=[0, 7500, 12500, 100000],
            labels=["Low", "Medium", "High"]
        )
    )
    .groupby(["product", "revenue_category"])  # Step 3: Group and count
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)  # Step 4: Sort
)

print("Product-category analysis:")
print(result)

**Explanation**: Method chaining creates readable, efficient pipelines:

1. **`.query()`**: Filters using a string expression
   - More readable than boolean indexing for simple conditions
   - Works well in chains

2. **`.assign()`**: Creates new columns
   - Can use lambda functions to reference the DataFrame
   - `lambda x: pd.cut(x["sales"], ...)` creates categories based on sales column

3. **`.groupby().size()`**: Counts rows in each group
   - `.size()` counts all rows (including NaN)
   - `.reset_index(name="count")` converts the result to a DataFrame with a "count" column

4. **`.sort_values()`**: Sorts by count descending

**Benefits of method chaining**:
- Reads top-to-bottom like a recipe
- No intermediate variables cluttering the namespace
- Easy to add/remove steps
- Parentheses allow multi-line formatting for readability

**Note**: The parentheses around the entire expression allow Python to treat multiple lines as one statement.

In [None]:
# Alternative solution (step-by-step)
# This achieves the same result but is less concise

# Step 1: Filter
filtered = transactions[transactions["sales"] > 5000]

# Step 2: Create category
filtered["revenue_category"] = pd.cut(
    filtered["sales"],
    bins=[0, 7500, 12500, 100000],
    labels=["Low", "Medium", "High"]
)

# Step 3: Group and count
grouped = filtered.groupby(["product", "revenue_category"]).size().reset_index(name="count")

# Step 4: Sort
result_alt = grouped.sort_values("count", ascending=False)

print("Alternative solution (same result):")
print(result_alt)

## Key takeaways

### Data reshaping

- **`melt()`** converts wide → long format (analysis-friendly)
- **`pivot()`** converts long → wide format (reporting-friendly)
- **Remember**: Wide format is good for humans, long format is good for analysis

### Pivot tables and aggregation

- **`pivot_table()`** is more powerful than `pivot()`—it aggregates duplicates
- Use **dictionaries in `aggfunc`** to apply different functions to different columns
- **`margins=True`** adds row and column totals automatically
- **`.pct_change()`** calculates growth rates (use `axis=1` for row-wise changes)

### MultiIndex operations

- **MultiIndex** organises hierarchical data (Country → Region → City)
- **`.xs()`** selects data at a specific index level (cross-section)
- **`unstack()`** moves index → columns (wider)
- **`stack()`** moves columns → index (taller)
- **`groupby(level=)`** groups by specific index levels

### Window functions and transformations

- **`.rolling(window=N)`** calculates statistics over N rows
- Common rolling operations: `.mean()`, `.std()`, `.sum()`
- **`.transform()`** applies functions within groups, returns same-shaped data
- **`.agg()`** reduces groups to summary statistics
- Use **2 standard deviations** for outlier detection

### Data cleaning and transformation

- **`pd.cut()`** creates bins with specified edges (fixed-width)
- **`pd.qcut()`** creates bins with equal counts (quantile-based)
- **Method chaining** creates readable pipelines:
  - `.query()` for filtering
  - `.assign()` for creating columns
  - `.groupby()` for aggregation
  - `.sort_values()` for sorting

### Common patterns

| Task | Method | Example |
|------|--------|--------|
| Wide → Long | `melt()` | `df.melt(id_vars=["id"], value_vars=["Q1", "Q2"])` |
| Long → Wide | `pivot()` | `df.pivot(index="id", columns="quarter", values="sales")` |
| Aggregate table | `pivot_table()` | `pd.pivot_table(df, values="sales", index="region", aggfunc="sum")` |
| Rolling average | `.rolling()` | `df["sales"].rolling(window=7).mean()` |
| Group transform | `.transform()` | `df.groupby("dept")["salary"].transform("mean")` |
| Create bins | `pd.cut()` | `pd.cut(df["age"], bins=[0, 30, 50, 100])` |
| Quantile bins | `pd.qcut()` | `pd.qcut(df["salary"], q=4)` |

### When to use each technique

**Use `melt()` when**:
- Creating visualizations
- Performing statistical analysis
- Preparing data for machine learning

**Use `pivot()` or `pivot_table()` when**:
- Creating summary tables for reports
- Comparing values across categories
- Calculating period-over-period changes

**Use MultiIndex when**:
- Data has natural hierarchies
- Need multi-dimensional analysis
- Working with time series at multiple frequencies

**Use rolling functions when**:
- Analysing trends over time
- Smoothing noisy data
- Detecting outliers

**Use `.transform()` when**:
- Need group statistics alongside individual rows
- Calculating percentages of group totals
- Comparing individuals to group averages