# Advanced Pandas (Demonstration)

_This notebook builds on the [Pandas](https://pandas.pydata.org/) fundamentals you learnt in Week 04, demonstrating advanced techniques for data reshaping and analysis. We'll focus on the concept of "tidy" data and powerful methods for transforming your datasets._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Set display options
pd.set_option("display.max_columns", 50)
plt.style.use("ggplot")

---

## 🎯 CORE CONTENT (Essential for Exercises)

**Estimated time**: 50-60 minutes

The sections below cover essential advanced Pandas techniques you'll need for the exercises:
- Understanding tidy data principles
- Reshaping data with `melt()` and `pivot()`
- Creating pivot tables with `pivot_table()`
- Basic data aggregation

Work through these sections carefully. The exercises will require you to apply these reshaping techniques.

---

## The concept of "tidy" data

Before we explore advanced data manipulation and analysis techniques, let's cover a fundamental concept: "Tidy" data _(vs. "messy" data)_. This concept, introduced by Hadley Wickham, has become a cornerstone of modern data analysis.

### What is tidy data?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

### Principles of tidy data

1. Column headers are variable names
   - Not values
   - Not complex descriptions
   - Should be clear and concise
2. Variables are organised in columns
   - Each column contains one and only one variable
   - All entries in a column should be of the same type
   - No mixing of different units or types of information
3. Observations are organised in rows
   - Each row represents one and only one observation
   - All values in a row should correspond to the same observational unit
4. Each type of observational unit forms a table
   - Different types of units should be stored in separate tables
   - Tables can be linked through shared identifiers

### Common "messy" data problems

Below are several examples comparing "tidy" vs. "messy" data to illustrate some of the issues related to "messy" data.

#### "Messy" problem #1

Column headers are values, not variable names.

In [None]:
# Messy
pd.DataFrame({
    "year": [2020, 2021],
    "q1": [100, 110],
    "q2": [120, 130],
    "q3": [140, 150],
    "q4": [160, 170]
})

In [None]:
# Tidy
pd.DataFrame({
    "year": [2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021],
    "quarter": ["Q1", "Q2", "Q3", "Q4", "Q1", "Q2", "Q3", "Q4"],
    "value": [100, 120, 140, 160, 110, 130, 150, 170]
})

#### "Messy" problem #2

Multiple variables are stored in one column.

In [None]:
# Messy
pd.DataFrame({
    "id": [1, 2],
    "age_gender": ["25_M", "30_F"]
})

In [None]:
# Tidy
pd.DataFrame({
    "id": [1, 2],
    "age": [25, 30],
    "gender": ["M", "F"]
})

#### "Messy" problem #3

Variables are stored in both rows and columns.

Messy data _(temperature readings)_:

| location | measurement        | value |
|----------|-------------------|-------|
| LA       | temperature_max   | 85    |
| LA       | temperature_min   | 65    |
| NYC      | temperature_max   | 72    |
| NYC      | temperature_min   | 55    |

Tidy data _(temperature readings)_:

| location | max_temp | min_temp |
|----------|----------|----------|
| LA       | 85       | 65       |
| NYC      | 72       | 55       |

#### A (slightly) more detailed example

Let's look at another example. While the "messy" version below is easily readable, it is considered "messy" because it violates the first principle of "tidy" data: "Each variable forms a column."

| Year | Country | Men | Women | Non-binary | Other |
|------|---------|-----|-------|------------|-------|
| 2020 | USA     | 10  | 12    | 3          | 2     |
| 2020 | Canada  | 8   | 9     | 2          | 1     |
| 2021 | USA     | 11  | 13    | 4          | 3     |
| 2021 | Canada  | 9   | 10    | 3          | 2     |

Let's analyse what the variables really are in this dataset:

1. `Year` _(currently a column - good!)_
2. `Country` _(currently a column - good!)_
3. `Gender` _(currently split across four columns - thus violating the first principle of "tidy" data!)_
4. `Count` / `Value` _(the numbers in the cells of the "messy" version)_

The issue is that `Gender` is not being treated as a single variable; instead, it's spread across four columns (`Men`, `Women`, `Non-binary`, `Other`). In "tidy" data, `Gender` should be a single variable in one column, with the possible values being `Men`, `Women`, `Non-binary`, and `Other`.

Think about it this way:

1. What if we needed to add another gender category? _We'd have to add a new column!_
2. What if we wanted to calculate the percentage by gender? _We'd have to reference multiple columns!_
3. What if we wanted to plot gender distribution? _We'd need to reshape the data first!_

Now compare the above with this example of "tidy" data:

| Year | Country | Gender     | Value |
|------|---------|------------|-------|
| 2020 | USA     | Men        | 10    |
| 2020 | USA     | Women      | 12    |
| 2020 | USA     | Non-binary | 3     |
| 2020 | USA     | Other      | 2     |
| 2020 | Canada  | Men        | 8     |
| 2020 | Canada  | Women      | 9     |
| 2020 | Canada  | Non-binary | 2     |
| 2020 | Canada  | Other      | 1     |
| 2021 | USA     | Men        | 11    |
| 2021 | USA     | Women      | 13    |
| 2021 | USA     | Non-binary | 4     |
| 2021 | USA     | Other      | 3     |
| 2021 | Canada  | Men        | 9     |
| 2021 | Canada  | Women      | 10    |
| 2021 | Canada  | Non-binary | 3     |
| 2021 | Canada  | Other      | 2     |

In this format:

1. Each variable _(`Year`, `Country`, `Gender`, `Value`)_ is a single column
2. Each observation is a single row
3. Each cell contains a single value
4. Analysis is more straightforward _(e.g., `groupby("Gender").sum()`)_
5. It's easier to add new `gender` categories _(just add new rows)_
5. It's more compatible with visualisation libraries

This is similar to the way we'd store the data in a database; you wouldn't typically have separate columns for each possible value of a categorical variable. After a brief summary below, you will see how to convert between these formats using Pandas in the next block. This includes using `melt()` to go from the "messy" to "tidy" format, and `pivot()` to go back if needed.

### Summary: Why is "tidy" data important?

#### Consistency

- Standardized way to structure data
- Makes data easier to understand
- Facilitates collaboration
- Reduces errors in analysis

#### Ease of manipulation

- Most Pandas functions expect tidy data
- Simplified data transformation
- More intuitive filtering and grouping
- Easier to reshape when needed

#### Analysis ready

- Compatible with most statistical models
- Ready for machine learning algorithms
- Easier to identify patterns and trends
- Facilitates feature engineering

#### Visualisation friendly

- Works well with plotting libraries
- Easier to create meaningful visualisations
- Better control over aesthetic mappings

### Best practices for "tidying" data

#### Document your tidying process

- Keep track of transformations
- Note any assumptions made
- Document reasons for choices

#### Preserve raw data

- Keep original data unchanged
- Create copies for tidying
- Maintain data lineage

#### Validate after tidying

- Check for missing values
- Verify data types
- Confirm unique identifiers
- Test relationships between variables

#### Consider the end use

- Think about analysis needs
- Plan for visualisation requirements
- Account for model requirements

_(**Remember:** While tidy data is ideal for many analyses, sometimes other formats might be more appropriate for specific tasks. The key is to understand when and why to use each format.)_

## Tidying data with Pandas

Pandas provides several functions to help tidy your data _(which we will be introducing/using throughout this module)_:

1. `melt()`: Unpivot a `DataFrame` from wide to long format.
2. `pivot()`: Reshape data from long to wide format.
3. `stack()` and `unstack()`: Reshape data by pivoting a level of the _(possibly hierarchical)_ column labels.

Remember, the goal is to structure your data so that it's easy to work with for your specific analysis needs. Sometimes, the most convenient structure for your analysis might not be the tidiest format, so use these principles as guidelines rather than strict rules.

### Creating a data frame

With this out of the way, let's start by creating a sample `DataFrame` to work with throughout the demonstration.

In [None]:
# More complex sample data
data = {
    "name": ["Alice", "Bob", "Carol", "Dan", "Eve", "Frank", "Grace", "Henry", 
             "Ivy", "Jack", "Kelly", "Liam", "Maya", "Noah", "Olivia"],
    
    "age": [25, 30, 35, 40, 45, 28, 33, 38, 42, 47, 29, 34, 39, 44, 31],
    
    "city": ["New York", "London", "Paris", "Tokyo", "Sydney", "Berlin", "Toronto",
             "Madrid", "Rome", "Amsterdam", "Singapore", "Dubai", "Moscow", 
             "Stockholm", "Vancouver"],
    
    "department": ["Sales", "IT", "HR", "Sales", "IT", "HR", "Sales", "IT", "HR",
                  "Sales", "IT", "HR", "Sales", "IT", "HR"],
    
    "salary_2022": [50000, 60000, 70000, 80000, 90000, 55000, 65000, 75000,
                    85000, 95000, 52000, 62000, 72000, 82000, 92000],
    
    "salary_2023": [52000, 63000, 73000, 84000, 94000, 57000, 68000, 78000,
                    89000, 99000, 54000, 65000, 75000, 86000, 96000],
    
    "performance_Q1": [4.5, 3.8, 4.2, 3.9, 4.7, 4.1, 3.7, 4.3, 4.0, 4.6,
                      3.9, 4.4, 4.2, 3.8, 4.5],
    
    "performance_Q2": [4.3, 4.0, 4.1, 4.2, 4.5, 3.9, 3.8, 4.4, 4.1, 4.7,
                      4.0, 4.3, 4.1, 3.9, 4.4],
    
    "projects_completed": [5, 7, 4, 6, 8, 5, 6, 7, 4, 8, 5, 6, 7, 5, 6],
    
    "training_hours": [20, 35, 25, 30, 40, 22, 28, 35, 26, 38, 24, 32, 27, 36, 29]
}

# Create DataFrame
df = pd.DataFrame(data)

# Add some missing values (to make the DataFrame more realistic)
df.loc[2, "performance_Q2"] = np.nan
df.loc[5, "training_hours"] = np.nan
df.loc[8, "projects_completed"] = np.nan
df.loc[11, "performance_Q1"] = np.nan

In [None]:
print("First few rows of the original DataFrame:")
print(df.head(3))

### Tools for tidying data in Pandas: `melt()`

We can use the `melt()` function to convert "wide" data to "long" format _(as in the example above)_.

#### Purpose of `melt()`

- Transforms "wide" format data into "long" format
- Converts columns into rows
- Useful for time series analysis and visualization

#### Key parameters used

- `id_vars`: Columns to keep as identifier variables _(`name` in our case)_
- `value_vars`: Columns to unpivot _(`salary_2022`, `salary_2023`)_
- `var_name`: Name for the new column containing former column names _(`year`)_
- `value_name`: Name for the new column containing the values _(`salary`)_

#### The transformation

- Before: Each row has one person with multiple columns for years
- After: Each row has one person-year combination
- Two rows per person _(one for each year)_

#### Benefits

- Easier to analyse trends over time
- Better format for many visualisation libraries
- More compatible with statistical analyses
- Easier to add new years without adding columns

#### Example

In [None]:
# Converting salary columns to long format
df_melted_salary = pd.melt(
    df,
    id_vars=["name"],
    value_vars=["salary_2022", "salary_2023"],
    var_name="year",
    value_name="salary"
)

In [None]:
print("\nMelted salary data (first 3 rows):")
print(df_melted_salary.head(3))

In [None]:
# Clean up the year column to remove "salary_" prefix
df_melted_salary["year"] = df_melted_salary["year"].str.replace("salary_", "")

In [None]:
print("\nCleaned melted data (first 3 rows):")
print(df_melted_salary.head(3))

### Tools for tidying data in Pandas: `pivot()`

The `pivot()` function is essentially the inverse operation of `melt()`. While `melt()` transforms wide data into long format, `pivot()` reshapes long data back into wide format. 

#### Basic pivoting

- Takes long-format data and converts it to wide format
- Creates a new column for each unique value in the `columns` parameter
- Values are reorganized based on the index and new columns

#### Advanced features

- Can handle multiple index levels (`index=["name", "department"]`)
- Can pivot multiple value columns simultaneously
- Automatically handles duplicates through aggregation

#### Common use cases

- Converting time series data from long to wide format
- Creating cross-tabulations
- Preparing data for visualization
- Building summary tables

#### Best practices

- Always check for duplicate combinations of index and columns
- Consider using `pivot_table()` for more complex pivoting operations
- Use `reset_index()` after pivoting if you need the index as a regular column

#### Example

A common workflow is to (a) melt data to long format for certain operations, and then (b) to pivot it back to wide format for other operations. This is illustrated with the next code block where we calculate the changes in salary between two years.

In [None]:
# Demonstrate pivot() to get back to wide format
df_pivoted = df_melted_salary.pivot(
    index="name",     # Rows - what will identify each record
    columns="year",   # Columns - what will become the new columns
    values="salary"   # Values - what will fill the cells
)

Here's what each parameter does:

1. **index:** Specifies which column(s) will identify each row
2. **columns:** Specifies which column contains the values that will become new column headers
3. **values:** Specifies which column contains the values that will fill the table

In [None]:
print("Pivoted (wide) format:")
print(df_pivoted.head())

### Summary: `melt()` and `pivot()`

This code example combining the `melt()` and `pivot()` functions demonstrates a common data analysis pattern:

1. Start with wide format: Each `person` has one row with multiple `year` columns
2. Melt to long format: Each `person`-`year` combination gets its own row
3. Pivot back to wide format: Restructure back to one row per `person`
4. Perform calculations: Calculate differences across `years`

The key difference between the original wide format and the final wide format is that:

1. We've cleaned up the data _(removed `salary_` prefix from `year` columns)_
2. We can now easily add the calculation for `salary` increase

Why do this?

Sometimes you need:

1. Long format: for plotting time series, statistical analysis, or certain types of aggregations
2. Wide format: for calculating differences between columns or creating summary statistics

This pattern of melting and then pivoting is particularly useful when you need to:

1. Clean or transform your data _(easier in long format)_
2. Then perform calculations across `years` / `categories` _(easier in wide format)_

### Tools for tidying data in Pandas: `pivot_table()`

Above, we've used `pivot()` to convert data from "long" into "wide" format. There's also `pivot_table()` for more advanced operations. Differences between `pivot()` and `pivot_table()`:

1. `pivot()`: Simpler function for basic reshaping; doesn't handle duplicate values
2. `pivot_table()`: More powerful function that can:

   - Handle duplicate values through aggregation
   - Pivot multiple value columns
   - Create multi-level indexes and columns
   - Apply different aggregation functions

Pivot tables are used to summarise and aggregate data inside `DataFrames`. They're especially useful for data with multiple dimensions. Pivot tables are a useful data analysis tools that to (re)structure and summarise data by:

1. Reorganising data from a "long" format to a "wide" format _(as you've seen above)_
2. Performing calculations across different dimensions of your data
3. Creating cross-tabulations of your data _("crosstabs")_

In the example below, we are essentially reversing the coversation of "wide" _(which we said about is often considered "messy")_ data into "long" _(typically "tidy")_ data. Sometimes, that's exactly what we need.

#### Example

The `pivot_table()` function is particularly useful for HR analytics and business reporting, which is what we'll be looking at next.

In [None]:
# Example data in long format:
sales_data = pd.DataFrame({
    "date": pd.date_range(start="2024-10-25", periods=12),
    "product": ["A", "B"] * 6,
    "sales": np.random.randint(100, 1000, 12)
})

print("Data in long format:")
print(sales_data)

In [None]:
# Pivot table in wide format:
pivot_table = pd.pivot_table(
    sales_data, 
    values="sales",      # What we're measuring
    index="date",        # Rows
    columns="product",   # Columns
    aggfunc="sum")       # How to aggregate

print("Pivot table in wide format:")
print(pivot_table)

### Summary: `pivot_table()`

#### Key components

1. `values`: The data you want to analyze _(like sales numbers)_
2. `index`: The categories you want as rows _(like dates)_
3. `columns`: The categories you want as columns _(like products)_
4. `aggfunc`: How to combine the data _(`sum`, `mean`, `count`, etc.)_

#### Practical use cases

1. Sales analysis by product and time period
2. Customer behavior analysis across different segments
3. Performance metrics across departments and regions

#### Additions

1. Using multiple index/column levels
2. Applying different aggregate functions

#### Example

In [None]:
# Basic pivot table: Average salary by department and year
print("Average salary by department and year:")
basic_pivot = pd.pivot_table(
    df_melted_salary.merge(df[["name", "department"]], on="name"),
    values="salary",
    index="department",
    columns="year",
    aggfunc="mean"
).round(2)

In [None]:
print(basic_pivot)

In [None]:

# More complex pivot table: Multiple metrics by department
print("Department performance metrics:")
dept_pivot = pd.pivot_table(
    df,
    values=["salary_2022", "performance_Q1", "training_hours", "projects_completed"],
    index="department",
    aggfunc={
        "salary_2022": "mean",
        "performance_Q1": ["mean", "min", "max"],
        "training_hours": "sum",
        "projects_completed": "sum"
    }
).round(2)

In [None]:
print(dept_pivot)

In [None]:
# Advanced pivot table: Performance metrics by city and department
print("City and department performance analysis:")
location_pivot = pd.pivot_table(
    df,
    values=["performance_Q1", "performance_Q2"],
    index=["city"],
    columns=["department"],
    aggfunc="mean",
    fill_value=0
).round(2)

In [None]:
print(location_pivot)

In [None]:
# Pivot table with margins (subtotals)
print("Salary analysis with subtotals:")
salary_pivot = pd.pivot_table(
    df,
    values=["salary_2022", "salary_2023"],
    index="department",
    aggfunc="mean",
    margins=True,
    margins_name="Overall Average"
).round(2)

---

## 📚 SUPPLEMENTARY CONTENT (Advanced Techniques)

**Estimated time**: 15-25 minutes

The sections below cover more advanced Pandas techniques:
- `stack()` and `unstack()` for hierarchical data
- Window functions and rolling calculations
- Working with MultiIndex
- Time series resampling
- Advanced aggregations

These topics are **not required for the exercises** but provide deeper insights for complex analyses. Focus on melt/pivot/pivot_table first.

---

In [None]:
print(salary_pivot)

### Tools for tidying data in Pandas: `stack()` / `unstack()`

These functions are particularly useful when dealing with multi-dimensional data. While `melt()` and `pivot()` work with single-level column structures, `stack()` and `unstack()` are designed to work with hierarchical indexes and columns _(MultiIndex)_.

#### `stack()`

- Rotates _(or pivots)_ the innermost column level to become the innermost row index
- Moves data from columns to row index
- Can specify which level to stack using the level parameter
- Returns a `Series` if one column level remains, `DataFrame` otherwise

#### `unstack()`

- Inverse operation of `stack()`
- Rotates _(or pivots)_ the innermost row index level to become the rightmost column level
- Moves data from row index to columns
- Can specify which level to unstack using the level parameter

#### Main differences from `melt()`/`pivot()`

1. Data structure requirements
   - `melt()`/`pivot()`: Work with single-level columns
   - `stack()`/`unstack()`: Work with hierarchical indexes (MultiIndex)

2. Operation style
   - `melt()`/`pivot()`: Reshape between "wide" and "long" formats
   - `stack()`/`unstack()`: Rotate levels between row index and columns

3. Use cases
   - `melt()`/`pivot()`: Better for single-level reshaping and general data restructuring
   - `stack()`/`unstack()`: Better for working with hierarchical data and multi-dimensional analysis

#### Common use cases

1. Financial analysis
   - Working with time series data with multiple categories
   - Analyzing metrics across different dimensions _(time, region, product)_
   - Creating financial reports with hierarchical structure

2. Multi-dimensional data
   - Survey responses with multiple categories
   - Sales data across regions, products, and time periods
   - Performance metrics with nested categories

3. Data cleaning
   - Reorganizing complex datasets
   - Preparing data for specific analysis requirements
   - Converting between different hierarchical structures

#### Best practices

1. Level management
   - Be explicit about which levels you're stacking/unstacking
   - Use level names or positions consistently
   - Keep track of your index hierarchy

2. Data integrity
   - Check for missing values before and after operations
   - Verify that the resulting structure matches your expectations
   - Consider how aggregation should handle duplicates

3. Performance
   - Stack/unstack operations can be memory-intensive
   - Consider working with subsets of large datasets
   - Use appropriate data types to minimize memory usage

## Advanced DataFrame operations

Following from the above _(i.e., reshaping and tidying data)_, let's have a look at more advanced `DataFrame` operations that are commonly used in data analysis. These operations enable complex calculations, create derived metrics, and analysze data across multiple dimensions.

Advanced operations typically combine multiple basic Pandas functions to achieve more sophisticated analyses, including

- Window functions and rolling calculations for trend analysis
- Advanced data cleaning and feature engineering for derived insights
- Multi-level indexing for hierarchical data organisation
- Time series analysis for temporal patterns
- Complex aggregations for summary statistics

These techniques are particularly useful when working with real-world datasets that require more nuanced analysis than simple grouping or filtering operations.

### Window functions and rolling calculations

Window functions allow you to perform calculations across a set of rows that are related to the current row. Rolling calculations are a type of window function that operate over a sliding window of data.

#### Key concepts

- Group-based window calculations
- Rolling statistics
- Transformation of grouped data

#### Example

We calculate a rolling mean of salaries within each department to show salary progression. This helps identify trends in compensation within organizational units.

In [None]:
# Calculate rolling mean of salary_2022 by department
df = df.sort_values("salary_2022")
df["salary_rolling_mean"] = df.groupby("department")["salary_2022"].transform(
    lambda x: x.rolling(window=2, min_periods=1).mean()
)

In [None]:
print("DataFrame with rolling mean salary by department:")
print(df[["name", "department", "salary_2022", "salary_rolling_mean"]].head(6))

### Advanced data cleaning

This section demonstrates some advanced data cleaning techniques, including string manipulation and binning continuous data into categories.

#### Example

We create derived metrics that provide additional insights:

- Combined performance score from quarterly ratings
- Performance categories using binning
- Salary increase percentages
- These transformations help in creating meaningful aggregations and visualisations

In [None]:
# Create combined performance score and categorize it
df["avg_performance"] = df[["performance_Q1", "performance_Q2"]].mean(axis=1)
df["performance_category"] = pd.cut(
    df["avg_performance"],
    bins=[0, 3.5, 4.0, 4.5, 5.0],
    labels=[
        "Needs Improvement",
        "Meets Expectations",
        "Exceeds Expectations",
        "Outstanding"]
)

In [None]:
df["avg_performance"].head(6)

In [None]:
# Calculate salary increase percentage
df["salary_increase_pct"] = (
    (df["salary_2023"] - df["salary_2022"]) / df["salary_2022"] * 100
).round(2)

In [None]:
df["salary_increase_pct"].head(6)

In [None]:
print("Advanced data cleaning results:")
print(df[["name", "avg_performance", "performance_category", "salary_increase_pct"]].head(6))

### Working with MultiIndex

This section shows more advanced operations with MultiIndex `DataFrames`, including sorting and cross-sectional selection.

The multi-level indexing allows us to:

- Organise data hierarchically _(`department` → `city` → `employee`)_
- Perform cross-sectional analysis
- Access data at different levels of granularity

#### Example

In [None]:
# Create a meaningful multi-level index
df_multi = df.set_index(["department", "city", "name"])

In [None]:
print("Multi-level indexed DataFrame:")
print(df_multi.head(6))

In [None]:
# Demonstrate cross-sectional selection
print("Cross-section for IT department:")
print(df_multi.xs("IT", level="department")[["salary_2022", "salary_2023", "avg_performance"]])

### Time series resampling

Resampling allows you to change the frequency of time series data. This is useful for aggregating high-frequency data to a lower frequency, or for converting between different time frequencies.

#### Example

We reshape and analyze the quarterly performance data to:

- Track changes over time
- Compare departments
- Identify trends in employee performance

In [None]:
# Create a time-based analysis of performance scores
performance_data = df.melt(
    id_vars=["name", "department"],
    value_vars=["performance_Q1", "performance_Q2"],
    var_name="quarter",
    value_name="performance"
)

In [None]:
# Clean up quarter names
performance_data["quarter"] = performance_data["quarter"].str.replace("performance_", "")

In [None]:
# Calculate department averages over time
dept_performance = performance_data.groupby(
    ["department", "quarter"])["performance"].mean().round(2)

In [None]:
print("Department performance over time:")
print(dept_performance)

### Advanced aggregation

Aggregations are operations that summarise multiple rows of data into a single result. You can think of it as calculating metrics like averages, sums, or counts, but with the ability to perform multiple calculations simultaneously across different groups of data.

#### Example

This example shows how to:

- Perform multiple aggregations simultaneously
- Create summary statistics by department
- Combine different metrics into a comprehensive view

In [None]:
# Complex aggregation by department
dept_summary = df.groupby("department").agg({
    "salary_2022": ["mean", "min", "max"],
    "avg_performance": "mean",
    "training_hours": "sum",
    "projects_completed": "sum"
}).round(2)

In [None]:
print("Department summary with multiple aggregations:")
print(dept_summary)

### Advanced visualisation

The last section demonstrates how to create more complex visualisations using Pandas and [Matplotlib](https://matplotlib.org/). Here, we're creating a simple scatter plot.

In [None]:
df.plot(x="avg_performance", y="salary_2023", kind="scatter")
plt.title("Performance vs Salary")
plt.show()