# Introduction to Pandas (Introduction)

_This notebook introduces Week 4's learning objectives and key concepts, building on the Python skills from Weeks 1 and 2._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Opus 4.1)*, including updated documentation and git commit messages.

## Week 4 overview

Welcome to Week 4! This week marks a significant milestone as we move from pure Python programming to data analysis with [Pandas](https://pandas.pydata.org/). Pandas is the cornerstone library for data manipulation in Python, and mastering it is essential for your assessment tasks.

## Learning objectives

By the end of this week, you will be able to:

1. **Understand Pandas data structures** - work with Series and DataFrames confidently
2. **Load and save data** - import data from CSV, Excel, and JSON files
3. **Manipulate DataFrames** - filter, sort, and transform data effectively
4. **Handle missing data** - identify and address data quality issues
5. **Perform aggregations** - use groupby operations to summarise data
6. **Merge datasets** - combine data from multiple sources using joins

## Prerequisites

Before starting this week's materials, ensure you're comfortable with:
- Python data types (strings, numbers, lists, dictionaries)
- Writing functions
- File handling basics (from Week 1)
- List comprehensions (from Week 2) (optional, but recommended!)

## Additional learning resources

### Corey Schafer's Python Pandas Tutorial Series

For additional support and different perspectives on the concepts we're covering, I can't overstate how excellent Corey Schafer's YouTube series is:

1. [Part 1: Installation and Loading Data](https://www.youtube.com/watch?v=ZyhVh-qRZPA) - Getting started with Pandas
2. [Part 2: Series](https://www.youtube.com/watch?v=zmdjNSmRXF4) - Understanding Series objects
3. [Part 3: DataFrames](https://www.youtube.com/watch?v=zmdjNSmRXF4) - Working with DataFrames
4. [Part 4: Filtering - Using Conditionals](https://www.youtube.com/watch?v=Lw2rlcxScZY) - Boolean indexing and filtering
5. [Part 5: Updating Rows and Columns](https://www.youtube.com/watch?v=DCDe29sIKcE) - Modifying DataFrame data
6. [Part 6: Add/Remove Rows and Columns](https://www.youtube.com/watch?v=HQ6XO9eT-fc) - DataFrame structure manipulation
7. [Part 7: Sorting Data](https://www.youtube.com/watch?v=T11QYVfZoD0) - Organizing your data
8. [Part 8: Grouping and Aggregating](https://www.youtube.com/watch?v=txMdrV1Ut64) - Analyzing by groups
9. [Part 9: Cleaning Data](https://www.youtube.com/watch?v=KdmPHEnPJPs) - Handling missing values and duplicates
10. [Part 10: Working with Dates and Time Series](https://www.youtube.com/watch?v=UFuo7EHI8zc) - Date/time functionality
11. [Part 11: Reading/Writing Data](https://www.youtube.com/watch?v=N6hyN6BW6ao) - File I/O operations

### Official documentation

- [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html) - Comprehensive official documentation
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html) - Quick overview of key functionality
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) - Quick reference guide

## Why Pandas?

### The challenge with pure Python

Imagine analysing sales data for 10,000 transactions using only Python lists and dictionaries. You'd need to write complex loops for every operation:

In [1]:
# First, create some demo data

sales_data = [
    {"date": "2023-01-01",
     "product": "Laptop",
     "amount": 1200},
    {"date": "2023-01-02",
     "product": "Mouse",
     "amount": 25},
    # ... thousands more records
]

In [2]:
# Print result
sales_data

[{'date': '2023-01-01', 'product': 'Laptop', 'amount': 1200},
 {'date': '2023-01-02', 'product': 'Mouse', 'amount': 25}]

In [3]:
# Calculate total sales
# Without Pandas requires manual loop

total = 0
for sale in sales_data:
    total += sale["amount"]
print(f"Total sales: £{total}")

Total sales: £1225


### The Pandas advantage

With Pandas, the same operation becomes elegant and efficient:

In [5]:
import pandas as pd

df = pd.DataFrame(sales_data)
total = df["amount"].sum()
print(f"Total sales: £{total}")

Total sales: £1225


## This week's structure

### Introduction (this notebook)
- Overview of Pandas capabilities
- Understanding when and why to use Pandas
- Preview of key concepts

### Demonstration
- Comprehensive walkthrough of Pandas features
- Series and DataFrame creation
- Data loading and saving
- Essential operations and transformations
- Grouping and aggregation
- Merging and joining data

### Exercises (90 minutes)
- Progressive exercises building Pandas skills
- Real-world data scenarios
- Focus on practical applications

### Solutions
- Complete solutions with explanations
- Alternative approaches discussed
- Best practices highlighted

## Key concepts preview

### 1. Series - One-dimensional data

A Pandas `Series` is a one-dimensional labeled array that can hold any data type (integers, strings, floating point numbers, Python objects, etc.). Think of it as a **cross between a Python list and a dictionary** - it has the ordered nature of a list but with labels (called an index) for each element. `Series` are the building blocks of `DataFrame`s; each column in a `DataFrame` is essentially a `Series`.

**Learn more**: [Corey Schafer - Python Pandas Tutorial (Part 2): Series](https://www.youtube.com/watch?v=zmdjNSmRXF4)

In [7]:
import pandas as pd

prices = pd.Series(
    [10.99, 25.50, 5.00, 15.75],
    index=["Book", "Shirt", "Coffee", "Lunch"])

print(prices)

Book      10.99
Shirt     25.50
Coffee     5.00
Lunch     15.75
dtype: float64


In [8]:
print(f"\nAverage price: £{prices.mean()}")


Average price: £14.31


### 2. DataFrame - Two-dimensional data

A `DataFrame` is Pandas' **primary data structure** - a two-dimensional labeled data structure with columns that can be of different types. You can think of it as a spreadsheet, or a dictionary of `Series` objects. `DataFrames` are incredibly versatile and allow you to store and manipulate tabular data with rows and columns. Each column is a `Series`, and operations can be performed on entire columns, rows, or individual cells.

**Learn more**: [Corey Schafer - Python Pandas Tutorial (Part 3): DataFrames](https://www.youtube.com/watch?v=zmdjNSmRXF4)

In [9]:
# A DataFrame is like a spreadsheet
data = {
    "Product": ["Laptop", "Mouse", "Keyboard", "Monitor"],
    "Price": [800, 25, 50, 200],
    "Stock": [10, 50, 30, 15]
}

In [10]:
inventory = pd.DataFrame(data)
print(inventory)

    Product  Price  Stock
0    Laptop    800     10
1     Mouse     25     50
2  Keyboard     50     30
3   Monitor    200     15


In [11]:
print(f"\nTotal inventory value: £{(inventory["Price"] * inventory["Stock"]).sum()}")


Total inventory value: £13750


### 3. Data filtering and selection

Filtering and selection are fundamental operations in data analysis that allow you to focus on specific subsets of your data. In Pandas, you can select data using column names, row indices, or boolean conditions. Boolean indexing (shown below) is particularly powerful - you create a condition that returns `True` or `False` for each row, and Pandas returns only the rows where the condition is `True`.

This allows you to answer questions like *"Which products cost more than £50?" or "Which customers made purchases last month?"*

**Learn more**: [Corey Schafer - Python Pandas Tutorial (Part 4): Filtering](https://www.youtube.com/watch?v=Lw2rlcxScZY)

In [15]:
# Find expensive items (price > £50)
expensive_items = inventory[inventory["Price"] > 50]
print("Expensive items:")
print(expensive_items)

Expensive items:
   Product  Price  Stock
0   Laptop    800     10
3  Monitor    200     15


### 4. Grouping and aggregation

`GroupBy` operations follow a "split-apply-combine" pattern: split your data into groups based on some criteria, apply a function to each group independently, and combine the results back into a data structure. This is essential for answering questions like *"What's the average sale per region?"* or *"How many products were sold by category?"* The `groupby()` method is one of the most powerful features in Pandas, enabling you to perform complex aggregations with simple, readable code.

**Learn more**: [Corey Schafer - Python Pandas Tutorial (Part 7): GroupBy and Aggregate](https://www.youtube.com/watch?v=txMdrV1Ut64)

In [18]:
# Example: Sales by category
sales = pd.DataFrame({
    "Category": ["Electronics", "Electronics", "Office", "Office", "Office"],
    "Product": ["Laptop", "Mouse", "Desk", "Chair", "Lamp"],
    "Revenue": [2400, 150, 500, 800, 120]
})

In [19]:
category_totals = sales.groupby("Category")["Revenue"].sum()

print("Revenue by category:")
print(category_totals)

Revenue by category:
Category
Electronics    2550
Office         1420
Name: Revenue, dtype: int64


## Real-world applications

### What you'll be able to do

After this week, you'll be able to extend your group project with the following:

1. **Load and clean the Olympics dataset** for your group project
2. **Calculate statistics** like average medals per country
3. **Filter data** to focus on specific sports or years
4. **Merge datasets** to combine athlete and event information
5. **Handle missing values** in real-world data
6. **Create summary tables** for your analysis

## Common challenges

Based on experience, students often find these aspects challenging:

1. **Index confusion** - Understanding the difference between label-based and position-based indexing
2. **Chained operations** - Knowing when to use `loc`, `iloc`, or bracket notation
3. **GroupBy logic** - Understanding split-apply-combine operations
4. **Merge types** - Choosing between inner, outer, left, and right joins

**We'll address each of these thoroughly in the Demonstration notebook.**

## Connection to assessment

### Group project (Week 3 assessment)

Pandas is essential for:
- Loading the Olympics CSV file
- Cleaning missing values
- Creating calculated columns (e.g., Age_Group)
- Aggregating medal counts by country
- Filtering specific Olympic years or sports

### Individual report

You'll use Pandas to:
- Import your chosen dataset
- Perform data quality checks
- Transform data for analysis
- Generate summary statistics
- Prepare data for visualisation

## Recommended approach

1. **Review this Introduction** (10 minutes)
   - Understand the week's objectives
   - Run the preview examples

2. **Work through the Demonstration** (90 minutes)
   - Run every code cell
   - Experiment with modifications
   - Take notes on new concepts

3. **Complete Exercises** (90 minutes)
   - Start with Series exercises
   - Progress to DataFrame operations
   - Don't skip the file I/O exercises

4. **Review Solutions** (30 minutes)
   - Compare your approaches
   - Note alternative methods
   - Identify areas for practice

## Tips for success

### Do:
- ✅ Experiment with small datasets first
- ✅ Use `.head()` to preview data frequently
- ✅ Check data types with `.dtypes`
- ✅ Read error messages carefully
- ✅ Keep the Pandas documentation handy

### Don't:
- ❌ Modify original DataFrames without keeping a copy
- ❌ Ignore warnings about chained assignment
- ❌ Assume data is clean without checking
- ❌ Use loops when Pandas has built-in methods

*(In this case, `don't` really means `don't`.)*

## Quick reference

Essential Pandas operations you'll use frequently (we'll cover these in the Demonstration):

In [None]:
# Loading data
df = pd.read_csv("assets/data/people_with_salary.csv")

# Viewing data
df.head()        # First 5 rows
df.info()        # Data types and missing values
df.describe()    # Statistical summary

# Selecting data
value = 30000
df["Name"]     # Single column
df[["Name", "Age"]]  # Multiple columns
df[df["Salary"] > value]  # Filter rows

# Modifying data
values = [1, 2, 3, 4]
df["new_column"] = values  # Add column
df.drop("Name", axis=1)  # Remove column
df.dropna()      # Remove missing values
df.fillna(value) # Fill missing values

# Aggregating
df.groupby("Salary").mean(numeric_only=True)  # Group and aggregate
# df.groupby("Salary")[["Age"]].mean()  # Alternatively, for "Age" only

# Create pivot table
df.pivot_table(
    values="Salary",
    index="City",
    columns="Country",
    aggfunc="mean")

# Showing data
print(df)

# Saving data
df.to_csv("assets/example_data/example_output.csv", index=False)

## Ready to begin?

If you're comfortable with:
- Python basics from Week 1
- Advanced Python from Week 2
- The assessment requirements from Week 3

Then you're ready to dive into Pandas!

Remember:
- **Pandas is a tool** - focus on what it can do, not memorising syntax
- **Practice is essential** - work through examples actively
- **Errors are normal** - they help you learn
- **Documentation is your friend** - use it frequently

Proceed to the Demonstration notebook when ready. Good luck!