# Introduction to Pandas (Demonstration)

_This notebook introduces Pandas and is designed for beginners who have never used Pandas before._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Opus 4.1)*, including updated documentation and git commit messages.

Overview:

1. What is Pandas and why it's useful
2. Creating and manipulating Pandas `Series`
3. Working with Pandas `DataFrames`
4. Reading and writing data from various sources
5. Basic data manipulation and analysis techniques
6. Data visualisation with Pandas

## Why Pandas is useful

[Pandas](https://pandas.pydata.org/) is an open-source Python library that provides easy-to-use data structures and data analysis tools. It's built on top of [NumPy](https://numpy.org/) (which we won't cover) and is an essential tool for data science, data analysis, and machine learning tasks; the latter of which we also won't be covering.

Key features of Pandas:

- Fast and efficient `DataFrame` object for data manipulation
- Ability to handle large datasets
- Tools for reading and writing data between in-memory data structures and different file formats
- Intelligent data alignment and integrated handling of missing data
- Powerful `group by` functionality for performing split-apply-combine operations on datasets
- Easy data merging and joining
- Time series functionality

## Resources

1. McKinney, Wes (2013). *Python for Data Analysis.* Sebastopol, CA: O'Reilly. Available at: https://wesmckinney.com/book/.
2. Schaefer, Corey (2020). *Python Pandas Tutorial.* YouTube. Available at: https://youtu.be/ZyhVh-qRZPA.

## Setup

In [None]:
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

## 🎯 CORE CONTENT (Essential for Exercises)

**Estimated time**: 50-60 minutes

The sections below cover essential Pandas techniques you'll need for the exercises:
- Creating and working with Series and DataFrames
- Reading and writing CSV files
- Basic data manipulation (filtering, sorting, grouping)
- Handling missing data
- Merging DataFrames

Work through these sections carefully and experiment with the code.

---

## Creating and manipulating Pandas Series

A Pandas Series is a one-dimensional labeled array that can hold data of any type (integers, floats, strings, etc.). It's similar to a column in a spreadsheet or a single column of a DataFrame.

Create a Series from a list:

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Or ...

In [None]:
my_list = [1, 3, 5, np.nan, 6, 8]
s = pd.Series(my_list)
print(s)

You can also create a Series with custom index labels:

In [None]:
s = pd.Series([1, 3, 5, 7], index=["a", "b", "c", "d"])
print(s)

You can access elements of a Series using index labels or integer location:

In [None]:
print(s["b"])  # Access by label
print(s[0])    # Access by integer location
print(s[1:2])  # Slicing

Series support various operations and methods. Here are a few examples:

In [None]:
print(s * 2)  # Multiply all elements by 2
print(s.mean())  # Calculate mean
print(s.max())   # Find maximum value
print(s.idxmax())  # Find index of maximum value

## Working with Pandas DataFrames

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table. It's the most commonly used Pandas object.

### Create a DataFrame

Creating a DataFrame from a dictionary:

In [None]:
data = {
    "Name": ["Alice", "Bob", "Carol", "Dan"],
    "Age": [28, 34, 29, 32],
    "City": ["New York", "Paris", "Berlin", "London"]
}

df = pd.DataFrame(data)
print(df)

You can view basic information about your DataFrame using various methods:

In [None]:
print(df.head())  # View first 5 rows
print(df.info())  # View summary of DataFrame
print(df.describe())  # View statistical summary of numerical columns

### Accessing data in a DataFrame

In [None]:
print(df["Name"])  # Access a column
print(df.loc[1])   # Access a row by label
print(df.iloc[1])  # Access a row by integer location
print(df.loc[1, "Name"])  # Access a specific value

### Manipulating a DataFrame

#### Adding a new column to the DataFrame

In [None]:
df["Country"] = ["USA", "France", "Germany", "UK"]
print(df)

Filter rows where `Age` is greater than `30`:

In [None]:
print(df[df["Age"] >= 33])

Filter rows where `City` is `Paris` or `London`:

In [None]:
print(df[df["City"].isin(["Paris", "London"])])

#### Sorting a DataFrame

Sort by `Age` in descending order:

In [None]:
print(df.sort_values("Age", ascending=False))

#### Handling duplicate data

Check for duplicate rows:

In [None]:
print(df.duplicated().sum())

#### Filtering data in a DataFrame

Remove duplicate rows:

In [None]:
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

### Functions and conditions

#### Applying functions to DataFrame columns

Apply a function to a column (here: `len`):

In [None]:
df["Name_Length"] = df["Name"].apply(len)
print(df)

#### Creating new columns based on conditions

In [None]:
df["Age_Group"] = pd.cut(df["Age"], bins=[0, 30, 40, 100], labels=["Young", "Middle", "Senior"])
print(df)

Getting help for `pd.cut`:

In [None]:
# ?pd.cut

### Basic statistical operations

In [None]:
print(df["Age"].describe())

## Reading and writing data

Pandas provides powerful tools for reading and writing data in various formats.

Below are some of the most common methods for reading and writing data with Pandas (except SQL, which we won't cover). Have a look at the official [Pandas](https://pandas.pydata.org/) documentation for the most up-to-date information on data input and output operations.

### Generating example files

Creating `people.csv`.

In [None]:
import os
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "San Francisco", "London", "Sydney"],
    "Country": ["USA", "USA", "UK", "Australia"]
}

df = pd.DataFrame(data)

output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok = True)

df.to_csv(f"{output_dir}/people.csv", index=False)

print(f"{output_dir}/people.csv has been created.")

Creating `data.csv`.

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)  # for reproducibility

dates = pd.date_range(start="2023-01-01", periods=100, freq="D")

df = pd.DataFrame({
    "date": dates,
    "value1": np.random.randn(100),
    "value2": np.random.randn(100),
    "category": np.random.choice(["A", "B", "C"], 100)
})

output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok = True)

df.to_csv(f"{output_dir}/data.csv", index=False)

print(f"{output_dir}/data.csv has been created.")

Creating `data.json`.

In [None]:
import pandas as pd
import numpy as np
import json

np.random.seed(42)  # for reproducibility

data = {
    "name": ["Alice", "Bob", "Carol", "Dan"],
    "age": np.random.randint(20, 60, 4),
    "city": ["New York", "London", "Paris", "Tokyo"],
    "salary": np.random.randint(30000, 100000, 4)
}

df = pd.DataFrame(data)

output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok = True)

df.to_json(f"{output_dir}/data.json", orient="records")

print(f"{output_dir}/data.json has been created.")

Creating `data.xlsx`.

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)  # for reproducibility

# Create sample data
data = {
    "Date": pd.date_range(start="2023-01-01", periods=100),
    "Sales": np.random.randint(100, 1000, 100),
    "Expenses": np.random.randint(50, 500, 100),
    "Category": np.random.choice(["A", "B", "C"], 100)
}

df = pd.DataFrame(data)

output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok = True)

# Create Excel writer object
with pd.ExcelWriter(f"{output_dir}/data.xlsx") as writer:
    df.to_excel(writer, sheet_name="Sheet1", index=False)

print(f"{output_dir}/data.xlsx has been created.")

### Reading CSV files

Pandas makes it easy to read CSV (Comma Separated Values) files:

In [None]:
# Assuming you have a file named 'assets/data.csv' in your working directory
df = pd.read_csv("assets/example_data/data.csv")
print(df.head())

### Writing to CSV files

You can also easily write DataFrames to CSV files:

---

## 📚 SUPPLEMENTARY CONTENT (Additional File Formats)

**Estimated time**: 15-25 minutes

The sections below cover additional file formats and visualization techniques:
- Reading and writing Excel files
- Reading and writing JSON files
- Reading HTML tables
- Basic data visualization

These topics are useful but **not required for the exercises**. Work through CSV files first, then explore these formats when needed for your projects.

---

In [None]:
df.to_csv("assets/example_data/data.csv", index=False)
print("assets/example_data/data.csv has been created.")

### Reading Excel files

Pandas can also read Excel files. Note that you might need to install the `openpyxl` library for this functionality.

In [None]:
# Assuming you have a file named 'assets/data.xlsx' in your working directory
df_excel = pd.read_excel("assets/example_data/data.xlsx", sheet_name="Sheet1")
print(df_excel.head())

### Writing to Excel files

In [None]:
output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok=True)
df.to_excel(f"{output_dir}/data.xlsx", sheet_name="Sheet1", index=False)
print(f"Data written to '{output_dir}/data.xlsx'")

### Reading JSON files

JSON is a popular format for storing and exchanging data. Pandas can easily read JSON files:

In [None]:
# Assuming you have a file named 'assets/data.json' in your working directory
df_json = pd.read_json("assets/example_data/data.json")
print(df_json.head())

### Writing to JSON files

In [None]:
output_dir = "assets/example_data"
os.makedirs(output_dir, exist_ok=True)
df.to_json(f"{output_dir}/data.json", orient="records")
print(f"Data written to '{output_dir}/data.json'")

### Reading HTML tables

Pandas can read HTML tables from web pages. This is useful for web scraping tasks.

In [None]:
# url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
# tables = pd.read_html(url)
# df_html = tables[0]  # Select the first table from the page
# print(df_html.head())

## Basic data manipulation and analysis techniques

Now that we know how to read and write data, let's look at some basic data manipulation and analysis techniques affored by Pandas.

### Handling missing data

Create a DataFrame with missing values:

In [None]:
df = pd.DataFrame({
    "A": [1, 2, np.nan, 4],
    "B": [5, np.nan, np.nan, 8],
    "C": [9, 10, 11, 12]
})

print("Original DataFrame:")
print(df)

Drop rows with any missing values:

In [None]:
print("\nAfter dropping rows with missing values:")
print(df.dropna())

Fill missing values with a specific value:

In [None]:
print("\nAfter filling missing values with 0:")
print(df.fillna(0))

Fill missing values with the mean of the column:

In [None]:
print("\nAfter filling missing values with column mean:")
print(df.fillna(df.mean()))

### Grouping and aggregating data

Useful resource(s):

- https://www.machinelearningplus.com/pandas/pandas-groupby-examples/
- McKinney, Wes (2013). *Python for Data Analysis.* Sebastopol, CA: O'Reilly. *(pp. 249-258)*. Available at: https://wesmckinney.com/book/.

Create a sample DataFrame:

In [None]:
df = pd.DataFrame({
    "Category": ["A", "B", "A", "B", "A", "B"],
    "Value": [10, 20, 30, 40, 50, 60]
})

print("Original DataFrame:")
print(df)

Group by `Category` and calculate `mean`:

In [None]:
print("\nMean Value by Category:")
print(df.groupby("Category")["Value"].mean())

Group by `Category` and calculate multiple aggregations:

- `mean`
- `sum`
- `count`

In [None]:
print("\nMultiple aggregations by Category:")
print(df.groupby("Category").agg({"Value": ["mean", "sum", "count"]}))

### Merging and joining DataFrames

Create two sample DataFrames:

In [None]:
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": [1, 2, 3, 4]})
df2 = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": [5, 6, 7, 8]})

print("DataFrame 1:")
print(df1)

print("\nDataFrame 2:")
print(df2)

Join operations are fundamental for combining data from different DataFrames, allowing you to merge data based on common columns or indices.

![](figure/pandas-joins-diagram.svg)

Inner join: Returns only the rows that have matching values in both DataFrames.

- `pd.merge(df1, df2, on='key_column', how='inner')`
- `df1.merge(df2, on='key_column', how='inner')`

In [None]:
print("\nInner Join:")
print(pd.merge(df1, df2, on="key", how="inner"))

Outer join: Returns all rows from both DataFrames, matching where possible and filling with NaN where there is no match.

- `pd.merge(df1, df2, on='key_column', how='outer')`
- `df1.merge(df2, on='key_column', how='outer')`

In [None]:
print("\nOuter Join:")
print(pd.merge(df1, df2, on="key", how="outer"))

Left join: Returns all rows from the left DataFrame and matched rows from the right DataFrame.
Fills with `NaN` for unmatched rows from the right DataFrame.

- `pd.merge(df1, df2, on='key_column', how='left')`
- `df1.merge(df2, on='key_column', how='left')`

In [None]:
print("\nLeft Join:")
print(pd.merge(df1, df2, on="key", how="left"))

## Data visualisation

Pandas provides built-in plotting functionality based on [Matplotlib](https://matplotlib.org/). This allows for quick and easy visualisation of your data directly from DataFrames or Series.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Set the style for better-looking plots
plt.style.use("seaborn-v0_8-notebook")

# Create a sample DataFrame
dates = pd.date_range("20230101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)

### Line Plot

In [None]:
df.plot(figsize=(10, 6))
plt.title("Line Plot of DataFrame")
plt.xlabel("Date")
plt.ylabel("Value")
plt.show()

### Bar Plot

In [None]:
df.iloc[0].plot(kind="bar", figsize=(10, 6))
plt.title("Bar Plot of First Row")
plt.xlabel("Column")
plt.ylabel("Value")
plt.show()

### Histogram

In [None]:
df["A"].hist(bins=20, figsize=(10, 6))
plt.title("Histogram of Column A")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

### Scatter Plot

In [None]:
df.plot.scatter(x="A", y="B", figsize=(10, 6))
plt.title("Scatter Plot: A vs B")
plt.show()

### Box Plot

In [None]:
df.boxplot(figsize=(10, 6))
plt.title("Box Plot of DataFrame")
plt.ylabel("Value")
plt.show()