# Pandas Data Preprocessing 

## Loading Data from Multiple Sources

Pandas provides **flexible data ingestion** capabilities that allow you to read data from various sources. The `read_csv()` function is particularly versatile, accepting both **local file paths** and **public URLs**. This makes it easy to work with data stored on your computer or hosted online without changing your code structure.


In [None]:
import pandas as pd

# Read from a URL
url = "https://raw.githubusercontent.com/alanjones2/uk-historical-weather/refs/heads/main/data/Cardiff_Bute_Park.csv"
df = pd.read_csv(url)

# Or read from a local file
# df = pd.read_csv('data/weather.csv')

print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
df.head(3)

## Initial Data Exploration

Before manipulating data, always **understand its structure**. Check the **shape**, **column names**, **data types**, and **missing values**. The `.info()` method provides a comprehensive overview in one call.

In [None]:
# Basic exploration
print(f"Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print("\n" + "="*50)

# Comprehensive info
df.info()

## Statistical Summaries with `.describe()`

The `.describe()` method computes **summary statistics** (count, mean, std, min, max, quartiles) for all numerical columns. The output is itself a DataFrame, allowing further operations.

In [None]:
# Get summary statistics
stats = df.describe()
print(stats)

# Access specific statistics
print(f"\nMean temperature: {stats.loc['mean', 'Tmean']:.2f}°C")

## Converting to Datetime Types

Date columns are often read as **text** by default. Converting them to **datetime objects** using `pd.to_datetime()` unlocks temporal operations like filtering by date ranges, extracting components, and creating time-aware visualizations.

In [None]:
# Convert text to datetime
df['Date'] = pd.to_datetime(df['Date'])

print(f"Date type: {df['Date'].dtype}")
print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")

# Now we can filter by dates
recent_data = df[df['Date'] > '2020-01-01']
print(f"\nRows after 2020: {len(recent_data)}")

## Setting a Meaningful Index

For time series data, using the **date as the index** simplifies plotting and slicing. Use `set_index()` to assign a column as the index and `.drop(axis=1)` to remove unnecessary columns.

In [None]:
# Set date as index
df = df.set_index('Date')

# Drop unnecessary columns
df = df.drop('Unnamed: 0', axis=1)

print("New structure:")
print(df.head(3))
print(f"\nIndex name: {df.index.name}")

## Detecting Missing Values

Real-world data often has **missing values** (`NaN`). Use `.isnull()` to identify them and `.sum()` to count missing values per column. Understanding **where and why** data is missing guides your handling strategy.

In [None]:
# Count missing values
missing = df.isnull().sum()
print("Missing values per column:")
print(missing)

# Find rows with missing values in a specific column
missing_sun = df[df['Sun'].isnull()]
print(f"\nRows with missing Sun data: {len(missing_sun)}")

## Strategies for Handling Missing Data

Three main approaches: **drop** rows/columns (sparse data), **fill** with values like mean or median (simple imputation), or **interpolate** from surrounding values (time series). Choose based on your **data's nature** and **missingness patterns**.

In [None]:
# Strategy 1: Drop columns with mostly missing data
df = df.drop('status', axis=1)

# Strategy 2: Fill with a value
df['Rain'] = df['Rain'].fillna(0)

# Strategy 3: Interpolate
df['Sun'] = df['Sun'].interpolate(method='linear')

print(f"Remaining missing values: {df.isnull().sum().sum()}")

## Interpolation for Time Series

**Interpolation** estimates missing values from neighboring points. **Linear interpolation** draws straight lines between known values. More sophisticated methods (polynomial, spline) exist but require more computation and careful application.

In [None]:
import numpy as np

# Create sample data with gaps
data = pd.Series([1.0, 2.0, np.nan, np.nan, 5.0, 6.0])
print("Before interpolation:")
print(data)

# Linear interpolation
data_interp = data.interpolate(method='linear')
print("\nAfter interpolation:")
print(data_interp)

## Extracting Date Components

Datetime objects have useful **attributes** like `.year`, `.month`, `.day`, `.dayofweek`, etc. Extract **month names** with `.month_name()` or quarters with `.quarter` for seasonal analysis and grouping.

In [None]:
# Extract date components from the index
df['Year'] = df.index.year
df['Month'] = df.index.month
df['Month_Name'] = df.index.month_name()
df['Quarter'] = df.index.quarter

print(df[['Year', 'Month', 'Month_Name', 'Quarter']].head(6))

## Functional Programming with `.apply()`

Use `.apply()` to apply **custom functions** to DataFrame columns. Functions are **first-class objects** that can be passed as arguments, enabling you to create derived features with custom logic.

In [None]:
# Define a custom function
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

# Apply it to create a new column
df['Season'] = df['Month'].apply(get_season)
print(df[['Month', 'Month_Name', 'Season']].head(12))

## Creating Derived Features

Create new columns through **arithmetic operations** or **boolean conditions**. Derived features encode domain knowledge into your data structure, like temperature ranges or threshold-based flags.

In [None]:
# Arithmetic operation
df['Temp_Range'] = df['Tmax'] - df['Tmin']

# Boolean conditions
df['Rainy_Month'] = df['Rain'] > 100
df['Sunny_Month'] = df['Sun'] > 150

print(df[['Tmax', 'Tmin', 'Temp_Range', 'Rain', 'Rainy_Month']].head())
print(f"\nRainy months: {df['Rainy_Month'].sum()} ({df['Rainy_Month'].mean()*100:.1f}%)")

## Grouping and Aggregation

The `.groupby()` method **aggregates data** by categories. Chain three operations: group by a column, select columns to aggregate, and apply an aggregation function (mean, sum, count, etc.).

In [None]:
# Group by season and calculate mean temperature
seasonal_temp = df.groupby('Season')['Tmean'].mean()
print("Average temperature by season:")
print(seasonal_temp.round(2))

# Percentage of rainy months by month name
rainy_pct = df.groupby('Month_Name')['Rainy_Month'].mean() * 100
print("\nRainiest months:")
print(rainy_pct.sort_values(ascending=False).head())

## Exporting Cleaned Data

After preprocessing, **save your cleaned data** for future use. Pandas supports **CSV** (`.to_csv()`), **Excel** (`.to_excel()`), **JSON**, **SQL**, and more. Some formats require additional libraries.

In [None]:
# Select columns to export
df_export = df[['Year', 'Month', 'Tmax', 'Tmin', 'Rain', 'Sun']]

# Export to CSV
df_export.to_csv('weather_cleaned.csv')
print("Exported to CSV")

# Export to Excel (requires openpyxl)
# df_export.to_excel('weather_cleaned.xlsx')

# Export to JSON
# df_export.to_json('weather_cleaned.json', orient='records', indent=2)

## Hands on!

- You will find a wroked through example in the **lecture** notebook
- The exercises will take you through two exercises using real world datasets to practice the concepts covered  today.