# Data Preprocessing

The preprocessing phase transformed raw Transport for London (TfL) passenger data and London weather data into a clean, feature-rich dataset suitable for modeling.

The process began with loading TfL journey data from separate CSV files spanning 2019-2023, combining tube and bus journey counts into a total passenger count metric. Simultaneously, London weather data from 1979-2023 was imported to provide environmental context for passenger behavior analysis.

Data quality was addressed through multiple cleaning operations. Dates were standardized to datetime format with invalid entries removed. Missing passenger counts were imputed using monthly averages to maintain temporal patterns. Extreme outliers were identified and filtered using z-score methodology (z > 3) to prevent them from skewing the model. For weather data, misleading rows with quality flags 1 and 9 were removed, and missing values were filled using monthly averages to preserve seasonal patterns. Snow depth and global radiation variables were excluded due to incomplete data coverage.

To enable meaningful analysis, the TfL data was aggregated to daily totals and merged with weather data based on date alignment. The combined dataset was then split into training (2019-2022) and testing (2023) sets to facilitate model evaluation on unseen data.

The dataset was enriched through feature engineering, adding temporal context through day-of-week and month indicators, and creating a binary feature for rainy days (precipitation > 0). Missing values resulting from lag/rolling features were handled using backward fill to maintain data continuity.

To prevent multicollinearity issues, highly correlated weather features (correlation > 0.8) were identified and removed, specifically min_temp, mean_temp, and sunshine, which showed strong correlation with other variables. Finally, numerical features were normalized using StandardScaler to ensure consistent scale across variables, particularly for passenger_count, max_temp, precipitation, pressure, humidity, and cloud_cover.

The final processed datasets contain the following features:

| Feature | Description | Type |
|---------|-------------|------|
| date | Calendar date | datetime |
| passenger_count | Total daily TfL passengers (normalized) | float |
| max_temp | Maximum daily temperature (normalized) | float |
| precipitation | Daily precipitation amount (normalized) | float |
| pressure | Atmospheric pressure (normalized) | float |
| humidity | Relative humidity (normalized) | float |
| cloud_cover | Cloud coverage (normalized) | float |
| day_of_week | Day of week (0-6) | int |
| month | Month of year (1-12) | int |
| is_raining | Binary indicator for precipitation > 0 | int |

The preprocessing resulted in a training datas


In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import os

In [20]:
# Set random seed for reproducibility
np.random.seed(42)

ABS_PATH = 'Bayesian-analysis-of-public-transport-passengers'

In [21]:
# 1. Load Data
# TfL data: separate files for each year (2019–2023)
tfl_files = {
    2019: "raw_data/Journeys_2019.csv",
    2020: "raw_data/Journeys_2020.csv",
    2021: "raw_data/Journeys_2021.csv",
    2022: "raw_data/Journeys_2022.csv",
    2023: "raw_data/Journeys_2023.csv"
}

# Initialize empty list to store TfL data
tfl_dfs = []

# Load and concatenate TfL files
for year, file in tfl_files.items():
    file_path = os.path.join(ABS_PATH, file)
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        # Sum TubeJourneyCount and BusJourneyCount into total passenger_count
        df['passenger_count'] = df['TubeJourneyCount'] + df['BusJourneyCount']
        # Keep only TravelDate and passenger_count
        df = df[['TravelDate', 'passenger_count']]
        tfl_dfs.append(df)
    else:
        print(f"Warning: File {file} not found. Skipping...")

# Combine TfL data
if tfl_dfs:
    tfl_data = pd.concat(tfl_dfs, ignore_index=True)
else:
    raise FileNotFoundError("No TfL data files were found.")

# Load weather data
weather_data = pd.read_csv(os.path.join(ABS_PATH,"raw_data/london_weather_data_1979_to_2023.csv"))

In [22]:
# 2. Data Cleaning (TfL Data)
# Convert date to datetime
tfl_data['TravelDate'] = pd.to_datetime(tfl_data['TravelDate'], format='%Y%m%d', errors='coerce')

# Check for invalid dates
print(f"TfL data: {len(tfl_data)} rows before dropping NaT")
tfl_data = tfl_data.dropna(subset=['TravelDate'])
print(f"TfL data: {len(tfl_data)} rows after dropping NaT")

# Handle missing passenger counts
tfl_data['passenger_count'] = tfl_data['passenger_count'].fillna(tfl_data.groupby(tfl_data['TravelDate'].dt.month)['passenger_count'].transform('mean'))

# Remove outliers (z-score > 3)
z_scores = np.abs((tfl_data['passenger_count'] - tfl_data['passenger_count'].mean()) / tfl_data['passenger_count'].std())
tfl_data = tfl_data[z_scores < 3]
print(f"TfL data after outlier removal: {len(tfl_data)} rows")


TfL data: 1826 rows before dropping NaT
TfL data: 1826 rows after dropping NaT
TfL data after outlier removal: 1826 rows


In [23]:
# 3. Data Cleaning (Weather Data)

#Deleting wrong or missleading rows and Q_ columns
q_columns = [col for col in weather_data.columns if col.startswith('Q_')]
print(f"Weather data: {len(weather_data)} rows before deleting missleading rows")
weather_data = weather_data[~weather_data[q_columns].isin([1, 9]).any(axis=1)]
print(f"Weather data: {len(weather_data)} rows after  deleting missleading rows")

weather_data = weather_data.drop(columns=[col for col in weather_data.columns if col.startswith('Q_')])

# Rename columns to match naming convention
weather_data = weather_data.rename(columns={
    'DATE': 'date',
    'TX': 'max_temp',
    'TN': 'min_temp',
    'TG': 'mean_temp',
    'SS': 'sunshine',
    'SD': 'snow_depth',
    'RR': 'precipitation',
    'QQ': 'global_radiation',
    'PP': 'pressure',
    'HU': 'humidity',
    'CC': 'cloud_cover'
})

# Convert date to datetime
weather_data['date'] = pd.to_datetime(weather_data['date'], format='%Y%m%d', errors='coerce')

# Check for invalid dates
print(f"Weather data: {len(weather_data)} rows before dropping NaT")
weather_data = weather_data.dropna(subset=['date'])
print(f"Weather data: {len(weather_data)} rows after dropping NaT")

# Handle missing weather values
weather_columns = ['max_temp', 'min_temp', 'mean_temp', 'sunshine', 'precipitation',
                   'pressure', 'humidity', 'cloud_cover']
for col in weather_columns:
    weather_data[col] = weather_data[col].fillna(weather_data.groupby(weather_data['date'].dt.month)[col].transform('mean'))
weather_data = weather_data.drop(columns=['snow_depth', 'global_radiation'])  # Remove columns not used in analysis
# Convert units
'''weather_data['max_temp'] = weather_data['max_temp'] / 10
weather_data['min_temp'] = weather_data['min_temp'] / 10
weather_data['mean_temp'] = weather_data['mean_temp'] / 10
weather_data['sunshine'] = weather_data['sunshine'] / 10
weather_data['precipitation'] = weather_data['precipitation'] / 10
weather_data['pressure'] = weather_data['pressure'] / 10'''

# Remove invalid entries
weather_data = weather_data[weather_data['precipitation'] >= 0]
print(f"Weather data after cleaning: {len(weather_data)} rows")

Weather data: 16436 rows before deleting missleading rows
Weather data: 14112 rows after  deleting missleading rows
Weather data: 14112 rows before dropping NaT
Weather data: 14112 rows after dropping NaT
Weather data after cleaning: 14112 rows


In [24]:
# 4. Temporal Alignment
# Aggregate TfL data to daily total
tfl_data = tfl_data.rename(columns={'TravelDate': 'date'})
daily_tfl = tfl_data.groupby('date')['passenger_count'].sum().reset_index()

# Merge with weather data
merged_data = pd.merge(daily_tfl, weather_data, on='date', how='inner')
print(f"Merged data: {len(merged_data)} rows")

# Filter for 2019–2022 (training) and 2023 (testing)
train_data = merged_data[merged_data['date'].dt.year.isin([2019, 2020, 2021, 2022])]
test_data = merged_data[merged_data['date'].dt.year == 2023]
print(f"Train data (2019–2022): {len(train_data)} rows")
print(f"Test data (2023): {len(test_data)} rows")

# Check if train_data is empty
if len(train_data) == 0:
    raise ValueError("Train data is empty. Check date ranges in TfL and weather data for overlap in 2019–2022.")

Merged data: 1636 rows
Train data (2019–2022): 1298 rows
Test data (2023): 338 rows


In [25]:
# 5. Feature Engineering
# Ensure date is datetime
train_data['date'] = pd.to_datetime(train_data['date'])
test_data['date'] = pd.to_datetime(test_data['date'])

# Add temporal features for train_data
train_data['day_of_week'] = train_data['date'].dt.dayofweek
train_data['month'] = train_data['date'].dt.month
train_data['is_raining'] = (train_data['precipitation'] > 0).astype(int)

# Add temporal features for test_data (skip if empty)
if len(test_data) > 0:
    test_data['day_of_week'] = test_data['date'].dt.dayofweek
    test_data['month'] = test_data['date'].dt.month
    test_data['is_raining'] = (test_data['precipitation'] > 0).astype(int)

# Handle missing values from lag/rolling features
train_data = train_data.fillna(method='bfill')
if len(test_data) > 0:
    test_data = test_data.fillna(method='bfill')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['date'] = pd.to_datetime(train_data['date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['date'] = pd.to_datetime(test_data['date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['day_of_week'] = train_data['date'].dt.dayofweek
A value is trying to be set on a co

In [26]:
# 6. Handle Redundancies
# Drop highly correlated weather features
corr_matrix = train_data[weather_columns].corr()
high_corr = [(col1, col2) for col1 in corr_matrix.columns for col2 in corr_matrix.index
             if col1 < col2 and abs(corr_matrix.loc[col2, col1]) > 0.8]
if high_corr:
    drop_cols = [col2 for col1, col2 in high_corr]
    print(f"Dropped columns due to high correlation: {drop_cols}")
    train_data = train_data.drop(columns=drop_cols)
    if len(test_data) > 0:
        test_data = test_data.drop(columns=drop_cols)


Dropped columns due to high correlation: ['min_temp', 'mean_temp', 'min_temp', 'sunshine']


In [27]:
# 7. Normalize Numerical Features
# Dynamically select numerical columns that exist
base_numerical_cols = ['passenger_count', 'max_temp', 'min_temp', 'mean_temp', 'precipitation',
                       'sunshine', 'pressure', 'humidity', 'cloud_cover']
numerical_cols = [col for col in base_numerical_cols if col in train_data.columns]
print(f"Numerical columns for normalization: {numerical_cols}")

# Apply normalization only if train_data is not empty
if len(train_data) > 0:
    scaler = StandardScaler()
    train_data[numerical_cols] = scaler.fit_transform(train_data[numerical_cols])
    if len(test_data) > 0:
        test_data[numerical_cols] = scaler.transform(test_data[numerical_cols])
else:
    raise ValueError("Cannot normalize: train_data is empty.")

Numerical columns for normalization: ['passenger_count', 'max_temp', 'precipitation', 'pressure', 'humidity', 'cloud_cover']


In [29]:
# 8. Save Processed Data
train_data.to_csv(os.path.join(ABS_PATH,'processed_train_data.csv'), index=False)
if len(test_data) > 0:
    test_data.to_csv(os.path.join(ABS_PATH,'processed_test_data.csv'), index=False)
else:
    print("Warning: test_data is empty, skipping save of test_data CSV.")