# Feature Engineering

**Project:** Road Collision Severity Prediction  
**Section:** F - Feature Engineering for Machine Learning

## What i Will Do in This Notebook:

in this notebook, i will create rich features from the cleaned data that will help our machine learning models predict collision severity better. raw data alone is not enough - i need to engineer meaningful features that capture patterns.

i will create several types of features:
1. **temporal features** - extracting hour, day of week, weekend indicators, rush hour flags, and seasons
2. **weather features** - calculating temperature means and ranges, categorizing rainfall, identifying frost days
3. **geographic features** - using population density and distance to cities
4. **road features** - grouping speed limits into bands, identifying pedestrian crossings
5. **interaction features** - combining multiple dimensions like speed × vehicles, rush hour × rain

these engineered features will give the model much more information to learn from compared to just the raw columns. at the end i will save everything to a new table called feature_engineered_collisions.

In [None]:
# importing all libraries i need for feature engineering
# i need pandas for data manipulation and numpy for numerical operations
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import os
from pathlib import Path
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# connecting to postgresql database
project_directory = Path.cwd()
load_dotenv(project_directory / ".env")

# getting postgresql credentials from environment
postgres_host = os.getenv("POSTGRES_HOST")
postgres_port = os.getenv("POSTGRES_PORT")
postgres_database = os.getenv("POSTGRES_DB")
postgres_user = os.getenv("POSTGRES_USER")
postgres_password = os.getenv("POSTGRES_PASSWORD")

# creating database connection
database_url = f"postgresql+psycopg2://{postgres_user}:{postgres_password}@{postgres_host}:{postgres_port}/{postgres_database}"
engine = create_engine(database_url)

print("libraries loaded and postgresql database connection established")
# above confirms setup is ready for feature engineering

✓ Libraries loaded and PostgreSQL database connection established


## Loading Cleaned Data

now i will load the cleaned collision data that we prepared in notebook 2. this data already has proper date types and geographic enrichment, so i can start engineering features right away.

In [None]:
# loading cleaned collision data from database
# this includes weather and population enrichment from notebook 2
sql_query = "SELECT * FROM clean_collisions"
df_for_features = pd.read_sql(sql_query, engine)

print(f"loaded {len(df_for_features)} collision records for feature engineering")
print(f"number of columns in raw data: {df_for_features.shape[1]}")
print(f"\ncolumn names:")
print(df_for_features.columns.tolist())
# above shows me the starting point before feature engineering

# displaying first few rows to understand the data
df_for_features.head()

Loaded 48471 collision records
Columns: ['collision_index', 'collision_year', 'collision_ref_no', 'location_easting_osgr', 'location_northing_osgr', 'longitude', 'latitude', 'police_force', 'collision_severity', 'number_of_vehicles', 'number_of_casualties', 'date', 'day_of_week', 'time', 'local_authority_district', 'local_authority_ons_district', 'local_authority_highway', 'local_authority_highway_current', 'first_road_class', 'first_road_number', 'road_type', 'speed_limit', 'junction_detail_historic', 'junction_detail', 'junction_control', 'second_road_class', 'second_road_number', 'pedestrian_crossing_human_control_historic', 'pedestrian_crossing_physical_facilities_historic', 'pedestrian_crossing', 'light_conditions', 'weather_conditions', 'road_surface_conditions', 'special_conditions_at_site', 'carriageway_hazards_historic', 'carriageway_hazards', 'urban_or_rural_area', 'did_police_officer_attend_scene_of_accident', 'trunk_road_flag', 'lsoa_of_accident_location', 'enhanced_severit

Unnamed: 0,collision_index,collision_year,collision_ref_no,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,collision_severity,number_of_vehicles,...,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location,enhanced_severity_collision,collision_injury_based,collision_adjusted_severity_serious,collision_adjusted_severity_slight,nearest_city,city_distance_km,city_population
0,2025010551784,2025,10551784,528234.0,185607.0,-0.15174,51.55478,1,3,3,...,1,-1,E01000886,-1,0,0.0,1.0,Highgate,2.0197,10955.0
1,2025010551786,2025,10551786,529585.0,178570.0,-0.13485,51.49124,1,3,2,...,3,-1,E01004747,-1,0,0.0,1.0,Westminster,0.387295,255324.0
2,2025010551792,2025,10551792,524767.0,187961.0,-0.20089,51.57671,1,3,5,...,1,-1,E01000216,-1,0,0.0,1.0,Hendon,2.807743,35874.0
3,2025010551794,2025,10551794,527549.0,184185.0,-0.16213,51.54216,1,2,1,...,1,-1,E01000967,-1,0,1.0,0.0,Hampstead,1.900392,48858.0
4,2025010551795,2025,10551795,534910.0,183108.0,-0.05647,51.53077,1,2,2,...,1,-1,E01004198,-1,0,1.0,0.0,Stepney,2.070374,16238.0


## Creating Temporal Features

now i will extract time-based features from the date and time columns. these temporal patterns can be very important for predicting collision severity - certain hours or days might be more dangerous.

In [None]:
# first i need to ensure date and time are in proper format
df_for_features['date'] = pd.to_datetime(df_for_features['date'])
if 'time' in df_for_features.columns and df_for_features['time'].dtype == 'object':
    df_for_features['time'] = pd.to_datetime(df_for_features['time'], format='%H:%M', errors='coerce').dt.time
print("converted date and time to proper datetime formats")

# extracting hour of day (0-23)
# this tells us what time the collision happened
df_for_features['hour_of_day'] = pd.to_datetime(df_for_features['time'].astype(str), format='%H:%M:%S', errors='coerce').dt.hour
print("extracted hour_of_day feature (0-23)")

# extracting day of week (1=monday through 7=sunday)
# i am adding 1 so monday=1 instead of 0 for easier interpretation
df_for_features['day_of_week'] = df_for_features['date'].dt.dayofweek + 1
print("extracted day_of_week feature (1=monday, 7=sunday)")

# creating weekend indicator (1 if saturday or sunday, 0 otherwise)
# weekends might have different collision patterns than weekdays
df_for_features['is_weekend'] = (df_for_features['day_of_week'] >= 6).astype(int)
print("created is_weekend feature")

# extracting month (1-12)
df_for_features['month'] = df_for_features['date'].dt.month
print("extracted month feature")

# extracting year
df_for_features['year'] = df_for_features['date'].dt.year
print("extracted year feature")

# creating season feature based on month
# this captures seasonal patterns in collisions
def get_season(month):
    """this function maps month numbers to season names"""
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'autumn'

df_for_features['season'] = df_for_features['month'].apply(get_season)
print("created season feature (winter, spring, summer, autumn)")

# creating rush hour indicator
# morning rush: 7-9 am, evening rush: 5-7 pm
# these times might have different severity patterns due to traffic
df_for_features['is_rush_hour'] = ((df_for_features['hour_of_day'] >= 7) & (df_for_features['hour_of_day'] <= 9) | 
                       (df_for_features['hour_of_day'] >= 17) & (df_for_features['hour_of_day'] <= 19)).astype(int)
print("created is_rush_hour feature")

# displaying sample of temporal features created
print("\ntemporal features created successfully:")
print("  - hour_of_day, day_of_week, is_weekend")
print("  - month, year, season, is_rush_hour")
print(f"\nsample of new temporal features:")
df_for_features[['date', 'time', 'hour_of_day', 'day_of_week', 'is_weekend', 'season', 'is_rush_hour']].head()
# above shows all temporal features are created properly

✓ Temporal features created:
  - hour_of_day, day_of_week, is_weekend, month, year, season, is_rush_hour

Sample:


Unnamed: 0,date,time,hour_of_day,day_of_week,is_weekend,season,is_rush_hour
0,2025-01-01,NaT,,3,0,Winter,0
1,2025-01-01,NaT,,3,0,Winter,0
2,2025-01-01,NaT,,3,0,Winter,0
3,2025-01-01,NaT,,3,0,Winter,0
4,2025-01-01,NaT,,3,0,Winter,0


## Weather Features
Create derived weather features from temperature and precipitation data.

In [22]:
# Weather features (if columns exist)
if 'tmax' in df.columns and 'tmin' in df.columns:
    df['temp_mean'] = (df['tmax'] + df['tmin']) / 2
    df['temp_range'] = df['tmax'] - df['tmin']
else:
    print("⚠ tmax/tmin not found, skipping temp features")

if 'rain' in df.columns:
    df['rain_mm'] = df['rain']
    # Categorize rainfall
    def rain_category(rain):
        if pd.isna(rain) or rain == 0:
            return 'none'
        elif rain < 2.5:
            return 'light'
        elif rain < 10:
            return 'medium'
        else:
            return 'heavy'
    df['rain_category'] = df['rain'].apply(rain_category)
else:
    print("⚠ rain column not found")

if 'af' in df.columns:
    df['is_frost_day'] = (df['af'] > 0).astype(int)
else:
    print("⚠ af (frost days) column not found")

print("✓ Weather features created")
print(df[['temp_mean', 'temp_range', 'rain_mm', 'rain_category', 'is_frost_day']].head() if 'temp_mean' in df.columns else "Limited weather features")

⚠ tmax/tmin not found, skipping temp features
⚠ rain column not found
⚠ af (frost days) column not found
✓ Weather features created
Limited weather features


## Geographic Features
Process population and location-based features.

In [23]:
# Geographic features
if 'nearest_city_population' in df.columns:
    df['log_population'] = np.log1p(df['nearest_city_population'])
    print("✓ log_population created")
else:
    print("⚠ nearest_city_population not found")

# Check if distance_to_city_km exists
if 'distance_to_city_km' in df.columns:
    print(f"✓ distance_to_city_km available (mean: {df['distance_to_city_km'].mean():.2f} km)")
else:
    print("⚠ distance_to_city_km not found")

print("\nGeographic features summary:")
geo_cols = [col for col in ['nearest_city_population', 'log_population', 'distance_to_city_km', 'urban_or_rural_area'] if col in df.columns]
if geo_cols:
    print(df[geo_cols].describe())

⚠ nearest_city_population not found
⚠ distance_to_city_km not found

Geographic features summary:
       urban_or_rural_area
count         48471.000000
mean              1.334117
std               0.471773
min               1.000000
25%               1.000000
50%               1.000000
75%               2.000000
max               3.000000


## Road & Collision Features
Create features from speed limits, vehicles, casualties, and road characteristics.

In [24]:
# Speed limit bands
if 'speed_limit' in df.columns:
    def speed_band(speed):
        if pd.isna(speed):
            return 'unknown'
        elif speed <= 30:
            return '<=30'
        elif speed <= 50:
            return '40-50'
        elif speed <= 70:
            return '60-70'
        else:
            return '>70'
    
    df['speed_limit_band'] = df['speed_limit'].apply(speed_band)
    print("✓ speed_limit_band created")
else:
    print("⚠ speed_limit column not found")

# Pedestrian crossing indicator
ped_cols = [col for col in df.columns if 'pedestrian' in col.lower() and 'crossing' in col.lower()]
if ped_cols:
    df['has_pedestrian_crossing'] = (df[ped_cols].fillna(0).sum(axis=1) > 0).astype(int)
    print(f"✓ has_pedestrian_crossing created from {len(ped_cols)} columns")
else:
    print("⚠ No pedestrian crossing columns found")

print("\nRoad feature columns available:")
road_cols = [col for col in df.columns if col in ['speed_limit', 'speed_limit_band', 'number_of_vehicles', 
                                                     'number_of_casualties', 'road_type', 'junction_detail',
                                                     'light_conditions', 'weather_conditions', 'has_pedestrian_crossing']]
print(road_cols)

✓ speed_limit_band created
✓ has_pedestrian_crossing created from 3 columns

Road feature columns available:
['number_of_vehicles', 'number_of_casualties', 'road_type', 'speed_limit', 'junction_detail', 'light_conditions', 'weather_conditions', 'speed_limit_band', 'has_pedestrian_crossing']


## Interaction Features
Create at least 3 interaction features combining different dimensions.

In [25]:
# Interaction Feature 1: speed_limit * number_of_vehicles
if 'speed_limit' in df.columns and 'number_of_vehicles' in df.columns:
    df['speed_x_vehicles'] = df['speed_limit'].fillna(0) * df['number_of_vehicles'].fillna(0)
    print("✓ Interaction 1: speed_x_vehicles")

# Interaction Feature 2: is_rush_hour * rain_mm
if 'is_rush_hour' in df.columns and 'rain_mm' in df.columns:
    df['rush_x_rain'] = df['is_rush_hour'] * df['rain_mm'].fillna(0)
    print("✓ Interaction 2: rush_x_rain")

# Interaction Feature 3: urban_or_rural_area * speed_limit_band (encoded)
if 'urban_or_rural_area' in df.columns and 'speed_limit_band' in df.columns:
    # Encode urban/rural as numeric (1=Urban, 2=Rural)
    urban_map = {'Urban': 1, 'Rural': 2}
    urban_encoded = df['urban_or_rural_area'].map(urban_map).fillna(0)
    
    # Encode speed band
    speed_band_map = {'<=30': 1, '40-50': 2, '60-70': 3, '>70': 4, 'unknown': 0}
    speed_encoded = df['speed_limit_band'].map(speed_band_map).fillna(0)
    
    df['urban_x_speed_band'] = urban_encoded * speed_encoded
    print("✓ Interaction 3: urban_x_speed_band")

print(f"\n✓ Created {3} interaction features")
interact_cols = [col for col in df.columns if '_x_' in col]
print(f"Interaction columns: {interact_cols}")

✓ Interaction 1: speed_x_vehicles
✓ Interaction 3: urban_x_speed_band

✓ Created 3 interaction features
Interaction columns: ['speed_x_vehicles', 'urban_x_speed_band']


## Final Feature Set & Data Quality Check

In [26]:
# Check for missing values in key columns
print("Missing values in engineered features:")
engineered_cols = ['hour_of_day', 'day_of_week', 'is_weekend', 'month', 'season', 'is_rush_hour',
                   'temp_mean', 'temp_range', 'rain_mm', 'rain_category', 'is_frost_day',
                   'log_population', 'speed_limit_band', 'speed_x_vehicles', 'rush_x_rain', 'urban_x_speed_band']
missing_cols = [col for col in engineered_cols if col in df.columns]
print(df[missing_cols].isnull().sum())

# Fill remaining missing values
for col in missing_cols:
    if df[col].dtype in ['float64', 'int64']:
        df[col] = df[col].fillna(df[col].median())
    else:
        df[col] = df[col].fillna('unknown')

print(f"\n✓ Data quality check complete")
print(f"Total records: {len(df)}")
print(f"Total features: {len(df.columns)}")
print(f"Shape: {df.shape}")

Missing values in engineered features:
hour_of_day           48471
day_of_week               0
is_weekend                0
month                     0
season                    0
is_rush_hour              0
speed_limit_band          0
speed_x_vehicles          0
urban_x_speed_band        0
dtype: int64

✓ Data quality check complete
Total records: 48471
Total features: 57
Shape: (48471, 57)


## Save to Database
Persist the feature-engineered dataset to PostgreSQL for ML pipeline.

In [27]:
# Save to PostgreSQL
df.to_sql('feature_engineered_collisions', engine, if_exists='replace', index=False)

# Verify
verify_query = "SELECT COUNT(*) FROM feature_engineered_collisions"
count = pd.read_sql(verify_query, engine).iloc[0, 0]

print(f"✓ Successfully saved {count} records to 'feature_engineered_collisions' table")
print(f"✓ Feature engineering complete!")

✓ Successfully saved 48471 records to 'feature_engineered_collisions' table
✓ Feature engineering complete!
