1. Data Retrieval

## 1.1 Data Source

The dataset used in this project is the **US Accidents (3.0 Million records)** dataset sourced from Kaggle.

- **Source:** [Kaggle - US Accidents Dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents)
- **Format:** CSV
- **Contents:** Accident data including location, time, weather conditions, and environment features.

In [16]:
# Import required libraries
import pandas as pd

# Load the accidents data
accidents_df = pd.read_csv('../data/US_Accidents_March23.csv')  # Adjust path if needed

# Show basic information
print(f"Shape of dataset: {accidents_df.shape}")
accidents_df.head()

Shape of dataset: (7728394, 46)


Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,False,False,False,False,True,False,Day,Day,Day,Day


2.1 Convert Time Columns to Datetime Format

In [17]:
# 2.1 Convert Time Columns to Datetime Format
# Convert 'Start_Time' and 'End_Time' to proper datetime objects
accidents_df['Start_Time'] = pd.to_datetime(accidents_df['Start_Time'], errors='coerce')
accidents_df['End_Time'] = pd.to_datetime(accidents_df['End_Time'], errors='coerce')

# Confirm that conversion was successful
print("Start_Time type after conversion:", accidents_df['Start_Time'].dtype)
print("End_Time type after conversion:", accidents_df['End_Time'].dtype)

Start_Time type after conversion: datetime64[ns]
End_Time type after conversion: datetime64[ns]


2.2 Create New Feature: Extract Hour from Start_Time

In [18]:
# Extract hour of accident from Start_Time
accidents_df['Hour'] = accidents_df['Start_Time'].dt.hour

# View distribution of accidents by hour
accidents_df['Hour'].value_counts().sort_index()

Hour
0.0      98452
1.0      85743
2.0      82394
3.0      74229
4.0     149077
5.0     209579
6.0     375179
7.0     546789
8.0     541643
9.0     334067
10.0    313625
11.0    322215
12.0    316904
13.0    352361
14.0    394697
15.0    463389
16.0    520177
17.0    516626
18.0    390621
19.0    267045
20.0    201883
21.0    169500
22.0    148605
23.0    110428
Name: count, dtype: int64

2.3 Handle Missing Values

In [19]:
# Check missing values summary
missing_values = accidents_df.isnull().sum()
missing_percentage = (missing_values / len(accidents_df)) * 100

missing_summary = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage
})
missing_summary = missing_summary[missing_summary['Missing Values'] > 0].sort_values(by='Percentage (%)', ascending=False)

print("Summary of Missing Values:")
missing_summary

Summary of Missing Values:


Unnamed: 0,Missing Values,Percentage (%)
End_Lat,3402762,44.029355
End_Lng,3402762,44.029355
Precipitation(in),2203586,28.512858
Wind_Chill(F),1999019,25.865904
End_Time,743166,9.616047
Start_Time,743166,9.616047
Hour,743166,9.616047
Wind_Speed(mph),571233,7.391355
Visibility(mi),177098,2.291524
Wind_Direction,175206,2.267043


2.4 Data columns to list

In [20]:
print(accidents_df.columns.tolist())

['ID', 'Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Hour']


2.5 Data Enrichment

In [21]:
# =====================================
# 2. Feature Engineering (Data Enrichment)
# =====================================
import numpy as np

# Convert Start_Time and End_Time to datetime
accidents_df['Start_Time'] = pd.to_datetime(accidents_df['Start_Time'], format='mixed', errors='coerce')
accidents_df['End_Time'] = pd.to_datetime(accidents_df['End_Time'], format='mixed', errors='coerce')

# 2.1 Create Duration feature (in minutes)
accidents_df['Duration_minutes'] = (accidents_df['End_Time'] - accidents_df['Start_Time']).dt.total_seconds() / 60

# 2.2 Extract Start Hour, Day of Week, and Month from Start_Time
accidents_df['Start_Hour'] = accidents_df['Start_Time'].dt.hour
accidents_df['Start_DayOfWeek'] = accidents_df['Start_Time'].dt.dayofweek
accidents_df['Start_Month'] = accidents_df['Start_Time'].dt.month

# 2.3 Categorize accidents into Day or Night
# Day: 6 AM to 6 PM, otherwise Night
accidents_df['Day_Night'] = np.where((accidents_df['Start_Hour'] >= 6) & (accidents_df['Start_Hour'] <= 18), 'Day', 'Night')

# 2.4 Create Weekend flag
# 1 if Saturday or Sunday, else 0
accidents_df['Is_Weekend'] = accidents_df['Start_DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)

# 2.5 Create Season from Month (Optional but cool enrichment)
def map_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

accidents_df['Season'] = accidents_df['Start_Month'].apply(map_season)

# Preview the enriched dataset
print(f"Shape after feature engineering: {accidents_df.shape}")
display(accidents_df.head())

Shape after feature engineering: (7728394, 54)


Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Nautical_Twilight,Astronomical_Twilight,Hour,Duration_minutes,Start_Hour,Start_DayOfWeek,Start_Month,Day_Night,Is_Weekend,Season
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,Night,Night,5.0,314.0,5.0,0.0,2.0,Night,0,Winter
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,Night,Day,6.0,30.0,6.0,0.0,2.0,Day,0,Winter
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,Day,Day,6.0,30.0,6.0,0.0,2.0,Day,0,Winter
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,Day,Day,7.0,30.0,7.0,0.0,2.0,Day,0,Winter
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,Day,Day,7.0,30.0,7.0,0.0,2.0,Day,0,Winter
