# Predicting Traffic Incident Severity Based on Weather and Time Factors

## 1.0 Description

The goal of this project is to develop a predictive model that can estimate the severity of traffic incidents based on various factors such as weather conditions (rain, snow, fog, etc.) and time-related elements (time of day, day of the week, and holidays). The model will utilize historical traffic incident data, weather patterns, and temporal factors to predict the likelihood of incidents leading to severe outcomes, such as injuries or fatalities, rather than minor disruptions. This will help transportation authorities, emergency services, and city planners make data-driven decisions, optimize response strategies, and improve public safety.

## 1.1 Business Understanding

Traffic incidents contribute significantly to congestion, injuries, fatalities, and economic losses. Understanding the factors that influence the severity of these incidents can help reduce the overall impact on society. By leveraging data science to predict the severity of traffic incidents based on weather and time related variables, transportation agencies can:

- ```Improve Safety```: Predicting severe incidents allows for timely interventions, such as dispatching emergency services more effectively.


- ```Optimize Resource Allocation```: Traffic management and emergency responders can allocate resources in advance based on predicted severity, ensuring quicker response times in critical situations.

- ```Enhance Traffic Management```: Better understanding of incident severity can guide traffic signal optimization, road closures, and detour planning to minimize disruptions.

- ```Promote Public Awareness```: Through predictive insights, authorities can inform drivers about weather-related risks and encourage safer driving practices during high-risk periods.

This project aims to create a solution that not only reduces the severity of traffic incidents but also improves overall traffic flow and safety.



## 1.2 Objectives

1. Data collection:  The data was sourced from https://data.sfgov.org/Public-Safety/Traffic-Crashes-Resulting-in-Injury/ubvf-ztfx/about_data

2. Data cleaning

3. Exploratory Data analysis

4. Feature Engineering

5. Model development

6. Model evaluation

7. Model deployment

## 1.3 Shareholders

- ```Transportation Authorities```: Local and regional traffic management departments who would benefit from predictive tools to optimize response times and reduce traffic disruptions.

- ```Emergency Services```: Police, fire departments, and medical teams who could use severity predictions to prepare resources and prioritize high-risk incidents.
- ```City Planners and Government Agencies```: Municipal decision makers focused on infrastructure planning and public safety initiatives could use these insights to improve roadways and safety measures.

- ```Public and Drivers```: The general public will benefit indirectly through increased safety, fewer severe accidents, and enhanced traffic management.

- ```Insurance Companies```: Insurers could use severity predictions to optimize their pricing models, assess risk in real-time, and process claims more efficiently.
- ```Weather Services```: Weather data providers may collaborate for deeper insights and provide better real-time forecasts for integrating into the system.

- ```Technology Providers```: Companies providing machine learning infrastructure, cloud services, and data collection tools will play a role in the development and deployment of the model.

## 2.0 Necessary libraries

In [1]:
import pandas as pd
import numpy as np


In [4]:
data_path = '..\Data\Traffic_Crashes_Resulting_in_Injury_20250305.csv'
Crashes = pd.read_csv(data_path)
Crashes.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,unique_id,cnn_intrsctn_fkey,cnn_sgmt_fkey,case_id_pkey,tb_latitude,tb_longitude,geocode_source,geocode_location,collision_datetime,collision_date,...,data_updated_at,data_loaded_at,analysis_neighborhood,supervisor_district,police_district,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,Neighborhoods,SF Find Neighborhoods
0,82049,20208000.0,8087000.0,230041955,37.734019,-122.388046,SFPD-INTERIM DB,CITY STREET,01/18/2023 05:53:00 PM,2023 January 18,...,01/21/2025 12:00:00 AM,02/10/2025 01:42:44 PM,Bayview Hunters Point,10.0,BAYVIEW,2.0,9.0,1.0,86.0,86.0
1,82166,25723000.0,805000.0,230111655,37.762886,-122.428578,SFPD-INTERIM DB,CITY STREET,02/15/2023 09:30:00 AM,2023 February 15,...,01/21/2025 12:00:00 AM,02/10/2025 01:42:44 PM,Castro/Upper Market,8.0,MISSION,3.0,5.0,5.0,37.0,37.0
2,41951,32862000.0,7826101.0,3491922,37.768636,-122.454858,SFPD-CROSSROADS,CITY STREET,11/11/2007 03:50:00 PM,2007 November 11,...,04/26/2023 12:00:00 AM,02/10/2025 01:42:44 PM,Golden Gate Park,5.0,PARK,7.0,11.0,12.0,9.0,9.0
3,48546,23904000.0,,190523857,37.780363,-122.39908,SFPD-INTERIM DB,CITY STREET,07/19/2019 01:50:00 PM,2019 July 19,...,04/26/2023 12:00:00 AM,02/10/2025 01:42:44 PM,South of Market,6.0,SOUTHERN,1.0,10.0,34.0,32.0,32.0
4,35692,26705000.0,,170390767,37.804146,-122.42511,SFPD-CROSSROADS,CITY STREET,05/11/2017 07:53:00 AM,2017 May 11,...,04/26/2023 12:00:00 AM,02/10/2025 01:42:44 PM,Russian Hill,2.0,CENTRAL,6.0,6.0,32.0,98.0,98.0


In [5]:
Crashes.shape

(61229, 63)

The data set has 61229 rows and 63 columns

In [6]:
Crashes.columns

Index(['unique_id', 'cnn_intrsctn_fkey', 'cnn_sgmt_fkey', 'case_id_pkey',
       'tb_latitude', 'tb_longitude', 'geocode_source', 'geocode_location',
       'collision_datetime', 'collision_date', 'collision_time',
       'accident_year', 'month', 'day_of_week', 'time_cat', 'juris',
       'officer_id', 'reporting_district', 'beat_number', 'primary_rd',
       'secondary_rd', 'distance', 'direction', 'weather_1', 'weather_2',
       'collision_severity', 'type_of_collision', 'mviw', 'ped_action',
       'road_surface', 'road_cond_1', 'road_cond_2', 'lighting',
       'control_device', 'intersection', 'vz_pcf_code', 'vz_pcf_group',
       'vz_pcf_description', 'vz_pcf_link', 'number_killed', 'number_injured',
       'street_view', 'dph_col_grp', 'dph_col_grp_description',
       'party_at_fault', 'party1_type', 'party1_dir_of_travel',
       'party1_move_pre_acc', 'party2_type', 'party2_dir_of_travel',
       'party2_move_pre_acc', 'point', 'data_as_of', 'data_updated_at',
       'data_

In [7]:
Crashes_selected = Crashes[['tb_latitude','tb_longitude','collision_date', 'collision_time','accident_year','month','day_of_week','primary_rd','secondary_rd','distance','direction','weather_1','collision_severity','type_of_collision','mviw','ped_action','road_surface','road_cond_1','lighting','dph_col_grp_description','control_device','number_killed','number_injured','party_at_fault','party1_type', 'party1_dir_of_travel','party1_move_pre_acc','party2_type','party2_dir_of_travel','party2_move_pre_acc']]
Crashes_selected.head()

Unnamed: 0,tb_latitude,tb_longitude,collision_date,collision_time,accident_year,month,day_of_week,primary_rd,secondary_rd,distance,...,control_device,number_killed,number_injured,party_at_fault,party1_type,party1_dir_of_travel,party1_move_pre_acc,party2_type,party2_dir_of_travel,party2_move_pre_acc
0,37.734019,-122.388046,2023 January 18,17:53:00,2023,January,Wednesday,LANE ST,NEWCOMB AVE,68.0,...,,0.0,1,,Driver,West,Proceeding Straight,,,
1,37.762886,-122.428578,2023 February 15,09:30:00,2023,February,Wednesday,17TH ST,CHURCH ST,20.0,...,,0.0,1,,Bicyclist,East,Proceeding Straight,,,
2,37.768636,-122.454858,2007 November 11,15:50:00,2007,November,Sunday,KEZAR DR,WALLER ST,210.0,...,,0.0,2,,Driver,North,Proceeding Straight,,,
3,37.780363,-122.39908,2019 July 19,13:50:00,2019,July,Friday,PERRY ST,04TH ST,0.0,...,Functioning,0.0,1,,Bicyclist,East,Making Left Turn,Driver,East,Making Left Turn
4,37.804146,-122.42511,2017 May 11,07:53:00,2017,May,Thursday,BAY ST,VAN NESS AVE,0.0,...,Not Stated,0.0,1,,Driver,North,Not Stated,Pedestrian,West,Not Stated


In [8]:
Crashes_selected.isna().sum()

tb_latitude                 167
tb_longitude                167
collision_date                0
collision_time               60
accident_year                 0
month                         0
day_of_week                   9
primary_rd                    0
secondary_rd                142
distance                     79
direction                     1
weather_1                     0
collision_severity            0
type_of_collision             0
mviw                          0
ped_action                    0
road_surface                  0
road_cond_1                   0
lighting                      0
dph_col_grp_description       1
control_device                0
number_killed                 3
number_injured                0
party_at_fault             6066
party1_type                  11
party1_dir_of_travel         11
party1_move_pre_acc          10
party2_type                4386
party2_dir_of_travel       4384
party2_move_pre_acc        4383
dtype: int64

In [10]:
Crashes_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61229 entries, 0 to 61228
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tb_latitude              61062 non-null  float64
 1   tb_longitude             61062 non-null  float64
 2   collision_date           61229 non-null  object 
 3   collision_time           61169 non-null  object 
 4   accident_year            61229 non-null  int64  
 5   month                    61229 non-null  object 
 6   day_of_week              61220 non-null  object 
 7   primary_rd               61229 non-null  object 
 8   secondary_rd             61087 non-null  object 
 9   distance                 61150 non-null  float64
 10  direction                61228 non-null  object 
 11  weather_1                61229 non-null  object 
 12  collision_severity       61229 non-null  object 
 13  type_of_collision        61229 non-null  object 
 14  mviw                  

In [18]:

from sklearn.impute import SimpleImputer


numeric_cols = ['party_at_fault', 'number_killed', 'distance']
imputer = SimpleImputer(strategy='median')  # use median
Crashes_selected[numeric_cols] = imputer.fit_transform(Crashes_selected[numeric_cols])

# Impute categorical columns with 'Unknown' or 'Most Frequent'
categorical_cols = ['party2_type', 'party2_dir_of_travel', 'party2_move_pre_acc']
Crashes_selected[categorical_cols] = Crashes_selected[categorical_cols].fillna('Unknown')

# For small missing data, impute with mode
Crashes_selected['day_of_week'] = Crashes_selected['day_of_week'].fillna(Crashes_selected['day_of_week'].mode()[0])

# Drop rows with missing time or date
Crashes_selected= Crashes_selected.dropna(subset=['secondary_rd','tb_latitude', 'tb_longitude','collision_date', 'collision_time', 'party1_type', 'party1_dir_of_travel','party1_move_pre_acc','dph_col_grp_description','direction'])



In [19]:
Crashes_selected.isna().sum()

tb_latitude                0
tb_longitude               0
collision_date             0
collision_time             0
accident_year              0
month                      0
day_of_week                0
primary_rd                 0
secondary_rd               0
distance                   0
direction                  0
weather_1                  0
collision_severity         0
type_of_collision          0
mviw                       0
ped_action                 0
road_surface               0
road_cond_1                0
lighting                   0
dph_col_grp_description    0
control_device             0
number_killed              0
number_injured             0
party_at_fault             0
party1_type                0
party1_dir_of_travel       0
party1_move_pre_acc        0
party2_type                0
party2_dir_of_travel       0
party2_move_pre_acc        0
dtype: int64

In [21]:
Crashes_selected.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60948 entries, 0 to 61228
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tb_latitude              60948 non-null  float64
 1   tb_longitude             60948 non-null  float64
 2   collision_date           60948 non-null  object 
 3   collision_time           60948 non-null  object 
 4   accident_year            60948 non-null  int64  
 5   month                    60948 non-null  object 
 6   day_of_week              60948 non-null  object 
 7   primary_rd               60948 non-null  object 
 8   secondary_rd             60948 non-null  object 
 9   distance                 60948 non-null  float64
 10  direction                60948 non-null  object 
 11  weather_1                60948 non-null  object 
 12  collision_severity       60948 non-null  object 
 13  type_of_collision        60948 non-null  object 
 14  mviw                  

In [24]:
# Using IQR for multiple columns
outliers_multiple_columns = pd.DataFrame()

for col in Crashes_selected.select_dtypes(include=['float64', 'int64']).columns:  # Apply to numeric columns only
    Q1 = Crashes_selected[col].quantile(0.25)
    Q3 = Crashes_selected[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = Crashes_selected[(Crashes_selected[col] < lower_bound) | (Crashes_selected[col] > upper_bound)]
    outliers_multiple_columns[col] = outliers[col]

outliers_multiple_columns.sum()


tb_latitude            0.000000
tb_longitude     -214378.190337
accident_year          0.000000
distance          211870.000000
number_killed         20.000000
number_injured       838.000000
party_at_fault        20.000000
dtype: float64