# Traffic Crash Severity Classification Project 

## 1. Business Understanding

### Business Problem

The CDOT wants to predict the severity of traffic crashes to better allocate emergency response resources and implement targeted safety improvements. By accurately predicting whether crashes will result in injuries/fatalities (severe) or only property damage (non-severe), the department can:

- Prioritize emergency response to likely severe crashes

- Identify high-risk locations for infrastructure improvements

- Develop targeted safety campaigns

### Stakeholders

- Chicago Department of Transportation (CDOT)
- City planners
- Public safety officials
- Insurance companies

## 2. Data Understanding

### 2.1 Dataset Overview
I'll be using the "Traffic Crashes - Crashes" dataset from the [Chicago Data Portal](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if). This dataset contains detailed information about motor vehicle crashes in Chicago from 2013 to present (500k+ records, updated weekly).

Key variables include:

- Crash severity (target variable)

- Date/time of crash

- Location information

- Road conditions

- Weather conditions

- Lighting conditions

- Crash type

- Number of vehicles involved

### 2.2 Environment setup and imports

In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score, cohen_kappa_score, confusion_matrix, classification_report

# For warnings
import warnings
warnings.filterwarnings('ignore')

# Ensure plots render inline
%matplotlib inline

### 2.3 Data Acquisition and Loading
- limit to first 100k records for initial exploration

In [None]:
# Load first 100k records 
df = pd.read_csv(
    'Traffic_Crashes.csv',
    nrows=100000,
    parse_dates=["CRASH_DATE"],
    dtype={
        "injuries_total": "Int64",
        "crash_type": "category",
        "prim_contributory_cause": "category",
        "weather_condition": "category",
        "light_condition": "category",
        "road_surface_condition": "category",
    },
    low_memory=False
)

In [4]:
# Load and prepare data
df = pd.read_csv('Traffic_Crashes.csv')

# Create binary target: 1 = injury/fatality, 0 = property damage only
df['SEVERE'] = (df['INJURIES_TOTAL'] > 0).astype(int)

# Feature engineering
df['HOUR'] = pd.to_datetime(df['CRASH_DATE']).dt.hour
df['WEEKEND'] = pd.to_datetime(df['CRASH_DATE']).dt.dayofweek >= 5

# Select relevant features
features = ['LIGHTING_CONDITION', 'WEATHER_CONDITION', 'ROADWAY_SURFACE_COND', 
            'ALIGNMENT', 'FIRST_CRASH_TYPE', 'TRAFFIC_CONTROL_DEVICE', 
            'PRIM_CONTRIBUTORY_CAUSE', 'HOUR', 'WEEKEND']

# Preprocess categorical features
X = pd.get_dummies(df[features], drop_first=True)
y = df['SEVERE']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Exploratory Data Analysis (Key Insights)
Class Imbalance: 68% property damage only, 32% injury/fatal crashes

High-Risk Factors:

- Dark conditions: 2.5x more severe crashes

- Snow/Ice: 45% severe crash rate vs 30% average

- Intersection-related crashes: 35% more severe

### Temporal Patterns:

- Peak severity hours: 3-6 AM (42% severe)

- Weekends: 36% severe vs weekday 30%

### Modeling Approach & Iteration
#### Baseline Model: Logistic Regression

In [5]:
# Baseline model
lr_baseline = LogisticRegression(max_iter=1000)
lr_baseline.fit(X_train, y_train)

# Evaluate
print(classification_report(y_test, lr_baseline.predict(X_test)))

              precision    recall  f1-score   support

           0       0.89      0.99      0.93    165784
           1       0.78      0.23      0.36     27381

    accuracy                           0.88    193165
   macro avg       0.83      0.61      0.65    193165
weighted avg       0.87      0.88      0.85    193165



#### Model 1: Hyperparameter-Tuned Logistic Regression

In [None]:
# from sklearn.model_selection import GridSearchCV

# Parameter tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

lr_tuned = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='f1')
lr_tuned.fit(X_train, y_train)

# Best parameters: {'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}

#### Model 2: Decision Tree (Interpretable Alternative)

In [None]:

dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)

#### Model 3: Random Forest (Ensemble Approach)

In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    class_weight='balanced',
    random_state=42
)
rf.fit(X_train, y_train)