# NYC Traffic Collisions: Predicting Vulnerable Road User Type
**Students:** רועי בנימיני, עוז ניסנבוים

**Goal:** Classify collisions involving vulnerable road users (VRU) as either pedestrian or cyclist incidents.

**Dataset:** [NYC Motor Vehicle Collisions](https://data.gov/) - 2.2M+ collision records

---
## 1. Setup and Data Loading
### 1.1 Import Libraries

In [2]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
import os
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

# Evaluation
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score,
                             precision_score, recall_score, f1_score, roc_auc_score, roc_curve)

# Clustering
from sklearn.cluster import KMeans

# Display settings
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

print("✅ Libraries loaded successfully")

✅ Libraries loaded successfully


### 1.2 Load Data from GitHub

In [5]:
# Load data from GitHub repository
url = "https://github.com/Roybin12/machine-learning-2-project/raw/main/nyc_collisions_sample.zip"
df = pd.read_csv(url, compression='zip')

print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

Dataset shape: 400,000 rows × 29 columns
Memory usage: 396.2 MB


### 1.3 Initial Data Exploration

In [6]:
# Dataset overview
print("=== Column Types ===")
print(df.dtypes.value_counts())
print(f"\n=== Missing Values (Top 10) ===")
missing = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing[missing > 0].round(1))
print(f"\n=== First 3 Rows ===")
df.head(3)

=== Column Types ===
object     18
int64       7
float64     4
dtype: int64

=== Missing Values (Top 10) ===
VEHICLE TYPE CODE 5              99.6
CONTRIBUTING FACTOR VEHICLE 5    99.5
VEHICLE TYPE CODE 4              98.4
CONTRIBUTING FACTOR VEHICLE 4    98.3
VEHICLE TYPE CODE 3              93.1
CONTRIBUTING FACTOR VEHICLE 3    92.8
OFF STREET NAME                  82.3
CROSS STREET NAME                38.2
ZIP CODE                         30.5
BOROUGH                          30.5
ON STREET NAME                   21.8
VEHICLE TYPE CODE 2              20.1
CONTRIBUTING FACTOR VEHICLE 2    16.1
LOCATION                         10.8
LONGITUDE                        10.8
LATITUDE                         10.8
VEHICLE TYPE CODE 1               0.7
CONTRIBUTING FACTOR VEHICLE 1     0.4
NUMBER OF PERSONS KILLED          0.0
NUMBER OF PERSONS INJURED         0.0
dtype: float64

=== First 3 Rows ===


Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,09/21/2022,9:20,QUEENS,11420.0,40.675106,-73.80979,"(40.675106, -73.80979)",128 STREET,ROCKAWAY BOULEVARD,,2.0,0.0,0,0,0,0,2,0,Failure to Yield Right-of-Way,Unspecified,,,,4566168,Sedan,Station Wagon/Sport Utility Vehicle,,,
1,12/26/2018,12:00,QUEENS,11422.0,40.67452,-73.736084,"(40.67452, -73.736084)",MERRICK BOULEVARD,234 STREET,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4052858,Station Wagon/Sport Utility Vehicle,Box Truck,,,
2,05/12/2020,12:17,STATEN ISLAND,10304.0,40.608982,-74.088135,"(40.608982, -74.088135)",DEKALB STREET,TARGEE STREET,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4313485,Sedan,,,,


## 2. Data Preparation
### 2.1 Filter VRU Collisions & Create Target Variable
We keep only collisions involving pedestrians or cyclists, then classify them accordingly.

In [7]:
# Filter: only collisions with pedestrians OR cyclists (not both, for clear classification)
pedestrian_mask = (df['NUMBER OF PEDESTRIANS INJURED'] > 0) | (df['NUMBER OF PEDESTRIANS KILLED'] > 0)
cyclist_mask = (df['NUMBER OF CYCLIST INJURED'] > 0) | (df['NUMBER OF CYCLIST KILLED'] > 0)

# Keep VRU collisions, exclude mixed cases (both pedestrian and cyclist)
vru_df = df[pedestrian_mask | cyclist_mask].copy()
vru_df = vru_df[~(pedestrian_mask & cyclist_mask)]  # Remove mixed cases

# Create target: 0 = Pedestrian, 1 = Cyclist
vru_df['TARGET'] = np.where(
    (vru_df['NUMBER OF PEDESTRIANS INJURED'] > 0) | (vru_df['NUMBER OF PEDESTRIANS KILLED'] > 0), 
    0, 1
)

print(f"Original dataset: {len(df):,} rows")
print(f"VRU collisions: {len(vru_df):,} rows ({len(vru_df)/len(df)*100:.1f}%)")
print(f"\n=== Target Distribution ===")
print(vru_df['TARGET'].value_counts().rename({0: 'Pedestrian', 1: 'Cyclist'}))
print(f"\nClass ratio: {vru_df['TARGET'].value_counts()[0] / vru_df['TARGET'].value_counts()[1]:.1f}:1")

Original dataset: 400,000 rows
VRU collisions: 34,463 rows (8.6%)

=== Target Distribution ===
Pedestrian    23109
Cyclist       11354
Name: TARGET, dtype: int64

Class ratio: 2.0:1


### 2.2 Feature Engineering - Temporal Features
Extract time-based features: hour, day of week, month, rush hour, and weekend indicators.

In [12]:
# Convert to datetime
vru_df['CRASH_DATETIME'] = pd.to_datetime(vru_df['CRASH DATE'] + ' ' + vru_df['CRASH TIME'])

# Temporal features
vru_df['HOUR'] = vru_df['CRASH_DATETIME'].dt.hour
vru_df['DAY_OF_WEEK'] = vru_df['CRASH_DATETIME'].dt.dayofweek  # 0=Monday, 6=Sunday
vru_df['MONTH'] = vru_df['CRASH_DATETIME'].dt.month
vru_df['YEAR'] = vru_df['CRASH_DATETIME'].dt.year

# Derived features
vru_df['IS_WEEKEND'] = (vru_df['DAY_OF_WEEK'] >= 5).astype(int)
vru_df['IS_RUSH_HOUR'] = vru_df['HOUR'].isin([7, 8, 9, 16, 17, 18]).astype(int)
vru_df['TIME_OF_DAY'] = pd.cut(vru_df['HOUR'], bins=[-1, 6, 12, 18, 24], 
                                labels=['Night', 'Morning', 'Afternoon', 'Evening'])

print("=== Temporal Features Created ===")
print(vru_df[['CRASH_DATETIME', 'HOUR', 'DAY_OF_WEEK', 'MONTH', 'IS_WEEKEND', 'IS_RUSH_HOUR', 'TIME_OF_DAY']].head())

=== Temporal Features Created ===
        CRASH_DATETIME  HOUR  DAY_OF_WEEK  MONTH  IS_WEEKEND  IS_RUSH_HOUR  \
13 2016-07-20 11:42:00    11            2      7           0             0   
22 2014-03-17 21:30:00    21            0      3           0             0   
52 2013-12-29 16:50:00    16            6     12           1             1   
83 2019-12-23 07:04:00     7            0     12           0             1   
93 2017-09-18 10:45:00    10            0      9           0             0   

   TIME_OF_DAY  
13     Morning  
22     Evening  
52   Afternoon  
83     Morning  
93     Morning  


### 2.3 Load Data with Geocoded Coordinates
Merge raw data with previously geocoded locations from Google Maps API.

In [26]:
# ===========================================
# Load raw data + merge geocoded coordinates
# ===========================================

# 1. Load raw data
url_raw = "https://github.com/Roybin12/machine-learning-2-project/raw/main/nyc_collisions_sample.zip"
df = pd.read_csv(url_raw, compression='zip')

# 2. Load geocoded coordinates from previous API run
url_geo = "https://raw.githubusercontent.com/Roybin12/machine-learning-2-project/main/geocoded_locations.csv"
geo_df = pd.read_csv(url_geo)

# 3. Merge geocoded coordinates into raw data
for _, row in geo_df.iterrows():
    idx = int(row['index'])
    if idx in df.index:
        df.loc[idx, 'LATITUDE'] = row['lat']
        df.loc[idx, 'LONGITUDE'] = row['lng']

# 4. Filter VRU only
pedestrian_mask = (df['NUMBER OF PEDESTRIANS INJURED'] > 0) | (df['NUMBER OF PEDESTRIANS KILLED'] > 0)
cyclist_mask = (df['NUMBER OF CYCLIST INJURED'] > 0) | (df['NUMBER OF CYCLIST KILLED'] > 0)
vru_df = df[pedestrian_mask | cyclist_mask].copy()
vru_df = vru_df[~(pedestrian_mask & cyclist_mask)]

# 5. Create target variable
vru_df['TARGET'] = np.where(
    (vru_df['NUMBER OF PEDESTRIANS INJURED'] > 0) | (vru_df['NUMBER OF PEDESTRIANS KILLED'] > 0), 
    0, 1
)

# 6. Create temporal features
vru_df['CRASH_DATETIME'] = pd.to_datetime(vru_df['CRASH DATE'] + ' ' + vru_df['CRASH TIME'])
vru_df['HOUR'] = vru_df['CRASH_DATETIME'].dt.hour
vru_df['DAY_OF_WEEK'] = vru_df['CRASH_DATETIME'].dt.dayofweek
vru_df['MONTH'] = vru_df['CRASH_DATETIME'].dt.month
vru_df['IS_WEEKEND'] = (vru_df['DAY_OF_WEEK'] >= 5).astype(int)
vru_df['IS_RUSH_HOUR'] = vru_df['HOUR'].isin([7, 8, 9, 16, 17, 18]).astype(int)

# 7. Select features for modeling
feature_cols = ['BOROUGH', 'LATITUDE', 'LONGITUDE', 'HOUR', 'DAY_OF_WEEK', 'MONTH', 
                'IS_WEEKEND', 'IS_RUSH_HOUR', 'CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1']
model_df = vru_df[feature_cols + ['TARGET']].copy()

print(f"=== Data Ready ===")
print(f"Raw data: {len(df):,} rows")
print(f"Geocoded locations merged: {len(geo_df):,}")
print(f"VRU records: {len(model_df):,}")
print(f"\nTarget distribution:")
print(model_df['TARGET'].value_counts().rename({0: 'Pedestrian', 1: 'Cyclist'}))

=== Data Ready ===
Raw data: 400,000 rows
Geocoded locations merged: 27,086
VRU records: 34,463

Target distribution:
Pedestrian    23109
Cyclist       11354
Name: TARGET, dtype: int64


## 3. Data Cleaning
### 3.1 Check Missing Values

In [27]:
# Check missing values in model_df
print("=== Missing Values ===")
for col in model_df.columns:
    missing = model_df[col].isnull().sum()
    pct = missing / len(model_df) * 100
    if missing > 0:
        print(f"{col:35} | {missing:,} ({pct:.1f}%)")

print(f"\nTotal rows: {len(model_df):,}")

=== Missing Values ===
BOROUGH                             | 6,834 (19.8%)
LATITUDE                            | 490 (1.4%)
LONGITUDE                           | 490 (1.4%)
CONTRIBUTING FACTOR VEHICLE 1       | 872 (2.5%)
VEHICLE TYPE CODE 1                 | 1,804 (5.2%)

Total rows: 34,463


### 3.2 Handle Missing Values
- Coordinates: Drop rows (only 1.4%)
- Categorical: Fill with 'Unknown'
- Vehicle Type: Reduce to top 10 categories

In [28]:
# 1. Drop rows with missing coordinates
df_clean = model_df.dropna(subset=['LATITUDE', 'LONGITUDE']).copy()
print(f"After dropping missing coords: {len(df_clean):,} rows ({len(df_clean)/len(model_df)*100:.1f}%)")

# 2. Fill categorical missing values
df_clean['BOROUGH'] = df_clean['BOROUGH'].fillna('Unknown')
df_clean['CONTRIBUTING FACTOR VEHICLE 1'] = df_clean['CONTRIBUTING FACTOR VEHICLE 1'].fillna('Unknown')
df_clean['VEHICLE TYPE CODE 1'] = df_clean['VEHICLE TYPE CODE 1'].fillna('Unknown')

# 3. Reduce Vehicle Type categories (keep top 10, rest as 'Other')
top_vehicles = df_clean['VEHICLE TYPE CODE 1'].value_counts().nlargest(10).index.tolist()
df_clean['VEHICLE TYPE CODE 1'] = df_clean['VEHICLE TYPE CODE 1'].apply(
    lambda x: x if x in top_vehicles else 'Other'
)

print(f"\n=== After Cleaning ===")
print(f"Missing values: {df_clean.isnull().sum().sum()}")
print(f"\nVehicle Type categories: {df_clean['VEHICLE TYPE CODE 1'].nunique()}")
print(df_clean['VEHICLE TYPE CODE 1'].value_counts())

After dropping missing coords: 33,973 rows (98.6%)

=== After Cleaning ===
Missing values: 0

Vehicle Type categories: 11
Sedan                                  8498
Station Wagon/Sport Utility Vehicle    7386
PASSENGER VEHICLE                      4653
Other                                  3681
Bike                                   2488
SPORT UTILITY / STATION WAGON          2176
Unknown                                1791
Taxi                                   1138
UNKNOWN                                 973
TAXI                                    643
Pick-up Truck                           546
Name: VEHICLE TYPE CODE 1, dtype: int64


### 3.3 Standardize Category Names
Fix duplicate categories with different cases.

In [29]:
# Standardize Vehicle Type names
vehicle_mapping = {
    'PASSENGER VEHICLE': 'Sedan',
    'SPORT UTILITY / STATION WAGON': 'Station Wagon/Sport Utility Vehicle',
    'UNKNOWN': 'Unknown',
    'TAXI': 'Taxi'
}
df_clean['VEHICLE TYPE CODE 1'] = df_clean['VEHICLE TYPE CODE 1'].replace(vehicle_mapping)

print("=== Vehicle Types (Cleaned) ===")
print(df_clean['VEHICLE TYPE CODE 1'].value_counts())
print(f"\nCategories: {df_clean['VEHICLE TYPE CODE 1'].nunique()}")

=== Vehicle Types (Cleaned) ===
Sedan                                  13151
Station Wagon/Sport Utility Vehicle     9562
Other                                   3681
Unknown                                 2764
Bike                                    2488
Taxi                                    1781
Pick-up Truck                            546
Name: VEHICLE TYPE CODE 1, dtype: int64

Categories: 7


### 3.4 Final Data Overview

In [30]:
# Final data summary
print("=== Final Dataset ===")
print(f"Rows: {len(df_clean):,}")
print(f"Columns: {len(df_clean.columns)}")
print(f"Missing values: {df_clean.isnull().sum().sum()}")

print(f"\n=== Column Summary ===")
for col in df_clean.columns:
    dtype = df_clean[col].dtype
    unique = df_clean[col].nunique()
    print(f"{col:35} | {str(dtype):8} | {unique:,} unique")

print(f"\n=== Target Distribution ===")
print(df_clean['TARGET'].value_counts().rename({0: 'Pedestrian', 1: 'Cyclist'}))
print(f"Ratio: {df_clean['TARGET'].value_counts()[0] / df_clean['TARGET'].value_counts()[1]:.2f}:1")

=== Final Dataset ===
Rows: 33,973
Columns: 11
Missing values: 0

=== Column Summary ===
BOROUGH                             | object   | 6 unique
LATITUDE                            | float64  | 21,600 unique
LONGITUDE                           | float64  | 20,270 unique
HOUR                                | int64    | 24 unique
DAY_OF_WEEK                         | int64    | 7 unique
MONTH                               | int64    | 12 unique
IS_WEEKEND                          | int32    | 2 unique
IS_RUSH_HOUR                        | int32    | 2 unique
CONTRIBUTING FACTOR VEHICLE 1       | object   | 56 unique
VEHICLE TYPE CODE 1                 | object   | 7 unique
TARGET                              | int32    | 2 unique

=== Target Distribution ===
Pedestrian    22833
Cyclist       11140
Name: TARGET, dtype: int64
Ratio: 2.05:1
