# NYC Motor Vehicle Collisions - Classification Project

## Project Overview
This project analyzes the NYC Motor Vehicle Collisions dataset to build an end-to-end classification pipeline. 
The dataset contains over 1.6 million records of traffic accidents in New York City with 29+ features.

**Dataset Source:** NYC Open Data - Motor Vehicle Collisions  
**Project Goal:** Predict collision severity and patterns using multiple classification models

---

## Part 1: Data Loading and Sampling Strategy

Due to the large size of the original dataset (~1.6M rows), we will create a representative sample for our analysis.
Our sampling strategy will ensure statistical significance while maintaining computational efficiency.

In [1]:
# Data Manipulation and Analysis
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Machine Learning - Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Note: XGBoost will be installed later if needed
# from xgboost import XGBClassifier

# Machine Learning - Evaluation
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score,
    roc_curve
)

# Unsupervised Learning
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style (try seaborn style, fallback to default)
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    plt.style.use('seaborn-darkgrid')

print("‚úÖ All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print("‚úÖ All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úÖ All libraries imported successfully!
Pandas version: 2.2.3
NumPy version: 2.1.3
‚úÖ All libraries imported successfully!
Pandas version: 2.2.3
NumPy version: 2.1.3


## 1.1 Initial Data Exploration

We'll start by loading a large sample from the original dataset to understand its structure.

**Note:** This is an exploratory step. The final dataset will be loaded from GitHub in section 1.5.

In [2]:
# Load the dataset
file_path = r"C:\Users\RoyB\Downloads\Motor_Vehicle_Collisions_-_Crashes.csv"

print("Loading dataset...")
df_full = pd.read_csv(file_path)

print(f"\n{'='*60}")
print(f"Dataset loaded successfully!")
print(f"{'='*60}")
print(f"\nüìä Dataset Shape: {df_full.shape[0]:,} rows √ó {df_full.shape[1]} columns")
print(f"üíæ Memory Usage: {df_full.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\n{'='*60}")
print("Column Names and Data Types:")
print(f"{'='*60}")
print(df_full.dtypes)

print(f"\n{'='*60}")
print("First Few Rows:")
print(f"{'='*60}")
print(df_full.head())

print(f"\n{'='*60}")
print("Basic Statistics:")
print(f"{'='*60}")
print(df_full.describe())

print(f"\n{'='*60}")
print("Missing Values Summary:")
print(f"{'='*60}")
missing_summary = pd.DataFrame({
    'Missing_Count': df_full.isnull().sum(),
    'Missing_Percentage': (df_full.isnull().sum() / len(df_full) * 100).round(2)
})
print(missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False))

Loading dataset...

Dataset loaded successfully!

üìä Dataset Shape: 2,223,056 rows √ó 29 columns
üíæ Memory Usage: 2046.54 MB

Column Names and Data Types:
CRASH DATE                        object
CRASH TIME                        object
BOROUGH                           object
ZIP CODE                          object
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED        float64
NUMBER OF PERSONS KILLED         float64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER OF MOTORIST KILLED          int64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR V

## 1.2 Sampling Strategy

Given the dataset size (1.6M rows), we'll create a representative sample optimized for GitHub.

### GitHub Considerations:
- ‚ö†Ô∏è **GitHub file limit:** 100 MB per file (soft limit)
- üîç **Recommended limit:** 50-75 MB for smooth operation
- üìä **Our target:** ~400,000 rows (estimated ~60-90 MB)

### Why this sample size?
- ‚úÖ **Statistical Significance:** Large enough for reliable model training
- ‚úÖ **Computational Efficiency:** Manageable training time and memory usage
- ‚úÖ **GitHub Compatibility:** Well within file size limits
- ‚úÖ **Academic Standard:** Appropriate for 4th-year project demonstration

### Sampling Method:
Random sampling to maintain data distribution.

In [3]:
# Set random seed for reproducibility
np.random.seed(42)

# Define sample size
SAMPLE_SIZE = 400000

print(f"Creating sample of {SAMPLE_SIZE:,} rows...")

# Random sampling from the full dataset
df_sample = df_full.sample(n=SAMPLE_SIZE, random_state=42)

# Reset index
df_sample = df_sample.reset_index(drop=True)

print(f"\n{'='*60}")
print(f"‚úÖ Sample created successfully!")
print(f"{'='*60}")
print(f"Sample Shape: {df_sample.shape[0]:,} rows √ó {df_sample.shape[1]} columns")
print(f"Sample represents {(SAMPLE_SIZE / len(df_full) * 100):.2f}% of original data")
print(f"Memory Usage: {df_sample.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Verify sample distribution (if applicable)
print(f"\n{'='*60}")
print("Sample Overview:")
print(f"{'='*60}")
print(df_sample.info())

Creating sample of 400,000 rows...

‚úÖ Sample created successfully!
Sample Shape: 400,000 rows √ó 29 columns
Sample represents 17.99% of original data
Memory Usage: 368.25 MB

Sample Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 29 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH DATE                     400000 non-null  object 
 1   CRASH TIME                     400000 non-null  object 
 2   BOROUGH                        277905 non-null  object 
 3   ZIP CODE                       277867 non-null  object 
 4   LATITUDE                       356834 non-null  float64
 5   LONGITUDE                      356834 non-null  float64
 6   LOCATION                       356834 non-null  object 
 7   ON STREET NAME                 312600 non-null  object 
 8   CROSS STREET NAME              247083 non-null  object 
 9   OFF STREET NAME    

## 1.3 Saving Sample to GitHub Repository

We'll save the sampled dataset in an efficient format for version control and future use.

**File Format:** CSV (compressed) or Parquet for efficiency

In [4]:
# Define output path for GitHub repo
output_path = "data/nyc_collisions_sample.csv"

print(f"Saving sample to: {output_path}")

# Create data directory if it doesn't exist
import os
os.makedirs("data", exist_ok=True)

# Save as CSV
df_sample.to_csv(output_path, index=False)

# Check file size
file_size_mb = os.path.getsize(output_path) / 1024**2

print(f"\n{'='*60}")
print(f"‚úÖ Sample saved successfully!")
print(f"{'='*60}")
print(f"File location: {output_path}")
print(f"File size: {file_size_mb:.2f} MB")



Saving sample to: data/nyc_collisions_sample.csv

‚úÖ Sample saved successfully!
File location: data/nyc_collisions_sample.csv
File size: 82.63 MB


---

## 1.4 Compressing Dataset to ZIP Format

To overcome GitHub's upload limitations, we'll compress the CSV file to ZIP format.
This will significantly reduce file size while keeping the data easily accessible.

### Benefits:
- ‚úÖ **Smaller file size** - typically 80-90% compression for CSV
- ‚úÖ **Native Python support** - no external dependencies needed
- ‚úÖ **Easy to extract** - works on all platforms
- ‚úÖ **GitHub friendly** - much more likely to be under 25MB

In [5]:
import zipfile

# Define paths
csv_file = "data/nyc_collisions_sample.csv"
zip_file = "data/nyc_collisions_sample.zip"

print("Compressing CSV to ZIP format...")
print(f"{'='*60}")

# Create ZIP file with maximum compression
with zipfile.ZipFile(zip_file, 'w', zipfile.ZIP_DEFLATED, compresslevel=9) as zipf:
    zipf.write(csv_file, arcname='nyc_collisions_sample.csv')

# Get file sizes
csv_size_mb = os.path.getsize(csv_file) / 1024**2
zip_size_mb = os.path.getsize(zip_file) / 1024**2

print(f"\n{'='*60}")
print(f"‚úÖ ZIP file created successfully!")
print(f"{'='*60}")
print(f"\nüìä File Size Comparison:")
print(f"   Original CSV: {csv_size_mb:.2f} MB")
print(f"   Compressed ZIP: {zip_size_mb:.2f} MB")
print(f"   Compression ratio: {csv_size_mb / zip_size_mb:.1f}x smaller")
print(f"   Space saved: {csv_size_mb - zip_size_mb:.2f} MB ({(1 - zip_size_mb/csv_size_mb)*100:.1f}%)")

# Check GitHub compatibility
print(f"\n{'='*60}")
print(f"GitHub Upload Compatibility:")
print(f"{'='*60}")

if zip_size_mb <= 25:
    print(f"‚úÖ PERFECT! File is {zip_size_mb:.2f} MB - under GitHub's 25MB web upload limit")
    print(f"   You can upload this file directly via GitHub web interface!")
elif zip_size_mb <= 100:
    print(f"‚úÖ File is {zip_size_mb:.2f} MB - under GitHub's 100MB command line limit")
    print(f"   Upload via: git add data/nyc_collisions_sample.zip")
else:
    print(f"‚ö†Ô∏è  File is {zip_size_mb:.2f} MB - above GitHub limits")
    print(f"   Need to reduce sample size")

print(f"\nüìù File saved at: {zip_file}")

Compressing CSV to ZIP format...

‚úÖ ZIP file created successfully!

üìä File Size Comparison:
   Original CSV: 82.63 MB
   Compressed ZIP: 16.81 MB
   Compression ratio: 4.9x smaller
   Space saved: 65.82 MB (79.7%)

GitHub Upload Compatibility:
‚úÖ PERFECT! File is 16.81 MB - under GitHub's 25MB web upload limit
   You can upload this file directly via GitHub web interface!

üìù File saved at: data/nyc_collisions_sample.zip


---

## 1.5 Loading Data from GitHub Repository

The dataset is stored as a compressed ZIP file (16.81 MB) in the GitHub repository.
The data is loaded directly from the public repository to ensure reproducibility.

**Dataset Details:**
- **Repository:** https://github.com/Roybin12/machine-learning-2-project
- **File:** nyc_collisions_sample.zip
- **Compressed size:** 16.81 MB (ZIP)
- **Rows:** 400,000 (25% of original 1.6M dataset)
- **Columns:** 29
- **Memory Usage:** ~368 MB when loaded

**Why this sample size?**
- ‚úÖ **Statistical Significance:** Large enough for robust model training
- ‚úÖ **Computational Efficiency:** Manageable for academic project requirements
- ‚úÖ **GitHub Compatible:** Compressed to under 25MB
- ‚úÖ **Representative:** 25% sample maintains data distribution

This ensures that anyone running this notebook (including instructors) can access the data directly.

In [6]:
# Load data directly from GitHub repository
github_zip_url = "https://github.com/Roybin12/machine-learning-2-project/raw/main/nyc_collisions_sample.zip"

print("Loading dataset from GitHub repository...")
print(f"{'='*60}")
print(f"Source: {github_zip_url}")
print(f"{'='*60}\n")

# Read the CSV directly from the ZIP file on GitHub
df = pd.read_csv(github_zip_url, compression='zip')

print(f"‚úÖ Data loaded successfully from GitHub!")

print(f"\n{'='*60}")
print(f"Dataset Overview:")
print(f"{'='*60}")
print(f"üìä Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"üíæ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\n{'='*60}")
print("Column Names and Types:")
print(f"{'='*60}")
print(df.dtypes)

print(f"\n{'='*60}")
print("First 5 Rows:")
print(f"{'='*60}")
display(df.head())

print(f"\n{'='*60}")
print("Basic Statistics:")
print(f"{'='*60}")
display(df.describe())

Loading dataset from GitHub repository...
Source: https://github.com/Roybin12/machine-learning-2-project/raw/main/nyc_collisions_sample.zip

‚úÖ Data loaded successfully from GitHub!

Dataset Overview:
üìä Shape: 400,000 rows √ó 29 columns
üíæ Memory Usage: 368.28 MB

Column Names and Types:
CRASH DATE                        object
CRASH TIME                        object
BOROUGH                           object
ZIP CODE                          object
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED        float64
NUMBER OF PERSONS KILLED         float64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER O

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,09/21/2022,9:20,QUEENS,11420.0,40.675106,-73.80979,"(40.675106, -73.80979)",128 STREET,ROCKAWAY BOULEVARD,,2.0,0.0,0,0,0,0,2,0,Failure to Yield Right-of-Way,Unspecified,,,,4566168,Sedan,Station Wagon/Sport Utility Vehicle,,,
1,12/26/2018,12:00,QUEENS,11422.0,40.67452,-73.736084,"(40.67452, -73.736084)",MERRICK BOULEVARD,234 STREET,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,4052858,Station Wagon/Sport Utility Vehicle,Box Truck,,,
2,05/12/2020,12:17,STATEN ISLAND,10304.0,40.608982,-74.088135,"(40.608982, -74.088135)",DEKALB STREET,TARGEE STREET,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4313485,Sedan,,,,
3,10/22/2013,13:57,QUEENS,11101.0,40.746117,-73.944891,"(40.746117, -73.9448914)",JACKSON AVENUE,PEARSON STREET,,0.0,0.0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,243878,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
4,12/19/2016,8:40,,,40.608364,-74.038666,"(40.608364, -74.038666)",VERRAZANO BRIDGE UPPER,,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,3583313,Sedan,Sedan,,,



Basic Statistics:


Unnamed: 0,LATITUDE,LONGITUDE,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,COLLISION_ID
count,356834.0,356834.0,399996.0,399996.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0
mean,40.590051,-73.680955,0.329126,0.00167,0.059733,0.000775,0.02882,0.000132,0.235902,0.000732,3267802.0
std,2.33239,4.304818,0.712749,0.042863,0.248726,0.028096,0.169247,0.01151,0.674636,0.029444,1509462.0
min,0.0,-201.35999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
25%,40.667366,-73.974457,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3191020.0
50%,40.72028,-73.926546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3748664.0
75%,40.76967,-73.86656,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4303453.0
max,42.107204,0.0,22.0,3.0,6.0,2.0,4.0,1.0,22.0,3.0,4859867.0
