# Table of Contents
1. [Introduction](#introduction)
2. [Data Overview](#data-overview)
3. [Data Cleaning](#data-cleaning)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
5. [Modeling](#modeling)
6. [Results](#results)
7. [Conclusion](#conclusion)
8. [References](#references)

## Introduction

For this project we used a dataset containing recorded vehicle collisions in New York City, that spanned from 2012 to present day. This notebook serves as the primary record of all of the steps taken in order to produce the final dataset and accompanying Machine Learning Models.

## Data Overview

In [5]:
import pandas as pd

This project makes use of two datasets found on NYC Open Data, one that contains a record of every motorvehicle collision in NYC that resulted in injury and/or at least $1000 in damage and on that contains a record of every vehicle involved in those collisions.

Both Data Sets required extensive cleaning as they contained a large number of missing values and manually input text that often contained abbreviations and mispellings. 

### Data Set

This dataset and its corresponding metadata is made publicly available at NYC Open Data (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data)

In [9]:
df = pd.read_csv('Motor_Vehicle_Collisions_cpy2.csv')

  df = pd.read_csv('Motor_Vehicle_Collisions_cpy2.csv')


### Data Characteristics

In [11]:
df.shape

(2055607, 41)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2055607 entries, 0 to 2055606
Data columns (total 41 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   Unnamed: 0                     int64  
 1   CRASH DATE                     object 
 2   CRASH TIME                     object 
 3   BOROUGH                        object 
 4   ZIP CODE                       object 
 5   LATITUDE                       float64
 6   LONGITUDE                      float64
 7   LOCATION                       object 
 8   ON STREET NAME                 object 
 9   CROSS STREET NAME              object 
 10  OFF STREET NAME                object 
 11  NUMBER OF PERSONS INJURED      float64
 12  NUMBER OF PERSONS KILLED       float64
 13  NUMBER OF PEDESTRIANS INJURED  int64  
 14  NUMBER OF PEDESTRIANS KILLED   int64  
 15  NUMBER OF CYCLIST INJURED      int64  
 16  NUMBER OF CYCLIST KILLED       int64  
 17  NUMBER OF MOTORIST INJURED     int64  
 18  NU

In [13]:
df.columns

Index(['Unnamed: 0', 'CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE',
       'LATITUDE', 'LONGITUDE', 'LOCATION', 'ON STREET NAME',
       'CROSS STREET NAME', 'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
       'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5',
       'FULL ADDRESS', 'HOUSE NUMBER', 'ROAD', 'NEIGHBOURHOOD', 'SUBURB',
       'POSTCODE', 'day_of_week', 'is_weekend', 'CRASH HOUR', 'is_holiday',
       'holiday_name'],
      dtype='object')

The original raw dataset contains over 2.1 million values and has 29 different features. 

Of the 29 features there are 4 floats, 7 integers and 18 objects or strictly categorical features. 

At first glance it appears that 2 of the float columns are incorrectly representing what should be a discrete numeric value as continous numeric. Notice how the NUMBER OF PERSONS INJURED as well as the NUMBER OF PERSONS KILLED columns are of the float64 type. 

## Data Cleaning

### Missing Values

In [17]:
df = df.dropna(subset=['LATITUDE', 'LONGITUDE'])

The features in our dataset essentially fall into 5 distinct categories; temporal, spatial, severity, contributing factors and vehicle. 

In [19]:
print(df.isnull().sum())

Unnamed: 0                             0
CRASH DATE                             0
CRASH TIME                             0
BOROUGH                           491837
ZIP CODE                          492089
LATITUDE                               0
LONGITUDE                              0
LOCATION                           54768
ON STREET NAME                    404926
CROSS STREET NAME                 711186
OFF STREET NAME                  1573672
NUMBER OF PERSONS INJURED              0
NUMBER OF PERSONS KILLED               0
NUMBER OF PEDESTRIANS INJURED          0
NUMBER OF PEDESTRIANS KILLED           0
NUMBER OF CYCLIST INJURED              0
NUMBER OF CYCLIST KILLED               0
NUMBER OF MOTORIST INJURED             0
NUMBER OF MOTORIST KILLED              0
CONTRIBUTING FACTOR VEHICLE 1       6553
CONTRIBUTING FACTOR VEHICLE 2     305200
CONTRIBUTING FACTOR VEHICLE 3    1781385
CONTRIBUTING FACTOR VEHICLE 4    1885697
CONTRIBUTING FACTOR VEHICLE 5    1908032
COLLISION_ID    

In [20]:
missing_percentage = df.isnull().mean() * 100
print("Percentage of missing values in each column:")
print(missing_percentage)

Percentage of missing values in each column:
Unnamed: 0                        0.000000
CRASH DATE                        0.000000
CRASH TIME                        0.000000
BOROUGH                          25.663025
ZIP CODE                         25.676173
LATITUDE                          0.000000
LONGITUDE                         0.000000
LOCATION                          2.857680
ON STREET NAME                   21.128191
CROSS STREET NAME                37.108196
OFF STREET NAME                  82.110909
NUMBER OF PERSONS INJURED         0.000000
NUMBER OF PERSONS KILLED          0.000000
NUMBER OF PEDESTRIANS INJURED     0.000000
NUMBER OF PEDESTRIANS KILLED      0.000000
NUMBER OF CYCLIST INJURED         0.000000
NUMBER OF CYCLIST KILLED          0.000000
NUMBER OF MOTORIST INJURED        0.000000
NUMBER OF MOTORIST KILLED         0.000000
CONTRIBUTING FACTOR VEHICLE 1     0.341922
CONTRIBUTING FACTOR VEHICLE 2    15.924697
CONTRIBUTING FACTOR VEHICLE 3    92.948939
CONTRIBUT

#### What We Have

Fortunately the temporal data such as date and time are complete for the dataset as those values would be impossible to replace with confidence. 

Theres also minimal data missing from the Severity type columns (Number of killed/injured) with 18 values missing from Number of persons killed and 31 values missing (out of over 2.1million) from NUMBER OF PERSONS INJURED and NUMBER OF PERSONS KILLED respectively.  

#### What We Are Missing

We are missing a great deal of spatial type data (BOROUGH, ZIP CODE, LONGITUDE/LATITUDE, ON/OFF/CROSS STREET NAMES).

We are missing a great deal of Contributing Factor data (CONTRIBUTING FACTOR 2-5)

and we are also missing a great deal of Vehicle data (Vehicle Type 2-5)

#### Persons Injured and Persons Killed

In [27]:
print(df['NUMBER OF PERSONS INJURED'].value_counts())
print(df['NUMBER OF PERSONS KILLED'].value_counts())

NUMBER OF PERSONS INJURED
0.0     1469238
1.0      349386
2.0       64166
3.0       20842
4.0        7686
5.0        2893
6.0        1222
7.0         511
8.0         229
9.0         111
10.0         79
11.0         45
12.0         29
13.0         25
15.0         15
14.0          7
17.0          6
16.0          6
18.0          6
22.0          3
19.0          3
24.0          3
20.0          2
27.0          1
32.0          1
43.0          1
21.0          1
23.0          1
34.0          1
25.0          1
Name: count, dtype: int64
NUMBER OF PERSONS KILLED
0.0    1913773
1.0       2662
2.0         67
3.0         12
4.0          4
8.0          1
5.0          1
Name: count, dtype: int64


##### Persons Killed

In [29]:
nan_killed = df[df['NUMBER OF PERSONS KILLED'].isna()]
nan_killed.head()

Unnamed: 0.1,Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,...,HOUSE NUMBER,ROAD,NEIGHBOURHOOD,SUBURB,POSTCODE,day_of_week,is_weekend,CRASH HOUR,is_holiday,holiday_name


In [30]:
kill_cols = ['NUMBER OF PEDESTRIANS KILLED','NUMBER OF CYCLIST KILLED','NUMBER OF MOTORIST KILLED','NUMBER OF PERSONS KILLED']
nan_killed = nan_killed[kill_cols]
nan_killed
    

Unnamed: 0,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST KILLED,NUMBER OF PERSONS KILLED


The NUMBER OF PERSONS KILLED column appears to be the result of adding the NUMBER OF PEDESTRIANS KILLED,	NUMBER OF CYCLIST KILLED and NUMBER OF MOTORIST KILLED columns so given that there are no missing values for the supporting columns it is relatively safe to assume that the missing values for the NUMBER OF PERSONS KILLED column can simply be replaced with 0.

In [32]:
# Replace NaN values with 0 in specific columns
columns_to_fill = ['NUMBER OF PERSONS KILLED']
df[columns_to_fill] = df[columns_to_fill].fillna(0)
print("NUMBER OF PERSONS KILLED values that are null:", df['NUMBER OF PERSONS KILLED'].isnull().sum())

NUMBER OF PERSONS KILLED values that are null: 0


##### Persons Injured

In [34]:
nan_injured = df[df['NUMBER OF PERSONS INJURED'].isna()]
nan_injured

Unnamed: 0.1,Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,...,HOUSE NUMBER,ROAD,NEIGHBOURHOOD,SUBURB,POSTCODE,day_of_week,is_weekend,CRASH HOUR,is_holiday,holiday_name


In [35]:
injured_cols = ['COLLISION_ID', 'NUMBER OF PEDESTRIANS INJURED','NUMBER OF CYCLIST INJURED','NUMBER OF MOTORIST INJURED','NUMBER OF PERSONS INJURED']
nan_injured = nan_injured[injured_cols] 
nan_injured

Unnamed: 0,COLLISION_ID,NUMBER OF PEDESTRIANS INJURED,NUMBER OF CYCLIST INJURED,NUMBER OF MOTORIST INJURED,NUMBER OF PERSONS INJURED


In contrast to the NUMBER OF PERSONS KILLED nan values, where the resulting tally for all of the missing values appeared to have been zero, we are not as lucky with NUMBER OF PERSONS INJURED. While there are indeed a few values that should be zero, there are also several that containg CYCLYSTS INJURED and MOTORISTS INJURED. Therefore we should correct these values. 

To aid in correcting these values we have included the relevant COLLISION_ID column, since this is a relatively small correction we shall be doing this manually. 

In [37]:
# COLLISION_ID as keys and NUMBER OF PERSONS INJURED as values
corrections = {
    4387369: 1,  # Example COLLISION_ID with the new value for NUMBER OF PERSONS INJURED
    4026403: 1,
    4026185: 1,
    4025523: 1,
    4024624: 2,
    4024290: 1
}

# Update the NUMBER OF PERSONS INJURED based on COLLISION_ID
for collision_id, injured_count in corrections.items():
    df.loc[df['COLLISION_ID'] == collision_id, 'NUMBER OF PERSONS INJURED'] = injured_count
    
print("NUMBER OF PERSONS INJURED values that are null:", df['NUMBER OF PERSONS INJURED'].isnull().sum())

NUMBER OF PERSONS INJURED values that are null: 0


In [38]:
cols = injured_cols = ['COLLISION_ID', 'NUMBER OF PEDESTRIANS INJURED','NUMBER OF CYCLIST INJURED','NUMBER OF MOTORIST INJURED','NUMBER OF PERSONS INJURED']
orig = df[df['COLLISION_ID'].isin(corrections.keys())]
orig[cols]


Unnamed: 0,COLLISION_ID,NUMBER OF PEDESTRIANS INJURED,NUMBER OF CYCLIST INJURED,NUMBER OF MOTORIST INJURED,NUMBER OF PERSONS INJURED
179491,4387369,0,1,0,1.0
561370,4026403,0,0,1,1.0
609987,4026185,0,0,1,1.0
701656,4025523,0,0,1,1.0
820434,4024624,0,0,2,2.0
882407,4024290,0,1,0,1.0


As you can see it is verified that the values were corrected to reflect the individual tallies

#### Location Values

##### Street Names

In [42]:
df['ON STREET NAME'].value_counts()

ON STREET NAME
broadway                            19128
atlantic avenue                     17180
belt parkway                        17178
3 avenue                            13741
long island expressway              12463
                                    ...  
14 east 125 street                      1
south.portland avenue                   1
marine parkway gil hodges memori        1
37th avenue                             1
beach 144 street                        1
Name: count, Length: 7766, dtype: int64

The original dataset contains 18552 unique values. On closer observation however it becomes apparent that there are a great number of typos 

### Handling Categorical Data
#### Contributing Factor Vehicles
There are many contributing factors as to why a vehicle was involved in a collision. The goal is to standardize/correct spelling of factors (e.g. 'Illnes'), remove any possibly mistakenly entered values (e.g. '80', '1'), and group together various ways of referring to the same factor (e.g. multiple versions of what boils down to electronic use)

In [None]:
df = pd.read_csv('Motor_Vehicle_Collisions_cpy2.csv')

#Print all possible contributing factors for vehicles involved in collisions
contributing_factors_cols = ['CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5'] 
unique_contributing_factors = pd.concat(df[col] for col in contributing_factors_cols).unique()
print(unique_contributing_factors)

In [None]:
#Standardize spelling, remove incorrectly entered values, and group together similar values.
factors_mapping = {
    #IMPROPER DRIVING TECHNIQUE
    #General poor etiquette on the road--not in violation of any particular rule/law--that causes an accident
    'Following Too Closely': 'Improper Driving Technique',
    'Passing Too Closely': 'Improper Driving Technique',
    'Driver Inexperience': 'Improper Driving Technique',
    'Passing or Lane Usage Improper': 'Improper Driving Technique',
    'Turning Improperly': 'Improper Driving Technique',
    'Unsafe Lane Changing': 'Improper Driving Technique',
    'Backing Unsafely': 'Improper Driving Technique',
    'Aggressive Driving/Road Rage': 'Improper Driving Technique',

    #TRAFFIC RULE VIOLATION
    #Violation of a particular rule/law on the road that leads to an accident
    'Traffic Control Disregarded': 'Traffic Rule Violation',
    'Failure to Yield Right-of-Way': 'Traffic Rule Violation',
    'Failure to Keep Right': 'Traffic Rule Violation',
    'Unsafe Speed': 'Traffic Rule Violation',

    #POOR ROAD CONDITIONS
    #Any road condition that makes driving difficult and causes an accident
    'Pavement Slippery': 'Poor Road Conditions',
    'View Obstructed/Limited': 'Poor Road Conditions',
    'Glare': 'Poor Road Conditions',
    'Obstruction/Debris': 'Poor Road Conditions',
    'Pavement Defective': 'Poor Road Conditions',
    'Lane Marking Improper/Inadequate': 'Poor Road Conditions',
    'Traffic Control Device Improper/Non-Working': 'Poor Road Conditions',
    'Shoulders Defective/Improper': 'Poor Road Conditions',

    #EXTERNAL DISTRACTION/OBSTACLE
    #Any outside car/bike/pedestrian/animal that causes an accident
    'Reaction to Uninvolved Vehicle': 'External Distraction/Obstacle',
    'Reaction to Other Uninvolved Vehicle': 'External Distraction/Obstacle',
    'Other Vehicular': 'External Distraction/Obstacle',
    'Oversized Vehicle': 'External Distraction/Obstacle',
    'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion': 'External Distraction/Obstacle',
    'Animals Action': 'External Distraction/Obstacle',
    'Outside Car Distraction': 'External Distraction/Obstacle',
    'Vehicle Vandalism': 'External Distraction/Obstacle',

    #VEHICLE DEFECT
    #A malfunction/breakdown of a vehicle that causes it to be involved in an accident
    'Steering Failure': 'Vehicle Defect',
    'Brakes Defective': 'Vehicle Defect',
    'Tinted Windows': 'Vehicle Defect',
    'Other Lighting Defects': 'Vehicle Defect',
    'Driverless/Runaway Vehicle': 'Vehicle Defect',
    'Tire Failure/Inadequate': 'Vehicle Defect',
    'Headlights Defective': 'Vehicle Defect',
    'Accelerator Defective': 'Vehicle Defect',
    'Tow Hitch Defective': 'Vehicle Defect',
    'Windshield Inadequate': 'Vehicle Defect',

    #ALCOHOL/DRUG USE
    #Any substance use (legal or illegal) that causes an accident
    'Alcohol Involvement': 'Alcohol/Drug Use',
    'Drugs (illegal)': 'Alcohol/Drug Use',
    'Drugs (Illegal)': 'Alcohol/Drug Use',
    'Prescription Medication': 'Alcohol/Drug Use',

    #ELECTRONICS USE
    #The use of an electronic device (cellphone, GPS, headphones, etc.) that leads to an accident
    'Cell Phone (hands-free)': 'Electronics Use',
    'Cell Phone (hand-Held)': 'Electronics Use',
    'Cell Phone (hand-held)': 'Electronics Use',
    'Using On Board Navigation Device': 'Electronics Use',
    'Other Electronic Device': 'Electronics Use',
    'Listening/Using Headphones': 'Electronics Use',
    'Texting': 'Electronics Use',

    #DRIVER DISTRACTION/IMPAIRMENT
    #The involvement of factors not related to substances/electronics that distract a driver or make them unable to drive
    'Driver Inattention/Distraction': 'Driver Distraction/Impairment',
    'Lost Consciousness': 'Driver Distraction/Impairment',
    'Passenger Distraction': 'Driver Distraction/Impairment',
    'Fell Asleep': 'Driver Distraction/Impairment',
    'Fatigued/Drowsy': 'Driver Distraction/Impairment',
    'Physical Disability': 'Driver Distraction/Impairment',
    'Eating or Drinking': 'Driver Distraction/Impairment',
    'Illnes': 'Driver Distraction/Impairment',
    'Illness': 'Driver Distraction/Impairment',
    
    #It may be that unspecified means a car was involved but they do not know what caused it to be involved--not NaN then?
    'Unspecified': 'Unspecified',
    '80': 'Police Chase',
    '1': 'Unspecified',
}

for col in contributing_factors_cols:
    df[col] = df[col].map(factors_mapping).fillna(df[col])

print(df)

In [None]:
#Print new possible values for contributing factors
unique_contributing_factors = pd.concat(df[col] for col in contributing_factors_cols).unique()
print(unique_contributing_factors)

#### Vehicle Types
There are many vehicle types like there are contributing factors, so they will need to be handled in a similar way as well. Many are truncated beyond easy recognition, so there may be a challenge with handling them.

In [None]:
#Print all possible vehicle types for vehicles involved in collisions
#vehicle_type_cols = ['VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'] 
#unique_vehicle_types = pd.concat(df[col] for col in vehicle_type_cols).unique()
#for vehicle_type in unique_vehicle_types:
    #print(vehicle_type)

### Feature Engineering

#### Temporal Features

The vanilla dataset comes with a number of temporal features pre-included. There are seperate 'Crash Date' and 'Crash Time' fields making temporal analysis possible. With just those two fields we are able to create several potentially useful features.

By converting the date into date-time format we are able to extract the day of the week, month, week of the year, and even holidays to give the data more dimensionality. We can further break down the hourly data into catagories to appropriately handle and recognize traffic patterns such as rush hour.

In [None]:
# Convert 'CRASH DATE' to datetime
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'], errors='coerce')

##### Day of the Week

In [None]:
# Add 'day_of_week' feature
df['day_of_week'] = df['CRASH DATE'].dt.day_name()
df['day_of_week'].value_counts()

By adding a feature for the day of the week we are able to capture and represent weekly trends in traffic and collision patterns, by looking at the value count for the newly created feature, 'day_of_week', there are noticeably less accidents on weekends.

##### Weekday/Weekend

In [None]:
# Add 'is_weekend' feature
df['is_weekend'] = df['CRASH DATE'].dt.weekday >= 5
df['is_weekend'].value_counts()

Since there are significantly less accidents on the weekend as opposed to a weekday, the addition of a binary feature to represent whether any given day is a weekend makes sense. 

##### Hour of the Day

In [None]:
df['CRASH HOUR'] = pd.to_datetime(df['CRASH TIME']).dt.hour

By adding a feature for the crash hour we are essentially discretizing the time fields which may better represent daily trends in the data set

##### Holidays

In [None]:
import holidays

min_date = df['CRASH DATE'].min()
max_date = df['CRASH DATE'].max()

# Initialize US holidays for relevant years
years = range(min_date.year, max_date.year + 1)
us_holidays = holidays.US(years=years, observed=True)

# Add 'is_holiday' feature
df['is_holiday'] = df['CRASH DATE'].isin(us_holidays.keys()).astype(bool)

# Add 'holiday_name' feature
df['holiday_name'] = df['CRASH DATE'].apply(lambda x: us_holidays.get(x, None))


In [None]:
df['holiday_name'].value_counts()

In [None]:
df.head()

In [None]:
df['is_holiday'].value_counts()

It is known that certain holidays correlate with spikes in travel and one could reasonably hypothesize that this would be reflected by statistics.

Using the holidays library (https://pypi.org/project/holidays/) we are easily able to recognize and label objects, in date-time format, to their respective holiday. This quite effectively labels our data with the appropriate holiday. However, this comes with several new considerations in and of itself. There are noticeably fewer values counted for several of the holidays.

Juneteenth for instance is a relatively newly recognized holiday and may not be recognized as holiday in the earlier dates of the dataset, given that there are datapoints from as early as 2012. 

A seperate boolean feature, is_holiday, has been added to binarize the holiday feature.

##### Rush Hour

In [None]:
df['IS_RUSH_HOUR'] =  df['CRASH HOUR'].between(7, 9) | df['CRASH HOUR'].between(16, 18)
df['IS_MORNING_RUSH'] = df['CRASH HOUR'].between(7, 9) 
df['IS_EVENING_RUSH'] = df['CRASH HOUR'].between(16, 18)

There is a clear spike in traffic during rush hour, which encompasses several hours during the morning and afternoon, were people are typically traveling too or from work. We can easily encompass a boolean feature is_rush_hour to capture this trend.


In [None]:
df.head()

#### Geographical Features

##### Weather

Adverse weather conditions pose a serious hazard for drivers. Since our dataset contains fairly detailed information in regards to the time and location of accidents, integrating historical weather data is entirely possible and would likely benefit the model greatly. 

#### Collision Specific Features

There is a great deal of information in the dataset that pertains to how severe an accident is, such as the number or people killed and vehicles involved. This affords us an oppurtunity to account for more severe incidents which may impact traffic conditions for longer periods of time opposed to minor fender benders. 

##### Number of Vehicles Involved

Since there are already features for vehicle types 1 through vehicle types 5, we can essentially binarize each of those features to get a tally of the total number of vehicles involved. This may make it possible to remove the vehicle type columns entirely from the dataset without losing the value from the data. 

As noted in the Data Overview there are a large number of missing values for vehicle type 3 through vehicle type 5 which is likely a result of the vast majority of accidents involving 2 or fewer vehicles.

In [None]:
#Add a new column that contains the number of vehicles involved in a collision by counting non-NaN values in the vehicle type columns for each collision 
df['Number of Vehicles Involved'] = df[['VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']].count(axis=1)
print(df[['VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5', 'Number of Vehicles Involved']])

## Exploratory Data Analysis (EDA)

In [None]:
df = pd.read_csv('Motor_Vehicle_Collisions_cpy2.csv')

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()

### Folium Maps

import folium
from folium.plugins import MarkerCluster


# Remove rows with NaN values in LATITUDE or LONGITUDE
df = df.dropna(subset=['LATITUDE', 'LONGITUDE'])

# Define the date range for filtering
start_date = '2018-01-01'
end_date = '2018-01-31'

# Filter DataFrame based on date range
mask = (df['CRASH DATE'] >= start_date) & (df['CRASH DATE'] <= end_date)
filtered_df = df.loc[mask]

# Check if there are still NaNs after filtering
assert filtered_df[['LATITUDE', 'LONGITUDE']].notna().all().all(), "There are still NaN values in LATITUDE or LONGITUDE."

# Create a base map
center_lat = filtered_df['LATITUDE'].mean()
center_lon = filtered_df['LONGITUDE'].mean()
m = folium.Map(location=[center_lat, center_lon], zoom_start=12)

# Add MarkerCluster to the map
marker_cluster = MarkerCluster().add_to(m)

# Add accident markers to the map
# May be worth modifying later to distinguish between vehicle type, fatality and or pedestrian hit
for index, row in filtered_df.iterrows():
    folium.Marker(
        location=(row['LATITUDE'], row['LONGITUDE']),
        popup=folium.Popup(f"Date: {row['CRASH DATE'].strftime('%Y-%m-%d')}", parse_html=True)
    ).add_to(marker_cluster)

# uncomment to save the map to an HTML file
# m.save('2012_accident_map.html')
m


## Modeling

In [None]:
import matplotlib.pyplot as plt

# Histogram for a numerical column
df['LATITUDE'].hist(bins=30, edgecolor='black')
plt.title('Distribution of LATITUDE')
plt.xlabel('Latitude')
plt.ylabel('Frequency')
plt.show()

# Boxplot for a numerical column
df.boxplot(column='LATITUDE')
plt.title('Boxplot of LATITUDE')
plt.show()

## Results

In Progress!

In Progress!

## Conclusion

In Progress!

## References

In Progress!