### Columns to drop:
1. BOROUGH, ZIP CODE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME: these columns are redundant as we have geographical coordinates latitude & longitude
2. NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED: these columns are redundant as we have total # of persons injured/killed
3. CONTRIBUTING FACTOR VEHICLE 2 to 5: Majority of data is unspecified
4. COLLISION ID: No useful information

### Columns to Alter:
1. CONTRIBUTING FACTOR VEHICLE 1*-5*: Reduce class size to top 10 factors? 
<br>
ISSUES:    Unspecified and Driver Distraction is about 42% & 24% respectively, with other values about 7% or less.
<br>
SOLUTION:  Drop columns.
<br>
2. VEHICLE TYPE CODE 1-5: Reduce class size to top 10 factors? Reduce columns into car count?
<br>
ISSUES:    Sedan & Station Wagon are over 60% of values, while others make less than 10% Factor 1, while also making about 70% for Factor 2.
<br>
SOLUTION:  Columns to be dropped, new column w/ total cars associated with accident created.
<br>
3. Correct missing values in # of persons injured/killed
4. Drop 0 values in Total Cars column
5. Finally, drop NaN values
6. Verify column dtype

### Column to Add:
1. Create target column with 5 classes

In [1]:
#!pip install geopy

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np
#from geopy.geocoders import Nominatim
import datetime

In [3]:
# Define the path to the folder
folder_path = Path("C:/Users/crazy/OneDrive - The City College of New York/DSE I2100 - Applied Machine Learning and Data Mining/Project")
csv_file = folder_path.glob("*.csv").__next__()

# Load CSV file into DataFrame
df = pd.read_csv(csv_file)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [4]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070069 entries, 0 to 2070068
Data columns (total 29 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   CRASH DATE                     2070069 non-null  object 
 1   CRASH TIME                     2070069 non-null  object 
 2   BOROUGH                        1426009 non-null  object 
 3   ZIP CODE                       1425759 non-null  object 
 4   LATITUDE                       1836747 non-null  float64
 5   LONGITUDE                      1836747 non-null  float64
 6   LOCATION                       1836747 non-null  object 
 7   ON STREET NAME                 1631008 non-null  object 
 8   CROSS STREET NAME              1288388 non-null  object 
 9   OFF STREET NAME                346688 non-null   object 
 10  NUMBER OF PERSONS INJURED      2070051 non-null  float64
 11  NUMBER OF PERSONS KILLED       2070038 non-null  float64
 12  NUMBER OF PEDE

### Correcting values in # of persons injured/killed:

In [5]:
# Sum the injury-related columns
injury_columns = ['NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF MOTORIST INJURED']
df['NUMBER OF PERSONS INJURED'] = df[injury_columns].sum(axis=1)

# Sum the killed-related columns
killed_columns = ['NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST KILLED']
df['NUMBER OF PERSONS KILLED'] = df[killed_columns].sum(axis=1)

### The Following code replaces row value coordinates not in NYC with NaN for 'LATITUDE' & 'LONGITUDE' columns:

In [6]:
# Approximate coordinates for New York City:
# Maximum Latitude: 40.9176 (Northernmost point of the Bronx)
# Minimum Latitude: 40.4774 (Southernmost point of Staten Island)
# Maximum Longitude: -73.7004 (Easternmost point of Queens)
# Minimum Longitude: -74.2591 (Westernmost point of Staten Island)

# Define the maximum and minimum values for latitude and longitude
max_latitude = 40.9176
min_latitude = 40.4774
max_longitude = -73.7004
min_longitude = -74.2591

# Filter the DataFrame based on the conditions for latitude and longitude
invalid_latitudes = (df['LATITUDE'] > max_latitude) | (df['LATITUDE'] < min_latitude)
invalid_longitudes = (df['LONGITUDE'] > max_longitude) | (df['LONGITUDE'] < min_longitude)

# Replace the values with NaN where the conditions are not met
df.loc[invalid_latitudes, 'LATITUDE'] = np.nan
df.loc[invalid_longitudes, 'LONGITUDE'] = np.nan

### Creating target column "CLASS TYPE":

In [7]:
def determine_class_type(row):
    if row['NUMBER OF PERSONS INJURED'] == 0 and row['NUMBER OF PERSONS KILLED'] == 0:
        return 'Class 0'
    elif 1 <= row['NUMBER OF PERSONS INJURED'] <= 5 and row['NUMBER OF PERSONS KILLED'] == 0:
        return 'Class 1'
    elif row['NUMBER OF PERSONS INJURED'] > 5 and row['NUMBER OF PERSONS KILLED'] == 0:
        return 'Class 2'
    elif 1 <= row['NUMBER OF PERSONS KILLED'] <= 3:
        return 'Class 3'
    elif row['NUMBER OF PERSONS KILLED'] > 3:
        return 'Class 4'

df['CLASS TYPE'] = df.apply(determine_class_type, axis=1)
df['CLASS TYPE'] = df['CLASS TYPE'].astype('category')

### Dropping redundant/unnecessary columns:

In [8]:
df.drop(columns = ['BOROUGH', 'ZIP CODE', 'LOCATION', 'ON STREET NAME','CROSS STREET NAME','OFF STREET NAME', 
                   'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 
                   'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED',
                   'NUMBER OF MOTORIST INJURED','NUMBER OF MOTORIST KILLED', 'COLLISION_ID'], axis=1, inplace=True)

### Removing/Replacing NaN values for Contributing Factor & Vehicle Type columns:

In [9]:
# Drop NaN values from subset columns
df = df.dropna(subset=['CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1'])

# List of column pairs
column_pairs = [
    ('CONTRIBUTING FACTOR VEHICLE 2', 'VEHICLE TYPE CODE 2'),
    ('CONTRIBUTING FACTOR VEHICLE 3', 'VEHICLE TYPE CODE 3'),
    ('CONTRIBUTING FACTOR VEHICLE 4', 'VEHICLE TYPE CODE 4'),
    ('CONTRIBUTING FACTOR VEHICLE 5', 'VEHICLE TYPE CODE 5')
]

# Iterate over each column pair
for factor_column, type_column in column_pairs:
    # Check if factor_column has a value and type_column is NaN
    mask = (pd.notna(df[factor_column])) & (pd.isna(df[type_column]))
    # Check if type_column has a value and factor_column is NaN
    mask |= (pd.notna(df[type_column])) & (pd.isna(df[factor_column]))
    # Drop rows where either condition is met
    df = df[~mask]

# Reset the index after dropping rows
df.reset_index(drop=True, inplace=True)

# Fill NaN values in contributing factor vehicle columns with 'No vehicle'
for factor_column, type_column in column_pairs:
    df[factor_column].fillna('No factor', inplace=True)
    df[type_column].fillna('No vehicle', inplace=True)

### Dropping rows with NaN values:

In [10]:
df = df.dropna()

### Keeping only top 10 values in Contributing Factor & Vehicle Type columns:

In [11]:
# List of columns to iterate through
columns_to_iterate = [
    'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
    'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3','VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'
]

# Iterate over each column
for column in columns_to_iterate:
    # Get the top 10 most frequent values in the column
    top_10_values = df[column].value_counts().head(10).index.tolist()
    # Replace values not in top 10 with a placeholder value
    df[column] = df[column].apply(lambda x: x if x in top_10_values else 'Other')

    # Drop rows where any of the specified columns have 'Other' value
    df = df[df[column] != 'Other']
    
    # Reset the index after dropping rows
    df.reset_index(drop=True, inplace=True)
    

### Verifying column dtype:

In [12]:
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])
df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'], format='%H:%M').dt.time
df['LATITUDE'] = df['LATITUDE'].astype('float64')
df['LONGITUDE'] = df['LONGITUDE'].astype('float64')
df['CONTRIBUTING FACTOR VEHICLE 1'] = df['CONTRIBUTING FACTOR VEHICLE 1'].astype('object')
df['CONTRIBUTING FACTOR VEHICLE 2'] = df['CONTRIBUTING FACTOR VEHICLE 2'].astype('object')
df['CONTRIBUTING FACTOR VEHICLE 3'] = df['CONTRIBUTING FACTOR VEHICLE 3'].astype('object')
df['CONTRIBUTING FACTOR VEHICLE 4'] = df['CONTRIBUTING FACTOR VEHICLE 4'].astype('object')
df['CONTRIBUTING FACTOR VEHICLE 5'] = df['CONTRIBUTING FACTOR VEHICLE 5'].astype('object')
df['VEHICLE TYPE CODE 1'] = df['VEHICLE TYPE CODE 1'].astype('object')
df['VEHICLE TYPE CODE 2'] = df['VEHICLE TYPE CODE 2'].astype('object')
df['VEHICLE TYPE CODE 3'] = df['VEHICLE TYPE CODE 3'].astype('object')
df['VEHICLE TYPE CODE 4'] = df['VEHICLE TYPE CODE 4'].astype('object')
df['VEHICLE TYPE CODE 5'] = df['VEHICLE TYPE CODE 5'].astype('object')

In [13]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1057417 entries, 0 to 1057416
Data columns (total 15 columns):
 #   Column                         Non-Null Count    Dtype         
---  ------                         --------------    -----         
 0   CRASH DATE                     1057417 non-null  datetime64[ns]
 1   CRASH TIME                     1057417 non-null  object        
 2   LATITUDE                       1057417 non-null  float64       
 3   LONGITUDE                      1057417 non-null  float64       
 4   CONTRIBUTING FACTOR VEHICLE 1  1057417 non-null  object        
 5   CONTRIBUTING FACTOR VEHICLE 2  1057417 non-null  object        
 6   CONTRIBUTING FACTOR VEHICLE 3  1057417 non-null  object        
 7   CONTRIBUTING FACTOR VEHICLE 4  1057417 non-null  object        
 8   CONTRIBUTING FACTOR VEHICLE 5  1057417 non-null  object        
 9   VEHICLE TYPE CODE 1            1057417 non-null  object        
 10  VEHICLE TYPE CODE 2            1057417 non-null  objec

In [14]:
df.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5,CLASS TYPE
0,2021-09-11,09:35:00,40.667202,-73.8665,Unspecified,No factor,No factor,No factor,No factor,Sedan,No vehicle,No vehicle,No vehicle,No vehicle,Class 0
1,2021-12-14,08:17:00,40.86816,-73.83148,Unspecified,Unspecified,No factor,No factor,No factor,Sedan,Sedan,No vehicle,No vehicle,No vehicle,Class 1
2,2021-12-14,14:58:00,40.75144,-73.97397,Passing Too Closely,Unspecified,No factor,No factor,No factor,Sedan,Station Wagon/Sport Utility Vehicle,No vehicle,No vehicle,No vehicle,Class 0
3,2021-12-14,16:50:00,40.675884,-73.75577,Turning Improperly,Unspecified,No factor,No factor,No factor,Sedan,Station Wagon/Sport Utility Vehicle,No vehicle,No vehicle,No vehicle,Class 0
4,2021-12-11,19:43:00,40.87262,-73.904686,Unspecified,Unspecified,No factor,No factor,No factor,Station Wagon/Sport Utility Vehicle,Sedan,No vehicle,No vehicle,No vehicle,Class 1


### Converting new dataframe to csv file:

In [15]:
# Save the DataFrame to a CSV file
df.to_csv('Processed_Data_v2.csv', index=False)