### Columns to drop:
1. BOROUGH, ZIP CODE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME: these columns are redundant as we have geographical coordinates latitude & longitude
2. NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED: these columns are redundant as we have total # of persons injured/killed
3. CONTRIBUTING FACTOR VEHICLE 2 to 5: Majority of data is unspecified
4. COLLISION ID: No useful information

### Columns to Alter:
1. CONTRIBUTING FACTOR VEHICLE 1*-5*: Reduce class size to top 10 factors? 
<br>
ISSUES:    Unspecified and Driver Distraction is about 42% & 24% respectively, with other values about 7% or less.
<br>
SOLUTION:  Drop columns.
<br>
2. VEHICLE TYPE CODE 1-5: Reduce class size to top 10 factors? Reduce columns into car count?
<br>
ISSUES:    Sedan & Station Wagon are over 60% of values, while others make less than 10% Factor 1, while also making about 70% for Factor 2.
<br>
SOLUTION:  Columns to be dropped, new column w/ total cars associated with accident created.
<br>
3. Correct missing values in # of persons injured/killed
4. Drop 0 values in Total Cars column
5. Finally, drop NaN values
6. Verify column dtype

### Column to Add:
1. Create target column with 5 classes

In [2]:
#!pip install geopy

In [3]:
from pathlib import Path
import pandas as pd
import numpy as np
#from geopy.geocoders import Nominatim
import datetime

In [4]:
# Define the path to the folder
folder_path = Path("C:/Users/crazy/OneDrive - The City College of New York/DSE I2100 - Applied Machine Learning and Data Mining/Project")
csv_file = folder_path.glob("*.csv").__next__()

# Load CSV file into DataFrame
df = pd.read_csv(csv_file)


  df = pd.read_csv(csv_file)


In [5]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070069 entries, 0 to 2070068
Data columns (total 29 columns):
 #   Column                         Non-Null Count    Dtype  
---  ------                         --------------    -----  
 0   CRASH DATE                     2070069 non-null  object 
 1   CRASH TIME                     2070069 non-null  object 
 2   BOROUGH                        1426009 non-null  object 
 3   ZIP CODE                       1425759 non-null  object 
 4   LATITUDE                       1836747 non-null  float64
 5   LONGITUDE                      1836747 non-null  float64
 6   LOCATION                       1836747 non-null  object 
 7   ON STREET NAME                 1631008 non-null  object 
 8   CROSS STREET NAME              1288388 non-null  object 
 9   OFF STREET NAME                346688 non-null   object 
 10  NUMBER OF PERSONS INJURED      2070051 non-null  float64
 11  NUMBER OF PERSONS KILLED       2070038 non-null  float64
 12  NUMBER OF PEDE

### Correcting values in # of persons injured/killed:

In [7]:
# Sum the injury-related columns
injury_columns = ['NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF MOTORIST INJURED']
df['NUMBER OF PERSONS INJURED'] = df[injury_columns].sum(axis=1)

# Sum the killed-related columns
killed_columns = ['NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST KILLED']
df['NUMBER OF PERSONS KILLED'] = df[killed_columns].sum(axis=1)

### The Following code replaces row value coordinates not in NYC with NaN for 'LATITUDE' & 'LONGITUDE' columns:

In [9]:
# Approximate coordinates for New York City:
# Maximum Latitude: 40.9176 (Northernmost point of the Bronx)
# Minimum Latitude: 40.4774 (Southernmost point of Staten Island)
# Maximum Longitude: -73.7004 (Easternmost point of Queens)
# Minimum Longitude: -74.2591 (Westernmost point of Staten Island)

# Define the maximum and minimum values for latitude and longitude
max_latitude = 40.9176
min_latitude = 40.4774
max_longitude = -73.7004
min_longitude = -74.2591

# Filter the DataFrame based on the conditions for latitude and longitude
invalid_latitudes = (df['LATITUDE'] > max_latitude) | (df['LATITUDE'] < min_latitude)
invalid_longitudes = (df['LONGITUDE'] > max_longitude) | (df['LONGITUDE'] < min_longitude)

# Replace the values with NaN where the conditions are not met
df.loc[invalid_latitudes, 'LATITUDE'] = np.nan
df.loc[invalid_longitudes, 'LONGITUDE'] = np.nan

### Replacing Vehicle Type columns with Total Vehicles associated with accident:

In [11]:
vehicle_columns = ['VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']

# Count non-NaN values for each row and sum them up
df['TOTAL VEHICLES'] = df[vehicle_columns].apply(lambda row: np.sum(~pd.isna(row)), axis=1)

# Drop rows where 'TOTAL VEHICLES' column is 0
df = df[df['TOTAL VEHICLES'] != 0]

### Dropping redundant/unnecessary columns:

In [13]:
df.drop(columns = ['BOROUGH', 'ZIP CODE', 'LOCATION', 'ON STREET NAME','CROSS STREET NAME','OFF STREET NAME', 
                   'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
                  'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED','NUMBER OF MOTORIST KILLED',
                   'CONTRIBUTING FACTOR VEHICLE 1','CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
                   'CONTRIBUTING FACTOR VEHICLE 4','CONTRIBUTING FACTOR VEHICLE 5', 'COLLISION_ID','VEHICLE TYPE CODE 1',
                   'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'], axis=1, inplace=True)

### Dropping rows with NaN values:

In [15]:
df = df.dropna()

### Verifying column dtype:

In [17]:
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])
df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'], format='%H:%M').dt.time
df['LATITUDE'] = df['LATITUDE'].astype('float64')
df['LONGITUDE'] = df['LONGITUDE'].astype('float64')
df['NUMBER OF PERSONS INJURED'] = df['NUMBER OF PERSONS INJURED'].astype('int64')
df['NUMBER OF PERSONS KILLED'] = df['NUMBER OF PERSONS KILLED'].astype('int64')
df['TOTAL VEHICLES'] = df['TOTAL VEHICLES'].astype('int64')

### Creating target column "CLASS TYPE":

In [19]:
def determine_class_type(row):
    if row['NUMBER OF PERSONS INJURED'] == 0 and row['NUMBER OF PERSONS KILLED'] == 0 or row['TOTAL VEHICLES'] == 1:
        return 'Class 0'
    elif 1 <= row['NUMBER OF PERSONS INJURED'] <= 5 and row['NUMBER OF PERSONS KILLED'] == 0 or row['TOTAL VEHICLES'] == 2:
        return 'Class 1'
    elif row['NUMBER OF PERSONS INJURED'] > 5 and row['NUMBER OF PERSONS KILLED'] == 0 or row['TOTAL VEHICLES'] == 3:
        return 'Class 2'
    elif 1 <= row['NUMBER OF PERSONS KILLED'] <= 3 or row['TOTAL VEHICLES'] == 4:
        return 'Class 3'
    elif row['NUMBER OF PERSONS KILLED'] > 3 or row['TOTAL VEHICLES'] == 5:
        return 'Class 4'

df['CLASS TYPE'] = df.apply(determine_class_type, axis=1)
df['CLASS TYPE'] = df['CLASS TYPE'].astype('category')

In [20]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1819890 entries, 3 to 2070068
Data columns (total 8 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   CRASH DATE                 1819890 non-null  datetime64[ns]
 1   CRASH TIME                 1819890 non-null  object        
 2   LATITUDE                   1819890 non-null  float64       
 3   LONGITUDE                  1819890 non-null  float64       
 4   NUMBER OF PERSONS INJURED  1819890 non-null  int64         
 5   NUMBER OF PERSONS KILLED   1819890 non-null  int64         
 6   TOTAL VEHICLES             1819890 non-null  int64         
 7   CLASS TYPE                 1819890 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(2), int64(3), object(1)
memory usage: 112.8+ MB


In [21]:
df.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,TOTAL VEHICLES,CLASS TYPE
3,2021-09-11,09:35:00,40.667202,-73.8665,0,0,1,Class 0
6,2021-12-14,17:05:00,40.709183,-73.956825,0,0,2,Class 0
7,2021-12-14,08:17:00,40.86816,-73.83148,2,0,2,Class 1
8,2021-12-14,21:10:00,40.67172,-73.8971,0,0,1,Class 0
9,2021-12-14,14:58:00,40.75144,-73.97397,0,0,2,Class 0


In [22]:
# Get the counts of unique values in the "CLASS TYPE" column
CLASS_TYPE_counts = df['CLASS TYPE'].value_counts()

# Print unique values and their counts
print("Unique values and their counts in the 'CLASS TYPE' column:")
print(CLASS_TYPE_counts)

# Calculate the total sum of the counts
total_sum = CLASS_TYPE_counts.sum()
print("Total sum of counts:", total_sum)

Unique values and their counts in the 'CLASS TYPE' column:
Class 0    1535869
Class 1     282817
Class 2       1064
Class 3        139
Class 4          1
Name: CLASS TYPE, dtype: int64
Total sum of counts: 1819890


### Converting new dataframe to csv file:

In [40]:
# Save the DataFrame to a CSV file
df.to_csv('Processed_Data.csv', index=False)