In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
data_filepath = "..\\dataset\\new dataset.xlsx"
df = pd.read_excel(data_filepath)
print(df.head())

   Year  Month  Day Weekend?    Hour Collision Type         Injury Type  \
0  2015      1    5  Weekday     0.0          2-Car   No injury/unknown   
1  2015      1    6  Weekday  1500.0          2-Car   No injury/unknown   
2  2015      1    6  Weekend  2300.0          2-Car  Non-incapacitating   
3  2015      1    7  Weekend   900.0          2-Car  Non-incapacitating   
4  2015      1    7  Weekend  1100.0          2-Car   No injury/unknown   

                          Primary Factor      Reported_Location   Latitude  \
0  OTHER (DRIVER) - EXPLAIN IN NARRATIVE             1ST & FESS  39.159207   
1                  FOLLOWING TOO CLOSELY          2ND & COLLEGE  39.161440   
2              DISREGARD SIGNAL/REG SIGN  BASSWOOD & BLOOMFIELD  39.149780   
3          FAILURE TO YIELD RIGHT OF WAY         GATES & JACOBS  39.165655   
4          FAILURE TO YIELD RIGHT OF WAY                  W 3RD  39.164848   

   Longitude  
0 -86.525874  
1 -86.534848  
2 -86.568890  
3 -86.575956  
4 -86

# Data Cleaning
**Feature Selection**
Provided columns are 'Year', 'Month', 'Day', 'Weekend?', 'Hour', 'Collision Type', 'Injury Type','Primary Factor', 'Reported_Location', 'Latitude', 'Longitude'.

Considering we only care about when/where accidents happen, we can remove all features that don't give use insight into these two factors. That means we can remove the 'Collision Type', 'Injury Type', and 'Primary Factor' features. We can also remove 'Year' since we want this model to generalize for any years. In further versions of this model once could use the year, month, and date to determine the weather at the time of crash and factor this feature into the model.

**Removing Rows with Empty Values**
We will also drop any rows with empty values in the selected features

In [4]:
df = df.drop(columns=['Collision Type', 'Primary Factor', 'Year'])
df = df.dropna()
print(df.head())

   Month  Day Weekend?    Hour         Injury Type      Reported_Location  \
0      1    5  Weekday     0.0   No injury/unknown             1ST & FESS   
1      1    6  Weekday  1500.0   No injury/unknown          2ND & COLLEGE   
2      1    6  Weekend  2300.0  Non-incapacitating  BASSWOOD & BLOOMFIELD   
3      1    7  Weekend   900.0  Non-incapacitating         GATES & JACOBS   
4      1    7  Weekend  1100.0   No injury/unknown                  W 3RD   

    Latitude  Longitude  
0  39.159207 -86.525874  
1  39.161440 -86.534848  
2  39.149780 -86.568890  
3  39.165655 -86.575956  
4  39.164848 -86.579625  


# Normalizing Data
**Normalizing Hours**

Disregarding decimal points, Hour values are currently 3-4 characters with the least significant 2 digits being minutes (always 00 in this dataset) and the remaining significant bits denoting hours. We will remove decimal point and remove minutes integers so that the only remaining number is what number hour it is (from 0 to 23).

In [5]:
# We must normalize all values in hour column such that it is 4 integers indicating the format (HH:MM)
df['Hour'] = df['Hour'].astype(int).astype(str).str.zfill(4)

# get just the HH values (indicates which of the 24 buckets the value goes into)
df['Hour'] = df['Hour'].str[:2].astype(int)

**Encoding Weekend**

Encode "yes weekend" to 1 and "no weekend" to 0 so we can use this feature in our neural network

In [6]:
weekend_mapping = {'Weekday':0, 'Weekend':1}
df['Weekend?'] = df['Weekend?'].replace(weekend_mapping)
print(df.head())

   Month  Day  Weekend?  Hour         Injury Type      Reported_Location  \
0      1    5         0     0   No injury/unknown             1ST & FESS   
1      1    6         0    15   No injury/unknown          2ND & COLLEGE   
2      1    6         1    23  Non-incapacitating  BASSWOOD & BLOOMFIELD   
3      1    7         1     9  Non-incapacitating         GATES & JACOBS   
4      1    7         1    11   No injury/unknown                  W 3RD   

    Latitude  Longitude  
0  39.159207 -86.525874  
1  39.161440 -86.534848  
2  39.149780 -86.568890  
3  39.165655 -86.575956  
4  39.164848 -86.579625  


**Encoding Injury**

Encode there being an injury to a 1 and no injury to a 0 so we can classify if there's likely an injury.

In [7]:
injury_mapping = {'No injury/unknown' : 0,
                  'Non-incapacitating' : 1,
                  'Incapacitating': 2,
                  'Fatal' : 3}
df['Injury Type'] = df['Injury Type'].replace(injury_mapping)

# Save Modified Data

Will save modified data to a CSV file for use by our model

In [8]:
df.to_csv('.\\modified_data\\cleaned_data.csv', index=False)