# Extracting Cause and Effect to Predict Automotive Accidents
In this notebook, we will be creating different Machine Learning Models to predict car accident severity (as measured by impact on traffic)

### The Problem:
Car accidents come with a great cost: to the drivers and passengers involved in the accidents, to the individuals who must wait in congested highways, and to the transportation industries at large. These are immense tragedies with costly repercussions. This project seeks to develop a model that will tell us where and under what conditions severe car accidents are likely to occur. This information can then be implemented into solutions that will minimize the total costs of car accidents and hopefully save lives.

The model developed herein will be developed and proposed as valuable information for the transportation company Lyft, Inc. to minimze their insurance claims costs by warning drivers that they are in high-risk situations or directing drivers towards lower-risk route options.

### The Data
Data Source: https://osu.app.box.com/v/us-accidents-june20

Metadata: https://smoosavi.org/datasets/us_accidents

#### Descripion
This is a countrywide traffic accident dataset, which covers 49 states of the United States. The data was continuously collected from February 2016 thru June 2020, using several data providers, including two APIs which provide streaming traffic event data. These APIs broadcast traffic events captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset.

#### Discussion
This data contains 49 total features that can be thought of as 3 main overarching categories: (1) Location, (2) Weather, and (3) Time of Day. Listed Below.

I will incorporate features that will make the most accurate model possible, while remaining robust and relevant to those features that will make for the best formulation of solutions to the problem of severe car accidents (such as location, weather, and time of day). Since I am trying to discover which features are highly correlated with severe car accidents (and thus presenting a possibility of causality), I will be cleaning my data of any collinearity.

The dataset provides us with the assets necessary to make the following inquiries:
- Where are severe car accidents most likely to occur?
- When are severe car accidents most likely to occur?
- Under what conditions are severe car accidents most likely to occur?

(1) Location:
- 7-10 - "Exact Location"
- 7 - Start_Lat
- 8 - Start_Lng
- 9 - End_Lat
- 10 - End_Lng
- 13-20 - "Address Data"
- 13 - Number
- 14 - Street
- 15 - Side
- 16 - City
- 17 - County
- 18 - State
- 19 - Zipcode
- 20 - Country
- 33-45 - "What exists nearby"
- 33 - Amenity
- 34 - Bump
- 35 - Crossing
- 36 - Give_Way
- 37 - Junction
- 38 - No_Exit
- 39 - Railway
- 40 - Roundabout
- 41 - Station
- 42 - Stop
- 43 - Traffic_Calming
- 44 - Traffic_Signal
- 45 - Turning_Loop

(2) Weather:
- 23 - Weather_Timestamp
- 24 - Temperature(F)
- 25 - Wind_Chill(F)
- 26 - Humidity(%)
- 27 - Pressure(in)
- 28 - Visibility
- 29 - Wind_Direction
- 30 - Wind_Speed(mph)
- 31 - Precipitation(in)
- 32 - Weather_Condition

(3) Time of Day:
- 5 - Start_Time
- 6 - End_Time
- 21 - Timezone
- 23 - Weather_Timestamp
- 46 - Sunrise_Sunset
- 47 - Civil_Twilight
- 48 - Nautical_Twilight
- 49 - Astronomical_Twilight

Target Variable(s):
- 4 - Severity
- 11 - Distance(mi)

Extra Information:
- 12 - Description
- 3 - TMC ("Traffic Message Channel" code)
- 2 - Source
- 22 - Airport_Code

##### Acknowledgements:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic 
    Accident Dataset.”, arXiv preprint arXiv:1906.05409 (2019).

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident 
    Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM 
    SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

### Data Wrangling

In [2]:
import os

os.chdir('/')
os.chdir('Users/petermcmaster/Desktop')
os.getcwd()

'/Users/petermcmaster/Desktop'

In [3]:
import pandas as pd

data = pd.read_csv("US_Accidents.csv")
pd.set_option('display.max_columns', None)
data.head(40)

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Number,Street,Side,City,County,State,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,MapQuest,201.0,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,Right lane blocked due to accident on I-70 Eas...,,I-70 E,R,Dayton,Montgomery,OH,45424,US,US/Eastern,KFFO,2016-02-08 05:58:00,36.9,,91.0,29.68,10.0,Calm,,0.02,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,MapQuest,201.0,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,Accident on Brice Rd at Tussing Rd. Expect del...,2584.0,Brice Rd,L,Reynoldsburg,Franklin,OH,43068-3402,US,US/Eastern,KCMH,2016-02-08 05:51:00,37.9,,100.0,29.65,10.0,Calm,,0.0,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,MapQuest,201.0,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,Accident on OH-32 State Route 32 Westbound at ...,,State Route 32,R,Williamsburg,Clermont,OH,45176,US,US/Eastern,KI69,2016-02-08 06:56:00,36.0,33.3,100.0,29.67,10.0,SW,3.5,,Overcast,False,False,False,False,False,False,False,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,MapQuest,201.0,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,Accident on I-75 Southbound at Exits 52 52B US...,,I-75 S,R,Dayton,Montgomery,OH,45417,US,US/Eastern,KDAY,2016-02-08 07:38:00,35.1,31.0,96.0,29.64,9.0,SW,4.6,,Mostly Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,MapQuest,201.0,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,Accident on McEwen Rd at OH-725 Miamisburg Cen...,,Miamisburg Centerville Rd,R,Dayton,Montgomery,OH,45459,US,US/Eastern,KMGY,2016-02-08 07:53:00,36.0,33.3,89.0,29.65,6.0,SW,3.5,,Mostly Cloudy,False,False,False,False,False,False,False,False,False,False,False,True,False,Day,Day,Day,Day
5,A-6,MapQuest,201.0,3,2016-02-08 07:44:26,2016-02-08 08:14:26,40.10059,-82.925194,,,0.01,Accident on I-270 Outerbelt Northbound near Ex...,,Westerville Rd,R,Westerville,Franklin,OH,43081,US,US/Eastern,KCMH,2016-02-08 07:51:00,37.9,35.5,97.0,29.63,7.0,SSW,3.5,0.03,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
6,A-7,MapQuest,201.0,2,2016-02-08 07:59:35,2016-02-08 08:29:35,39.758274,-84.230507,,,0.0,Accident on Oakridge Dr at Woodward Ave. Expec...,376.0,N Woodward Ave,R,Dayton,Montgomery,OH,45417-2476,US,US/Eastern,KDAY,2016-02-08 07:56:00,34.0,31.0,100.0,29.66,7.0,WSW,3.5,,Overcast,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
7,A-8,MapQuest,201.0,3,2016-02-08 07:59:58,2016-02-08 08:29:58,39.770382,-84.194901,,,0.01,Accident on I-75 Southbound at Exit 54B Grand ...,,N Main St,R,Dayton,Montgomery,OH,45405,US,US/Eastern,KDAY,2016-02-08 07:56:00,34.0,31.0,100.0,29.66,7.0,WSW,3.5,,Overcast,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
8,A-9,MapQuest,201.0,2,2016-02-08 08:00:40,2016-02-08 08:30:40,39.778061,-84.172005,,,0.0,Accident on Notre Dame Ave at Warner Ave. Expe...,99.0,Notre Dame Ave,L,Dayton,Montgomery,OH,45404-1923,US,US/Eastern,KFFO,2016-02-08 07:58:00,33.3,,99.0,29.67,5.0,SW,1.2,,Mostly Cloudy,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day
9,A-10,MapQuest,201.0,3,2016-02-08 08:10:04,2016-02-08 08:40:04,40.10059,-82.925194,,,0.01,Right hand shoulder blocked due to accident on...,,Westerville Rd,R,Westerville,Franklin,OH,43081,US,US/Eastern,KCMH,2016-02-08 08:28:00,37.4,33.8,100.0,29.62,3.0,SSW,4.6,0.02,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Day,Day,Day,Day


In [4]:
import numpy as np

# Create a Table which displays column name, number of null values, percentage of null values, and dtype

def missing_data(data):
    # Count number of missing values in a column
    total = data.isnull().sum()
    
    # Get Percentage of missing values
    percent = (data.isnull().sum()/data.isnull().count()*100)
    temp = pd.concat([total, percent], axis=1, keys=['Total','Percent(%)'])
    
    # Create a Type column, that indicates the data-type of the column.
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    temp['Types'] = types
    
    return(np.transpose(temp))

missing_data(data)

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Number,Street,Side,City,County,State,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
Total,0,0,1034922,0,0,0,0,0,2478818,2478818,0,1,2262954,0,0,112,0,0,1069,0,3880,6758,43325,65736,1868256,69691,55884,75861,58877,454613,2025881,76143,0,0,0,0,0,0,0,0,0,0,0,0,0,116,116,116,116
Percent(%),0,0,29.4536,0,0,0,0,0,70.5464,70.5464,0,2.84597e-05,64.403,0,0,0.00318749,0,0,0.0304234,0,0.110424,0.192331,1.23302,1.87083,53.17,1.98339,1.59044,2.15898,1.67562,12.9382,57.656,2.16701,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00330133,0.00330133,0.00330133,0.00330133
Types,object,object,float64,int64,object,object,float64,float64,float64,float64,float64,object,float64,object,object,object,object,object,object,object,object,object,object,float64,float64,float64,float64,float64,object,float64,float64,object,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,object,object,object,object


In [29]:
# Date and Time Methods

from datetime import datetime
from datetime import date

print(data['Start_Time'][0])

my_date = datetime.strptime(data['Start_Time'][0], "%Y-%m-%d %H:%M:%S")
print(my_date)
print(my_date.weekday())

2016-02-08 05:46:00
2016-02-08 05:46:00
0


In [27]:
# Peep dtypes
data.dtypes

ID                        object
Source                    object
TMC                      float64
Severity                   int64
Start_Time                object
End_Time                  object
Start_Lat                float64
Start_Lng                float64
End_Lat                  float64
End_Lng                  float64
Distance(mi)             float64
Description               object
Number                   float64
Street                    object
Side                      object
City                      object
County                    object
State                     object
Zipcode                   object
Country                   object
Timezone                  object
Airport_Code              object
Weather_Timestamp         object
Temperature(F)           float64
Wind_Chill(F)            float64
Humidity(%)              float64
Pressure(in)             float64
Visibility(mi)           float64
Wind_Direction            object
Wind_Speed(mph)          float64
Precipitat

In [18]:
# Peep COuntry
data['Country'].value_counts()

US    3513740
Name: Country, dtype: int64

In [17]:
# Peep them Disktances
data['Distance(mi)'].value_counts()


0.000     2457182
0.010      250988
0.010       13359
0.020        5968
0.001        5528
           ...   
9.356           1
7.967           1
9.269           1
6.688           1
16.911          1
Name: Distance(mi), Length: 13476, dtype: int64

In [15]:
# Peep those descriptions

descriptions = data['Description']
pd.set_option('max_colwidth', None)
descriptions.head(40)

0                     Right lane blocked due to accident on I-70 Eastbound at Exit 41 OH-235 State Route 4.
1                                                        Accident on Brice Rd at Tussing Rd. Expect delays.
2                               Accident on OH-32 State Route 32 Westbound at Dela Palma Rd. Expect delays.
3                                         Accident on I-75 Southbound at Exits 52 52B US-35. Expect delays.
4                                 Accident on McEwen Rd at OH-725 Miamisburg Centerville Rd. Expect delays.
5                         Accident on I-270 Outerbelt Northbound near Exit 29 OH-3 State St. Expect delays.
6                                                   Accident on Oakridge Dr at Woodward Ave. Expect delays.
7                                         Accident on I-75 Southbound at Exit 54B Grand Ave. Expect delays.
8                                                  Accident on Notre Dame Ave at Warner Ave. Expect delays.
9        Right hand shoulder

In [16]:
# Peep those TMC reports
TMC = data['TMC'].value_counts()
print(TMC)

201.0    2080341
241.0     249852
245.0      40338
229.0      22932
203.0      17639
222.0      13154
244.0      12185
406.0      11109
246.0       7118
343.0       6930
202.0       6298
247.0       4775
236.0       2121
206.0       1274
248.0       1025
339.0        920
341.0        592
336.0         89
200.0         66
239.0         54
351.0          6
Name: TMC, dtype: int64


In [7]:
import numpy as np

# Create a Table which displays column name, number of null values, percentage of null values, and dtype

def missing_data(data):
    # Count number of missing values in a column
    total = data.isnull().sum()
    
    # Get Percentage of missing values
    percent = (data.isnull().sum()/data.isnull().count()*100)
    temp = pd.concat([total, percent], axis=1, keys=['Total','Percent(%)'])
    
    # Create a Type column, that indicates the data-type of the column.
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    temp['Types'] = types
    
    return(np.transpose(temp))

missing_data(data)

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Number,Street,Side,City,County,State,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
Total,0,0,1034922,0,0,0,0,0,2478818,2478818,0,1,2262954,0,0,112,0,0,1069,0,3880,6758,43325,65736,1868256,69691,55884,75861,58877,454613,2025881,76143,0,0,0,0,0,0,0,0,0,0,0,0,0,116,116,116,116
Percent(%),0,0,29.4536,0,0,0,0,0,70.5464,70.5464,0,2.84597e-05,64.403,0,0,0.00318749,0,0,0.0304234,0,0.110424,0.192331,1.23302,1.87083,53.17,1.98339,1.59044,2.15898,1.67562,12.9382,57.656,2.16701,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00330133,0.00330133,0.00330133,0.00330133
Types,object,object,float64,int64,object,object,float64,float64,float64,float64,float64,object,float64,object,object,object,object,object,object,object,object,object,object,float64,float64,float64,float64,float64,object,float64,float64,object,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,bool,object,object,object,object


In [8]:
# exploring Weather_Conditions

print(len(pd.unique(data['Weather_Condition'])))
pd.unique(data['Weather_Condition'])

128


array(['Light Rain', 'Overcast', 'Mostly Cloudy', 'Rain', 'Light Snow',
       'Haze', 'Scattered Clouds', 'Partly Cloudy', 'Clear', 'Snow',
       'Light Freezing Drizzle', 'Light Drizzle', 'Fog', 'Shallow Fog',
       'Heavy Rain', 'Light Freezing Rain', 'Cloudy', 'Drizzle', nan,
       'Light Rain Showers', 'Mist', 'Smoke', 'Patches of Fog',
       'Light Freezing Fog', 'Light Haze', 'Light Thunderstorms and Rain',
       'Thunderstorms and Rain', 'Fair', 'Volcanic Ash', 'Blowing Sand',
       'Blowing Dust / Windy', 'Widespread Dust', 'Fair / Windy',
       'Rain Showers', 'Mostly Cloudy / Windy', 'Light Rain / Windy',
       'Hail', 'Heavy Drizzle', 'Showers in the Vicinity', 'Thunderstorm',
       'Light Rain Shower', 'Light Rain with Thunder',
       'Partly Cloudy / Windy', 'Thunder in the Vicinity', 'T-Storm',
       'Heavy Thunderstorms and Rain', 'Thunder', 'Heavy T-Storm',
       'Funnel Cloud', 'Heavy T-Storm / Windy', 'Blowing Snow',
       'Light Thunderstorms and Snow',

In [14]:
# Exploring occurences by state
state = data['State'].value_counts()
print(state)

CA    816826
TX    329284
FL    258002
SC    173277
NC    165963
NY    160817
PA    106794
IL     99692
VA     96075
MI     95983
GA     93614
OR     90134
MN     81865
AZ     78586
TN     69895
WA     68545
OH     66140
LA     61515
OK     60003
NJ     59059
MD     53593
UT     51685
CO     49731
AL     44625
MA     39044
IN     33752
MO     33643
CT     25901
NE     23971
KY     22553
WI     20120
RI     11753
IA     11475
NV     10724
NH      7984
KS      7939
MS      6585
DE      5739
NM      5523
DC      4820
WV      2381
ME      2243
ID      2048
AR      2012
VT       702
MT       512
WY       508
SD        61
ND        44
Name: State, dtype: int64


### Exploratory Data Analysis

### Maps

### Feature Set

### Train Test Split

### Normalize the Data

### Plots

### Machine Learning Models

### Results, Evaluations, and Discussion

### Conclusion and Recommendation