# Data Science Capstone Project

## Summary:
* [Business Problem](#business_problem)
* [Data understanding](#data_understanding)
* [Exploratory Data Analysis](#eda)
* [Modeling](#modeling)
* [Evaluation](#evaluation)
* [Conclusion](#conclusion)

## Business problem <a name="business_problem"></a>

Every year, several people dies in car accident, whether from inattention, something in the track, illumination, external factors, etc.

This project aims to predict the severity of an accident, based on previous data, so it can help authorities, hospitals and health care systems, and, of course, the driver.

To do it, we are going to use Machine Learning Supervised algorithms to process the data and try to predict and to send info about the accident to emergency services.

## Data understanding <a name="data_understanding"></a>

The data we will use comes from the SDOT Traffic Management Division, Traffic Records Group, in CSV format. The data are from Seattle, and date from 2004 to the present day, they are updated weekly, so, later, we can make our model absorb more data, both to improve accuracy, and to look for new trends in the data and deliver a better result.
Within the CSV file we have columns that represent different types of data, such as the date of the accident, track conditions, number of accidents, number of cars involved, pedestrians, cyclists, etc. We also have many columns that serve only to identify each accident by bodies that deal with this type of work. In addition, we have many empty and unknown values that we need to deal with some techniques.
We can have more information about each column in the <b> METADATA.pdf </b> file to know what each variable is for.
The column we need to predict is the "SEVERITYCODE" which has the value of the severity of the accident, the higher, the greater the severity. Through some analysis, we can see that currently we will only deal with two values: 1 or 2 (damage only to properties and people injured in the accident, respectively).

### Importing libraries

In [1]:
# Pandas to handle dataset
import pandas as pd 

# Numpy to handle and operate through arrays
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Removing warnings
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [2]:
# Read the dataset
df_collisions = pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")

df_collisions.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
df_collisions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT  

 ## Exploratory data analysis <a name="eda"></a>

In this dataset, we have a lot of columns that are only used for identification by the authorities, so we need to remove these useless columns to make our DataFrame smaller and use only the columns that have impact in our decisions.
For more info about these columns, check the <b>Metadata.pdf</b>

In [4]:
# Remove useless columns and columns that have data inserted by the state, since they are added post-incident.

remove_columns = ['OBJECTID', 'X', 'Y', 'REPORTNO', 'STATUS', 'INCKEY', 'COLDETKEY', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDTTM', 'SDOT_COLCODE', 'SDOT_COLDESC', 'SDOTCOLNUM', 'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY']

df_collisions.drop(columns=remove_columns, inplace=True)

df_collisions.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,INCDATE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,HITPARKEDCAR
0,2,Intersection,Angles,2,2013/03/27 00:00:00+00,At Intersection (intersection related),,N,Overcast,Wet,Daylight,,,N
1,1,Block,Sideswipe,2,2006/12/20 00:00:00+00,Mid-Block (not related to intersection),,0,Raining,Wet,Dark - Street Lights On,,,N
2,1,Block,Parked Car,4,2004/11/18 00:00:00+00,Mid-Block (not related to intersection),,0,Overcast,Dry,Daylight,,,N
3,1,Block,Other,3,2013/03/29 00:00:00+00,Mid-Block (not related to intersection),,N,Clear,Dry,Daylight,,,N
4,2,Intersection,Angles,2,2004/01/28 00:00:00+00,At Intersection (intersection related),,0,Raining,Wet,Daylight,,,N


Let's check the first five rows in the dataframe and check the type of each column

In [5]:
df_collisions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   SEVERITYCODE    194673 non-null  int64 
 1   ADDRTYPE        192747 non-null  object
 2   COLLISIONTYPE   189769 non-null  object
 3   PERSONCOUNT     194673 non-null  int64 
 4   INCDATE         194673 non-null  object
 5   JUNCTIONTYPE    188344 non-null  object
 6   INATTENTIONIND  29805 non-null   object
 7   UNDERINFL       189789 non-null  object
 8   WEATHER         189592 non-null  object
 9   ROADCOND        189661 non-null  object
 10  LIGHTCOND       189503 non-null  object
 11  PEDROWNOTGRNT   4667 non-null    object
 12  SPEEDING        9333 non-null    object
 13  HITPARKEDCAR    194673 non-null  object
dtypes: int64(2), object(12)
memory usage: 20.8+ MB


### Missing values

Now, let's check our dataframe for missing values and how we'll handle then.

In [6]:
df_collisions.isnull().sum()

SEVERITYCODE           0
ADDRTYPE            1926
COLLISIONTYPE       4904
PERSONCOUNT            0
INCDATE                0
JUNCTIONTYPE        6329
INATTENTIONIND    164868
UNDERINFL           4884
WEATHER             5081
ROADCOND            5012
LIGHTCOND           5170
PEDROWNOTGRNT     190006
SPEEDING          185340
HITPARKEDCAR           0
dtype: int64

We can use 3 methods to handle missing data:
 - replacing by column average;
 - replacing by column most recurring value; or
 - deleting entire row/column with missing data

We can drop columns with 50% or more missing values

In [7]:
df_collisions.dropna(axis=1, thresh=97336, inplace=True)

df_collisions.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,INCDATE,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
0,2,Intersection,Angles,2,2013/03/27 00:00:00+00,At Intersection (intersection related),N,Overcast,Wet,Daylight,N
1,1,Block,Sideswipe,2,2006/12/20 00:00:00+00,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,N
2,1,Block,Parked Car,4,2004/11/18 00:00:00+00,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,N
3,1,Block,Other,3,2013/03/29 00:00:00+00,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,N
4,2,Intersection,Angles,2,2004/01/28 00:00:00+00,At Intersection (intersection related),0,Raining,Wet,Daylight,N


In [8]:
# Our dataset is imbalanced, so lets grab samples and balance that using undersampling technique
print(df_collisions.SEVERITYCODE.value_counts())

# Lets shuffle our dataframe and return it enterily
df_shuffled = df_collisions.sample(frac=1)

# Now we create 2 dataframes, one containing rows with SEVERITYCODE == 2 and one containing the same number of rows but with SEVERITYCODE == 1
df_severitycode_2 = df_shuffled.loc[df_shuffled['SEVERITYCODE'] == 2]
df_severitycode_1 = df_shuffled.loc[df_shuffled['SEVERITYCODE'] == 1].sample(n=58188, random_state=42)

# Now we can concat the two dataframes in one
df_collisions = pd.concat([df_severitycode_1, df_severitycode_2])
print(df_collisions.SEVERITYCODE.value_counts())

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64
2    58188
1    58188
Name: SEVERITYCODE, dtype: int64


<p>Since the remaing columns with missing values are categorical, we can fill those missing values with the recurrent value</p>

In [9]:
cols_with_missing_values = ['ADDRTYPE', 'COLLISIONTYPE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND']

df_collisions[cols_with_missing_values].head()

Unnamed: 0,ADDRTYPE,COLLISIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
133646,Block,Head On,N,Overcast,Dry,
155396,Intersection,Rear Ended,N,Clear,Dry,Daylight
51369,Block,Sideswipe,0,Clear,Dry,Dark - Street Lights On
164532,Block,Parked Car,N,Unknown,Unknown,Daylight
50520,Block,Sideswipe,0,Clear,Dry,Daylight


We can see that UNDERINFL column has numeric and string values, so we need to handle it individually. We can replace 0 with 'No' and 1 with 'Yes' values.

In [10]:
df_collisions['UNDERINFL'] = df_collisions['UNDERINFL'].map({'0': 0, '1': 1, 'N': 0, 'Y': 1})

df_collisions['HITPARKEDCAR'] = df_collisions['HITPARKEDCAR'].map({'N': 0, 'Y': 1})
df_collisions.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,INCDATE,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
133646,1,Block,Head On,2,2014/08/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Overcast,Dry,,0
155396,1,Intersection,Rear Ended,0,2016/05/02 00:00:00+00,At Intersection (but not related to intersection),0.0,Clear,Dry,Daylight,0
51369,1,Block,Sideswipe,2,2007/12/10 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Dark - Street Lights On,0
164532,1,Block,Parked Car,2,2017/04/15 00:00:00+00,Mid-Block (not related to intersection),0.0,Unknown,Unknown,Daylight,0
50520,1,Block,Sideswipe,2,2007/09/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Daylight,0


Now lets fill those missing values with the recurrent value

In [11]:
df_collisions['ADDRTYPE'].fillna(df_collisions['ADDRTYPE'].mode()[0], inplace=True)
df_collisions['COLLISIONTYPE'].fillna(df_collisions['COLLISIONTYPE'].mode()[0], inplace=True)
df_collisions['JUNCTIONTYPE'].fillna(df_collisions['JUNCTIONTYPE'].mode()[0], inplace=True)
df_collisions['UNDERINFL'].fillna(df_collisions['UNDERINFL'].mode()[0], inplace=True)
df_collisions['WEATHER'].fillna(df_collisions['WEATHER'].mode()[0], inplace=True)
df_collisions['ROADCOND'].fillna(df_collisions['ROADCOND'].mode()[0], inplace=True)
df_collisions['LIGHTCOND'].fillna(df_collisions['LIGHTCOND'].mode()[0], inplace=True)

df_collisions.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,INCDATE,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
133646,1,Block,Head On,2,2014/08/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Overcast,Dry,Daylight,0
155396,1,Intersection,Rear Ended,0,2016/05/02 00:00:00+00,At Intersection (but not related to intersection),0.0,Clear,Dry,Daylight,0
51369,1,Block,Sideswipe,2,2007/12/10 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Dark - Street Lights On,0
164532,1,Block,Parked Car,2,2017/04/15 00:00:00+00,Mid-Block (not related to intersection),0.0,Unknown,Unknown,Daylight,0
50520,1,Block,Sideswipe,2,2007/09/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Daylight,0


@todo
Lidar com a data e deltatime

In [12]:
df_collisions['MONTH'] = df_collisions['INCDATE'].apply(lambda x: pd.Timestamp(x).month_name())
df_collisions.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,PERSONCOUNT,INCDATE,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR,MONTH
133646,1,Block,Head On,2,2014/08/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Overcast,Dry,Daylight,0,August
155396,1,Intersection,Rear Ended,0,2016/05/02 00:00:00+00,At Intersection (but not related to intersection),0.0,Clear,Dry,Daylight,0,May
51369,1,Block,Sideswipe,2,2007/12/10 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Dark - Street Lights On,0,December
164532,1,Block,Parked Car,2,2017/04/15 00:00:00+00,Mid-Block (not related to intersection),0.0,Unknown,Unknown,Daylight,0,April
50520,1,Block,Sideswipe,2,2007/09/26 00:00:00+00,Mid-Block (not related to intersection),0.0,Clear,Dry,Daylight,0,September


In [13]:
categorical_columns = ['ADDRTYPE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'MONTH']

df_collisions = pd.get_dummies(df_collisions, columns=categorical_columns, drop_first=True)
df_collisions.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,INCDATE,UNDERINFL,HITPARKEDCAR,ADDRTYPE_Block,ADDRTYPE_Intersection,COLLISIONTYPE_Cycles,COLLISIONTYPE_Head On,COLLISIONTYPE_Left Turn,...,MONTH_December,MONTH_February,MONTH_January,MONTH_July,MONTH_June,MONTH_March,MONTH_May,MONTH_November,MONTH_October,MONTH_September
133646,1,2,2014/08/26 00:00:00+00,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
155396,1,0,2016/05/02 00:00:00+00,0.0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
51369,1,2,2007/12/10 00:00:00+00,0.0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
164532,1,2,2017/04/15 00:00:00+00,0.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50520,1,2,2007/09/26 00:00:00+00,0.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [14]:
df_collisions.drop(['INCDATE'], axis=1, inplace=True)

Usar GridSearchCV e Pipeline para checar quais os melhores parametros para o SelectKBest, KNN, Random Forest, e LinearSVC. Ao final, escolher qual obteve melhor resultado.

In [15]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

In [16]:
k_best = SelectKBest(chi2, k='all')
k_best.fit(df_collisions.drop(['SEVERITYCODE'], axis=1), df_collisions['SEVERITYCODE'])

#print(k_best.scores_)
#print(k_best.pvalues_)
#for pval in k_best.pvalues_:
#print(df_collisions.drop(['SEVERITYCODE'], axis=1).columns)

features_index_strong_evidence_corr = []
features_index_moderate_evidence_corr = []
features_index_weak_evidence_corr = []
features_index_no_evidence_corr = []

for i in range(len(k_best.pvalues_)):
    if k_best.pvalues_[i] < 0.001:
        features_index_strong_evidence_corr.append(i)
    elif k_best.pvalues_[i] < 0.05:
        features_index_moderate_evidence_corr.append(i)
    elif k_best.pvalues_[i] < 0.1:
        features_index_weak_evidence_corr.append(i)
    else:
        features_index_no_evidence_corr.append(i)

print("Number of features with strong evidence that the correlation is significant: " + str(len(features_index_strong_evidence_corr)))
print("Number of features with moderate evidence that the correlation is significant: " + str(len(features_index_moderate_evidence_corr)))
print("Number of features with weak evidence that the correlation is significant: " + str(len(features_index_weak_evidence_corr)))
print("Number of features with no evidence that the correlation is significant: " + str(len(features_index_no_evidence_corr)))


Number of features with strong evidence that the correlation is significant: 31
Number of features with moderate evidence that the correlation is significant: 7
Number of features with weak evidence that the correlation is significant: 3
Number of features with no evidence that the correlation is significant: 16


In [17]:
# This line changes the way numpy displays number for us. If False (default) it will display numbers as scientific notation, wich is hard to read and interpret. (https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html)
np.set_printoptions(suppress=True)

print(np.sort(k_best.scores_[features_index_strong_evidence_corr]))

[   11.016         19.96339228    20.0075643     23.23395649
    40.96589842    43.68032787    63.61359223    87.64470588
   113.10545455   114.98962963   141.17162501   159.85121714
   167.68476204   178.00680831   218.0147108    314.74176385
   390.62779267  1306.79294742  1722.99526247  1867.39883212
  1907.15362704  2011.22122379  3097.55810932  3264.95939946
  3407.51874247  3844.16558711  3951.04652087  3976.70753065
  4134.00058824  5120.03148089 12493.60922908]


In [18]:
k_best_test = SelectKBest(chi2, k=len(features_index_strong_evidence_corr))
k_best_test.fit(df_collisions.drop(['SEVERITYCODE'], axis=1), df_collisions['SEVERITYCODE'])

SelectKBest(k=31, score_func=<function chi2 at 0x7f6f6ed1fa60>)

In [19]:
# Testar se o SelectKBest vai pegar as features com evidência mais forte da correlação ou se seria outra coisa.

print(df_collisions.drop(['SEVERITYCODE'], axis=1).columns[features_index_strong_evidence_corr])
print(df_collisions.drop(['SEVERITYCODE'], axis=1).columns[k_best_test.get_support()])
print(df_collisions.drop(['SEVERITYCODE'], axis=1).columns[features_index_strong_evidence_corr] == df_collisions.drop(['SEVERITYCODE'], axis=1).columns[k_best_test.get_support()])

Index(['PERSONCOUNT', 'UNDERINFL', 'HITPARKEDCAR', 'ADDRTYPE_Block',
       'ADDRTYPE_Intersection', 'COLLISIONTYPE_Cycles',
       'COLLISIONTYPE_Head On', 'COLLISIONTYPE_Left Turn',
       'COLLISIONTYPE_Other', 'COLLISIONTYPE_Parked Car',
       'COLLISIONTYPE_Pedestrian', 'COLLISIONTYPE_Rear Ended',
       'COLLISIONTYPE_Right Turn', 'COLLISIONTYPE_Sideswipe',
       'JUNCTIONTYPE_At Intersection (intersection related)',
       'JUNCTIONTYPE_Mid-Block (but intersection related)',
       'JUNCTIONTYPE_Mid-Block (not related to intersection)', 'WEATHER_Clear',
       'WEATHER_Other', 'WEATHER_Overcast', 'WEATHER_Raining',
       'WEATHER_Snowing', 'WEATHER_Unknown', 'ROADCOND_Ice',
       'ROADCOND_Snow/Slush', 'ROADCOND_Unknown', 'ROADCOND_Wet',
       'LIGHTCOND_Daylight', 'LIGHTCOND_Unknown', 'MONTH_December',
       'MONTH_February'],
      dtype='object')
Index(['PERSONCOUNT', 'UNDERINFL', 'HITPARKEDCAR', 'ADDRTYPE_Block',
       'ADDRTYPE_Intersection', 'COLLISIONTYPE_Cycles',


In [20]:
# Testar com: 14, 22 e todas as 33 features com uma evidência forte de que a correlação é significante
# Fazer um teste sem as datas (já que eu não sei como lidar com tempo) e conferir as correlações.

In [21]:
k_best = SelectKBest(chi2)

knn_pipeline = Pipeline([
    ('k_best', k_best),
    ('knn', KNeighborsClassifier())
])

grid_knn = GridSearchCV(knn_pipeline, {
    'k_best__k': [14, 22, 33],
    'knn__n_neighbors': range(1, 11),
    'knn__weights': ['uniform', 'distance']
}, n_jobs=-1)

grid_knn.fit(df_collisions.drop(['SEVERITYCODE'], axis=1), df_collisions['SEVERITYCODE'])

GridSearchCV(estimator=Pipeline(steps=[('k_best',
                                        SelectKBest(score_func=<function chi2 at 0x7f6f6ed1fa60>)),
                                       ('knn', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'k_best__k': [14, 22, 33],
                         'knn__n_neighbors': range(1, 11),
                         'knn__weights': ['uniform', 'distance']})

In [22]:
k_best = SelectKBest(chi2)

linear_svc_pipeline = Pipeline([
    ('k_best', k_best),
    ('linear_svc', LinearSVC())
])

grid_linear_svc = GridSearchCV(linear_svc_pipeline, {
    'k_best__k': [14, 22, 33],
    'linear_svc__C': [1, 10, 100, 1000],
    'linear_svc__penalty': ['l1', 'l2']
}, n_jobs=-1)

grid_linear_svc.fit(df_collisions.drop(['SEVERITYCODE'], axis=1), df_collisions['SEVERITYCODE'])

GridSearchCV(estimator=Pipeline(steps=[('k_best',
                                        SelectKBest(score_func=<function chi2 at 0x7f6f6ed1fa60>)),
                                       ('linear_svc', LinearSVC())]),
             n_jobs=-1,
             param_grid={'k_best__k': [14, 22, 33],
                         'linear_svc__C': [1, 10, 100, 1000],
                         'linear_svc__penalty': ['l1', 'l2']})

In [23]:
k_best = SelectKBest(chi2)

rand_forest_pipeline = Pipeline([
    ('k_best', k_best),
    ('rfc', RandomForestClassifier())
])

grid_rfc = GridSearchCV(rand_forest_pipeline, {
    'k_best__k': [14, 22, 33],
    'rfc__criterion': ['gini', 'entropy'],
    'rfc__class_weight': ['balanced', 'balanced_subsample']
}, n_jobs=-1)

grid_rfc.fit(df_collisions.drop(['SEVERITYCODE'], axis=1), df_collisions['SEVERITYCODE'])

GridSearchCV(estimator=Pipeline(steps=[('k_best',
                                        SelectKBest(score_func=<function chi2 at 0x7f6f6ed1fa60>)),
                                       ('rfc', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'k_best__k': [14, 22, 33],
                         'rfc__class_weight': ['balanced',
                                               'balanced_subsample'],
                         'rfc__criterion': ['gini', 'entropy']})

In [24]:
print("KNN GridSearchCV best params: " + str(grid_knn.best_params_))
print("LinearSVC GridSearchCV best params: " + str(grid_linear_svc.best_params_))
print("Random Forest GridSearchCV best params: " + str(grid_rfc.best_params_))

KNN GridSearchCV best params: {'k_best__k': 33, 'knn__n_neighbors': 9, 'knn__weights': 'uniform'}
LinearSVC GridSearchCV best params: {'k_best__k': 33, 'linear_svc__C': 10, 'linear_svc__penalty': 'l2'}
Random Forest GridSearchCV best params: {'k_best__k': 14, 'rfc__class_weight': 'balanced_subsample', 'rfc__criterion': 'entropy'}


In [27]:
X = df_collisions.drop(['SEVERITYCODE'], axis=1)
y = df_collisions['SEVERITYCODE']

print("KNN GridSearchCV score: " + str(grid_knn.score(X, y)))
print("LinearSVC GridSearchCV score: " + str(grid_linear_svc.score(X, y)))
print("Random Forest GridSearchCV score: " + str(grid_rfc.score(X, y)))

KNN GridSearchCV score: 0.6327163676359386
LinearSVC GridSearchCV score: 0.6979188148759194
Random Forest GridSearchCV score: 0.7032893380078367
