#### Exam description
For this exam, you will predict the target values for the test.csv. 

#### Your task:
find a good machine learning model to predict the target value. Then predict the target values of the instances in the test.csv. 

#### Exam rules
- You can use only the machine learning models discussed in this course. 
    - If the prediction is based on a model that is not discussed in class, one of the models in your submission will randomly be selected for grading. 
- Fifty percent of the grade is based on your Python code submission. The other 50 percent of your grade is based on the evaluation score of the prediction. 
- The exam should be syntax error-free. Run your code before the final submission. 
- Save the final prediction array as ``final_test_prediction``. 
- <font color = 'red'> The final prediction will be evaluated using the **roc_auc_score** function. </font>

#### Devliverable
Submit ONLY the iPython notebook or the .py file of your work. Use the following frame for your submission. Please don't remove the headers in the following structure. 

#### Rubric
| Descriptio | Fair | Good | excelent |
|:-----------|:------|:------|:----------|
|Preprocessing|Demonstrate limited understanding of preprocessing steps | Demonstrate a moderate ability to find a way to apply the preprocessing step to prepare the dataset for Machine learning models | Demonstrate the ability to choose the appropriate preprocessing model to prepare the dataset |
|Machine learning model | Demonstrate limited understanding of methods used to train machine learning models | Demonstrate the ability to understand techniques used to train machine learning models with some effectiveness. This includes optimization algorithms, initialization, regularization, and hyperparameter search methods | Demonstrate ability to understand and apply various algorithms as well as initialization, regularization, and hyperparameter search methods |
|Final prediction |Demonstrate limited understanding of strategies to structure and end to end machine learning project | Demonstrate ability to understand classic ML strategies such as error analysis, data split, data collection and evaluation metric selection with some effectiveness | Demonstrates ability to structure the project and apply methods such as error analysis, data split, data collection, design a labeling process and select proper evaluation metrics to improve performance.


Import the <b> train </b> and the <b> test </b> data

In [1]:
import pandas as pd
df = pd.read_csv('train.csv')
df1 = pd.read_csv('test.csv')

In [2]:
df.columns = df.columns.str.replace("'","")
df1.columns = df1.columns.str.replace("'","")

In [3]:
df.columns

Index(['$1000 Damage to Any One Persons Property', 'Bridge Detail',
       'Construction Zone Flag', 'Construction Zone Workers Present Flag',
       'Crash Time', 'Day of Week', 'Highway System',
       'Intersecting Highway Number', 'Intersecting Street Name',
       'Manner of Collision', 'Median Type', 'Median Width',
       'Number of Entering Roads', 'Number of Lanes', 'Surface Condition',
       'Surface Type', 'Surface Width', 'Weather Condition', 'Crash Severity'],
      dtype='object')

By analysing the data we find that there are total 2322 Instances and 19 features. We need to predict if the severity of the crash is serious or not serious. This is a binary classification problem. The data is highly imbalanced with more than 98% of the data with the crash severity - 'Not Serious'

# Preprocessing ``train.csv`` (15 points)

<b> Crash Time </b>

Convert the time to hh:mm and then divide the time into morning and night. 

In [4]:
time = df['Crash Time'].astype(str)
time = [f'{t[:-2]}:{t[-2:]}' for t in time ]

#time = time[0:]
#print(time[0:3])
for i in range(0,len(time)):
    a = time[i].split(":")
    if len(time[i]) <= 3:
        time[i] = 'Night'
    else:
        if int(a[0]) <= 6:
            time[i] = 'Night'

        elif int(a[0]) >= 6 and int(a[0]) <= 18:
            time[i] = 'Morning'
        elif int(a[0]) > 18 and int(a[0]) < 24:
            time[i] = 'Night'

In [5]:
df['Crash Time'] = time 

In [6]:
df['Crash Time'] = df['Crash Time'].replace({'Morning': 1,'Night' : 0})

<b>      Crash Severity   </b>

Convert the dependent variable to a binary numeric value 

Map 1 to ' Not Serious'  &   Map 0 to 'Serious'

In [7]:
df['Crash Severity'].value_counts()

Not serious    2286
Serious          36
Name: Crash Severity, dtype: int64

In [8]:
df['Crash Severity'] = df['Crash Severity'].replace({'Not serious': 1,'Serious' : 0})

<b> $1000 Damage to Any One Persons Property </b>

The column - '$1000 Damage to Any One Persons Property' has two values yes and no.

Since the majority of the values are 'Yes' Lets replace yes with 1 and No with 0.

In [9]:
df['$1000 Damage to Any One Persons Property'] = df['$1000 Damage to Any One Persons Property'].replace({'Yes': 1,'No' : 0})

<b> Day of Week </b>

Dummy variables for the Day of the week column : Since ' Week day' is a nominal variable

In [10]:
cols = pd.get_dummies(df['Day of Week'], prefix= 'Day of Week')
df[cols.columns] = cols
df.drop('Day of Week', axis = 1, inplace = True)

<b> Construction Zone Flag </b>

'Construction Zone Flag' has two values yes and no.

Since the majority of the values are 'No'. Lets replace yes with 0 and No with 1

In [11]:
df['Construction Zone Flag'] = df['Construction Zone Flag'].replace({'No': 1,'Yes' : 0})

<b> Construction Zone Workers Present Flag </b>

'Construction Zone Workers Present Flag' has two values yes and no.

Since the majority of the values are 'No'.

Lets replace yes with 0 and No with 1.

In [12]:
df['Construction Zone Workers Present Flag'] = df['Construction Zone Workers Present Flag'].replace({'No': 1,'Yes' : 0})

<b> Surface Type </b>

Surface type has the value no data.

In [13]:
df = df.drop('Surface Type', axis = 1)

<b> Highway system </b> 

Highway system has only one type of data. So we don't get any information for further processing. 

In [14]:
df = df.drop('Highway System', axis = 1)

<b> Bridge detail </b> 

Bridge detail has more than like 90% of the data as 'Not Applicable'. So we dont get much information creating dummies for this variable.

In [15]:
df = df.drop('Bridge Detail', axis = 1)

<b>Median Type </b>

Median Type has more than like 90% of the data as 'No Data'. So we dont get much information creating dummies for this variable.

In [16]:
df = df.drop('Median Type', axis = 1)

<b>Weather Condition </b> 

Created dummies for the categorical variable Weather condition

In [17]:
df['Weather Condition'].value_counts()

1 - CLEAR                            1817
2 - CLOUDY                            319
3 - RAIN                              179
98 - OTHER (EXPLAIN IN NARRATIVE)       2
6 - FOG                                 2
5 - SNOW                                1
99 - UNKNOWN                            1
4 - SLEET/HAIL                          1
Name: Weather Condition, dtype: int64

In [18]:
df['Weather Condition'] = df['Weather Condition'].replace({'99 - UNKNOWN': '1 - CLEAR'})
df['Weather Condition'] = df['Weather Condition'].replace({'5 - SNOW': '1 - CLEAR'})
df['Weather Condition'] = df['Weather Condition'].replace({'4 - SLEET/HAIL': '1 - CLEAR'})

In [19]:
df['Weather Condition'].value_counts()

1 - CLEAR                            1820
2 - CLOUDY                            319
3 - RAIN                              179
98 - OTHER (EXPLAIN IN NARRATIVE)       2
6 - FOG                                 2
Name: Weather Condition, dtype: int64

In [20]:
cols = pd.get_dummies(df['Weather Condition'], prefix= 'Weather Condition')
df[cols.columns] = cols
df.drop('Weather Condition', axis = 1, inplace = True)

<b>Surface condition</b> 

Created dummies for the surface condition


In [21]:
df['Surface Condition'] = df['Surface Condition'].replace({'99 - UNKNOWN': '1 - DRY'})
df['Surface Condition'] = df['Surface Condition'].replace({'7 - SAND, MUD, DIRT': '1 - DRY'})

In [22]:
df['Surface Condition'].value_counts()

1 - DRY                              2055
2 - WET                               246
3 - STANDING WATER                     14
6 - ICE                                 5
98 - OTHER (EXPLAIN IN NARRATIVE)       2
Name: Surface Condition, dtype: int64

In [23]:
cols = pd.get_dummies(df['Surface Condition'], prefix= 'Surface Condition')
df[cols.columns] = cols
df.drop('Surface Condition', axis = 1, inplace = True)

<b> Intersecting Highway Number </b>

In [24]:
df['Intersecting Highway Number'].value_counts()

No Data    1948
35          221
345         146
75            7
Name: Intersecting Highway Number, dtype: int64

For the column - Intersecting highway number there are around 1948 values with no data. This is more than 80% of the data in that column. It makes no sense to impute it with median or the mode values.I have removed this column from the dataset because it wont be much informative later during analysis 

In [25]:
df = df.drop('Intersecting Highway Number', axis = 1)

<b> Median Width </b>

Median width we have around 1801 values of 40 and 521 values with no data.
I have dropped the variable since values of one type dont give any importance after further processing

In [26]:
df = df.drop('Median Width', axis = 1)

<b>  Manner of Collision    </b>

There are many different 'Manner of collision' which are unordered and categorical so I have created dummies for this variable

In [27]:
df['Manner of Collision'].value_counts()

SAME DIRECTION - BOTH GOING STRAIGHT-REAR END        665
SAME DIRECTION - BOTH GOING STRAIGHT-SIDESWIPE       542
SAME DIRECTION - ONE STRAIGHT-ONE STOPPED            404
ONE MOTOR VEHICLE - GOING STRAIGHT                   325
ANGLE - BOTH GOING STRAIGHT                          177
OPPOSITE DIRECTION - ONE STRAIGHT-ONE LEFT TURN       89
SAME DIRECTION - ONE STRAIGHT-ONE LEFT TURN           32
ANGLE - ONE STRAIGHT-ONE RIGHT TURN                   16
SAME DIRECTION - ONE STRAIGHT-ONE RIGHT TURN          12
SAME DIRECTION - BOTH RIGHT TURN                       9
OPPOSITE DIRECTION - ONE STRAIGHT-ONE BACKING          9
ONE MOTOR VEHICLE - TURNING RIGHT                      9
ANGLE - ONE STRAIGHT-ONE LEFT TURN                     8
OPPOSITE DIRECTION - ONE BACKING-ONE STOPPED           7
ONE MOTOR VEHICLE - TURNING LEFT                       7
ONE MOTOR VEHICLE - OTHER                              2
ANGLE - ONE STRAIGHT-ONE STOPPED                       2
OPPOSITE DIRECTION - ONE RIGHT 

In [28]:
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - BOTH GOING STRAIGHT-REAR END' : 'SAME DIRECTION'})
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - BOTH GOING STRAIGHT-SIDESWIPE': 'SAME DIRECTION'})
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE STOPPED' : 'SAME DIRECTION'})
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE LEFT TURN' : 'SAME DIRECTION'})
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE RIGHT TURN' : 'SAME DIRECTION'})
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - BOTH RIGHT TURN' : 'SAME DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'SAME DIRECTION - BOTH LEFT TURN' :'SAME DIRECTION' })

df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - GOING STRAIGHT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING RIGHT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING LEFT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - OTHER' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })

df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - BOTH GOING STRAIGHT' :'ANGLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE RIGHT TURN' :'ANGLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE LEFT TURN' :'ANGLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE STOPPED' :'ANGLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - ONE RIGHT TURN-ONE STOPPED' :'ANGLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE BACKING' :'ANGLE' })

df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - GOING STRAIGHT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING RIGHT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING LEFT' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - OTHER' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })



df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE STRAIGHT-ONE LEFT TURN' :'OPPOSITE DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE STRAIGHT-ONE BACKING' :'OPPOSITE DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE BACKING-ONE STOPPED' :'OPPOSITE DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE LEFT TURN-ONE STOPPED' :'OPPOSITE DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE RIGHT TURN-ONE LEFT TURN' :'OPPOSITE DIRECTION' })
df['Manner of Collision'] = df['Manner of Collision'].replace({'OPPOSITE DIRECTION - BOTH GOING STRAIGHT' :'OPPOSITE DIRECTION' })
 


In [29]:
df['Manner of Collision'].value_counts()

SAME DIRECTION        1665
ONE MOTOR VEHICLE      344
ANGLE                  205
OPPOSITE DIRECTION     108
Name: Manner of Collision, dtype: int64

In [30]:
cols = pd.get_dummies(df['Manner of Collision'], prefix= 'Manner of Collision')
df[cols.columns] = cols
df.drop('Manner of Collision', axis = 1, inplace = True)

<b> Intersecting Street Name  </b> 

There are just 541 values out of 2322. So changing to dummies doesn't make any sense. It is wiser to remove this column

In [31]:
df = df.drop('Intersecting Street Name', axis = 1)

<b>    Number of Entering Roads   </b> 

In [32]:
df['Number of Entering Roads'].value_counts()

97 - NOT APPLICABLE                  1980
4 - FOUR ENTERING ROADS               193
2 - THREE ENTERING ROADS - T           68
98 - OTHER (EXPLAIN IN NARRATIVE)      44
3 - THREE ENTERING ROADS - Y           31
8 - CLOVERLEAF                          2
6 - SIX ENTERING ROADS                  2
5 - FIVE ENTERING ROADS                 2
Name: Number of Entering Roads, dtype: int64

In [33]:
df['Number of Entering Roads'] = df['Number of Entering Roads'].replace({'5 - FIVE ENTERING ROADS': '97 - NOT APPLICABLE'})

In [34]:
df['Number of Entering Roads'].value_counts()

97 - NOT APPLICABLE                  1982
4 - FOUR ENTERING ROADS               193
2 - THREE ENTERING ROADS - T           68
98 - OTHER (EXPLAIN IN NARRATIVE)      44
3 - THREE ENTERING ROADS - Y           31
8 - CLOVERLEAF                          2
6 - SIX ENTERING ROADS                  2
Name: Number of Entering Roads, dtype: int64

Creating the dummies for the number of entering roads

In [35]:
cols = pd.get_dummies(df['Number of Entering Roads'], prefix= 'Number of Entering Roads')
df[cols.columns] = cols
df.drop('Number of Entering Roads', axis = 1, inplace = True)

#### Number of Lanes    

In the number of lanes column the 'No Data' has no importance. We can replace 'No Data' with the highest value. Then we can map to 1 and 0.

In [36]:
df['Number of Lanes'].value_counts()

8          1257
6           544
No Data     521
Name: Number of Lanes, dtype: int64

In [37]:
df['Number of Lanes'] = df['Number of Lanes'].replace({'No Data': 8})

In [38]:
df['Number of Lanes'] = df['Number of Lanes'].astype(int)

In [39]:
df['Number of Lanes'].value_counts()

8    1778
6     544
Name: Number of Lanes, dtype: int64

In [40]:
df['Number of Lanes']=df['Number of Lanes'].map({8:1,6:0})

<b>   Surface Width   </b>

In [41]:
df['Surface Width'].value_counts()

96         1257
72          544
No Data     521
Name: Surface Width, dtype: int64

In [42]:
df['Surface Width'] = df['Surface Width'].replace({'No Data': 96})

In [43]:
df['Surface Width'] = df['Surface Width'].astype(int)

In [44]:
df["Surface Width"]=df["Surface Width"].map({96:1,72:0})

In [45]:
df['Surface Width'].value_counts()

1    1778
0     544
Name: Surface Width, dtype: int64

# Preprocessing ``test.csv`` (10 points)

#### Used the same steps for the test file

In [46]:
df1['$1000 Damage to Any One Persons Property'] = df1['$1000 Damage to Any One Persons Property'].replace({'Yes': 1,'No' : 0})

In [47]:
cols = pd.get_dummies(df1['Day of Week'], prefix= 'Day of Week')
df1[cols.columns] = cols
df1.drop('Day of Week', axis = 1, inplace = True)

In [48]:
df1['Construction Zone Flag'] = df1['Construction Zone Flag'].replace({'No': 1,'Yes' : 0})

In [49]:
df1['Construction Zone Workers Present Flag'] = df1['Construction Zone Workers Present Flag'].replace({'No': 1,'Yes' : 0})

In [50]:
df1 = df1.drop('Surface Type', axis = 1)

In [51]:
df1 = df1.drop('Highway System', axis = 1)

In [52]:
df1 = df1.drop('Bridge Detail', axis = 1)

In [53]:
df1 = df1.drop('Median Type', axis = 1)

In [54]:
cols = pd.get_dummies(df1['Weather Condition'], prefix= 'Weather Condition')
df1[cols.columns] = cols
df1.drop('Weather Condition', axis = 1, inplace = True)

In [55]:
cols = pd.get_dummies(df1['Surface Condition'], prefix= 'Surface Condition')
df1[cols.columns] = cols
df1.drop('Surface Condition', axis = 1, inplace = True)

In [56]:
df1['Intersecting Highway Number'].value_counts()

No Data    657
35          71
345         41
75           5
Name: Intersecting Highway Number, dtype: int64

In [57]:
df1 = df1.drop('Intersecting Highway Number', axis = 1)

In [58]:
#df1['Median Width'] = df1['Median Width'].replace({'40':1 , 'No Data':0})
df1 = df1.drop('Median Width', axis = 1)

In [59]:
df1['Manner of Collision'].value_counts()

SAME DIRECTION - BOTH GOING STRAIGHT-SIDESWIPE       203
SAME DIRECTION - BOTH GOING STRAIGHT-REAR END        202
SAME DIRECTION - ONE STRAIGHT-ONE STOPPED            142
ONE MOTOR VEHICLE - GOING STRAIGHT                   101
ANGLE - BOTH GOING STRAIGHT                           58
OPPOSITE DIRECTION - ONE STRAIGHT-ONE LEFT TURN       27
ANGLE - ONE STRAIGHT-ONE RIGHT TURN                    9
SAME DIRECTION - ONE STRAIGHT-ONE RIGHT TURN           8
SAME DIRECTION - ONE STRAIGHT-ONE LEFT TURN            5
ONE MOTOR VEHICLE - TURNING LEFT                       4
ONE MOTOR VEHICLE - TURNING RIGHT                      2
ANGLE - ONE STRAIGHT-ONE LEFT TURN                     2
OPPOSITE DIRECTION - BOTH GOING STRAIGHT               2
ANGLE - ONE LEFT TURN-ONE STOPPED                      1
OTHER                                                  1
OPPOSITE DIRECTION - ONE RIGHT TURN-ONE LEFT TURN      1
OPPOSITE DIRECTION - ONE BACKING-ONE STOPPED           1
ONE MOTOR VEHICLE - BACKING    

In [60]:
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - BOTH GOING STRAIGHT-REAR END' : 'SAME DIRECTION'})
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - BOTH GOING STRAIGHT-SIDESWIPE': 'SAME DIRECTION'})
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE STOPPED' : 'SAME DIRECTION'})
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE LEFT TURN' : 'SAME DIRECTION'})
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - ONE STRAIGHT-ONE RIGHT TURN' : 'SAME DIRECTION'})
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - BOTH RIGHT TURN' : 'SAME DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'SAME DIRECTION - BOTH LEFT TURN' :'SAME DIRECTION' })

df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - GOING STRAIGHT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING RIGHT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING LEFT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - OTHER' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })

df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - BOTH GOING STRAIGHT' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE RIGHT TURN' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE LEFT TURN' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE STOPPED' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE RIGHT TURN-ONE STOPPED' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE STRAIGHT-ONE BACKING' :'ANGLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ANGLE - ONE LEFT TURN-ONE STOPPED' :'ANGLE' })


df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - GOING STRAIGHT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING RIGHT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - TURNING LEFT' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - OTHER' :'ONE MOTOR VEHICLE' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'ONE MOTOR VEHICLE - BACKING' :'ONE MOTOR VEHICLE' })


df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE STRAIGHT-ONE LEFT TURN' :'OPPOSITE DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE STRAIGHT-ONE BACKING' :'OPPOSITE DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE BACKING-ONE STOPPED' :'OPPOSITE DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE LEFT TURN-ONE STOPPED' :'OPPOSITE DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - ONE RIGHT TURN-ONE LEFT TURN' :'OPPOSITE DIRECTION' })
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OPPOSITE DIRECTION - BOTH GOING STRAIGHT' :'OPPOSITE DIRECTION' })
 


In [61]:
df1['Manner of Collision'] = df1['Manner of Collision'].replace({'OTHER' :'SAME DIRECTION' })

In [62]:
df1['Manner of Collision'].value_counts()

SAME DIRECTION        563
ONE MOTOR VEHICLE     109
ANGLE                  71
OPPOSITE DIRECTION     31
Name: Manner of Collision, dtype: int64

In [63]:
cols = pd.get_dummies(df1['Manner of Collision'], prefix= 'Manner of Collision')
df1[cols.columns] = cols
df1.drop('Manner of Collision', axis = 1, inplace = True)

In [64]:
df1 = df1.drop('Intersecting Street Name', axis = 1)

In [65]:
df1['Number of Entering Roads'].value_counts()

97 - NOT APPLICABLE                  670
4 - FOUR ENTERING ROADS               62
2 - THREE ENTERING ROADS - T          19
3 - THREE ENTERING ROADS - Y          14
98 - OTHER (EXPLAIN IN NARRATIVE)      6
7 - TRAFFIC CIRCLE                     1
8 - CLOVERLEAF                         1
6 - SIX ENTERING ROADS                 1
Name: Number of Entering Roads, dtype: int64

In [66]:
df1['Number of Entering Roads'] = df1['Number of Entering Roads'].replace({'7 - TRAFFIC CIRCLE': '97 - NOT APPLICABLE'})

In [67]:
df1['Number of Entering Roads'].value_counts()

97 - NOT APPLICABLE                  671
4 - FOUR ENTERING ROADS               62
2 - THREE ENTERING ROADS - T          19
3 - THREE ENTERING ROADS - Y          14
98 - OTHER (EXPLAIN IN NARRATIVE)      6
8 - CLOVERLEAF                         1
6 - SIX ENTERING ROADS                 1
Name: Number of Entering Roads, dtype: int64

In [68]:
cols = pd.get_dummies(df1['Number of Entering Roads'], prefix= 'Number of Entering Roads')
df1[cols.columns] = cols
df1.drop('Number of Entering Roads', axis = 1, inplace = True)

In [69]:
df1['Number of Lanes'] = df1['Number of Lanes'].replace({'No Data': 8})

In [70]:
df1['Number of Lanes'] = df1['Number of Lanes'].astype(int)

In [71]:
df1['Number of Lanes']=df1['Number of Lanes'].map({8:1,6:0})

In [72]:
df1['Surface Width'] = df1['Surface Width'].replace({'No Data': 96})

In [73]:
df1['Surface Width'] = df1['Surface Width'].astype(int)

In [74]:
df1["Surface Width"]=df1["Surface Width"].map({96:1,72:0})

In [75]:
time = df1['Crash Time'].astype(str)
time = [f'{t[:-2]}:{t[-2:]}' for t in time ]

#time = time[0:]
#print(time[0:3])
for i in range(0,len(time)):
    a = time[i].split(":")
    if len(time[i]) <= 3:
        time[i] = 'Night'
    else:
        if int(a[0]) <= 6:
            time[i] = 'Night'

        elif int(a[0]) >= 6 and int(a[0]) <= 18:
            time[i] = 'Morning'
        elif int(a[0]) > 18 and int(a[0]) < 24:
            time[i] = 'Night'

In [76]:
df1['Crash Time'] = time 

In [77]:
df1['Crash Time'] = df1['Crash Time'].replace({'Morning': 1,'Night' : 0})

# Machine learning models (20 points)

The data is highly imbalanced.We need to upsample the data inorder to get accurate prediictions.

In [78]:
!pip install imbalanced-learn



In [79]:
y = df['Crash Severity']
X = df.drop('Crash Severity', axis = 1)

Split the data into training and testing

The dataset is highly imbalanced I am not able to predict accurate results through normal modelling. Inorder to get good prediction results I have done the upsampling using SMOTE and used Stratified Cross validation

In [80]:
from sklearn.model_selection import train_test_split

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

In [82]:
from imblearn.over_sampling import SMOTE 

def upsample_SMOTE(X_train, y_train, ratio=1.0):
    """Upsamples minority class using SMOTE.
    Ratio argument is the percentage of the upsampled minority class in relation
    to the majority class. Default is 1.0
    """
    sm = SMOTE(random_state=23, sampling_strategy=ratio)
    X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)
    #print(len(X_train_sm), len(y_train_sm))
    return X_train_sm, y_train_sm

X_train_sm,y_train_sm = upsample_SMOTE(X_train, y_train, ratio=1.0)

In [83]:
from sklearn.model_selection import StratifiedKFold
sk =StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

<b> KNN </b>

In [84]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Grid search 

In [85]:
k_range = list(range(1, 12))
param_grid = dict(n_neighbors=k_range)

In [86]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv= sk, return_train_score=True,scoring= 'roc_auc')

In [87]:
grid_search.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]},
             return_train_score=True, scoring='roc_auc')

In [88]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'n_neighbors': 8}
Best cross-validation score: 0.98


In [89]:
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score,f1_score
knn = KNeighborsClassifier(8)

knn.fit(X_train_sm, y_train_sm)
print('Train score: {:.4f}'.format(knn.score(X_train_sm, y_train_sm)))
print('Test score: {:.4f}'.format(knn.score(X_test, y_test)))

Train score: 0.9504
Test score: 0.9088


In [90]:
from sklearn.metrics import accuracy_score, roc_auc_score
y_pred = knn.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, knn.predict_proba(X_test)[:,1]))

roc_auc_score:  0.7097902097902098


In [91]:
y_pred

array([1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Logistic Regression 

Grid Search 

In [92]:
from sklearn.linear_model import LogisticRegression
c_range=[0.001, 0.01, 0.1, 1, 10, 100, 1000]
grid = {"C": c_range , "penalty":["l1","l2"],"solver":["liblinear"]}
logreg = LogisticRegression(class_weight = 'balanced',max_iter = 10000)
logreg_cv = GridSearchCV(logreg,grid,cv= sk, scoring = 'roc_auc')
logreg_cv.fit(X_train_sm,y_train_sm)

print("tuned Hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned Hyperparameters :(best parameters)  {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
accuracy : 0.9911307660572756


In [93]:
log_l1 = LogisticRegression(penalty = 'l2', C = 10, solver = 'liblinear', max_iter = 500, class_weight = 'balanced')
log_l1.fit(X_train_sm, y_train_sm)
train_score = (log_l1.score(X_train_sm, y_train_sm))
test_score = (log_l1.score(X_test, y_test))
print('Train Score',train_score)
print('Test Score' ,test_score)

Train Score 0.954784130688448
Test Score 0.9380378657487092


In [94]:
y_pred = log_l1.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, log_l1.predict_proba(X_test)[:,1]))

roc_auc_score:  0.6417055167055168


SVC with Poly kernel 

Grid search 

In [95]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'degree': [1,2,3,4,5,6,7,8,9,10]}
print("Parameter grid:\n{}".format(param_grid))

Parameter grid:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'degree': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}


In [96]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

grid_search_poly = GridSearchCV(SVC(kernel = 'poly'),param_grid, cv= sk, return_train_score=True)

In [97]:
grid_search_poly.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=SVC(kernel='poly'),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
                         'degree': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
             return_train_score=True)

In [98]:
print("Best parameters: {}".format(grid_search_poly.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search_poly.best_score_))

Best parameters: {'C': 10, 'degree': 3}
Best cross-validation score: 0.97


In [99]:
clf2 = SVC(kernel='poly', C= 10 , degree = 5,probability=True)
clf2.fit(X_train_sm, y_train_sm)
train_score = (clf2.score(X_train_sm, y_train_sm))
test_score = (clf2.score(X_test, y_test))
print('Train Score',train_score)
print('Test Score' ,test_score)

Train Score 0.9760793465577596
Test Score 0.9432013769363167


In [100]:
y_pred = clf2.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, clf2.predict_proba(X_test)[:,1]))

roc_auc_score:  0.6712315462315462


Decision Tree 

Naive Grid Search 

In [101]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

best_score = 0
for max_depth in [1,2,3,4,5,6,7,8,9,10]:
    for min_samples_leaf in [10,25,50,100,500,1000,250,1500,2000,750]:
        for  min_samples_split in [10,50,100,150,200,250,300,350,400,450]:
        # for each combination of parameters, train an SVC
            for min_impurity_decrease in [0.0002,0.0005,0.0007,0.0009,0.001,0.003,0.005,0.007,0.009,0.01]:
                dtree = DecisionTreeClassifier(max_depth = max_depth, min_samples_leaf = min_samples_leaf, min_samples_split = min_samples_split, random_state=0)
                dtree.fit(X_train_sm, y_train_sm)
        # evaluate the SVC on the test set
                score = dtree.score(X_test, y_test)
        # if we got a better score, store the score and parameters
                if score > best_score:
                    best_score = score
                    best_parameters = {'max_depth': max_depth, 'min_samples': min_samples_leaf,'min_samples_split': min_samples_split, 'min_impurity_decrease':min_impurity_decrease}

print("Best score: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Best score: 0.95
Best parameters: {'max_depth': 9, 'min_samples': 10, 'min_samples_split': 450, 'min_impurity_decrease': 0.0002}


In [102]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth = 9, min_samples_leaf = 10,min_impurity_decrease =0.0002 ,min_samples_split = 300,random_state=0)
dtree.fit(X_train_sm, y_train_sm)
Test_Score = dtree.score(X_test, y_test)
Train_score = dtree.score(X_train_sm, y_train_sm)
print('Test_Score:', Test_Score, 'Train Score',Train_score)

Test_Score: 0.9122203098106713 Train Score 0.8809801633605601


In [103]:
y_pred = dtree.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, dtree.predict_proba(X_test)[:,1]))

roc_auc_score:  0.7898212898212897


Bagging 

Bagging with KNN

In [104]:
from sklearn.ensemble import BaggingClassifier

knn = KNeighborsClassifier(8)
bag_clf = BaggingClassifier(knn ,max_samples=100, bootstrap=True, random_state=0, oob_score = True)

param_grid = {'n_estimators':[100,500,1000]}

bag_grid = GridSearchCV(bag_clf, param_grid = param_grid, cv = sk, n_jobs = -1, scoring = 'roc_auc' )

In [105]:
bag_grid.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=8),
                                         max_samples=100, oob_score=True,
                                         random_state=0),
             n_jobs=-1, param_grid={'n_estimators': [100, 500, 1000]},
             scoring='roc_auc')

In [106]:
print("Best parameters: {}".format(bag_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(bag_grid.best_score_))

Best parameters: {'n_estimators': 100}
Best cross-validation score: 0.92


In [107]:
bag_clf = BaggingClassifier(knn,max_samples=100,n_estimators = 1000, bootstrap = True, random_state=0, oob_score = True)

In [108]:
bag_clf.fit(X_train_sm, y_train_sm)

BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=8),
                  max_samples=100, n_estimators=1000, oob_score=True,
                  random_state=0)

In [109]:
print('Train score: %.2f'%bag_clf.score(X_train_sm, y_train_sm))
print('Test score: %.2f'%bag_clf.score(X_test, y_test))

Train score: 0.81
Test score: 0.65


In [110]:
y_pred = bag_clf.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, bag_clf.predict_proba(X_test)[:,1]))

roc_auc_score:  0.7022144522144522


Bagging with random forest

In [111]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

rnd = RandomForestClassifier( max_depth = 9, n_estimators = 100, bootstrap= True,n_jobs=-1, random_state=0)

bag_clf = BaggingClassifier(rnd,max_samples=100, bootstrap=True, random_state=0, oob_score = True)

param_grid = {'n_estimators':[100,500,1000]}

bag_grid = GridSearchCV(bag_clf, param_grid = param_grid, cv = sk, n_jobs = -1, scoring = 'roc_auc' )

In [112]:
bag_grid.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=BaggingClassifier(base_estimator=RandomForestClassifier(max_depth=9,
                                                                               n_jobs=-1,
                                                                               random_state=0),
                                         max_samples=100, oob_score=True,
                                         random_state=0),
             n_jobs=-1, param_grid={'n_estimators': [100, 500, 1000]},
             scoring='roc_auc')

In [113]:
print("Best parameters: {}".format(bag_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(bag_grid.best_score_))

Best parameters: {'n_estimators': 1000}
Best cross-validation score: 0.97


In [114]:
bag_clf = BaggingClassifier(rnd,max_samples=100,n_estimators = 1000,bootstrap=True, random_state=0, oob_score = True)

In [115]:
bag_clf.fit(X_train_sm, y_train_sm)

BaggingClassifier(base_estimator=RandomForestClassifier(max_depth=9, n_jobs=-1,
                                                        random_state=0),
                  max_samples=100, n_estimators=1000, oob_score=True,
                  random_state=0)

In [116]:
from  sklearn.metrics import accuracy_score
print('Train score: %.2f'%bag_clf.score(X_train_sm, y_train_sm))
print('Test score: %.2f'%bag_clf.score(X_test, y_test))
print('Out-of-bag score: %.2f'%bag_clf.oob_score_)

Train score: 0.89
Test score: 0.83
Out-of-bag score: 0.88


In [117]:
from sklearn.metrics import accuracy_score, roc_auc_score
y_pred = bag_clf.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, bag_clf.predict_proba(X_test)[:,1]))

roc_auc_score:  0.6865773115773115


Pasting with KNN

In [118]:
from sklearn.ensemble import BaggingClassifier

knn = KNeighborsClassifier(8)
bag_clf = BaggingClassifier(knn ,max_samples=100, bootstrap= False, random_state=0)

param_grid = {'n_estimators':[100,500,1000]}

bag_grid = GridSearchCV(bag_clf, param_grid = param_grid, cv = sk, n_jobs = -1, scoring = 'roc_auc' )

In [119]:
bag_grid.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=8),
                                         bootstrap=False, max_samples=100,
                                         random_state=0),
             n_jobs=-1, param_grid={'n_estimators': [100, 500, 1000]},
             scoring='roc_auc')

In [120]:
print("Best parameters: {}".format(bag_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(bag_grid.best_score_))


Best parameters: {'n_estimators': 1000}
Best cross-validation score: 0.92


In [121]:
bag_clf = BaggingClassifier(knn,max_samples=100,n_estimators = 1000, random_state=0)

In [122]:
bag_clf.fit(X_train_sm, y_train_sm)

BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=8),
                  max_samples=100, n_estimators=1000, random_state=0)

In [123]:
from  sklearn.metrics import accuracy_score
print('Train score: %.2f'%bag_clf.score(X_train_sm, y_train_sm))
print('Test score: %.2f'%bag_clf.score(X_test, y_test))

Train score: 0.81
Test score: 0.65


In [124]:
y_pred = bag_clf.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, bag_clf.predict_proba(X_test)[:,1]))

roc_auc_score:  0.7022144522144522


Random Forest Classifier 

In [125]:
from sklearn.ensemble import RandomForestClassifier
param_grid = {'max_depth': [1,2,3,4,5,6,7,8,9],
              'n_estimators': [100,250,350,400,500,1000,750]}
print("Parameter grid:\n{}".format(param_grid))

Parameter grid:
{'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'n_estimators': [100, 250, 350, 400, 500, 1000, 750]}


In [126]:
from sklearn.model_selection import GridSearchCV
grid_search_poly = GridSearchCV(RandomForestClassifier(bootstrap= True,n_jobs=-1, random_state=0),param_grid, cv=5, return_train_score=True)

In [127]:
grid_search_poly.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1, random_state=0),
             param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'n_estimators': [100, 250, 350, 400, 500, 1000, 750]},
             return_train_score=True)

In [128]:
print("Best parameters: {}".format(grid_search_poly.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search_poly.best_score_))

Best parameters: {'max_depth': 9, 'n_estimators': 100}
Best cross-validation score: 0.93


In [129]:
rnd = RandomForestClassifier( max_depth = 9, n_estimators = 100, bootstrap= True,n_jobs=-1, random_state=0)

In [130]:
rnd.fit(X_train_sm, y_train_sm)

RandomForestClassifier(max_depth=9, n_jobs=-1, random_state=0)

In [131]:
print('Train score: %.2f'%rnd.score(X_train_sm, y_train_sm))
print('Test score: %.2f'%rnd.score(X_test, y_test))

Train score: 0.92
Test score: 0.92


In [132]:
from sklearn.metrics import accuracy_score, roc_auc_score
y_pred = rnd.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, rnd.predict_proba(X_test)[:,1]))

roc_auc_score:  0.6292735042735043


<b> Gradient Boosting </b> 

In [133]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {'n_estimators':[100, 500 , 1000], 
              'learning_rate':[0.1, 0.5, 1 ],
            'max_depth' : [1,2,3]}

clf = GradientBoostingClassifier(random_state=0)

In [134]:
gbc_grid = GridSearchCV(clf, param_grid = param_grid, cv = sk, n_jobs = -1, scoring = 'roc_auc' )

In [135]:
gbc_grid.fit(X_train_sm, y_train_sm)

GridSearchCV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
             estimator=GradientBoostingClassifier(random_state=0), n_jobs=-1,
             param_grid={'learning_rate': [0.1, 0.5, 1], 'max_depth': [1, 2, 3],
                         'n_estimators': [100, 500, 1000]},
             scoring='roc_auc')

In [136]:
print("Best parameters: {}".format(gbc_grid.best_params_))
print("Best cross-validation score: {:.2f}".format(gbc_grid.best_score_))

Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}
Best cross-validation score: 1.00


In [137]:
gbrt = GradientBoostingClassifier(max_depth = 3,n_estimators = 500, learning_rate=0.1, random_state=0,)
gbrt.fit(X_train_sm, y_train_sm)

GradientBoostingClassifier(n_estimators=500, random_state=0)

In [138]:
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train_sm, y_train_sm)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))

Accuracy on training set: 0.975
Accuracy on test set: 0.945


In [139]:
from sklearn.metrics import accuracy_score, roc_auc_score
y_pred = gbrt.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, gbrt.predict_proba(X_test)[:,1]))

roc_auc_score:  0.6648212898212897


# Best model ( 5 points)
Explain which machine learning model is the best model for this dataset and why? 

I have run many models inorder to check the model which can fit the dataset well.
Out of all those models, in the previous question, I found decision tree to be the best model.I have displayed the results of the following models:
    
   <b>  Gradient Boosting </b>
   
Accuracy on training set: 0.975
Accuracy on test set: 0.945

roc_auc_score:  0.6648212898212897
    
<b> RFC </b>

roc_auc_score:  0.6292735042735043

Train score: 0.92
Test score: 0.92

<b> PASTING WITH KNN </b>

roc_auc_score:  0.7022144522144522

Train score: 0.81
Test score: 0.65

<b> BAGGING WITH RANDOM FOREST </b>

roc_auc_score:  0.6865773115773115

Train score: 0.89
Test score: 0.83

<b> BAGGING WITH KNN </b>

roc_auc_score:  0.7022144522144522

Train score: 0.81
Test score: 0.65

<b> DECISION TREE </b>

roc_auc_score:  0.7898212898212897

Test_Score: 0.9122203098106713 Train Score 0.8809801633605601

<b> SVC POLY </b> 

roc_auc_score:  0.6712315462315462

Train Score 0.9760793465577596
Test Score 0.9432013769363167

<b> Logistic </b>

roc_auc_score:  0.6417055167055168

Train Score 0.954784130688448
Test Score 0.9380378657487092

<b> KNN </b>

roc_auc_score:  0.7097902097902098

Train score: 0.9504
Test score: 0.9088











From all the models I have run, i could get an highest roc_auc_score for - Decision Tree , KNN and Bagging with KNN.which is almost near to 0.70. 

Amongst all these models, Decision tree model has been confirmed to perform well with an roc_auc score of 0.789 and with an very minimum train and test scores difference.

Test_Score: 0.9122203098106713 

Train Score 0.8809801633605601


#### Best model 

In [140]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

In [141]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

best_score = 0
for max_depth in [1,2,3,4,5,6,7,8,9,10]:
    for min_samples_leaf in [10,25,50,100,500,1000,250,1500,2000,750]:
        for  min_samples_split in [10,50,100,150,200,250,300,350,400,450]:
        # for each combination of parameters, train an SVC
            for min_impurity_decrease in [0.0002,0.0005,0.0007,0.0009,0.001,0.003,0.005,0.007,0.009,0.01]:
                dtree = DecisionTreeClassifier(max_depth = max_depth, min_samples_leaf = min_samples_leaf, min_samples_split = min_samples_split, random_state=0)
                dtree.fit(X_train_sm, y_train_sm)
        # evaluate the SVC on the test set
                score = dtree.score(X_test, y_test)
        # if we got a better score, store the score and parameters
                if score > best_score:
                    best_score = score
                    best_parameters = {'max_depth': max_depth, 'min_samples': min_samples_leaf,'min_samples_split': min_samples_split, 'min_impurity_decrease':min_impurity_decrease}

print("Best score: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Best score: 0.95
Best parameters: {'max_depth': 9, 'min_samples': 10, 'min_samples_split': 450, 'min_impurity_decrease': 0.0002}


In [142]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth = 9,min_samples_leaf = 10,min_impurity_decrease = 0.0002,min_samples_split = 300,random_state=0)
dtree.fit(X_train_sm, y_train_sm)
Test_Score = dtree.score(X_test, y_test)
Train_score = dtree.score(X_train_sm, y_train_sm)
print('Test_Score:', Test_Score, 'Train Score',Train_score)

Test_Score: 0.9122203098106713 Train Score 0.8809801633605601


In [143]:
from sklearn.metrics import accuracy_score, roc_auc_score
y_pred = dtree.predict(X_test)
print('roc_auc_score: ', roc_auc_score(y_test, dtree.predict_proba(X_test)[:,1]))

roc_auc_score:  0.7898212898212897


# Grading (50 points)
Your model should predict the outcome for every row in the test.csv. 
You should be able to correctly print the ``final_test_prediction`` executing the following statement: 

In [144]:
final_test_prediction = dtree.predict(df1)
final_test_prediction

array([1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,