# Coursera Capstone

This notebook is dedicated to analysing car crash severity data and building a ML model that can help prevent such accidents by warning users when road conditions are accident-prone.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import jaccard_score
import seaborn as sns
import holidays
from datetime import date

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


### Business Understanding

The idea behind this project is to develop a system that will warn road-users of the chances and severity of an accident occuring to them. The four factors that determine an accident are weather and road conditions, vehicle characteristics and finally human error. It is easy to deduce that weather and road conditions will have a meaningful impact on the chances and severity of an accident. For vehicle characteristics: a motorbike will have increased chances of a severe accident most of the time due to decreased protection when compared to a car for example. All of these factors influence the final result and must be taken into consideration. Human error is difficult to quantify unless the distraction/error made by the driver has been recorded, as well as being diifcult to predict (that would have to besolved with another model with a different set of attributes). We must look for the aforementioned attributes in our dataset, and use them to deduce the severity and chances of accidents happening with varying road, weather and vehicle conditions.

### Data Understanding

Extracting the data and displaying it.

In [3]:
df_collisions = pd.read_csv('Data-Collisions.csv', low_memory=False)
df_collisions

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.703140,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.334540,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194668,2,-122.290826,47.565408,219543,309534,310814,E871089,Matched,Block,,...,Dry,Daylight,,,,24,From opposite direction - both moving - head-on,0,0,N
194669,1,-122.344526,47.690924,219544,309085,310365,E876731,Matched,Block,,...,Wet,Daylight,,,,13,From same direction - both going straight - bo...,0,0,N
194670,2,-122.306689,47.683047,219545,311280,312640,3809984,Matched,Intersection,24760.0,...,Dry,Daylight,,,,28,From opposite direction - one left turn - one ...,0,0,N
194671,2,-122.355317,47.678734,219546,309514,310794,3810083,Matched,Intersection,24349.0,...,Dry,Dusk,,,,5,Vehicle Strikes Pedalcyclist,4308,0,N


The data is all car-accidents in Seattle; provided by SPD and recorded by Traffic Records

All the attributes that are not a direct result of the accident taking place should be considered. We can only use the set of attributes that were present before the accident happened. We cannot predict accidents with data that is a direct result of that accident. 
As previously mentioned, road, weather & vehicle conditions are determining factors of an accident; as well as location and location type. Human error is another one, but it is almost impossible to determine the chances of the driver being distracted with this dataset. It is only once the accident has happned that we may find out if there was human error involved.

Initially, just by looking at the metadata, the attributes I would use are:

ADDRTYPE: Alley, Block, Intersection
LOCATION: General location of collision
JUNCTIONTYPE: Category of junction at which the collision took place
WEATHER: Weather description at the location and during the time of the collision
ROADCOND: Road condition at the location and during the time of the collision
LIGHTCOND: Light conditions at the location and during the time of the collision
INDCTTM: Date and time of incident. Accidents at night are more likely to be severe etc. Accidents during rush hour are more likely to occur.....

PERSONCOUNT, PEDCOUNT, PEDCYLCOUNT, VEHCOUNT, INJURIES, SERIOUSINJURIES, FATALITIES, HITPARKEDCAR are all attributes that easily help in deducing accident severity. But because this is a predictive model, and these attributes cannot be known before the actual accident happens, they are meaningless when trying to predict accident severity and probability. For example, instead of VEHCOUNT, traffic density per location would be an attribute that we could know beforehand and therfore a meaningful attribute for the model. The same for the other attributes. How many people were travelling in the car that had the accident would be useful as well, as it is easily known how many people you are taking in your car before the actual trip. I will not go through the explanation of every single attribute; the main idea is to select attributes that are known and can be qauntified before the accident happened and taht directly contribute to the final result (accident or not and its severity). Data that is recorded as a result of the accident is obviously not useful for predictions.

Our label and what we are trying to predict is SEVERITYCODE, which displays whether there was an accident and its severity. This is what we will try to predict in the future. 

We must also look at each column (attribute) of the dataset and determine whether data is skewed, biased or missing. We must correct these issues if our model is to be precise. We can also do statistical tests as an initial screening to determine correlations between different attributes and their respective labels.

In [4]:
##Dropping useless columns as well as inspecting for missing values and checking for bias


df = df_collisions[['ADDRTYPE','LOCATION','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','INCDTTM','SEVERITYCODE']]
df

Unnamed: 0,ADDRTYPE,LOCATION,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE
0,Intersection,5TH AVE NE AND NE 103RD ST,At Intersection (intersection related),Overcast,Wet,Daylight,3/27/2013 2:54:00 PM,2
1,Block,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,12/20/2006 6:55:00 PM,1
2,Block,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,Mid-Block (not related to intersection),Overcast,Dry,Daylight,11/18/2004 10:20:00 AM,1
3,Block,2ND AVE BETWEEN MARION ST AND MADISON ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,3/29/2013 9:26:00 AM,1
4,Intersection,SWIFT AVE S AND SWIFT AV OFF RP,At Intersection (intersection related),Raining,Wet,Daylight,1/28/2004 8:04:00 AM,2
...,...,...,...,...,...,...,...,...
194668,Block,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,Mid-Block (not related to intersection),Clear,Dry,Daylight,11/12/2018 8:12:00 AM,2
194669,Block,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,Mid-Block (not related to intersection),Raining,Wet,Daylight,12/18/2018 9:14:00 AM,1
194670,Intersection,20TH AVE NE AND NE 75TH ST,At Intersection (intersection related),Clear,Dry,Daylight,1/19/2019 9:25:00 AM,2
194671,Intersection,GREENWOOD AVE N AND N 68TH ST,At Intersection (intersection related),Clear,Dry,Dusk,1/15/2019 4:48:00 PM,2


In [5]:
df['JUNCTIONTYPE'].value_counts()

Mid-Block (not related to intersection)              89800
At Intersection (intersection related)               62810
Mid-Block (but intersection related)                 22790
Driveway Junction                                    10671
At Intersection (but not related to intersection)     2098
Ramp Junction                                          166
Unknown                                                  9
Name: JUNCTIONTYPE, dtype: int64

In [6]:
df.drop(['LOCATION'], axis=1 ,inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [7]:
df

Unnamed: 0,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE
0,Intersection,At Intersection (intersection related),Overcast,Wet,Daylight,3/27/2013 2:54:00 PM,2
1,Block,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,12/20/2006 6:55:00 PM,1
2,Block,Mid-Block (not related to intersection),Overcast,Dry,Daylight,11/18/2004 10:20:00 AM,1
3,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,3/29/2013 9:26:00 AM,1
4,Intersection,At Intersection (intersection related),Raining,Wet,Daylight,1/28/2004 8:04:00 AM,2
...,...,...,...,...,...,...,...
194668,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,11/12/2018 8:12:00 AM,2
194669,Block,Mid-Block (not related to intersection),Raining,Wet,Daylight,12/18/2018 9:14:00 AM,1
194670,Intersection,At Intersection (intersection related),Clear,Dry,Daylight,1/19/2019 9:25:00 AM,2
194671,Intersection,At Intersection (intersection related),Clear,Dry,Dusk,1/15/2019 4:48:00 PM,2


In [8]:
df['WEATHER'].value_counts()

Clear                       111135
Raining                      33145
Overcast                     27714
Unknown                      15091
Snowing                        907
Other                          832
Fog/Smog/Smoke                 569
Sleet/Hail/Freezing Rain       113
Blowing Sand/Dirt               56
Severe Crosswind                25
Partly Cloudy                    5
Name: WEATHER, dtype: int64

In [9]:
df['LIGHTCOND'].value_counts()

Daylight                    116137
Dark - Street Lights On      48507
Unknown                      13473
Dusk                          5902
Dawn                          2502
Dark - No Street Lights       1537
Dark - Street Lights Off      1199
Other                          235
Dark - Unknown Lighting         11
Name: LIGHTCOND, dtype: int64

In [10]:
df['ROADCOND'].value_counts()

Dry               124510
Wet                47474
Unknown            15078
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64

In [11]:
df['ADDRTYPE'].value_counts()

Block           126926
Intersection     65070
Alley              751
Name: ADDRTYPE, dtype: int64

In [12]:
df['SEVERITYCODE'].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

We can see from the Time and Date of the incidents that some entries contain the time of day whereas some contain just the day it occured. I will delete the time of day data and just concentrate on the dates.

In [13]:
df["INCDTTM"]= df["INCDTTM"].str.split(" ", n = 1, expand = True) ## I REMOVE THE TIME FROM THE DATE

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["INCDTTM"]= df["INCDTTM"].str.split(" ", n = 1, expand = True) ## I REMOVE THE TIME FROM THE DATE


In [14]:
df['INCDTTM'].value_counts() ###We can see it worked

11/2/2006     96
10/3/2008     92
5/18/2005     84
11/5/2005     83
1/13/2006     83
              ..
5/18/2020      2
5/20/2020      1
5/19/2020      1
5/17/2020      1
12/25/2015     1
Name: INCDTTM, Length: 5985, dtype: int64

In [15]:
#df["INCDTTM"]= df["INCDTTM"].str.rsplit("/", n = 1, expand = True)
df = df[df['INCDTTM'].notna().notnull()]

In [16]:
# df['INCDTTM'].value_counts() ### IDEALLY I WOULD SPLIT DATES INTO 3 CATEGORIES: HOLIDAYS & FESTIVES, WORK DAYS, WEEKEND. DAYS IMMEDIATELY AFTER
## A BIG HOLIDAY EVENT ALSO NOTED. ESPECIALLY EVENING AFTER BIG NATIONAL/LOCAL EVENTS (USUALLY PPL DRIVING HOME IN THE EVENING AFTER THE CELEBRATION)

##I WOULD ALSO USE THE TIME DATA TO FIGURE OUT IF THE ACCIDENT OCCURED AT NIGHT, RUSH HOUR or whether the accident occured outside of these times
### making these three categories. RUSH HOUR =more traffic, night time= less visibilty and higher cahnce of user error due to tiredness or alcohol related... 
### THE REMAINING CATEGORY WOULD BE THE REMAINDER OF NIGHT & AND RUSH HOUR CATEGORIES
### Due to time constraints I will only use the dates without the years, and I will not create extra categories (although this would be the best practice)
df['INCDTTM'] = pd.to_datetime(df['INCDTTM'], format = '%m/%d/%Y', errors='coerce')

In [17]:
df['Day of Week']=df['INCDTTM'].dt.dayofweek

df['Weekend'] = df['Day of Week'].apply(lambda x: 1 if (x>3)  else 0)
df

Unnamed: 0,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE,Day of Week,Weekend
0,Intersection,At Intersection (intersection related),Overcast,Wet,Daylight,2013-03-27,2,2,0
1,Block,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,2006-12-20,1,2,0
2,Block,Mid-Block (not related to intersection),Overcast,Dry,Daylight,2004-11-18,1,3,0
3,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,2013-03-29,1,4,1
4,Intersection,At Intersection (intersection related),Raining,Wet,Daylight,2004-01-28,2,2,0
...,...,...,...,...,...,...,...,...,...
194668,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,2018-11-12,2,0,0
194669,Block,Mid-Block (not related to intersection),Raining,Wet,Daylight,2018-12-18,1,1,0
194670,Intersection,At Intersection (intersection related),Clear,Dry,Daylight,2019-01-19,2,5,1
194671,Intersection,At Intersection (intersection related),Clear,Dry,Dusk,2019-01-15,2,1,0


In [25]:
us_holidays = holidays.UnitedStates() ### CREATING A NEW COLUMN DISPLAYING WETHER THAT DAY WAS A US HOLIDAY OR NOT 

lst = []
for i in df['INCDTTM']:  ## I ITERATE THROUGH THE COLUMN AND SEE WETHER THE DATE ON THE DATAFRAME COINCIDES WITH A US HOLIDAY. IF IT IS TRUE I 
    if i in us_holidays: #### APPEND TRUE TO THE LIST CREATED ABOVE. IF IT DOESNT I APPEND FALSE
        x = 'True'
        lst.append(x)
    else:
        y = 'False'
        lst.append(y)

df['Holidays'] = pd.DataFrame(lst) ## IT WORKS!
df['Holidays'] = df['Holidays'].apply(lambda x:0 if x=='False' else x)
df['Holidays'] = df['Holidays'].apply(lambda x:1 if x=='True' else x) #### 0 = False and 1 = True
df

Unnamed: 0,ADDRTYPE,JUNCTIONTYPE,WEATHER,ROADCOND,LIGHTCOND,INCDTTM,SEVERITYCODE,Day of Week,Weekend,Holidays
0,Intersection,At Intersection (intersection related),Overcast,Wet,Daylight,2013-03-27,2,2,0,0
1,Block,Mid-Block (not related to intersection),Raining,Wet,Dark - Street Lights On,2006-12-20,1,2,0,0
2,Block,Mid-Block (not related to intersection),Overcast,Dry,Daylight,2004-11-18,1,3,0,0
3,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,2013-03-29,1,4,1,0
4,Intersection,At Intersection (intersection related),Raining,Wet,Daylight,2004-01-28,2,2,0,0
...,...,...,...,...,...,...,...,...,...,...
194668,Block,Mid-Block (not related to intersection),Clear,Dry,Daylight,2018-11-12,2,0,0,1
194669,Block,Mid-Block (not related to intersection),Raining,Wet,Daylight,2018-12-18,1,1,0,0
194670,Intersection,At Intersection (intersection related),Clear,Dry,Daylight,2019-01-19,2,5,1,0
194671,Intersection,At Intersection (intersection related),Clear,Dry,Dusk,2019-01-15,2,1,0,0


In [26]:
df = pd.get_dummies(df, columns= ['ADDRTYPE','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND'])
df
df['SEVERITYCODE'] = df['SEVERITYCODE'].apply(lambda x:0 if x==1 else x)
df['SEVERITYCODE'] = df['SEVERITYCODE'].apply(lambda x:1if x==2 else x) ## BINARY NOW. 0 = PROPDAMAGE, 1 =INJURY

In [28]:
df.drop(['INCDTTM'], axis=1 ,inplace=True)

KeyError: "['INCDTTM'] not found in axis"

## Data Preprocessing

In [29]:
df_maj = df[df.SEVERITYCODE == 0] ## I MUST REDUCE THE BIAS BEFORE SPLITTING
df_min = df[df.SEVERITYCODE == 1]

df_resampled = resample(df_maj, replace=False, n_samples=58188, random_state=4) 
df_final= pd.concat([df_resampled, df_min])
df_final.SEVERITYCODE.value_counts()  ## WE CAN SEE THAT THE DATA IS NOW UNBIASED. DOWNSIDE ID TAHT WE HAVE LOST DATA IN THE PROCESS.

1    58188
0    58188
Name: SEVERITYCODE, dtype: int64

In [30]:
df_final

Unnamed: 0,SEVERITYCODE,Day of Week,Weekend,Holidays,ADDRTYPE_Alley,ADDRTYPE_Block,ADDRTYPE_Intersection,JUNCTIONTYPE_At Intersection (but not related to intersection),JUNCTIONTYPE_At Intersection (intersection related),JUNCTIONTYPE_Driveway Junction,...,ROADCOND_Wet,LIGHTCOND_Dark - No Street Lights,LIGHTCOND_Dark - Street Lights Off,LIGHTCOND_Dark - Street Lights On,LIGHTCOND_Dark - Unknown Lighting,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,LIGHTCOND_Other,LIGHTCOND_Unknown
149536,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
96165,0,2,0,0,0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
86662,0,6,1,0,0,1,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
174854,0,4,1,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
98840,0,3,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194663,1,2,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
194666,1,4,1,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
194668,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
194670,1,5,1,0,0,0,1,0,1,0,...,0,0,0,0,0,0,1,0,0,0


In [31]:
df_final.dtypes

SEVERITYCODE                                                      int64
Day of Week                                                       int64
Weekend                                                           int64
Holidays                                                          int64
ADDRTYPE_Alley                                                    uint8
ADDRTYPE_Block                                                    uint8
ADDRTYPE_Intersection                                             uint8
JUNCTIONTYPE_At Intersection (but not related to intersection)    uint8
JUNCTIONTYPE_At Intersection (intersection related)               uint8
JUNCTIONTYPE_Driveway Junction                                    uint8
JUNCTIONTYPE_Mid-Block (but intersection related)                 uint8
JUNCTIONTYPE_Mid-Block (not related to intersection)              uint8
JUNCTIONTYPE_Ramp Junction                                        uint8
JUNCTIONTYPE_Unknown                                            

In [32]:
# x = df[['ADDRTYPE','LOCATION','WEATHER','ROADCOND','LIGHTCOND','INCDTTM']]    ## TRAIN TEST SPLIT!

#bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
df_final.drop(['Weekend'], axis=1 ,inplace=True)

x = df_final.loc[:, df_final.columns != 'SEVERITYCODE'] ##SELECTING ALL COLUMNS EXCEPT FOR SEVERITYCODE WHICH IS OUR LABEL
y = df_final['SEVERITYCODE']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=4)


In [33]:
df_final

Unnamed: 0,SEVERITYCODE,Day of Week,Holidays,ADDRTYPE_Alley,ADDRTYPE_Block,ADDRTYPE_Intersection,JUNCTIONTYPE_At Intersection (but not related to intersection),JUNCTIONTYPE_At Intersection (intersection related),JUNCTIONTYPE_Driveway Junction,JUNCTIONTYPE_Mid-Block (but intersection related),...,ROADCOND_Wet,LIGHTCOND_Dark - No Street Lights,LIGHTCOND_Dark - Street Lights Off,LIGHTCOND_Dark - Street Lights On,LIGHTCOND_Dark - Unknown Lighting,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk,LIGHTCOND_Other,LIGHTCOND_Unknown
149536,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
96165,0,2,0,0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
86662,0,6,0,0,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
174854,0,4,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
98840,0,3,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194663,1,2,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
194666,1,4,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
194668,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
194670,1,5,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [34]:
param_grid = [{'C' : [0.01, 0.1, 1, 10, 100], 'solver' : ['liblinear','newton-cg','saga', 'sag','lbfgs']}]
lr_CV = GridSearchCV(LogisticRegression(), param_grid, cv=5, verbose=True, n_jobs=-1)
lr_CV.fit(x_train,y_train)
lr_CV_pred = lr_CV.predict(x_test)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:   38.6s finished


In [35]:
print(lr_CV_pred[0:999])
print(lr_CV.best_estimator_)
print(lr_CV.best_score_)
print(lr_CV.best_params_)

[0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0
 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0
 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1
 1 0 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1
 1 0 0 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1
 0 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0
 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1
 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0
 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1
 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 0
 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1
 1 0 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0
 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0
 1 0 1 1 0 0 0 1 1 1 0 0 

In [36]:
lr = LogisticRegression(C=0.01, solver='newton-cg')
lr.fit(x_train,y_train)
lr.predict(x_test)
print(lr.score(x_train, y_train))
jaccard_score(y_test, lr.predict(x_test))

0.6135445757250269


0.4498407643312102

In [None]:
## NOW WIITH SVM
from sklearn import svm

param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['rbf','linear','poly','sigmoid']}  

svmCV = GridSearchCV(svm.SVC(), param_grid, cv=5, verbose=True, n_jobs=-1)
svmCV.fit(x_train, y_train)
svmCV_pred = svmCV.predict(x_test)

In [None]:
print(svmCV_pred)
print(svmCV.best_estimator_)
print(svmCV.best_score_)
print(svmCV.best_params_)