# Dataset
This data was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. For more information about the dataset, please refer to the paper:
Wang, Tong, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry MacNeille. 'A bayesian framework for learning rule sets for interpretable classification.' The Journal of Machine Learning Research 18, no. 1 (2017): 2357-2393.


#### Attribute Information:

* destination: No Urgent Place, Home, Work
* passanger: Alone, Friend(s), Kid(s), Partner (who are the passengers in the car)
* weather: Sunny, Rainy, Snowy
* temperature:55, 80, 30
* time: 2PM, 10AM, 6PM, 7AM, 10PM
* coupon: Restaurant(<$\$$20), Coffee House, Carry out & Take away, Bar, Restaurant($\$$20-$\$$50)
* expiration: 1d, 2h (the coupon expires in 1 day or in 2 hours)
* gender: Female, Male
* age: 21, 46, 26, 31, 41, 50plus, 36, below21
* maritalStatus: Unmarried partner, Single, Married partner, Divorced, Widowed
* has_Children:1, 0
* education: Some college - no degree, Bachelors degree, Associates degree, High School Graduate, Graduate degree (Masters or Doctorate), Some High School
* occupation: Unemployed, Architecture & Engineering, Student, Education&Training&Library, Healthcare Support, Healthcare Practitioners & Technical, Sales & Related, Management, Arts Design Entertainment Sports & Media, Computer & Mathematical, Life Physical Social Science, Personal Care & Service, Community & Social Services, Office & Administrative Support, Construction & Extraction, Legal, Retired, Installation Maintenance & Repair, Transportation & Material Moving, Business & Financial, Protective Service, Food Preparation & Serving Related, Production Occupations, Building & Grounds Cleaning & Maintenance, Farming Fishing & Forestry
* income: $\$$37500 - $\$$49999, $\$$62500 - $\$$74999, $\$$12500 - $\$$24999, $\$$75000 - $\$$87499, $\$$50000 - $\$$62499, $\$$25000 - $\$$37499, $\$$100000 or More, $\$$87500 - $\$$99999, Less than $\$$12500
* Bar: never, less1, 1\~3, gt8, nan, 4\~8 (feature meaning: how many times do you go to a bar every month?)
* CoffeeHouse: never, less1, 4\~8, 1\~3, gt8, nan (feature meaning: how many times do you go to a coffeehouse every month?)
* CarryAway:n4\~8, 1\~3, gt8, less1, never (feature meaning: how many times do you get take-away food every month?)
* RestaurantLessThan20: 4\~8, 1\~3, less1, gt8, never (feature meaning: how many times do you go to a restaurant with an average expense per person of less than $\$$20 every month?)
* Restaurant20To50: 1\~3, less1, never, gt8, 4\~8, nan (feature meaning: how many times do you go to a restaurant with average expense per person of $\$$20 - $\$$50 every month?)
* toCoupon_GEQ15min:0,1 (feature meaning: driving distance to the restaurant/bar for using the coupon is greater than 15 minutes)
* toCoupon_GEQ25min:0, 1 (feature meaning: driving distance to the restaurant/bar for using the coupon is greater than 25 minutes)
* direction_same:0, 1 (feature meaning: whether the restaurant/bar is in the same direction as your current destination)
* direction_opp:1, 0 (feature meaning: whether the restaurant/bar is in the same direction as your current destination)
* Y:1, 0 (whether the coupon is accepted)

# Preprocessing ``train.csv`` 

In [359]:
import pandas as pd
import numpy as np

In [360]:
train = pd.read_csv("train.csv")
train.shape


(2378, 26)

First we find how many null values are in the dataset. 

In [361]:
train.isna().sum()

destination                0
passanger                  0
weather                   47
temperature               34
time                       0
coupon                     0
expiration                 0
gender                    49
age                       52
maritalStatus              0
has_children              46
education                  0
occupation                 0
income                     0
car                     2362
Bar                       20
CoffeeHouse               40
CarryAway                 23
RestaurantLessThan20      26
Restaurant20To50          27
toCoupon_GEQ5min           0
toCoupon_GEQ15min          0
toCoupon_GEQ25min          0
direction_same             0
direction_opp              0
Y                          0
dtype: int64

The variable 'car' has over 80% missing values. Therefore I am dropping the column. 

In [362]:
train.drop('car',axis=1,inplace=True)

In [363]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2378 entries, 0 to 2377
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   destination           2378 non-null   object 
 1   passanger             2378 non-null   object 
 2   weather               2331 non-null   object 
 3   temperature           2344 non-null   float64
 4   time                  2378 non-null   object 
 5   coupon                2378 non-null   object 
 6   expiration            2378 non-null   object 
 7   gender                2329 non-null   object 
 8   age                   2326 non-null   object 
 9   maritalStatus         2378 non-null   object 
 10  has_children          2332 non-null   float64
 11  education             2378 non-null   object 
 12  occupation            2378 non-null   object 
 13  income                2378 non-null   object 
 14  Bar                   2358 non-null   object 
 15  CoffeeHouse          

In [364]:
train.income.value_counts(dropna=False)

$25000 - $37499     379
$100000 or More     344
$12500 - $24999     341
$37500 - $49999     299
$50000 - $62499     287
Less than $12500    221
$62500 - $74999     177
$75000 - $87499     174
$87500 - $99999     156
Name: income, dtype: int64

In [365]:
train.direction_opp.value_counts(dropna=False)

1    1845
0     533
Name: direction_opp, dtype: int64

In [366]:
train.temperature.value_counts(dropna=False)

80.0     1138
55.0      711
30.0      436
999.0      59
NaN        34
Name: temperature, dtype: int64

One of the categories is 999 which as a temperature does not make practical sense. Therefore, I will treat it as a null value and replace it with the most frequently occurring category, 80.0 

In [367]:
train.temperature = train.temperature.replace(999,np.nan)
train.temperature.value_counts(dropna=False)

80.0    1138
55.0     711
30.0     436
NaN       93
Name: temperature, dtype: int64

In [368]:
train.gender.value_counts(dropna=False)

Female    1214
Male      1115
NaN         49
Name: gender, dtype: int64

In [369]:
train.age.value_counts(dropna=False)        # 8 categories

26         486
21         477
31         378
50plus     329
36         222
41         212
46         128
below21     94
NaN         52
Name: age, dtype: int64

In [370]:
train.has_children.value_counts(dropna=False) 

0.0    1353
1.0     979
NaN      46
Name: has_children, dtype: int64

In [371]:
train.Bar.value_counts(dropna=False)      # 5 categories

never    955
less1    645
1~3      478
4~8      217
gt8       63
NaN       20
Name: Bar, dtype: int64

In [372]:
train.CoffeeHouse.value_counts(dropna=False)     # 5 categories

less1    629
1~3      599
never    562
4~8      339
gt8      209
NaN       40
Name: CoffeeHouse, dtype: int64

In [373]:
train.CarryAway.value_counts(dropna=False)     # 5 categories

1~3      875
4~8      797
less1    343
gt8      313
never     27
NaN       23
Name: CarryAway, dtype: int64

In [374]:
train.RestaurantLessThan20.value_counts(dropna=False)     # 5 categories

1~3      1032
4~8       653
less1     382
gt8       248
never      37
NaN        26
Name: RestaurantLessThan20, dtype: int64

In [375]:
train.Restaurant20To50.value_counts(dropna=False)     # 5 categories


less1    1129
1~3       636
never     404
4~8       129
gt8        53
NaN        27
Name: Restaurant20To50, dtype: int64

I will now replace the null values with most frequently occurring value in all the columns (with null values) since they are all categorical in nature

In [376]:
# converting np.nan to most frequently occurring value

for col in train.columns:
    if col in ['weather','temperature', 'gender', 'age', 'has_children', 'Bar', 'CoffeeHouse', 'CarryAway','RestaurantLessThan20','Restaurant20To50'] :
        mode= train[col].mode()[0]
        train[col].replace(np.nan, mode, inplace=True)

The column 'has_children' is being converted to an integer type. 

In [377]:
# change the data type of has_children from float to int
train.has_children = train.has_children.astype('int64')
train.has_children.value_counts()

0    1399
1     979
Name: has_children, dtype: int64

## Data Transformation

In [378]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2378 entries, 0 to 2377
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   destination           2378 non-null   object 
 1   passanger             2378 non-null   object 
 2   weather               2378 non-null   object 
 3   temperature           2378 non-null   float64
 4   time                  2378 non-null   object 
 5   coupon                2378 non-null   object 
 6   expiration            2378 non-null   object 
 7   gender                2378 non-null   object 
 8   age                   2378 non-null   object 
 9   maritalStatus         2378 non-null   object 
 10  has_children          2378 non-null   int64  
 11  education             2378 non-null   object 
 12  occupation            2378 non-null   object 
 13  income                2378 non-null   object 
 14  Bar                   2378 non-null   object 
 15  CoffeeHouse          

In [379]:
train.Restaurant20To50.value_counts(dropna=False)

less1    1156
1~3       636
never     404
4~8       129
gt8        53
Name: Restaurant20To50, dtype: int64

In [380]:
train.toCoupon_GEQ5min.value_counts(dropna=False)     # only one category. It can be dropped

1    2378
Name: toCoupon_GEQ5min, dtype: int64

It can be noticed that the column 'toCoupon_GEQ5min' has only one category and doesn't add any value to the data. Hence It will be dropped.

In [381]:
print(train.occupation.value_counts(dropna=False))   # too many categories 

Unemployed                                   363
Student                                      297
Computer & Mathematical                      270
Sales & Related                              216
Education&Training&Library                   172
Management                                   161
Arts Design Entertainment Sports & Media     116
Office & Administrative Support              109
Business & Financial                         103
Retired                                      100
Food Preparation & Serving Related            58
Healthcare Support                            44
Healthcare Practitioners & Technical          44
Community & Social Services                   43
Legal                                         38
Transportation & Material Moving              37
Architecture & Engineering                    36
Construction & Extraction                     29
Personal Care & Service                       28
Life Physical Social Science                  27
Protective Service  

There are about 25 categories in the column 'Occupation'. Hence it will be dropped.

In [382]:
print(train.direction_same.value_counts(dropna=False))
print(train.direction_opp.value_counts(dropna=False)) 

0    1845
1     533
Name: direction_same, dtype: int64
1    1845
0     533
Name: direction_opp, dtype: int64


In [383]:
# 'direction_same' and 'direction_opp' variables have reversed values. So I check if the data is reversed.

(train.direction_opp == train.direction_same).sum()

0

It is observed that the two variables 'direction_opp' and 'direction_same' convey the same data and one of these can be dropped. Therefore I am dropping the variable 'direction_opp'. 

In [384]:
# Drop occupation and direction_opp, toCoupon_GEQ5min

train.drop(['occupation','direction_opp', 'toCoupon_GEQ5min'], axis=1,inplace=True)


We perform one-hot encoding for the following multiclass categorical variables that can not be ordered in any way:

In [385]:
#one hot encoding

for col2 in ['destination','passanger','weather', 'coupon', 'maritalStatus']:
    cols = pd.get_dummies(train[col2], prefix= col2)
    train[cols.columns] = cols
    train.drop(col2, axis = 1, inplace = True)

In [386]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2378 entries, 0 to 2377
Data columns (total 37 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   temperature                      2378 non-null   float64
 1   time                             2378 non-null   object 
 2   expiration                       2378 non-null   object 
 3   gender                           2378 non-null   object 
 4   age                              2378 non-null   object 
 5   has_children                     2378 non-null   int64  
 6   education                        2378 non-null   object 
 7   income                           2378 non-null   object 
 8   Bar                              2378 non-null   object 
 9   CoffeeHouse                      2378 non-null   object 
 10  CarryAway                        2378 non-null   object 
 11  RestaurantLessThan20             2378 non-null   object 
 12  Restaurant20To50    

I will now map all the columns with ordinal categories to numerical values and transform the existing columns with the mapping.

The columns 'Bar', 'CoffeeHouse','CarryAway','RestaurantLessThan20' and 'Restaurant20To50' share the same categories and can be transformed using the same mapping

In [387]:
# Ordinal Mapping
Mapper = {"never":0,"less1":1,"1~3":2,"4~8":3,"gt8":4}
train= train.replace({'Bar': Mapper})
train= train.replace({'CoffeeHouse': Mapper})
train= train.replace({'CarryAway': Mapper})
train= train.replace({'RestaurantLessThan20': Mapper})
train= train.replace({'Restaurant20To50': Mapper})

In [388]:
train.income.value_counts()

$25000 - $37499     379
$100000 or More     344
$12500 - $24999     341
$37500 - $49999     299
$50000 - $62499     287
Less than $12500    221
$62500 - $74999     177
$75000 - $87499     174
$87500 - $99999     156
Name: income, dtype: int64

Temperature column has only three values. Hence it can be transformed as ordinal categorical variable. <br>
Time of the day can be ordered as such so I'm mapping Time ordinally. <br>
Age can be mapped ordinally based on ascending order of ages.<br>
Income intervals can also be mapped Ordinally based on the ascending order of the income ranges.

In [389]:
Mapper2= {'30':0, '55.0': 1, '80.0': 2 }
train= train.replace({'temperature': Mapper})

Mapper3= {'7AM':0, '10AM': 1, '2PM': 2 , '6PM': 3, '10PM': 4}
train= train.replace({'time': Mapper3})

Mapper4= {'below21':0, '21': 1, '26': 2 , '31': 3, '36': 4, '41': 5, '46': 6, '50plus': 7}
train= train.replace({'age': Mapper4})

Mapper5= {'Less than $12500':0, '$12500 - $24999': 1, '$25000 - $37499': 2 , '$37500 - $49999': 3, '$50000 - $62499': 4, '$62500 - $74999': 5, '$75000 - $87499': 6, '$87500 - $99999': 7, '$100000 or More': 8}
train= train.replace({'income': Mapper5})


In [390]:
train.expiration.value_counts()

1d    1329
2h    1049
Name: expiration, dtype: int64

1d has the higher frequency and hence is mapped to 1 and 2h to 0.

In [391]:
train.expiration.replace(('1d', '2h'), (1,0), inplace=True)

In [392]:
train.gender.value_counts()

Female    1263
Male      1115
Name: gender, dtype: int64

Female has the higher frequency and hence is mapped to 1 and male to 0.

In [393]:
train.gender.replace(('Female', 'Male'), (1,0), inplace=True)

In [394]:
train.education.value_counts()

Bachelors degree                          805
Some college - no degree                  785
Graduate degree (Masters or Doctorate)    373
Associates degree                         220
High School Graduate                      177
Some High School                           18
Name: education, dtype: int64

For the variable Education, I am binning the values and mapping them ordinally to make the variable more machine learning friendly.

In [395]:
Mapper6= {'Some High School':0, 'High School Graduate': 0, 'Some college - no degree': 1, 'Associates degree': 2, 'Bachelors degree': 2, 'Graduate degree (Masters or Doctorate)': 3}
train= train.replace({'education': Mapper6})


In [396]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2378 entries, 0 to 2377
Data columns (total 37 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   temperature                      2378 non-null   float64
 1   time                             2378 non-null   int64  
 2   expiration                       2378 non-null   int64  
 3   gender                           2378 non-null   int64  
 4   age                              2378 non-null   int64  
 5   has_children                     2378 non-null   int64  
 6   education                        2378 non-null   int64  
 7   income                           2378 non-null   int64  
 8   Bar                              2378 non-null   int64  
 9   CoffeeHouse                      2378 non-null   int64  
 10  CarryAway                        2378 non-null   int64  
 11  RestaurantLessThan20             2378 non-null   int64  
 12  Restaurant20To50    

There does not seem to be any data leakage as none of the columns provide information about the target variable.

# Preprocessing ``test.csv`` 

Final test data set is imported from test.csv

In [397]:
test = pd.read_csv("test.csv")
test.shape

(793, 25)

In [398]:
test.isna().sum()

destination               0
passanger                 0
weather                  14
temperature              13
time                      0
coupon                    0
expiration                0
gender                   14
age                      12
maritalStatus             0
has_children             16
education                 0
occupation                0
income                    0
car                     784
Bar                       9
CoffeeHouse              16
CarryAway                13
RestaurantLessThan20      8
Restaurant20To50         13
toCoupon_GEQ5min          0
toCoupon_GEQ15min         0
toCoupon_GEQ25min         0
direction_same            0
direction_opp             0
dtype: int64

Similar operations done to train dataset are performed on the test Data set as well. 

In [399]:
test.drop('car',axis=1,inplace=True)

In [400]:
test.temperature.value_counts(dropna=False)
test.temperature = test.temperature.replace(999,np.nan)
test.temperature.value_counts(dropna=False)

80.0    390
55.0    223
30.0    144
NaN      36
Name: temperature, dtype: int64

In [401]:
for col in test.columns:
    if col in ['weather','temperature', 'gender', 'age', 'has_children', 'Bar', 'CoffeeHouse', 'CarryAway','RestaurantLessThan20','Restaurant20To50'] :
        mode= test[col].mode()[0]
        test[col].replace(np.nan, mode, inplace=True)

In [402]:
# change the data type of has_children from float to int
test.has_children = test.has_children.astype('int64')
test.has_children.value_counts()

0    453
1    340
Name: has_children, dtype: int64

In [403]:
# Drop occupation and direction_opp, toCoupon_GEQ5min
test.drop(['occupation','direction_opp', 'toCoupon_GEQ5min'], axis=1,inplace=True)

#one hot encoding

for col2 in ['destination','passanger','weather', 'coupon', 'maritalStatus']:
    cols = pd.get_dummies(test[col2], prefix= col2)
    test[cols.columns] = cols
    test.drop(col2, axis = 1, inplace = True)

In [404]:
# Ordinal Mapping
Mapper = {"never":0,"less1":1,"1~3":2,"4~8":3,"gt8":4}
test= test.replace({'Bar': Mapper})
test= test.replace({'CoffeeHouse': Mapper})
test= test.replace({'CarryAway': Mapper})
test= test.replace({'RestaurantLessThan20': Mapper})
test= test.replace({'Restaurant20To50': Mapper})


In [405]:
Mapper2= {'30':0, '55.0': 1, '80.0': 2 }
test= test.replace({'temperature': Mapper})

Mapper3= {'7AM':0, '10AM': 1, '2PM': 2 , '6PM': 3, '10PM': 4}
test= test.replace({'time': Mapper3})

Mapper4= {'below21':0, '21': 1, '26': 2 , '31': 3, '36': 4, '41': 5, '46': 6, '50plus': 7}
test= test.replace({'age': Mapper4})

Mapper5= {'Less than $12500':0, '$12500 - $24999': 1, '$25000 - $37499': 2 , '$37500 - $49999': 3, '$50000 - $62499': 4, '$62500 - $74999': 5, '$75000 - $87499': 6, '$87500 - $99999': 7, '$100000 or More': 8}
test= test.replace({'income': Mapper5})

In [406]:
test.expiration.replace(('1d', '2h'), (1,0), inplace=True)
test.gender.replace(('Female', 'Male'), (1,0), inplace=True)

In [407]:
Mapper6= {'Some High School':0, 'High School Graduate': 0, 'Some college - no degree': 1, 'Associates degree': 2, 'Bachelors degree': 2, 'Graduate degree (Masters or Doctorate)': 3}
test= test.replace({'education': Mapper6})

In [408]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 793 entries, 0 to 792
Data columns (total 36 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   temperature                      793 non-null    float64
 1   time                             793 non-null    int64  
 2   expiration                       793 non-null    int64  
 3   gender                           793 non-null    int64  
 4   age                              793 non-null    int64  
 5   has_children                     793 non-null    int64  
 6   education                        793 non-null    int64  
 7   income                           793 non-null    int64  
 8   Bar                              793 non-null    int64  
 9   CoffeeHouse                      793 non-null    int64  
 10  CarryAway                        793 non-null    int64  
 11  RestaurantLessThan20             793 non-null    int64  
 12  Restaurant20To50      

We notice that all the null values have been dealt with. All columns transformed to numerical values with appropriate transformations.

# Machine learning models 

First the train dataset is split into X_train X_test, Y_train and Y_test. Then, MinMaxScaler is used to scale the dataset. 

## Data Split

In [409]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

We split the Preprocessed data into Target (Y) and feature (X) datasets

In [410]:
colIndexes= train.drop(['Y'], axis=1).columns
X= train[colIndexes]
Y= train.Y

X and Y datasets are split into training and test datasets individually. We use a MinMaxScaler to transform the features to avoid any distortions.

In [411]:
X_train_org, X_test_org, Y_train, Y_test= train_test_split(X,Y,random_state=0)
scaler= MinMaxScaler()
X_train= scaler.fit_transform(X_train_org)
X_test= scaler.transform(X_test_org)

## Classification Models

We now build various Classification Models paired with a Gridsearch Cross validation to find the optimal hyperparameters. 
Since we are more concerned with finding most 'yes' values correctly even if it means a few 'no's are falsely classified as 'yes', our evaluation criterion is going to be **"RECALL"**

In [412]:
# KNN CLASSIFIER

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score

grid_params= {'n_neighbors': [3,4,5,8,10,15,20],'weights':['distance'],'metric':['euclidean']}

gs_knn= GridSearchCV(KNeighborsClassifier(), grid_params, cv= 5 ,scoring = 'recall',return_train_score=True, verbose=1)
gs_results= gs_knn.fit(X_train, Y_train)
print("Best parameters: {}".format(gs_results.best_params_))
print("Best cross-validation score: {:.2f}".format(gs_results.best_score_))

gs_best_knn = KNeighborsClassifier(n_neighbors = gs_results.best_params_['n_neighbors'],metric=gs_results.best_params_['metric'],weights=gs_results.best_params_['weights'])
gs_best_knn.fit(X_train,Y_train)
print(f'test score : {gs_best_knn.score(X_test, Y_test)}')
knn_Y_predict = gs_best_knn.predict(X_test)
print('Recall :{}'.format(recall_score(Y_test,knn_Y_predict)))


Fitting 5 folds for each of 7 candidates, totalling 35 fits
Best parameters: {'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'distance'}
Best cross-validation score: 0.75
test score : 0.6554621848739496
Recall :0.7631578947368421


Best parameters from the grid search are- metric: euclidean, n_neighbors: 20, weights: distance. The Cross-validation score of this model is 0.75. Test Score of the model with the best parameters is 0.66 and the Recall score is 0.76. 

In [413]:
# LOGISTIC REGRESSION

from sklearn.linear_model import LogisticRegression
import warnings

grid_params= {'penalty': ['l1', 'l2'], 'C': [0.05, 0.1, 1, 10, 100, 500], 'solver': ['lbfgs','saga','sag'] }

Logreg = GridSearchCV(LogisticRegression(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')
warnings.filterwarnings('ignore')
best_logreg = Logreg.fit(X_train, Y_train)
print("Best parameters: {}".format(Logreg.best_params_))
print("Best cross-validation score: {:.2f}".format(Logreg.best_score_))

logL2= LogisticRegression(penalty= Logreg.best_params_['penalty'], C= Logreg.best_params_['C'], solver= Logreg.best_params_['solver'])
logL2.fit(X_train, Y_train)
print('train_score_l2 : {}'.format(logL2.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(logL2.score(X_test, Y_test)))

Logreg_Y_predict = logL2.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,Logreg_Y_predict)))


Best parameters: {'C': 0.05, 'penalty': 'l1', 'solver': 'saga'}
Best cross-validation score: 0.81
train_score_l2 : 0.6539540100953449
test_score_l2 : 0.6621848739495798
Recall :0.8129


The best parameters for Logistic Regression model is C= 0.05, penalty= l1, solver= saga. When the model is run with these parameters, we get a train score of 0.66 and Recall score of 0.81. 

In [414]:
# LINEAR SVC

from sklearn.svm import LinearSVC

grid_params= {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 1000], 'random_state': [0] }

clf_lsvc = GridSearchCV(LinearSVC(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

best_lsvc = clf_lsvc.fit(X_train, Y_train)
print("Best parameters: {}".format(clf_lsvc.best_params_))
print("Best cross-validation score: {:.2f}".format(clf_lsvc.best_score_))

clf_ls= LinearSVC(penalty= clf_lsvc.best_params_['penalty'], C= clf_lsvc.best_params_['C'], random_state= clf_lsvc.best_params_['random_state'])
clf_ls.fit(X_train, Y_train)
print('train_score_l2 : {}'.format(clf_ls.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(clf_ls.score(X_test, Y_test)))

clf_ls_Ypredict = clf_ls.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,clf_ls_Ypredict)))

Best parameters: {'C': 0.01, 'penalty': 'l2', 'random_state': 0}
Best cross-validation score: 0.75
train_score_l2 : 0.6679753224901851
test_score_l2 : 0.66890756302521
Recall :0.7661


In [415]:
# DECISION TREE

from sklearn.tree import DecisionTreeClassifier

grid_params ={'criterion': ['gini', 'entropy'], 'max_depth': [2,5,7], 'min_samples_leaf':[5,10,15],  'random_state': [0]}

clf_tree = GridSearchCV(DecisionTreeClassifier(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

best_tree = clf_tree.fit(X_train, Y_train)
print("Best parameters: {}".format(clf_tree.best_params_))
print("Best cross-validation score: {:.2f}".format(clf_tree.best_score_))

tree = DecisionTreeClassifier(**clf_tree.best_params_)
tree.fit(X_train, Y_train)

print("Train score: {:.4f}".format(tree.score(X_train, Y_train)))
print("Test score: {:.4f}".format(tree.score(X_test,Y_test)))

tree_Ypredict = tree.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,tree_Ypredict)))

Best parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 5, 'random_state': 0}
Best cross-validation score: 0.77
Train score: 0.6854
Test score: 0.6437
Recall :0.6930


The best parameters obtained from the Grid Search are: criterion= 'entropy', max_depth= 5, min_samples_leaf= 5. The cross-validation score is 0.77 The decicion tree model run with these parameters provides a train score of 0.64 and Recall score of 0.69.

In [416]:
# SVC KERNEL

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import recall_score, precision_score

In [417]:
# kernel= poly

grid_params ={'kernel': ['poly'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

clf_poly = GridSearchCV(SVC(random_state=0), param_grid = grid_params, cv = 5, scoring= 'recall',verbose=True)

clf_poly.fit(X_train, Y_train)

print("Best parameters: {}".format(clf_poly.best_params_))
print("Best cross-validation score: {:.2f}".format(clf_poly.best_score_))

clf_bestpoly = SVC(**clf_poly.best_params_)
clf_bestpoly.fit(X_train,Y_train)

print('train_score_l2 : {}'.format(clf_bestpoly.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(clf_bestpoly.score(X_test, Y_test)))

clf_bestpoly_Ypredict = clf_bestpoly.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,clf_bestpoly_Ypredict)))


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'C': 0.1, 'degree': 4, 'gamma': 0.1, 'kernel': 'poly'}
Best cross-validation score: 0.85
train_score_l2 : 0.6870443073471677
test_score_l2 : 0.6487394957983194
Recall :0.8363


We notice that SVC With kernel 'poly', degree '4' and C = 0.1 and gamma 0.1 provides the test score 0.65 and a Recall score of 0.84. 

In [418]:
#kernel= rbf
grid_params ={'kernel': ['rbf'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

clf_rbf = GridSearchCV(SVC(random_state=0), param_grid = grid_params, cv = 5, scoring= 'recall',verbose=True)

clf_rbf.fit(X_train, Y_train)

print("Best parameters: {}".format(clf_rbf.best_params_))
print("Best cross-validation score: {:.2f}".format(clf_rbf.best_score_))

clf_bestrbf = SVC(**clf_rbf.best_params_)
clf_bestrbf.fit(X_train,Y_train)

print('train_score_l2 : {}'.format(clf_bestrbf.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(clf_bestrbf.score(X_test, Y_test)))

clf_bestrbf_Ypredict = clf_bestrbf.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,clf_bestrbf_Ypredict)))


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'C': 0.1, 'degree': 2, 'gamma': 0.3, 'kernel': 'rbf'}
Best cross-validation score: 0.93
train_score_l2 : 0.683679192372406
test_score_l2 : 0.6403361344537815
Recall :0.8772


SVC With kernel 'rbf', degree '2' and C = 0.1 and gamma 0.3 provides Test score of 0.64 and Recall score of 0.88.

In [419]:
#kernel= linear
grid_params ={'kernel': ['linear'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

clf_lin = GridSearchCV(SVC(random_state=0), param_grid = grid_params, cv = 5, scoring= 'recall',verbose=True)

clf_lin.fit(X_train, Y_train)

print("Best parameters: {}".format(clf_lin.best_params_))
print("Best cross-validation score: {:.2f}".format(clf_lin.best_score_))

clf_bestlin = SVC(**clf_lin.best_params_)
clf_bestlin.fit(X_train,Y_train)

print('train_score_l2 : {}'.format(clf_bestlin.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(clf_bestlin.score(X_test, Y_test)))

clf_bestlin_Ypredict = clf_bestlin.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,clf_bestlin_Ypredict)))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'C': 0.1, 'degree': 2, 'gamma': 0.1, 'kernel': 'linear'}
Best cross-validation score: 0.72
train_score_l2 : 0.6741446999439148
test_score_l2 : 0.6789915966386555
Recall :0.7456


We notice that SVC With kernel 'linear', degree '2' and C = 0.1 and gamma 0.1 provides a test score of 0.68 and a Recall score 0.75. 

In [420]:
# RANDOM FOREST

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve

grid_params ={'n_estimators': [100, 200, 300], 'max_depth': [4,6,7], 'min_samples_leaf':[5,7,10],  'random_state': [0], 'max_features' : [5,7]}

rf = GridSearchCV(RandomForestClassifier(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

rf.fit(X_train, Y_train)
print("Best parameters: {}".format(rf.best_params_))
print("Best cross-validation score: {:.2f}".format(rf.best_score_))

RForest= RandomForestClassifier(**rf.best_params_)
RForest.fit(X_train, Y_train)

print("Train score: {:.4f}".format(RForest.score(X_train, Y_train)))
print("Test score: {:.4f}".format(RForest.score(X_test, Y_test)))

RForest_Ypredict = RForest.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test, RForest_Ypredict)))

Best parameters: {'max_depth': 4, 'max_features': 5, 'min_samples_leaf': 10, 'n_estimators': 100, 'random_state': 0}
Best cross-validation score: 0.84
Train score: 0.6988
Test score: 0.6975
Recall :0.8801


The best parameters for the model after Grid Search with cross-validation is max_depth= 4, max_features= 5, min_samples_leaf= 10, n_estimators= 100. With these parameters, we run the Random Forest , we get a test score of 0.70 and a recall score of 0.88.

Next I will be building models with ensemble methods. 

### VOTING CLASSIFIER

I am developing voting classifiers with Kernel SVC (poly), Decision tree and Logistic Regression. I am using the best parameters found in the Grid search for their respective models as the parameters used for these classifiers. 

In [421]:
# VOTING CLASSIFIERS 
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [422]:
# Hard Voting

svc= SVC(random_state=0, kernel= 'poly', degree= 4, C= 0.1, gamma= 0.1, probability= True)
svc.fit(X_train, Y_train)

tree_cls= DecisionTreeClassifier(criterion='entropy', max_depth= 5, min_samples_leaf=5,random_state= 0)
tree_cls.fit(X_train, Y_train)

log_cls= LogisticRegression(solver= 'saga', penalty= 'l1', C=0.05, max_iter= 1000, multi_class= 'auto',random_state= 0)
log_cls.fit(X_train, Y_train)

# Voting= hard
hardVoting_cls= VotingClassifier(estimators= [('svc', svc), ('tree', tree_cls), ('log', log_cls)], voting= 'hard')
hardVoting_cls.fit(X_train,Y_train)
hardVoting_cls.score(X_train,Y_train)

y_pred_voting= hardVoting_cls.predict(X_test)
print('Train score: {}'.format(hardVoting_cls.score(X_train,Y_train)))
print('Test score: {}'.format(hardVoting_cls.score(X_test,Y_test)))
print('Recall :{:.4f}'.format(recall_score(Y_test, y_pred_voting)))

Train score: 0.6859226023555804
Test score: 0.6621848739495798
Recall :0.8099


Voting Classifier developed with voting method set to "hard" provides a recall score of 0.81 and Test score of 0.66.

In [423]:
# Soft Voting

voting_cls_soft= VotingClassifier(estimators= [('svc', svc), ('tree', tree_cls), ('log', log_cls)], voting= 'soft')
voting_cls_soft.fit(X_train,Y_train)
voting_cls_soft.score(X_train,Y_train)

y_pred_voting= voting_cls_soft.predict(X_test)
print('Train score: {}'.format(voting_cls_soft.score(X_train,Y_train)))
print('Test score: {}'.format(voting_cls_soft.score(X_test,Y_test)))
print('Recall :{:.4f}'.format(recall_score(Y_test, y_pred_voting)))

Train score: 0.7083567021873247
Test score: 0.6941176470588235
Recall :0.7544


Voting Classifier developed with voting method set to "soft" provides a Recall score of 0.75 and Test score of 0.69. 
Soft Voting Classifier has better Test score than Hard Voting Classifier. However we care more about the Recall value than Accuracy. Recall score Hard VC(0.81) is better than Soft VC. Thus, Hard Voting Classifier is doing a better job than Soft Voting Classifier.

In [424]:
# BAGGING

# bagging on Decision tree with gridsearch
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
from  sklearn.metrics import accuracy_score

tree_bag= DecisionTreeClassifier( criterion= 'entropy', min_samples_leaf= 5, max_depth= 5, random_state= 0)
grid_params_bag= {'max_features': [5, 10, 15], 'n_estimators': [100, 200, 300], 'max_samples': [0.1, 0.5]}

grid_bag= GridSearchCV(
    BaggingClassifier(tree_bag, random_state=0), param_grid= grid_params_bag, cv= 5, scoring= 'recall')
best_tree_bag= grid_bag.fit(X_train, Y_train)

print('Best parameters: {}'.format(grid_bag.best_params_))
print('Best CV score:{}'.format(grid_bag.best_score_))

tree_bagging= BaggingClassifier(tree_bag,**grid_bag.best_params_, random_state=0, bootstrap=True)
tree_bagging.fit(X_train, Y_train)
print('Train_score : {}'.format(tree_bagging.score(X_train, Y_train)))
print('Test_score : {}'.format(tree_bagging.score(X_test, Y_test)))

tree_bag_Ypredict = tree_bagging.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,tree_bag_Ypredict)))

Best parameters: {'max_features': 5, 'max_samples': 0.1, 'n_estimators': 300}
Best CV score:0.9229862072501666
Train_score : 0.641054402692092
Test_score : 0.6235294117647059
Recall :0.9298


Grid Search on Bagging with Decision Tree gives the following best parameters: max_features: 5, max_samples: 0.1, n_estimators: 300. Test score is 0.62 and Recall score is 0.93. This is the best Recall so far

In [425]:
# Bagging on KNN with Grid search
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score

knn = KNeighborsClassifier(metric= 'euclidean', n_neighbors= 20, weights= 'distance')

grid_params_bag ={'max_features': [5, 7, 10], 'n_estimators': [100, 200, 500], 'max_samples': [0.1, 0.5]}

grid_knn = GridSearchCV(BaggingClassifier(knn, random_state=0), param_grid = grid_params_bag, cv = 5, scoring= 'recall')

grid_best_ksvc = grid_knn.fit(X_train, Y_train)
print("Best parameters: {}".format(grid_knn.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_knn.best_score_))

knn_bag= BaggingClassifier(knn, **grid_knn.best_params_, random_state=0)
knn_bag.fit(X_train, Y_train)
print('train_score_l2 : {}'.format(knn_bag.score(X_train, Y_train)))
print('test_score_l2 : {}'.format(knn_bag.score(X_test, Y_test)))

knn_bag_Ypredict = knn_bag.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,knn_bag_Ypredict)))

Best parameters: {'max_features': 5, 'max_samples': 0.1, 'n_estimators': 500}
Best cross-validation score: 0.92
train_score_l2 : 0.6920919798093101
test_score_l2 : 0.6302521008403361
Recall :0.9269


Grid Search on Bagging with KNN Classifier gives the following best parameters: max_features: 5, max_samples: 0.1, n_estimators: 500. Test score is 0.63 and Recall score is 0.927.

In [426]:
# PASTING

# pasting with decision tree, gridsearch

grid_tree_paste= GridSearchCV(
    BaggingClassifier(tree_bag, random_state=0, bootstrap=False), param_grid= grid_params_bag, cv= 5, scoring= 'recall')
grid_tree_paste.fit(X_train, Y_train)

print('Best parameters: {}'.format(grid_tree_paste.best_params_))
print('Best CV score:{}'.format(grid_tree_paste.best_score_))

tree_paste= BaggingClassifier(tree_bag,**grid_tree_paste.best_params_, random_state=0, bootstrap= False)
tree_paste.fit(X_train, Y_train)
print('Train_score : {}'.format(tree_paste.score(X_train, Y_train)))
print('Test_score : {}'.format(tree_paste.score(X_test, Y_test)))

tree_paste_Ypredict = tree_paste.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test,tree_paste_Ypredict)))

Best parameters: {'max_features': 5, 'max_samples': 0.1, 'n_estimators': 500}
Best CV score:0.9280572219658515
Train_score : 0.6421761076836792
Test_score : 0.6134453781512605
Recall :0.9035


Grid Search on Pasting with Decision tree gives the following best parameters: max_features: 5, max_samples: 0.1, n_estimators: 500. Test score is 0.61 and Recall score is 0.90.

In [427]:
# pasting on KNN, grid search
knn = KNeighborsClassifier(metric= 'euclidean', n_neighbors= 5, weights= 'distance')

grid_params_bag ={'max_features': [5, 7, 10], 'n_estimators': [100, 200, 300], 'max_samples': [0.1, 0.5]}
grid_knn_paste = GridSearchCV(BaggingClassifier(knn, random_state=0, bootstrap=False), param_grid = grid_params_bag, cv = 5, scoring= 'recall')

grid_knn_paste.fit(X_train, Y_train)
print("Best parameters: {}".format(grid_knn_paste.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_knn_paste.best_score_))

knn_paste= BaggingClassifier(knn,**grid_knn_paste.best_params_, random_state=0, bootstrap= False)
knn_paste.fit(X_train, Y_train)
print('Train_score : {}'.format(knn_paste.score(X_train, Y_train)))
print('Test_score : {}'.format(knn_paste.score(X_test, Y_test)))
knn_paste_Ypredict = knn_paste.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test, knn_paste_Ypredict)))


Best parameters: {'max_features': 5, 'max_samples': 0.1, 'n_estimators': 300}
Best cross-validation score: 0.91
Train_score : 0.699383062254627
Test_score : 0.6487394957983194
Recall :0.9035


Grid Search on Pasting with KNN Classifier gives the following best parameters: max_features: 5, max_samples: 0.1, n_estimators: 300. Test score is 0.65 and Recall score is 0.90.

In [428]:
# ADABOOST

# Adaboost Classifier with Decision Tree

from sklearn.ensemble import AdaBoostClassifier
tree_bag= DecisionTreeClassifier( criterion= 'entropy', min_samples_leaf= 10, max_depth= 7, random_state= 0)


grid_params_ada ={'learning_rate': [0.5, 1, 1.5], 'n_estimators': [100, 200, 300], 'algorithm': ['SAMME', 'SAMME.R']}

ada_tree = GridSearchCV(
    AdaBoostClassifier(tree_bag, random_state=0), param_grid= grid_params_ada, cv=5, scoring= 'recall')
ada_tree.fit(X_train, Y_train)

print("Best parameters: {}".format(ada_tree.best_params_))
print("Best cross-validation score: {:.2f}".format(ada_tree.best_score_))

tree_ada= AdaBoostClassifier(tree_bag,**ada_tree.best_params_, random_state=0)
tree_ada.fit(X_train, Y_train)
print('Train_score : {}'.format(tree_ada.score(X_train, Y_train)))
print('Test_score : {}'.format(tree_ada.score(X_test, Y_test)))
tree_ada_Ypredict= tree_ada.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test, tree_ada_Ypredict)))

Best parameters: {'algorithm': 'SAMME', 'learning_rate': 0.5, 'n_estimators': 300}
Best cross-validation score: 0.73
Train_score : 1.0
Test_score : 0.6773109243697479
Recall :0.7398


Grid Search on Adaboost with Decision tree gives the following best parameters: algorithm: 'SAMME', learning_rate: 0.5, n_estimators: 300. <br> Test score is 0.67 and Recall score is 0.74. There seems to be overfitting with this model since train score is 1.0

In [429]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth= 6, max_features= 2, min_samples_leaf= 10, n_estimators= 10, random_state= 0)

grid_params ={'learning_rate': [0.5, 0.7,0.9], 'n_estimators': [50,100, 150], 'algorithm': ['SAMME']}

grid_ada_rf = GridSearchCV(
    AdaBoostClassifier(rf, random_state=0), param_grid= grid_params, cv=5, scoring= 'recall',return_train_score=True)
grid_ada_rf.fit(X_train, Y_train)

print("Best parameters: {}".format(grid_ada_rf.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_ada_rf.best_score_))

rf_ada= AdaBoostClassifier(rf,**grid_ada_rf.best_params_, random_state=0)
rf_ada.fit(X_train, Y_train)
print('Train_score : {}'.format(rf_ada.score(X_train, Y_train)))
print('Test_score : {}'.format(rf_ada.score(X_test, Y_test)))

rf_ada_Ypredict= rf_ada.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test, rf_ada_Ypredict)))


Best parameters: {'algorithm': 'SAMME', 'learning_rate': 0.5, 'n_estimators': 50}
Best cross-validation score: 0.73
Train_score : 0.9584969153112731
Test_score : 0.6487394957983194
Recall :0.7135


Grid Search on Adaboost with Random forest Classifier gives the following best parameters: algorithm: SAMME, learning_rate: 0.5, n_estimators: 50. Test score is 0.65 and Recall score is 0.71. There seems to be some amount of overfitting since train and test scores have a huge gap between them.

## Gradient Boosting

In [430]:
from  sklearn.ensemble import GradientBoostingClassifier
grid_params= {'max_features': [2, 5, 10, 20], 'n_estimators': [100, 200, 500], 'learning_rate': [0.1, 0.25, 0.5, 0.75, 1]}

grid_boost= GridSearchCV(
    GradientBoostingClassifier(random_state=0), param_grid= grid_params, cv= 5, scoring='recall' )
grid_boost.fit(X_train, Y_train)

print('Best parameters: {}'.format(grid_boost.best_params_))
print('Best CV score:{}'.format(grid_boost.best_score_))

gboost= GradientBoostingClassifier(**grid_boost.best_params_, random_state=0)
gboost.fit(X_train, Y_train)
print('Train_score : {}'.format(gboost.score(X_train, Y_train)))
print('Test_score : {}'.format(gboost.score(X_test, Y_test)))

gboost_Ypredict= gboost.predict(X_test)
print('Recall :{:.4f}'.format(recall_score(Y_test, gboost_Ypredict)))


Best parameters: {'learning_rate': 0.1, 'max_features': 2, 'n_estimators': 100}
Best CV score:0.7618571501820233
Train_score : 0.7291082445316882
Test_score : 0.7025210084033613
Recall :0.7924


Grid Search on Gradient boosting gives the following best parameters: max_features: 2, learning_rate: 0.1, n_estimators: 100. Test score is 0.70 and Recall score is 0.79. This is a decent model with a good Test score and a good Recall score.

## PCA

I am using Principal Component Analysis for Dimension Reduction of the dataset. After that I run all the original models on the new reduced dataset. Both X_train ad X_test are reduced to retain 95% of the information in the dataset. 

In [431]:
# Creating an empty DataFrame for the model results.

pca_results_df = pd.DataFrame(columns=['model','test_score','recall_score'])

In [432]:
from sklearn.decomposition import PCA

pca= PCA(n_components= 0.95)
X_train_reduced= pca.fit_transform(X_train)
X_test_reduced= pca.transform(X_test)

In [433]:
# KNN
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import recall_score

grid_params= {'n_neighbors': [3,4,5,8,10,15,20],'weights':['distance'],'metric':['euclidean']}

pca_gs_knn= GridSearchCV(KNeighborsClassifier(), grid_params, cv= 5 ,scoring = 'recall',return_train_score=True)
pca_gs_results= pca_gs_knn.fit(X_train, Y_train)
print("Best parameters: {}".format(pca_gs_knn.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_gs_knn.best_score_))

pca_gs_best_knn = KNeighborsClassifier(n_neighbors = pca_gs_results.best_params_['n_neighbors'],metric=pca_gs_results.best_params_['metric'],weights=pca_gs_results.best_params_['weights'])
pca_gs_best_knn.fit(X_train_reduced,Y_train)
print(f'test score : {pca_gs_best_knn.score(X_test_reduced, Y_test)}')
knn_Y_predict = pca_gs_best_knn.predict(X_test_reduced)
print('Recall :{}'.format(recall_score(Y_test,knn_Y_predict)))

Best parameters: {'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'distance'}
Best cross-validation score: 0.75
test score : 0.6605042016806723
Recall :0.7573099415204678


The best parameters for the model after Grid Search with cross-validation is metric: euclidean, n_neighbors: 20, weights: distance. With these parameters, we run the KNN Classifier , we get a test score of 0.66 and a recall score of 0.76.

In [434]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
import warnings

grid_params= {'penalty': ['l1', 'l2'], 'C': [0.05, 0.1, 1, 10, 100, 500], 'solver': ['lbfgs','saga'] }

pca_Logreg = GridSearchCV(LogisticRegression(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')
warnings.filterwarnings('ignore')

pca_best_logreg = pca_Logreg.fit(X_train_reduced, Y_train)
print("Best parameters: {}".format(pca_Logreg.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_Logreg.best_score_))

pca_logL2= LogisticRegression(**pca_Logreg.best_params_)
pca_logL2.fit(X_train_reduced, Y_train)
print('train_score_l2 : {}'.format(pca_logL2.score(X_train_reduced, Y_train)))
print('test_score_l2 : {}'.format(pca_logL2.score(X_test_reduced, Y_test)))

Logreg_Y_predict = pca_logL2.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test,Logreg_Y_predict)))

Best parameters: {'C': 0.05, 'penalty': 'l1', 'solver': 'saga'}
Best cross-validation score: 0.81
train_score_l2 : 0.6511497476163769
test_score_l2 : 0.6252100840336134
Recall :0.7807


The best parameters for the model after Grid Search with cross-validation is C: 0.05, penalty: l1, solver: saga. The KNN Classifier provides a test score of 0.63 and a recall score of 0.78.

In [435]:
# Decision tree

from sklearn.tree import DecisionTreeClassifier

grid_params ={'criterion': ['gini', 'entropy'], 'max_depth': [2,5,7], 'min_samples_leaf':[5,10,15],  'random_state': [0]}

pca_clf_tree = GridSearchCV(DecisionTreeClassifier(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

pca_best_tree = pca_clf_tree.fit(X_train_reduced, Y_train)
print("Best parameters: {}".format(pca_clf_tree.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_clf_tree.best_score_))

pca_tree = DecisionTreeClassifier(**pca_clf_tree.best_params_)
pca_tree.fit(X_train_reduced, Y_train)

print("Train score: {:.4f}".format(pca_tree.score(X_train_reduced, Y_train)))
print("Test score: {:.4f}".format(pca_tree.score(X_test_reduced,Y_test)))

tree_Ypredict = pca_tree.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test,tree_Ypredict)))

Best parameters: {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'random_state': 0}
Best cross-validation score: 0.74
Train score: 0.6125
Test score: 0.6050
Recall :0.8421


The best parameters for the model after Grid Search with cross-validation is criterion: gini, max_depth: 2, min_samples_leaf: 5. With these parameters, we get a test score of 0.60 and a recall score of 0.84.

In [436]:
# LinearSVC

from sklearn.svm import LinearSVC

grid_params= {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100, 1000], 'random_state': [0] }

pca_lsvc = GridSearchCV(LinearSVC(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

best_lsvc = pca_lsvc.fit(X_train_reduced, Y_train)
print("Best parameters: {}".format(pca_lsvc.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_lsvc.best_score_))

pca_ls= LinearSVC(**pca_lsvc.best_params_)
pca_ls.fit(X_train_reduced, Y_train)
print('train_score : {}'.format(pca_ls.score(X_train_reduced, Y_train)))
print('test_score : {}'.format(pca_ls.score(X_test_reduced, Y_test)))

pca_ls_Ypredict = pca_ls.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test,pca_ls_Ypredict)))

Best parameters: {'C': 0.01, 'penalty': 'l2', 'random_state': 0}
Best cross-validation score: 0.74
train_score : 0.6679753224901851
test_score : 0.6722689075630253
Recall :0.7690


The best parameters for the model after Grid Search with cross-validation is C': 0.01, 'penalty': 'l2'. With these parameters, the Linear SVC model provides a test score of 0.67 and a recall score of 0.77.

In [437]:
# KERNEL SVC
# kernel= poly

grid_params_poly ={'kernel': ['poly'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

pca_poly = GridSearchCV(SVC(random_state=0), param_grid = grid_params_poly, cv = 5, scoring= 'recall',verbose=True)

pca_poly.fit(X_train_reduced, Y_train)

print("Best parameters: {}".format(pca_poly.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_poly.best_score_))

pca_bestpoly = SVC(**pca_poly.best_params_)
pca_bestpoly.fit(X_train_reduced,Y_train)

print('train_score : {}'.format(pca_bestpoly.score(X_train_reduced, Y_train)))
print('test_score : {}'.format(pca_bestpoly.score(X_test_reduced, Y_test)))

pca_bestpoly_Ypredict = pca_bestpoly.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test,pca_bestpoly_Ypredict)))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'C': 0.1, 'degree': 2, 'gamma': 0.1, 'kernel': 'poly'}
Best cross-validation score: 1.00
train_score : 0.5535614133482895
test_score : 0.5747899159663865
Recall :1.0000


The best parameters for the model after Grid Search with cross-validation is C: 0.1, degree: 2, gamma: 0.1, kernel: poly. With these parameters we get a test score of 0.57 and a recall score of 1. The model seems to be underfitting significantly with a train score of 0.55.

In [438]:
#kernel= rbf
grid_params ={'kernel': ['rbf'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

pca_rbf = GridSearchCV(SVC(random_state=0), param_grid = grid_params, cv = 5, scoring= 'recall',verbose=False)
pca_rbf.fit(X_train_reduced, Y_train)
print("Best parameters: {}".format(pca_rbf.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_rbf.best_score_))

pca_bestrbf = SVC(**pca_rbf.best_params_)
pca_bestrbf.fit(X_train_reduced, Y_train)

print('train_score : {}'.format(pca_bestrbf.score(X_train_reduced, Y_train)))
print('test_score : {}'.format(pca_bestrbf.score(X_test_reduced, Y_test)))

pca_bestrbf_Ypredict = pca_bestrbf.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test,pca_bestrbf_Ypredict)))

Best parameters: {'C': 0.1, 'degree': 2, 'gamma': 0.3, 'kernel': 'rbf'}
Best cross-validation score: 0.89
train_score : 0.6887268648345485
test_score : 0.6470588235294118
Recall :0.8421


The best parameters for the model after Grid Search with cross-validation is C: 0.1, degree: 2, gamma: 0.3, kernel: rbf. With these parameters, we get a test score of 0.65 and a recall score of 0.84.

In [439]:
#kernel= linear
grid_params ={'kernel': ['linear'], 'C': [0.1,0.3], 'degree':[2,3,4],  'gamma': [0.1,0.3]}

pca_lin = GridSearchCV(SVC(random_state=0), param_grid = grid_params, cv = 5, scoring= 'recall',verbose=True)

pca_lin.fit(X_train_reduced, Y_train)

print("Best parameters: {}".format(pca_lin.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_lin.best_score_))

pca_bestlin = SVC(**pca_lin.best_params_)
pca_bestlin.fit(X_train_reduced,Y_train)

print('train_score_l2 : {}'.format(pca_bestlin.score(X_train_reduced, Y_train)))
print('test_score_l2 : {}'.format(pca_bestlin.score(X_test_reduced, Y_test)))

pca_bestlin_Ypredict = pca_bestlin.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test, pca_bestlin_Ypredict)))

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'C': 0.3, 'degree': 2, 'gamma': 0.1, 'kernel': 'linear'}
Best cross-validation score: 0.72
train_score_l2 : 0.6713404374649468
test_score_l2 : 0.6789915966386555
Recall :0.7573


The best parameters for the model after Grid Search with cross-validation is C: 0.1, degree: 2, gamma: 0.1, kernel: linear. With these parameters, we get a test score of 0.68 and a recall score of 0.76.

In [440]:
# RANDOM FOREST

grid_params ={'n_estimators': [100, 200, 500], 'max_depth': [2,4,6], 'min_samples_leaf':[5,10],  'random_state': [0], 'max_features' : [2]}

pca_rf = GridSearchCV(RandomForestClassifier(), param_grid = grid_params, cv = 5, n_jobs=-1, scoring= 'recall')

pca_rf.fit(X_train_reduced, Y_train)
print("Best parameters: {}".format(pca_rf.best_params_))
print("Best cross-validation score: {:.2f}".format(pca_rf.best_score_))

pca_RForest= RandomForestClassifier(**pca_rf.best_params_)
pca_RForest.fit(X_train_reduced, Y_train)

print("Train score: {:.4f}".format(pca_RForest.score(X_train_reduced, Y_train)))
print("Test score: {:.4f}".format(pca_RForest.score(X_test_reduced, Y_test)))

pca_RForest_Ypredict = pca_RForest.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test, pca_RForest_Ypredict)))

Best parameters: {'max_depth': 2, 'max_features': 2, 'min_samples_leaf': 5, 'n_estimators': 200, 'random_state': 0}
Best cross-validation score: 0.97
Train score: 0.6029
Test score: 0.5983
Recall :0.9737


The best parameters for the model after Grid Search with cross-validation is max_depth: 2, max_features: 2, min_samples_leaf: 5, n_estimators: 200. With these parameters, we get a test score of 0.60 and a recall score of 0.97. This is the best recall score of all the models.

## DEEP LEARNING

Lastly I am building a Deep Learning model. For the model, I use Sequential() and add two hidden layers. I use Grid search on Keras Classifier to find the best parameters- batch size and epochs. 

In [441]:
from tensorflow.keras import Sequential
from keras.layers import Dense
import numpy

In [442]:
from keras.wrappers.scikit_learn import KerasClassifier
def create_model():
    #create model
    model = Sequential()
    model.add(Dense(20, input_dim=36, activation='relu'))
    model.add(Dense(12, activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    #compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['Recall'])
    return model

In [443]:
seed = 10
from tensorflow.random import set_seed
np.random.seed(10)
set_seed(10)

model_grid = KerasClassifier(build_fn = create_model, verbose = 0)

param_grid = {'batch_size':[10,20,30,40] , 'epochs':[10, 50, 100]}
grid_search = GridSearchCV(estimator= model_grid, param_grid = param_grid, cv = 5, scoring='recall')

grid_search_result = grid_search.fit(X_train, Y_train)

print(grid_search.best_params_)


{'batch_size': 40, 'epochs': 10}


In [444]:
#fix random seed for reproducibility
numpy.random.seed(10)

set_seed(10)

# create model
model = Sequential()
model.add(Dense(20, input_dim=36, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['Recall', 'accuracy'])
model.fit(X_train, Y_train, epochs= 10, batch_size=10, verbose=0)
scores = model.evaluate(X_test, Y_test)
dl_y_predict = model.predict(X_test)

print('Recall Score : {}'.format(scores[1]))
print('Accuracy  Score : {}'.format(scores[2]))


Recall Score : 0.7631579041481018
Accuracy  Score : 0.6756302714347839


The best parameters from the grid search are batch_size: 40, epochs: 10. When we run the Deep Learning model with these parameters, we get Accuracy = 0.68 and a Recall score =  0.76 

Summarizing the Best 10 models : <br>

SVC Kernel RBF : Recall - 0.88  Accuracy - 0.64 <br>
SVC Kernel Poly : Recall - 0.84 Accuracy - 0.65 <br>
**Random Forest : Recall - 0.88 Accuracy - 0.70**<br>

Voting Classifier (Hard) : Recall - 0.81, Accuracy - 0.66 <br>
**Bagging With Decision Tree : Recall - 0.93 Accuracy - 0.62**<br>
Bagging With KNN : Recall - 0.927 Accuracy - 0.63 <br> 
Pasting With Decision Tree : Recall - 0.90 Accuracy - 0.61 <br>
Pasting With KNN : Recall - 0.90 Accuracy - 0.65 <br>

PCA - Decision Tree : Recall - 0.84 Accuracy - 0.60 <br>
**PCA - Random Forest : Recall - 0.97 Accuracy - 0.60**<br>

# Best model 
Explain which machine learning model is the best model for this dataset and why? 

From all the above models, Three models particularly standout. <br> Random Forest has a high Recall of 88% and the highest accuracy of 70%. 
Bagging with Decision Tree has a higher recall of 93% with a high accuracy of 62% <br>
Random Forest with PCA applied reduced dataset gives the Highest Recall of 97% and a good accuracy of 60%.<br> 

While Random Forest and Bagging With Decision tree both have Good Recall and Accuracy combinations, ***Random Forest With PCA*** has the best Recall which is the metric of our interest. While the Accuracy is only slightly better than naive rule, the cost of wrongful predition of "no"s is not significant. Hence I am choosing Random Forest with PCA applied on it.

In [445]:
BestModel_RF_PCA = RandomForestClassifier(max_depth= 2, max_features= 2, min_samples_leaf= 5, n_estimators= 200, random_state=0)
BestModel_RF_PCA.fit(X_train_reduced, Y_train)

print("Train score: {:.4f}".format(BestModel_RF_PCA.score(X_train_reduced, Y_train)))
print("Test score: {:.4f}".format(BestModel_RF_PCA.score(X_test_reduced, Y_test)))
pca_RForest_Ypredict = BestModel_RF_PCA.predict(X_test_reduced)
print('Recall :{:.4f}'.format(recall_score(Y_test, pca_RForest_Ypredict)))

Train score: 0.6029
Test score: 0.5983
Recall :0.9737


I will now train the Best model with the entire train dataset (from train.csv) and predict the outcome of the test data set from test.csv. For this we have to use the MinMax Scaler to scale both train and test datasets and perform PCA to reduce the dimensions of the datasets.

In [446]:
#final_test_prediction
from sklearn.decomposition import PCA

scaler_full = MinMaxScaler()
train_full_X = scaler_full.fit_transform(X)
test_full_X =  scaler_full.transform(test)

pca = PCA(n_components=0.95)
train_X_reduced = pca.fit_transform(train_full_X)
test_X_reduced = pca.transform(test_full_X)

In [447]:
BestModel = RandomForestClassifier(max_depth= 2, max_features= 2, min_samples_leaf= 5, n_estimators= 200, random_state=0)
BestModel.fit(train_X_reduced,Y)
trainPredict = BestModel.predict(train_X_reduced)
print("Recall Train score: {:.4f}".format(recall_score(Y,trainPredict)))


Recall Train score: 0.9902


Finally, a prediction is made on the test data set using the best model.

In [448]:
final_test_prediction = BestModel.predict(test_X_reduced)
final_test_prediction

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,