#**Assignment – Model Selection**


Do the following on titanic dataset
1. Load the dataset into python environment
2. Do all the necessary pre-processing steps
3. Create kNN and SVM models
4. Do k-fold and stratified stratified k-fold cross validation techniques and find the
average accuracy score of the models

##**Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


##**Load the Dataset**

In [2]:
data=pd.read_csv("/content/titanic_dataset .csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


##**Exploratory Data Analysis**

In [3]:
# Check the info to get basic details about the data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
# Dimension of the dataset

data.shape

(891, 12)

In [5]:
# check for null values

data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# Frequency distribution of values in variables

for i in data.columns:
  print(data[i].value_counts())

1      1
599    1
588    1
589    1
590    1
      ..
301    1
302    1
303    1
304    1
891    1
Name: PassengerId, Length: 891, dtype: int64
0    549
1    342
Name: Survived, dtype: int64
3    491
1    216
2    184
Name: Pclass, dtype: int64
Braund, Mr. Owen Harris                     1
Boulos, Mr. Hanna                           1
Frolicher-Stehli, Mr. Maxmillian            1
Gilinski, Mr. Eliezer                       1
Murdlin, Mr. Joseph                         1
                                           ..
Kelly, Miss. Anna Katherine "Annie Kate"    1
McCoy, Mr. Bernard                          1
Johnson, Mr. William Cahoone Jr             1
Keane, Miss. Nora A                         1
Dooley, Mr. Patrick                         1
Name: Name, Length: 891, dtype: int64
male      577
female    314
Name: Sex, dtype: int64
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtyp

In [7]:
# Drop passengerid and Name these are unique in each row

data.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)

In [8]:
# Summary of the numeric columns of the dataset

data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


##**Handling Missing Values**

In [9]:
# More than 60 pecent of the data is missing in 'Cabin'

data.drop('Cabin',axis=1,inplace=True)

In [10]:
#Check the skewness of the variable 'Age'

data['Age'].skew()

0.38910778230082704

In [11]:
# Fill the missing values of the variable 'Age' with median

data['Age']=data['Age'].fillna(data['Age'].median())

In [12]:
# fill the missing vlues of the variable 'Embarked' with mode

data['Embarked']=data['Embarked'].fillna(data['Embarked'].mode()[0])

In [13]:
# Check whether all missing values are handled.

data.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

##**Encoding**

In [14]:
# One hot encoding

data=pd.get_dummies(data,columns=['Sex','Embarked'])
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,0,1
1,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,0,0,0,1
4,0,3,35.0,0,0,8.05,0,1,0,0,1


**Declare feature vector and target variable**

In [15]:
y=data['Survived']
X=data.drop('Survived',axis=1)

**Scaling**

In [16]:
# Standard Scaler

from sklearn.preprocessing import StandardScaler
Scaler=StandardScaler()

Scaled_X=Scaler.fit_transform(X)

**Split data into separate training and test set**

In [17]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(Scaled_X,y,random_state=42,test_size=0.25)

In [18]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape,y_train.shape,y_test.shape

((668, 10), (223, 10), (668,), (223,))

##**KNN**

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn_model=KNeighborsClassifier()

In [20]:
knn_model.fit(X_train,y_train)
knn_pred=knn_model.predict(X_test)

In [21]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
print('Accuracy:',accuracy_score(y_test,knn_pred))
print('Precision:',precision_score(y_test,knn_pred))
print('recall:',recall_score(y_test,knn_pred))
print('F1:',f1_score(y_test,knn_pred))


Accuracy: 0.8116591928251121
Precision: 0.7764705882352941
recall: 0.7415730337078652
F1: 0.7586206896551726


##**SVM**

In [22]:
from sklearn.svm import SVC
svm_model=SVC()

In [23]:
svm_model.fit(X_train,y_train)
svm_pred=svm_model.predict(X_test)

In [24]:
print('Accuracy:',accuracy_score(y_test,svm_pred))
print('Precision:',precision_score(y_test,svm_pred))
print('recall:',recall_score(y_test,svm_pred))
print('F1:',f1_score(y_test,svm_pred))


Accuracy: 0.8251121076233184
Precision: 0.8205128205128205
recall: 0.7191011235955056
F1: 0.7664670658682635


##**KFold**

In [25]:
from sklearn.model_selection  import KFold
kfold_validator=KFold(n_splits=10,shuffle=True,random_state=42)

In [26]:
for train_index,test_index in kfold_validator.split(Scaled_X,y):
  print('Training Index:',train_index)
  print('Testing Index:',test_index)

Training Index: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  24  25  26  27  28  29  31  32  33  34  35  36  37
  38  40  41  42  43  45  46  47  48  49  50  51  52  53  54  55  56  57
  58  59  60  61  62  64  65  68  69  71  73  74  75  76  77  78  79  80
  81  82  83  84  85  87  88  89  90  91  92  93  94  95  96  97  98  99
 100 101 102 103 104 105 106 107 108 109 111 112 113 114 115 116 117 118
 119 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 138 139
 140 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
 159 160 161 162 163 164 165 166 167 169 170 171 172 173 175 176 177 178
 179 180 181 182 183 184 185 186 187 188 189 190 191 193 194 195 197 199
 200 201 202 203 205 206 207 209 212 213 214 216 217 218 219 220 221 222
 223 224 225 226 227 228 229 230 231 232 233 234 236 237 238 239 240 241
 242 243 245 246 247 248 249 251 252 253 255 256 257 258 259 260 261 262
 263 264 265 267 268 269 270 271 27

In [27]:
from sklearn.model_selection import cross_val_score

knn_score=cross_val_score(knn_model,Scaled_X,y,cv=kfold_validator)
svm_score=cross_val_score(svm_model,Scaled_X,y,cv=kfold_validator)

In [28]:
knn_score

array([0.81111111, 0.80898876, 0.79775281, 0.76404494, 0.83146067,
       0.83146067, 0.82022472, 0.73033708, 0.71910112, 0.87640449])

In [29]:
svm_score

array([0.83333333, 0.80898876, 0.82022472, 0.78651685, 0.86516854,
       0.88764045, 0.82022472, 0.78651685, 0.75280899, 0.8988764 ])

**Stratified KFold**

In [30]:
from sklearn.model_selection import StratifiedKFold
strat_validator=StratifiedKFold(n_splits=10,shuffle=True,random_state=42)

In [31]:
for train_index,test_index in strat_validator.split(Scaled_X,y):
  print('Training Index:',train_index)
  print('Testing Index:',test_index)

Training Index: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  25  26  27  29  30  32  33  35  36  37  38  39
  40  41  42  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58
  59  60  61  62  63  65  66  67  69  70  71  72  73  74  75  76  77  78
  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
  97  98  99 100 101 102 104 105 106 107 109 110 111 112 113 114 115 116
 117 118 119 120 121 122 123 124 125 126 127 129 130 131 132 133 134 135
 136 137 138 139 140 142 144 145 149 151 152 153 154 155 156 157 158 159
 160 161 162 163 165 166 167 168 169 171 172 175 177 178 179 180 181 182
 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 236 237
 238 239 240 241 242 243 244 245 246 247 248 250 252 253 254 255 256 257
 258 259 260 261 262 263 264 265 26

In [32]:
from sklearn.model_selection import cross_val_score

strat_knn_score=cross_val_score(knn_model,Scaled_X,y,cv=strat_validator)
strat_svm_score=cross_val_score(knn_model,Scaled_X,y,cv=strat_validator)

In [33]:
strat_knn_score

array([0.82222222, 0.79775281, 0.76404494, 0.79775281, 0.78651685,
       0.78651685, 0.84269663, 0.83146067, 0.83146067, 0.80898876])

In [34]:
strat_svm_score

array([0.82222222, 0.79775281, 0.76404494, 0.79775281, 0.78651685,
       0.78651685, 0.84269663, 0.83146067, 0.83146067, 0.80898876])

**Average Accuracy Scores**

In [35]:
print("Accuracy score for KNN:",accuracy_score(y_test,knn_pred))
print("Accuracy score for SVM:",accuracy_score(y_test,svm_pred),'\n')

print("Accuracy score for KNN(KFold):",knn_score.mean())
print("Accuracy score for SVM (KFold):",svm_score.mean(),'\n')

print("Accuracy score for KNN(Stratified KFold):",strat_knn_score.mean())
print("Accuracy score for SVM (Stratified KFold):",strat_svm_score.mean())

Accuracy score for KNN: 0.8116591928251121
Accuracy score for SVM: 0.8251121076233184 

Accuracy score for KNN(KFold): 0.7990886392009988
Accuracy score for SVM (KFold): 0.8260299625468164 

Accuracy score for KNN(Stratified KFold): 0.8069413233458176
Accuracy score for SVM (Stratified KFold): 0.8069413233458176
