# Classification Algorithms and Model Evaluation

In this notebook, we will cover:

* Applying Logistic Regression Algorithm
* Model Evaluation using Confusion Matrix
* Applying KNN Algorithm
* Applying Decision Tree Algorithm
* Applying Random Forest Algorithm
* Final Model Selection
* Submission on Kaggle

Importing all necessary packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score
from sklearn.metrics import recall_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

Setting global seed of notebook

In [2]:
np.random.seed(300)
import random
random.seed(300)

### Loading Data

In [3]:
data = pd.read_csv('train_clean.csv')
df = data.copy()
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Embarked,Title,Group,GrpSize,FareCat,AgeCat
0,1,0,3,male,1,0,S,Mr,2,couple,0-10,16-32
1,2,1,1,female,1,0,C,Mrs,2,couple,70-100,32-48
2,3,1,3,female,0,0,S,Miss,1,solo,0-10,16-32
3,4,1,1,female,1,0,S,Mrs,2,couple,40-70,32-48
4,138,0,1,male,1,0,S,Mr,2,couple,40-70,32-48


### One Hot encoding for categorical varaibles

In [4]:
df_OneHot=pd.get_dummies(df,columns=['Pclass','Sex','Embarked','Title','GrpSize','FareCat','AgeCat'])
df_OneHot.head()

Unnamed: 0,PassengerId,Survived,SibSp,Parch,Group,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,...,FareCat_10-25,FareCat_100+,FareCat_25-40,FareCat_40-70,FareCat_70-100,AgeCat_0-16,AgeCat_16-32,AgeCat_32-48,AgeCat_48-64,AgeCat_64+
0,1,0,1,0,2,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,0
1,2,1,1,0,2,1,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
2,3,1,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,0
3,4,1,1,0,2,1,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
4,138,0,1,0,2,1,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0


In [5]:
df=df_OneHot.copy()

### Creating Independent and Dependent Variables

In [6]:
X = df.drop(['PassengerId','Survived'], axis=1)
Y = df['Survived']

### Train Test Split

In [7]:
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size=0.22)
print(xtrain.shape, ytrain.shape)
print(xtest.shape, ytest.shape)

(694, 31) (694,)
(197, 31) (197,)


# 1. Logistic Regression

### Creating Model & Training

In [8]:
clf_lr = LogisticRegression().fit(xtrain, ytrain)

### Evaluation

Predicting binary classifier

In [9]:
pred_lr = clf_lr.predict(xtest)

In [10]:
pred_lr[0:9]

array([0, 1, 1, 1, 0, 1, 0, 1, 0], dtype=int64)

Predicting prabability of **0** and **1**

### Comparison of Predicted and Actual

In [11]:
xt = xtest.copy()
xt['pred'] = pred_lr
xt['actual'] = ytest
xt.head(20)

Unnamed: 0,SibSp,Parch,Group,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,...,FareCat_25-40,FareCat_40-70,FareCat_70-100,AgeCat_0-16,AgeCat_16-32,AgeCat_32-48,AgeCat_48-64,AgeCat_64+,pred,actual
673,0,0,1,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
297,0,0,1,0,0,1,1,0,0,1,...,0,0,0,1,0,0,0,0,1,1
889,0,0,1,1,0,0,0,1,1,0,...,1,0,0,0,1,0,0,0,1,1
226,1,1,3,0,1,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,1
167,0,1,2,1,0,0,0,1,1,0,...,0,1,0,0,1,0,0,0,0,1
820,0,0,1,0,0,1,1,0,1,0,...,0,0,0,1,0,0,0,0,1,1
11,0,4,5,0,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
716,0,0,1,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,1
335,0,0,1,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
686,0,1,3,0,1,0,1,0,0,0,...,1,0,0,1,0,0,0,0,1,1


### Accuracy

In [12]:
accuracy_lr = accuracy_score(ytest,pred_lr)
print("Accuracy by built-in function: {}".format(accuracy_lr))

Accuracy by built-in function: 0.8274111675126904


### Classification Report

In [13]:
print(classification_report(ytest,pred_lr))

             precision    recall  f1-score   support

          0       0.80      0.94      0.87       117
          1       0.88      0.66      0.76        80

avg / total       0.84      0.83      0.82       197



# 2. K Nearest Neighbors (KNN)

For KNN, we need to stadardize data first

In [14]:
from sklearn.preprocessing import StandardScaler 

In [15]:
scaler = StandardScaler()  
scaler.fit(xtrain)
X_train_=scaler.transform(xtrain)
X_test_=scaler.transform(xtest)
X_train=pd.DataFrame(data=X_train_, columns=xtrain.columns)
X_test=pd.DataFrame(data=X_test_, columns=xtest.columns)

Training KNN

In [16]:
clf_knn = KNeighborsClassifier(n_neighbors=5).fit(X_train,ytrain)

In [17]:
pred_knn=clf_knn.predict(X_test)

In [18]:
accuracy_knn = accuracy_score(ytest,pred_knn)
print("Accuracy : {}".format(accuracy_knn))

Accuracy : 0.7969543147208121


In [19]:
print(classification_report(ytest,pred_knn))

             precision    recall  f1-score   support

          0       0.79      0.90      0.84       117
          1       0.81      0.65      0.72        80

avg / total       0.80      0.80      0.79       197



# 3. Decision Tree Classifier

Training Decision Tree model

In [20]:
clf_dt = DecisionTreeClassifier(max_depth=4).fit(xtrain,ytrain)

In [21]:
pred_dt = clf_dt.predict(xtest)

In [22]:
accuracy_dt = accuracy_score(ytest,pred_dt)
print("Accuracy: {}".format(accuracy_dt))

Accuracy: 0.8477157360406091


In [23]:
print(classification_report(ytest,pred_dt))

             precision    recall  f1-score   support

          0       0.81      0.97      0.88       117
          1       0.95      0.66      0.78        80

avg / total       0.86      0.85      0.84       197



# 4. Random Forest Classifier

In [24]:
clf_rf = RandomForestClassifier(max_depth=4).fit(xtrain,ytrain)

In [25]:
pred_rf = clf_rf.predict(xtest)

In [26]:
accuracy_rf = accuracy_score(ytest,pred_rf)
print("Accuracy: {}".format(accuracy_rf))

Accuracy: 0.8121827411167513


In [27]:
print(classification_report(ytest,pred_rf))

             precision    recall  f1-score   support

          0       0.79      0.92      0.85       117
          1       0.85      0.65      0.74        80

avg / total       0.82      0.81      0.81       197



# 5. Model Selection

In [28]:
models=pd.DataFrame({'Algorith Name':['Logistic Regression','KNN','Decision Tree','Random Forest'],
                     'Accuracy':[accuracy_lr,accuracy_knn,accuracy_dt,accuracy_rf]})
models.sort_values('Accuracy',ascending=False,inplace=True)
models

Unnamed: 0,Algorith Name,Accuracy
2,Decision Tree,0.847716
0,Logistic Regression,0.827411
3,Random Forest,0.812183
1,KNN,0.796954


Hence we are selecting Random Forest as our final model

Analyzing confusion matrix of Random Forest

In [29]:
tn, fp, fn, tp = confusion_matrix(ytest, pred_rf).ravel()
conf_matrix=pd.DataFrame({"pred_Not Survived":[tn,fn],"pred_Survived":[fp,tp]},index=["Not Survived","Survived"])
conf_matrix

Unnamed: 0,pred_Not Survived,pred_Survived
Not Survived,108,9
Survived,28,52


# 6. Submission on Kaggle

Importing test data

In [30]:
test = pd.read_csv('test_clean.csv')
df_test = test.copy()
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Embarked,Title,Group,GrpSize,FareCat,AgeCat
0,892,3,male,0,0,Q,Mr,1,solo,0-10,32-48
1,893,3,female,1,0,S,Mrs,2,couple,0-10,32-48
2,894,2,male,0,0,Q,Mr,1,solo,0-10,48-64
3,895,3,male,0,0,S,Mr,1,solo,0-10,16-32
4,896,3,female,1,1,S,Mrs,3,group,10-25,16-32


One Hot encoding of test data

In [31]:
df_OneHot=pd.get_dummies(df_test,columns=['Pclass','Sex','Embarked','Title','GrpSize','FareCat','AgeCat'])
df_OneHot.head()
df_test=df_OneHot.copy()

Separating Passenger ID for submission

In [32]:
PassengerID=df_test['PassengerId']
df_test.drop('PassengerId',axis=1,inplace=True)

Prediction through final model

In [33]:
pred_final=clf_rf.predict(df_test)

Creating file for submission

In [34]:
submission=pd.DataFrame({'PassengerId':PassengerID,'Survived':pred_final})

In [35]:
submission.to_csv('my_submission v1.0.csv',index=False)