## __OVERVIEW:__ 
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

## __GOAL:__ 
In this challenge, we are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, we must predict a 0 or 1 value for the variable.

### Importing libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

### Getting the dataset

In [2]:
# train dataset is stored in 'train.csv'
df=pd.read_csv('train.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
df1=pd.read_csv('test.csv')
df1

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


### Data preprocessing

#### Training data

In [4]:
# getting the description of all numerical attributes of the dataset
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
# getting the information regarding the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Checking for missing values

In [6]:
# checking the count of null values in every column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
df_train=df.copy()

In [8]:
# dropping the columns that are irrelevant to the classification of survival
df_train.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked','Age'],axis=1,inplace=True)

In [9]:
df_train

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch
0,0,3,male,1,0
1,1,1,female,1,0
2,1,3,female,0,0
3,1,1,female,1,0
4,0,3,male,0,0
...,...,...,...,...,...
886,0,2,male,0,0
887,1,1,female,0,0
888,0,3,female,1,2
889,1,1,male,0,0


#### Testing data

In [10]:
# getting the description of all numerical attributes of the dataset
df1.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [11]:
# getting the information regarding the dataset
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


#### Checking for missing values

In [12]:
# checking the count of null values in every column
df1.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [13]:
df1_test=df1.copy()

In [14]:
# dropping the columns that are irrelevant to the classification of survival
df1_test.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked','Age'],axis=1,inplace=True)

In [15]:
df1_test

Unnamed: 0,Pclass,Sex,SibSp,Parch
0,3,male,0,0
1,3,female,1,0
2,2,male,0,0
3,3,male,0,0
4,3,female,1,1
...,...,...,...,...
413,3,male,0,0
414,1,female,0,0
415,3,male,0,0
416,3,male,0,0


#### Imbalance treatment

In [16]:
df_train['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

### Encoding

In [17]:
from sklearn.preprocessing import LabelEncoder

#### Training data

In [18]:
le=LabelEncoder()
# the 'Sex' column is of object type and we are encoding it using LabelEncoder
le.fit_transform(df_train['Sex'])

array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,

In [19]:
df_train = df_train.apply(le.fit_transform)

In [20]:
df_train

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch
0,0,2,1,1,0
1,1,0,0,1,0
2,1,2,0,0,0
3,1,0,0,1,0
4,0,2,1,0,0
...,...,...,...,...,...
886,0,1,1,0,0
887,1,0,0,0,0
888,0,2,0,1,2
889,1,0,1,0,0


#### Testing data

In [21]:
le1=LabelEncoder()
# the 'Sex' column is of object type and we are encoding it using LabelEncoder
le1.fit_transform(df1_test['Sex'])

array([1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,

In [22]:
df1_test=df1_test.apply(le1.fit_transform)

In [23]:
df1_test

Unnamed: 0,Pclass,Sex,SibSp,Parch
0,2,1,0,0
1,2,0,1,0
2,1,1,0,0
3,2,1,0,0
4,2,0,1,1
...,...,...,...,...
413,2,1,0,0
414,0,0,0,0
415,2,1,0,0
416,2,1,0,0


### Model building

In [24]:
# the independent variables stored in 'x'
x=df_train.drop('Survived',axis=1,inplace=True)
# the dependent variable is stored in 'y'
y=df['Survived']

In [25]:
x=df_train.copy()

In [26]:
x

Unnamed: 0,Pclass,Sex,SibSp,Parch
0,2,1,1,0
1,0,0,1,0
2,2,0,0,0
3,0,0,1,0
4,2,1,0,0
...,...,...,...,...
886,1,1,0,0
887,0,0,0,0
888,2,0,1,2
889,0,1,0,0


In [27]:
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

#### Train test split

In [28]:
from sklearn.model_selection import train_test_split
# using 'stratify' for treating the imbalance in the dependent variable
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=29,stratify=y)

In [29]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(712, 4) (179, 4) (712,) (179,)


In [30]:
x_train

Unnamed: 0,Pclass,Sex,SibSp,Parch
284,0,1,0,0
861,1,1,1,0
190,1,0,0,0
43,1,0,1,2
741,0,1,1,0
...,...,...,...,...
171,2,1,4,1
673,1,1,0,0
629,2,1,0,0
872,0,1,0,0


In [31]:
x_test

Unnamed: 0,Pclass,Sex,SibSp,Parch
701,0,1,0,0
373,0,1,0,0
791,1,1,0,0
78,1,1,0,2
864,1,1,0,0
...,...,...,...,...
825,2,1,0,0
875,2,0,0,0
384,2,1,0,0
250,2,1,0,0


In [32]:
y_train

284    0
861    0
190    1
43     1
741    0
      ..
171    0
673    1
629    0
872    0
637    0
Name: Survived, Length: 712, dtype: int64

In [33]:
y_test

701    1
373    0
791    0
78     1
864    0
      ..
825    0
875    1
384    0
250    0
432    1
Name: Survived, Length: 179, dtype: int64

#### MODEL-1
#### Logistic Regression

In [34]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression

In [35]:
lr=LogisticRegression()
lr.fit(x_train,y_train)
lr_pred_train=lr.predict(x_train)
lr_pred_test=lr.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,lr_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,lr_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,lr_pred_train))
print()
print("Test classification report\n",classification_report(y_test,lr_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,lr_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,lr_pred_test))

Train confusion matrix
 [[388  51]
 [ 91 182]]

Test confusion matrix
 [[97 13]
 [21 48]]

Train classification report
               precision    recall  f1-score   support

           0       0.81      0.88      0.85       439
           1       0.78      0.67      0.72       273

    accuracy                           0.80       712
   macro avg       0.80      0.78      0.78       712
weighted avg       0.80      0.80      0.80       712


Test classification report
               precision    recall  f1-score   support

           0       0.82      0.88      0.85       110
           1       0.79      0.70      0.74        69

    accuracy                           0.81       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179


Train accuracy score 0.800561797752809

Test accuracy score 0.8100558659217877


#### MODEL-2
#### Decision Tree Classifier

In [36]:
from sklearn.tree import DecisionTreeClassifier

In [37]:
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
dt_pred_train=dt.predict(x_train)
dt_pred_test=dt.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,dt_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,dt_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,dt_pred_train))
print()
print("Test classification report\n",classification_report(y_test,dt_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,dt_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,dt_pred_test))

Train confusion matrix
 [[407  32]
 [ 97 176]]

Test confusion matrix
 [[95 15]
 [24 45]]

Train classification report
               precision    recall  f1-score   support

           0       0.81      0.93      0.86       439
           1       0.85      0.64      0.73       273

    accuracy                           0.82       712
   macro avg       0.83      0.79      0.80       712
weighted avg       0.82      0.82      0.81       712


Test classification report
               precision    recall  f1-score   support

           0       0.80      0.86      0.83       110
           1       0.75      0.65      0.70        69

    accuracy                           0.78       179
   macro avg       0.77      0.76      0.76       179
weighted avg       0.78      0.78      0.78       179


Train accuracy score 0.8188202247191011

Test accuracy score 0.7821229050279329


#### MODEL-3
#### Random Forest Classifier

In [38]:
from sklearn.ensemble import RandomForestClassifier

In [39]:
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
rf_pred_train=rf.predict(x_train)
rf_pred_test=rf.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,rf_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,rf_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,rf_pred_train))
print()
print("Test classification report\n",classification_report(y_test,rf_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,rf_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,rf_pred_test))

Train confusion matrix
 [[400  39]
 [ 90 183]]

Test confusion matrix
 [[94 16]
 [21 48]]

Train classification report
               precision    recall  f1-score   support

           0       0.82      0.91      0.86       439
           1       0.82      0.67      0.74       273

    accuracy                           0.82       712
   macro avg       0.82      0.79      0.80       712
weighted avg       0.82      0.82      0.81       712


Test classification report
               precision    recall  f1-score   support

           0       0.82      0.85      0.84       110
           1       0.75      0.70      0.72        69

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179


Train accuracy score 0.8188202247191011

Test accuracy score 0.7932960893854749


#### MODEL-4
#### XG Boost Classifier

In [40]:
from xgboost import XGBClassifier

In [41]:
xg=XGBClassifier()
xg.fit(x_train,y_train)
xg_pred_train=xg.predict(x_train)
xg_pred_test=xg.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,xg_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,xg_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,xg_pred_train))
print()
print("Test classification report\n",classification_report(y_test,xg_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,xg_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,xg_pred_test))

Train confusion matrix
 [[404  35]
 [ 95 178]]

Test confusion matrix
 [[94 16]
 [23 46]]

Train classification report
               precision    recall  f1-score   support

           0       0.81      0.92      0.86       439
           1       0.84      0.65      0.73       273

    accuracy                           0.82       712
   macro avg       0.82      0.79      0.80       712
weighted avg       0.82      0.82      0.81       712


Test classification report
               precision    recall  f1-score   support

           0       0.80      0.85      0.83       110
           1       0.74      0.67      0.70        69

    accuracy                           0.78       179
   macro avg       0.77      0.76      0.77       179
weighted avg       0.78      0.78      0.78       179


Train accuracy score 0.8174157303370787

Test accuracy score 0.7821229050279329


#### MODEL-5
#### Support Vector Machine

In [42]:
from sklearn.svm import SVC

In [43]:
svm=SVC()
svm.fit(x_train,y_train)
svm_pred_train=svm.predict(x_train)
svm_pred_test=svm.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,svm_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,svm_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,svm_pred_train))
print()
print("Test classification report\n",classification_report(y_test,svm_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,svm_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,svm_pred_test))

Train confusion matrix
 [[395  44]
 [ 90 183]]

Test confusion matrix
 [[98 12]
 [20 49]]

Train classification report
               precision    recall  f1-score   support

           0       0.81      0.90      0.85       439
           1       0.81      0.67      0.73       273

    accuracy                           0.81       712
   macro avg       0.81      0.79      0.79       712
weighted avg       0.81      0.81      0.81       712


Test classification report
               precision    recall  f1-score   support

           0       0.83      0.89      0.86       110
           1       0.80      0.71      0.75        69

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.81       179
weighted avg       0.82      0.82      0.82       179


Train accuracy score 0.8117977528089888

Test accuracy score 0.8212290502793296


#### MODEL-6
#### K Nearest Neighbors

In [44]:
from sklearn.neighbors import KNeighborsClassifier

In [45]:
knn=KNeighborsClassifier()
knn.fit(x_train,y_train)
knn_pred_train=knn.predict(x_train)
knn_pred_test=knn.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,knn_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,knn_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,knn_pred_train))
print()
print("Test classification report\n",classification_report(y_test,knn_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,knn_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,knn_pred_test))

Train confusion matrix
 [[392  47]
 [ 90 183]]

Test confusion matrix
 [[97 13]
 [21 48]]

Train classification report
               precision    recall  f1-score   support

           0       0.81      0.89      0.85       439
           1       0.80      0.67      0.73       273

    accuracy                           0.81       712
   macro avg       0.80      0.78      0.79       712
weighted avg       0.81      0.81      0.80       712


Test classification report
               precision    recall  f1-score   support

           0       0.82      0.88      0.85       110
           1       0.79      0.70      0.74        69

    accuracy                           0.81       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179


Train accuracy score 0.8075842696629213

Test accuracy score 0.8100558659217877


#### MODEL-7
#### Naive Bayes Theorem

In [46]:
from sklearn.naive_bayes import BernoulliNB

In [47]:
nb=BernoulliNB()
nb.fit(x_train,y_train)
nb_pred_train=nb.predict(x_train)
nb_pred_test=nb.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,nb_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,nb_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,nb_pred_train))
print()
print("Test classification report\n",classification_report(y_test,nb_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,nb_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,nb_pred_test))

Train confusion matrix
 [[371  68]
 [ 84 189]]

Test confusion matrix
 [[92 18]
 [20 49]]

Train classification report
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       439
           1       0.74      0.69      0.71       273

    accuracy                           0.79       712
   macro avg       0.78      0.77      0.77       712
weighted avg       0.78      0.79      0.79       712


Test classification report
               precision    recall  f1-score   support

           0       0.82      0.84      0.83       110
           1       0.73      0.71      0.72        69

    accuracy                           0.79       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.79      0.79      0.79       179


Train accuracy score 0.7865168539325843

Test accuracy score 0.7877094972067039


#### MODEL-8
#### Stacking method

In [48]:
from sklearn.ensemble import StackingClassifier

In [49]:
estimator_models=[('Logistic Regression',LogisticRegression()),
                  ('Decision Tree',DecisionTreeClassifier()),
                  ('Random Forest',RandomForestClassifier()),
                  ('XG Boost',XGBClassifier()),
                  ('KNN',KNeighborsClassifier()),
                  ('Naive Bayes',BernoulliNB())]

sc=StackingClassifier(estimators=estimator_models,final_estimator=SVC(),cv=10)
sc.fit(x_train,y_train)
sc_pred_train=sc.predict(x_train)
sc_pred_test=sc.predict(x_test)

# Confusion matrix
print("Train confusion matrix\n",confusion_matrix(y_train,sc_pred_train))
print()
print("Test confusion matrix\n",confusion_matrix(y_test,sc_pred_test))
print()

# Classification report
print("Train classification report\n",classification_report(y_train,sc_pred_train))
print()
print("Test classification report\n",classification_report(y_test,sc_pred_test))
print()

# Accuracy score
print("Train accuracy score",accuracy_score(y_train,sc_pred_train))
print()
print("Test accuracy score",accuracy_score(y_test,sc_pred_test))

Train confusion matrix
 [[395  44]
 [ 89 184]]

Test confusion matrix
 [[98 12]
 [20 49]]

Train classification report
               precision    recall  f1-score   support

           0       0.82      0.90      0.86       439
           1       0.81      0.67      0.73       273

    accuracy                           0.81       712
   macro avg       0.81      0.79      0.80       712
weighted avg       0.81      0.81      0.81       712


Test classification report
               precision    recall  f1-score   support

           0       0.83      0.89      0.86       110
           1       0.80      0.71      0.75        69

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.81       179
weighted avg       0.82      0.82      0.82       179


Train accuracy score 0.8132022471910112

Test accuracy score 0.8212290502793296


### Prediction using testing data

In [50]:
prediction=sc.predict(df1_test)
output=pd.DataFrame({'PassengerId': df1.PassengerId, 'Survived': prediction})
pd.set_option('display.max_rows', 500)
display(output)
output.to_csv('Titanic survival output.csv', index=False)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0
