# Titanic Survivor prediction problem 🚣🏻 🚢

## 1. Problem

To predict the passengers who might have survived the sinking of the Titanic or not.

## 2. Data

The data is from kaggle's Titanic: Machine learning from disaster competition.

https://www.kaggle.com/c/titanic/overview/description

## 3. Evaluation

Accuracy of our predictions is the evaluation criteria.

## 4. Features

* Survival - 0 if survived, 1 if not
* pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
* Sex
* Age
* sibsp - # of siblings / spouses aboard the Titanic
* parch - # of parents / children aboard the Titanic
* ticket - Ticket number
* fare - Passenger fare
* cabin - Cabin number
* embarked - Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
passengers = pd.read_csv('drive/My Drive/titanic/train.csv')
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
passengers.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We'll drop passenger id, name, ticket and cabin since they will be different for everyone.

In [3]:
x = passengers.drop(['PassengerId','Name','Ticket','Survived'],axis=1)
y = passengers['Survived']

In [None]:
x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,,S
1,1,female,38.0,1,0,71.2833,C85,C
2,3,female,26.0,0,0,7.925,,S
3,1,female,35.0,1,0,53.1,C123,S
4,3,male,35.0,0,0,8.05,,S


In [None]:
x.isna().sum()

Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [4]:
cabin = [0 if str(i)=='nan' else 1 for i in x.Cabin]
x.Cabin=cabin

In [None]:
x.isna().sum()

Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin         0
Embarked      2
dtype: int64

Now we have age and embarked left.
If we see the value counts, we can see how the values vary.

If the values vary by too much, we'll go for median, else we can use the general mean.

In [None]:
x['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

In [None]:
x.Age.median()

28.0

In [None]:
x.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [None]:
x.dtypes

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Cabin         int64
Embarked     object
dtype: object

In [6]:
# Seperating data into train and validation before preprocessing
from sklearn.model_selection import train_test_split

xtrain,xvalid,ytrain,yvalid = train_test_split(x,y,test_size=0.2)
xtrain.shape, xvalid.shape, ytrain.shape, yvalid.shape

((712, 8), (179, 8), (712,), (179,))

In [None]:
xtrain.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
116,3,male,70.5,0,0,7.75,0,Q
562,2,male,28.0,0,0,13.5,0,S
333,3,male,16.0,2,0,18.0,0,S
245,1,male,44.0,2,0,90.0,1,Q
114,3,female,17.0,0,0,14.4583,0,C


In [None]:
xvalid.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
314,2,male,43.0,1,1,26.25,0,S
53,2,female,29.0,1,0,26.0,0,S
726,2,female,30.0,3,0,21.0,0,S
764,3,male,16.0,0,0,7.775,0,S
463,2,male,48.0,0,0,13.0,0,S


In [5]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [6]:
def preprocess_data(data,categories,numbers):
  one_hot=OneHotEncoder()
  num_impute = Pipeline(steps = [('simpleImpute',SimpleImputer(strategy = 'median')),
                                 ('standardScale',StandardScaler())])
  
  cat_impute = Pipeline(steps = [('simpleImputer',SimpleImputer(strategy = 'constant',fill_value='missing')),
                                 ('one_hot',one_hot)])
  
  transformer = ColumnTransformer(transformers = [('num_impute',num_impute,numbers),
                                                  ('cat_impute',cat_impute,categories)],remainder = 'passthrough')
  
  return transformer.fit_transform(data)

In [None]:
xtrain = preprocess_data(xtrain,['Sex','Embarked'],['Pclass','Age','SibSp','Parch','Fare','Cabin'])

In [None]:
xvalid = preprocess_data(xvalid,['Sex','Embarked'],['Pclass','Age','SibSp','Parch','Fare','Cabin'])

In [None]:
Lreg = LogisticRegression(max_iter = 1000).fit(xtrain,ytrain)

In [None]:
RForest = RandomForestClassifier(n_estimators = 50).fit(xtrain,ytrain)

In [None]:
KNN = KNeighborsClassifier().fit(xtrain,ytrain)

In [7]:
from sklearn.metrics import accuracy_score

In [None]:
xtrain[0]

array([ 0.84265733,  3.14859858, -0.461893  , -0.45729619, -0.4795063 ,
       -0.54488848,  0.        ,  1.        ,  0.        ,  1.        ,
        0.        ,  0.        ])

In [None]:
xvalid[0]

array([-0.42765617,  1.0939753 ,  0.41738809,  0.86540649, -0.09418593,
       -0.54507013,  0.        ,  1.        ,  0.        ,  0.        ,
        1.        ,  0.        ])

In [None]:
# Logistic Regression

ypredsLR = Lreg.predict(xvalid)
lr_acc = accuracy_score(yvalid,ypredsLR)

In [None]:
# K Neighbours

ypredsKNN = KNN.predict(xvalid)
knn_acc = accuracy_score(yvalid,ypredsKNN)

In [None]:
# Random Forest

ypredsRF = RForest.predict(xvalid)
RF_acc = accuracy_score(yvalid,ypredsRF)

In [None]:
print(f'Logistic Regression: {lr_acc*100:.2f}%,\nRandom Forest Classifier: {RF_acc*100:.2f}%,\nKNeighbors: {knn_acc*100:.2f}%')

Logistic Regression: 78.77%,
Random Forest Classifier: 82.68%,
KNeighbors: 82.12%


## Working on Full data

In [7]:
x.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,0,S
1,1,female,38.0,1,0,71.2833,1,C
2,3,female,26.0,0,0,7.925,0,S
3,1,female,35.0,1,0,53.1,1,S
4,3,male,35.0,0,0,8.05,0,S


In [8]:
x = preprocess_data(x,['Sex','Embarked'],['Pclass','Age','SibSp','Parch','Fare','Cabin'])

In [9]:
RForest = RandomForestClassifier(n_estimators  = 50).fit(x,y)

In [11]:
LReg = LogisticRegression(max_iter = 1000).fit(x,y)

In [12]:
KNeigh = KNeighborsClassifier().fit(x,y)

In [10]:
rf_preds = RForest.predict(x)
print(f'Accuracy Score: {accuracy_score(y,rf_preds)*100:.2f}%')

Accuracy Score: 98.32%


In [16]:
lr_preds = LReg.predict(x)
print(f'Accuracy Score: {accuracy_score(y,lr_preds)*100:.2f}%')

Accuracy Score: 80.13%


In [17]:
knn_preds = KNeigh.predict(x)
print(f'Accuracy Score: {accuracy_score(y,knn_preds)*100:.2f}%')

Accuracy Score: 85.75%


## Improving accuracy using Randomized search cv

Random Forest classifier has the best accuracy score, so we'll try to improve this.

In [11]:
from sklearn.model_selection import cross_val_score, RandomizedSearchCV

cscore = cross_val_score(RForest,x,y,cv = 5)
cscore,cscore.mean()

(array([0.77653631, 0.78651685, 0.82022472, 0.78089888, 0.83707865]),
 0.8002510827945516)

In [12]:
RForest

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [14]:
rs_grid = {
    'bootstrap':[True,False],
    'max_depth':[10,50,80,100,None],
    'max_features':['auto','sqrt'],
    'min_samples_split':[2,5,8],
    'min_samples_leaf':[1,2,4],
    'n_estimators':[100,200,300,500,800,1000,1500,2000]
}
rf_random = RandomizedSearchCV(RForest,param_distributions=rs_grid,
                               n_iter = 100, cv = 5, verbose = 2, random_state=42,
                               n_jobs = -1)
rf_random.fit(x,y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   36.7s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  8.7min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [15]:
rf_random.best_params_

{'bootstrap': False,
 'max_depth': 100,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 8,
 'n_estimators': 2000}

## Working on test data

Now that we have the best parameters for the model, we can start working on the test data.

In [18]:
 xtest = pd.read_csv('drive/My Drive/titanic/test.csv')

In [None]:
xtest.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [20]:
c = [0 if str(i)=='nan' else 1 for i in xtest.Cabin]
xtest.Cabin = c

In [21]:
Xtest = xtest.drop(['PassengerId','Name','Ticket'],axis=1)

In [22]:
# We are adding this column to maintain the shape of the array after applying column Transforms
Xtest['Emnarked_missing']=0

In [None]:
Xtest.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Emnarked_missing
0,3,male,34.5,0,0,7.8292,0,Q,0
1,3,female,47.0,1,0,7.0,0,S,0
2,2,male,62.0,0,0,9.6875,0,Q,0
3,3,male,27.0,0,0,8.6625,0,S,0
4,3,female,22.0,1,1,12.2875,0,S,0


In [23]:
Xtest.isna().sum()

Pclass               0
Sex                  0
Age                 86
SibSp                0
Parch                0
Fare                 1
Cabin                0
Embarked             0
Emnarked_missing     0
dtype: int64

In [None]:
Xtest.Fare.median()

14.4542

In [24]:
Xtest.Fare.fillna(value=Xtest.Fare.median(),inplace=True)

In [None]:
Xtest.Cabin.value_counts()

0    327
1     91
Name: Cabin, dtype: int64

In [25]:
submission = pd.DataFrame(columns = ['PassengerId','Survived'])

In [None]:
Xtest.Sex.value_counts(), Xtest.Embarked.value_counts()

(male      266
 female    152
 Name: Sex, dtype: int64, S    270
 C    102
 Q     46
 Name: Embarked, dtype: int64)

In [26]:
Xtest = preprocess_data(Xtest,['Sex','Embarked'],['Pclass','Age','SibSp','Parch','Fare','Cabin'])

In [27]:
rftestpreds = rf_random.predict(Xtest)

In [None]:
Xtest[0]

array([ 0.87348191,  0.38623105, -0.49947002, -0.4002477 , -0.49741333,
       -0.52752958,  0.        ,  1.        ,  0.        ,  1.        ,
        0.        ,  0.        ])

In [28]:
submission['PassengerId'] = xtest['PassengerId']
submission['Survived'] = rftestpreds

In [29]:
submission.to_csv('drive/My Drive/titanic/titanicsubmission3.csv',index=False)

In [None]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,1


In [30]:
len(submission)

418