<a href="https://www.kaggle.com/code/pankajkumar2002/the-titanic-disaster?scriptVersionId=153320063" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <h1 style='color:red; font-size:38px; border:2px solid red; border-radius:20px;'><center> TITANIC DISASTER </center></h1>

![](https://static.toiimg.com/photo/58787332.cms)

# <h1 style='color:red; font-size:30px; border:2px solid red; border-radius:20px;'><center> Table of Contents </center></h1>

<div style='font-size:16px;'>
    <ul>
        <li> Introduction </li>
        <li> EDA </li>
        <li> Preprocessing </li>
        <li> Model Selection and HyperParameter Tunining </li>
        <li> Test Predictions</li>
    </ul>
</div>

# <h1 style='color:red; font-size:30px; border:2px solid red; border-radius:20px;'><center> Introduction </center></h1>

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

<div> <p style='font-size:28px;'> About the Dataset </p> <br>
    <table style='font-family: arial, sans-serif; border-collapse: collapse; width: 100%; font-size:20px; align:left;'>
        <tr> 
            <td> Variable </td>
            <td> Definition </td>
            <td> Key </td>
        </tr>
        <tr> 
            <td> survival </td>
            <td> Survival </td>
            <td> 0 = No, 1 = Yes </td>
        </tr>
        <tr> 
            <td> pclass </td>
            <td> Ticket class </td>
            <td> 1 = 1st, 2 = 2nd, 3 = 3rd </td>
        </tr>
        <tr> 
            <td> sex </td>
            <td> Sex </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> age </td>
            <td> Age in years </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> sibsp </td>
            <td> # of siblings / spouses aboard the Titanic </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> parch </td>
            <td> # of parents / children aboard the Titanic </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> ticket </td>
            <td> Ticket number </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> fare </td>
            <td> Passenger Fees</td>
            <td>  </td>
        </tr>
        <tr> 
            <td> cabin </td>
            <td> Cabin number </td>
            <td>  </td>
        </tr>
        <tr> 
            <td> embarked </td>
            <td> Port of Embarkation </td>
            <td> C = Cherbourg, Q = Queenstown, S = Southampton  </td>
        </tr>
    </table>
    
</div>

In [1]:
import re
import numpy as np
import pandas as pd
from tqdm import tqdm

import warnings
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, f1_score, recall_score, precision_score, accuracy_score
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier

In [2]:
warnings.filterwarnings('ignore')

pd.options.plotting.backend='plotly'
raw_dataset = pd.read_csv('../input/titanic/train.csv')
raw_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
raw_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


# <h1 style='color:red; font-size:30px; border:2px solid red; border-radius:20px;'><center> Exploratory Data Analysis  </center></h1>

In [4]:
_template = dict(layout=go.Layout(font=dict(family='Times New Roman', size=13), width=800))

In [5]:
fig = px.pie(raw_dataset, names='Survived', width=800, title='Survived', height=600)
fig.show()

In [6]:
fig = px.histogram(raw_dataset.select_dtypes(include=['int64']), x='Pclass', color='Survived', nbins=10)
fig.update_layout(template=_template)
fig.show()

In [7]:
fig = px.histogram(raw_dataset.select_dtypes(include=['int64']), x='SibSp', color='Survived', nbins=20)
fig.update_layout(template=_template)
fig.show()

In [8]:
fig = px.histogram(raw_dataset.select_dtypes(include=['int64']), x='Parch', color='Survived', nbins=20)
fig.update_layout(template=_template)
fig.show()

In [9]:
fig = px.scatter(raw_dataset, x='Fare', color='Survived')
fig.update_layout(template=_template, width=1000)
fig.show()

In [10]:
fig = px.box(raw_dataset.Fare, points='all')
fig.update_layout(template=_template)
fig.show()

In [11]:
fig = px.histogram(raw_dataset, x='Sex', color='Survived')
fig.update_layout(template=_template, width=1000)
fig.show()

In [12]:
fig = px.histogram(raw_dataset, x='Embarked', color='Survived')
fig.update_layout(template=_template, width=1000)
fig.show()

# <h1 style='color:red; font-size:30px; border:2px solid red; border-radius:20px;'><center> Preprocessing </center></h1>

In the Preprocessing, Feature Selection and Feature Extraction is done respectively. Next, Pipelines are being defined for smoother preprocessing of numerical and categorical features.

In [13]:
raw_data = raw_dataset.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
numeric_col = list(raw_data.select_dtypes(include=['int64', 'float64']).columns)
categorical_col = list(raw_data.select_dtypes(include=['object']).columns)
numeric_col = numeric_col[1:]
Train_x = raw_data.iloc[:, 1:]
train_y = raw_data.iloc[:, 0]
print("Numerical Columns : ", numeric_col)
print("Categorical Columns : ", categorical_col)

Numerical Columns :  ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Categorical Columns :  ['Sex', 'Cabin', 'Embarked']


In [14]:
def Cabin(x):
    x = re.sub(r'[A-Z]\S+', x[0], x)
    return x

Train_x['Cabin'] = Train_x['Cabin'].astype('str')
Train_x['Cabin'] = Train_x['Cabin'].map(Cabin)

In [15]:
raw_data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [16]:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numeric_col),
        ('cat', categorical_transformer, categorical_col)
    ]
)

train_x = preprocessor.fit_transform(Train_x)
train_x.shape

(891, 26)

# <h1 style='color:red; font-size:30px; border:2px solid red; border-radius:20px;'><center> Model Selection and Hyperparameter tuning </center></h1>

The four models are defines i.e. Gradient Boost, Random Forest, AdaBoost and Support Vector Machine. We are using RandomizedSearch for best hyperparameter for respective models and caculating & comparing the scores between. After all the model with the highest score is being selected for final evalution with the best hyperparameters. 

In [17]:
def model_selection():
    grb_model = GradientBoostingClassifier(random_state=42)
    rf_model = RandomForestClassifier(random_state=42)
    ada_model = AdaBoostClassifier(random_state=42)
    svc_model = SVC()
    models = [grb_model, rf_model, ada_model, svc_model]
    models_name = ['Gradient Boost', 'Random Forest', 'Ada Boost', 'SVM']
    
    return models, models_name

In [18]:
def hyper_tuning():
    max_score = 0
    srch = None
    
    models, models_name = model_selection()
    
    params = [
        dict( random_state=[42], n_estimators=[10,50,250,500], max_depth=[1,3,5,7,9] ),
        dict( random_state=[42], n_estimators=[10,50,250,500], max_depth=[1,3,5,7,9] ),
        dict( random_state=[42], n_estimators=[10,50,100,500], base_estimator=[DecisionTreeClassifier(max_depth=100), SVC(kernel='rbf')] ),
        dict( kernel = ['poly', 'rbf', 'sigmoid'], C = [50, 10, 1.0, 0.1, 0.01], gamma = ['scale'])
    ]
    
    for i, model in tqdm(enumerate(models)):
        search = RandomizedSearchCV(model, params[i], cv=StratifiedKFold(n_splits=10))
        search.fit(train_x, train_y)
        print('Score of {} is {}'.format(models_name[i], search.best_score_))

        if search.best_score_ > max_score:
            max_score = search.best_score_
            srch = search
    return srch

In [19]:
search = hyper_tuning()

1it [01:09, 69.28s/it]

Score of Gradient Boost is 0.8339575530586767


2it [01:50, 52.53s/it]

Score of Random Forest is 0.833932584269663


3it [02:10, 37.99s/it]

Score of Ada Boost is 0.8059176029962547


4it [02:14, 33.68s/it]

Score of SVM is 0.82270911360799





In [20]:
search.best_params_

{'random_state': 42, 'n_estimators': 50, 'max_depth': 5}

Splitting up of the dataset into training and validation is being done to find the metrics of classification model. Next whole dataset is being trained to predict the classification on newly test dataset.

In [21]:
x, valid_x, y, valid_y = train_test_split(train_x, train_y, test_size=0.3, random_state=42)
final_model = RandomForestClassifier(random_state=42, n_estimators=500, max_depth=9)
final_model.fit(x, y)

RandomForestClassifier(max_depth=9, n_estimators=500, random_state=42)

In [22]:
y_pred = final_model.predict(valid_x)

print('roc_auc : {}'.format(roc_auc_score(valid_y, y_pred)))
print('accuracy score : {}'.format(accuracy_score(valid_y, y_pred)))
print('presion score : {}'.format(precision_score(valid_y, y_pred)))
print('recall score : {}'.format(recall_score(valid_y, y_pred)))
print('f1 score : {}'.format(f1_score(valid_y, y_pred)))

roc_auc : 0.7728237791932059
accuracy score : 0.7910447761194029
presion score : 0.7956989247311828
recall score : 0.6666666666666666
f1 score : 0.7254901960784312


In [23]:
final_model.fit(train_x, train_y)

RandomForestClassifier(max_depth=9, n_estimators=500, random_state=42)

# <h1 style='font-size:30px; color:red; border:2px solid red; border-radius:20px;'><center> Test Dataset </center></h1>

In [24]:
test_data = pd.read_csv('../input/titanic/test.csv')
id_col = test_data.iloc[:, 0]
test_data = test_data.drop(['PassengerId','Name', 'Ticket'], axis=1)
test_data['Cabin'] = test_data['Cabin'].astype('str')
test_data['Cabin'] = test_data['Cabin'].map(Cabin)
test_data = preprocessor.fit_transform(test_data)
test_data.shape

(418, 26)

In [25]:
y_pred = final_model.predict(test_data)
output_data = { 'PassengerId':id_col, 'Survived':y_pred}
output = pd.DataFrame(output_data)
output.to_csv('submission.csv', index=False)

# <h1 style=' font-size:30px; color:red; border:2px solid red; border-radius:20px;'><center> Thank You for Reading !  </center></h1>