# **Titanic - Machine Learning from Disaster**

This is building up on version 1

Result: 0.82057

### Dataset description:

- **PassengerId**: An unique index for passenger rows. It starts from 1 for first row and increments by 1 for every new rows.

- **Survived**: Shows if the passenger survived or not. 1 stands for survived and 0 stands for not survived.

- **Pclass**: Ticket class. 1 stands for First class ticket. 2 stands for Second class ticket. 3 stands for Third class ticket.

- **Name**: Passenger's name. Name also contain title. "Mr" for man. "Mrs" for woman. "Miss" for girl. "Master" for boy.

- **Sex**: Passenger's sex. It's either Male or Female.

- **Age**: Passenger's age. "NaN" values in this column indicates that the age of that particular passenger has not been recorded.

- **SibSp**: Number of siblings or spouses travelling with each passenger.

- **Parch**: Number of parents of children travelling with each passenger.

- **Ticket**: Ticket number.

- **Fare**: How much money the passenger has paid for the travel journey.

- **Cabin**: Cabin number of the passenger. "NaN" values in this column indicates that the cabin number of that particular passenger has not been recorded.

- **Embarked**: Port from where the particular passenger was embarked/boarded.

### Load libraries

In [54]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
import re

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB


from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import StackingClassifier

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

from sklearn.metrics import precision_score, recall_score

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold


### Load data

In [2]:
# getting data
test_df = pd.read_csv("~/documents/titanic/test.csv")
train_df = pd.read_csv("~/documents/titanic/train.csv")

In [4]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Feature Engineering

#### Grouping titles

In [5]:
data = [train_df, test_df]

for dataset in data:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir',
                          'Jonkheer', 'Don'], 'Mr')
    dataset['Title'] = dataset['Title'].replace(['Dona', 'Countess', 'Lady', 'Mme'], 'Mrs')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)
    

In [6]:
train_df.groupby(['Title'])['Survived'].mean()

Title
Dr        0.428571
Master    0.575000
Miss      0.702703
Mr        0.158192
Mrs       0.796875
Name: Survived, dtype: float64

In [None]:
train_df.head()

#### Imputing median into fare

In [7]:
data = [train_df, test_df]

for dataset in data:

    # Fill Fare
    fares_to_impute = dataset.groupby('Pclass')['Fare'].median()

    pclasses = [1, 2, 3]
    for pclass in pclasses:
        fare_to_impute = fares_to_impute[pclasses.index(pclass) + 1]
        dataset.loc[(dataset['Fare'].isna()) &
                    (dataset['Pclass'] == pclass), 'Fare'] = fare_to_impute

#### Imputing median class

In [8]:
# Fill missing values of 'Age' using 'Title'

data = [train_df, test_df]

for dataset in data:

    ages_to_impute = dataset.groupby(['Title'])['Age'].median()
    titles = ages_to_impute.index.tolist()

    for title in titles:
        age_to_impute = ages_to_impute[titles.index(title)]
        dataset.loc[(dataset['Age'].isna()) &
                    (dataset['Title'] == title), 'Age'] = age_to_impute

In [9]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr


#### New Variable AgePclass

In [10]:
data = [train_df, test_df]


for dataset in data:

    dataset['AgePclass'] = dataset['Age'] * dataset['Pclass']

In [11]:
data = [train_df, test_df]


for dataset in data:

# Also create 'AgeBand' to have a more symmetric 'Age' features

    dataset['AgeBand'] = pd.qcut(dataset['Age'], q=5, labels=False)


Checking average based on sex and age band

In [12]:
train_df.groupby(['Sex', 'AgeBand'])['Survived'].mean()

Sex     AgeBand
female  0          0.688312
        1          0.686047
        2          0.774194
        3          0.850000
        4          0.766667
male    0          0.292453
        1          0.122222
        2          0.146067
        3          0.247059
        4          0.169492
Name: Survived, dtype: float64

In [13]:
for age in range(5, 60, 5):
    survive_rate = train_df[(train_df['Sex'] == 'male') &
                           (train_df['Age'] < age)]['Survived'].mean()
    print(f"Male aged under {age} survived: {round(survive_rate * 100, 2)} %")

Male aged under 5 survived: 62.96 %
Male aged under 10 survived: 58.33 %
Male aged under 15 survived: 53.49 %
Male aged under 20 survived: 30.11 %
Male aged under 25 survived: 21.21 %
Male aged under 30 survived: 21.99 %
Male aged under 35 survived: 18.96 %
Male aged under 40 survived: 19.27 %
Male aged under 45 survived: 19.11 %
Male aged under 50 survived: 19.43 %
Male aged under 55 survived: 19.27 %


#### Creating variable priority

In [14]:
data = [train_df, test_df]


for dataset in data:


    dataset['Priority'] = 1
    dataset.loc[dataset['Title'] == 'Mr', 'Priority'] = 0

In [15]:
train_df.groupby('Priority')['Survived'].mean()

Priority
0    0.158192
1    0.716667
Name: Survived, dtype: float64

In [16]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgePclass,AgeBand,Priority
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,66.0,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,3,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,78.0,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,3,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,105.0,3,0


#### Creating Ticket ID

In [17]:
data = [train_df, test_df]


for dataset in data:


    def ticketid(ticket):
        try:
            # All digits part of a ticket except last character
            id = re.findall(r"\d+", ticket)[-1]
            if len(id) > 1:
                return id[:-1]
            else:  # For ticket id with only 1 digit.
                return id
        except:
            # For LINE tickets
            return '000'

    dataset['TicketID'] = dataset['Ticket'].apply(ticketid)
    dataset['TicketID'] = dataset['TicketID'].astype(int)
    dataset['TicketID'] += dataset['Pclass'] * 1-000-000 * dataset['Fare']

In [18]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgePclass,AgeBand,Priority,TicketID
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,66.0,1,0,2120.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,3,1,1760.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,78.0,1,1,310131.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,3,1,11381.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,105.0,3,0,37348.0


#### New Variable Same Priority Group

In [19]:
data = [train_df, test_df]


for dataset in data:

    x = dataset[['Priority', 'TicketID', 'Fare']].to_string(header=False, index=False,
                                                index_names=False).split('\n')
    dataset['SamePrioGroup'] = ['-'.join(i.split()) for i in x]

In [20]:
train_df.groupby('SamePrioGroup')['Survived'].mean().value_counts()

0.000000    358
1.000000    225
0.500000     17
0.333333      3
0.250000      2
0.125000      1
0.400000      1
0.714286      1
0.666667      1
0.200000      1
0.750000      1
Name: Survived, dtype: int64

#### New Variable Same priority companion

In [21]:
data = [train_df, test_df]


for dataset in data:


    count_same_prio_group = dataset.groupby('SamePrioGroup')['SamePrioGroup'].count()

    for group in dataset['SamePrioGroup'].unique():
        dataset.loc[dataset['SamePrioGroup'] == group,
                   'SamePrioCompanion'] = count_same_prio_group[group]

In [22]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,66.0,1,0,2120.0,0-2120.0-7.2500,4.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,3,1,1760.0,1-1760.0-71.2833,1.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,78.0,1,1,310131.0,1-310131.0-7.9250,3.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,3,1,11381.0,1-11381.0-53.1000,2.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,105.0,3,0,37348.0,0-37348.0-8.0500,1.0


In [23]:
train_df[(train_df['SamePrioCompanion'] > 1) & (train_df['Priority'] == 1)].groupby('SamePrioGroup')['Survived'].mean().value_counts()

1.00    58
0.00    18
0.50     5
0.75     1
Name: Survived, dtype: int64

#### Sex

In [24]:
# Sex

sex = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(sex)


In [25]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Mr,66.0,1,0,2120.0,0-2120.0-7.2500,4.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,3,1,1760.0,1-1760.0-71.2833,1.0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,78.0,1,1,310131.0,1-310131.0-7.9250,3.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,3,1,11381.0,1-11381.0-53.1000,2.0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Mr,105.0,3,0,37348.0,0-37348.0-8.0500,1.0


#### Concating train_df and test_df

In [26]:
X_full = pd.concat([train_df, test_df], axis=0, ignore_index=True)



In [27]:
count_same_prio_group = X_full.groupby('SamePrioGroup')['SamePrioGroup'].count()
count_same_prio_group

SamePrioGroup
0-1051.0-8.0500      1
0-11042.0-79.6500    1
0-11047.0-26.0000    1
0-11047.0-52.0000    2
0-11049.0-26.5500    1
                    ..
1-758.0-10.5167      1
1-758.0-9.8375       1
1-88.0-14.5000       2
1-926.0-7.7500       2
1-957.0-16.7000      3
Name: SamePrioGroup, Length: 818, dtype: int64

#### Creating same priority survived

In [28]:
# Find passengers who had 1+ companions with same priority in his/her group

count_same_prio_group = X_full.groupby('SamePrioGroup')['SamePrioGroup'].count()
survived_same_prio_group = X_full.groupby('SamePrioGroup')['Survived'].mean()


for group in X_full['SamePrioGroup'].unique():
    X_full.loc[X_full['SamePrioGroup'] == group,
               'SamePrioCompanion'] = count_same_prio_group[group]

    X_full.loc[X_full['SamePrioGroup'] == group,
               'SamePrioSurvived'] = survived_same_prio_group[group]
    

# For groups with people inside test set (no Survived information), we 
# set a default rate of group survival rate to the population survival rate
X_full['SamePrioSurvived'].fillna(722 / (722 + 1502), inplace=True, axis=0)


In [29]:
X_full.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion,SamePrioSurvived
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Mr,66.0,1,0,2120.0,0-2120.0-7.2500,5.0,0.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,3,1,1760.0,1-1760.0-71.2833,1.0,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,78.0,1,1,310131.0,1-310131.0-7.9250,3.0,1.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,3,1,11381.0,1-11381.0-53.1000,2.0,1.0
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Mr,105.0,3,0,37348.0,0-37348.0-8.0500,1.0,0.0


#### Creating Far per person

In [30]:
X_full['FarePerPerson'] = X_full['Fare'] / X_full['SamePrioCompanion']

#### Creating Fareband

In [31]:
# Create FareBand

X_full['FareBand'] = pd.qcut(X_full['Fare'], q=5, labels=False)

#### Creating same outcome

In [32]:
# Feature to find passengers with same priority, same group who had same outcomes
def same_outcome(survived_rate):
    if survived_rate == 1 or survive_rate == 0:
        return 1
    else:
        return 0

X_full['SameOutcome'] = X_full['SamePrioSurvived'].apply(same_outcome)

In [33]:
X_full.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion,SamePrioSurvived,FarePerPerson,FareBand,SameOutcome
0,1,0.0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,...,66.0,1,0,2120.0,0-2120.0-7.2500,5.0,0.0,1.45,0,0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,...,38.0,3,1,1760.0,1-1760.0-71.2833,1.0,1.0,71.2833,4,1
2,3,1.0,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,...,78.0,1,1,310131.0,1-310131.0-7.9250,3.0,1.0,2.641667,1,1
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,...,35.0,3,1,11381.0,1-11381.0-53.1000,2.0,1.0,26.55,4,1
4,5,0.0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,...,105.0,3,0,37348.0,0-37348.0-8.0500,1.0,0.0,8.05,1,0


#### Dropping unused fields

In [34]:
X_full.drop(['Survived', 'Name', 'Title', 
             'Ticket', 'SibSp', 'Parch', 'Cabin', 'Embarked'], axis=1, inplace=True)


In [35]:
X_full

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion,SamePrioSurvived,FarePerPerson,FareBand,SameOutcome
0,1,3,0,22.0,7.2500,66.0,1,0,2120.0,0-2120.0-7.2500,5.0,0.00000,1.450000,0,0
1,2,1,1,38.0,71.2833,38.0,3,1,1760.0,1-1760.0-71.2833,1.0,1.00000,71.283300,4,1
2,3,3,1,26.0,7.9250,78.0,1,1,310131.0,1-310131.0-7.9250,3.0,1.00000,2.641667,1,1
3,4,1,1,35.0,53.1000,35.0,3,1,11381.0,1-11381.0-53.1000,2.0,1.00000,26.550000,4,1
4,5,3,0,35.0,8.0500,105.0,3,0,37348.0,0-37348.0-8.0500,1.0,0.00000,8.050000,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,3,0,29.0,8.0500,87.0,2,0,326.0,0-326.0-8.0500,2.0,0.00000,4.025000,1,0
1305,1306,1,1,39.0,108.9000,39.0,3,1,1776.0,1-1776.0-108.9000,2.0,1.00000,54.450000,4,1
1306,1307,3,0,38.5,7.2500,115.5,3,0,310129.0,0-310129.0-7.2500,1.0,0.32464,7.250000,0,0
1307,1308,3,0,29.0,8.0500,87.0,2,0,35933.0,0-35933.0-8.0500,2.0,0.32464,4.025000,1,0


#### Standardize numerical features

In [36]:
# Standardize numerical features

for col in ['Age', 'Fare', 'AgePclass', 'FarePerPerson', 'TicketID']:
    mean = X_full[col].mean()
    std = X_full[col].std()
    X_full[col] = (X_full[col] - mean) / std

In [37]:
X_full.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,AgePclass,AgeBand,Priority,TicketID,SamePrioGroup,SamePrioCompanion,SamePrioSurvived,FarePerPerson,FareBand,SameOutcome
0,1,3,0,-0.569679,-0.502981,0.081988,1,0,-0.412168,0-2120.0-7.2500,5.0,0.0,-0.619586,0,0
1,2,1,1,0.645306,0.734529,-0.80247,3,1,-0.417833,1-1760.0-71.2833,1.0,1.0,1.879283,4,1
2,3,3,1,-0.265933,-0.489936,0.461041,1,1,4.435372,1-310131.0-7.9250,3.0,1.0,-0.576945,1,1
3,4,1,1,0.417496,0.383118,-0.897233,3,1,-0.266416,1-11381.0-53.1000,2.0,1.0,0.278576,4,1
4,5,3,0,0.417496,-0.48752,1.313911,3,0,0.142258,0-37348.0-8.0500,1.0,0.0,-0.383416,1,0


### ML Alghorithm

#### Creating train and test data set

In [38]:
X_train = X_full[:891]
y_train = train_df['Survived']
X_test = X_full[891:]
X_test.reset_index(inplace=True, drop=True)



#### Oversample perished people to match population ratio of not survived 


In [39]:

np.random.seed(42)
not_survived_ix = train_df['Survived'][train_df['Survived'] == 0].index.to_numpy()
rnd_chosen = np.random.choice(not_survived_ix, 163, replace=False)
not_survived_ix = np.r_[not_survived_ix, rnd_chosen]

survived_ix = train_df['Survived'][train_df['Survived'] == 1].index.to_numpy()

rnd_ix = np.r_[survived_ix, not_survived_ix]
X_train = X_train.loc[rnd_ix]
y_train = y_train.loc[rnd_ix]


#### Dropping last fields

In [40]:
X_train_all = X_train.drop(['PassengerId', 'SamePrioGroup', 'SameOutcome', 'SamePrioSurvived'], axis=1)
X_test_all = X_test.drop(['PassengerId', 'SamePrioGroup',  'SameOutcome', 'SamePrioSurvived'], axis=1)

In [42]:
X_test_all.head()

Unnamed: 0,Pclass,Sex,Age,Fare,AgePclass,AgeBand,Priority,TicketID,SamePrioCompanion,FarePerPerson,FareBand
0,3,0,0.379528,-0.491787,1.266529,3,0,0.075307,1.0,-0.391317,0
1,3,1,1.328735,-0.507813,2.45107,4,1,0.126236,1.0,-0.420989,0
2,2,0,2.467783,-0.455874,1.914078,4,0,-0.067359,1.0,-0.324821,1
3,3,0,-0.189996,-0.475683,0.555804,2,0,0.050504,2.0,-0.516486,1
4,3,1,-0.569679,-0.405626,0.081988,1,1,4.435388,2.0,-0.451628,2


#### Random Forest

In [50]:
# Random Forest:

random_forest = RandomForestClassifier(n_estimators=100)

random_forest.fit(X_train_all, y_train)

Y_pred_rf = random_forest.predict(X_test_all)

random_forest.score(X_train_all, y_train)
acc_random_forest = round(random_forest.score(X_train_all, y_train) * 100, 2)
print(acc_random_forest)

99.43


### K-Fold Cross Validation

In [43]:
X_train_all

Unnamed: 0,Pclass,Sex,Age,Fare,AgePclass,AgeBand,Priority,TicketID,SamePrioCompanion,FarePerPerson,FareBand
1,1,1,0.645306,0.734529,-0.802470,3,1,-0.417833,1.0,1.879283,4
2,3,1,-0.265933,-0.489936,0.461041,1,1,4.435372,3.0,-0.576945,1
3,1,1,0.417496,0.383118,-0.897233,3,1,-0.266416,2.0,0.278576,4
8,3,1,-0.189996,-0.427932,0.555804,2,1,0.101795,3.0,-0.538677,2
9,2,1,-1.177171,-0.061945,-1.118347,0,1,-0.071357,1.0,0.404562,3
...,...,...,...,...,...,...,...,...,...,...,...
650,3,0,0.037813,-0.490500,0.840094,2,0,0.104124,9.0,-0.640079,1
769,3,0,0.189686,-0.481481,1.029621,3,0,-0.432155,1.0,-0.372234,1
420,3,0,0.037813,-0.490500,0.840094,2,0,0.104171,7.0,-0.631110,1
793,1,0,0.037813,-0.049867,-1.055172,2,0,-0.417818,1.0,0.426926,3


In [44]:
X_test_all

Unnamed: 0,Pclass,Sex,Age,Fare,AgePclass,AgeBand,Priority,TicketID,SamePrioCompanion,FarePerPerson,FareBand
0,3,0,0.379528,-0.491787,1.266529,3,0,0.075307,1.0,-0.391317,0
1,3,1,1.328735,-0.507813,2.451070,4,1,0.126236,1.0,-0.420989,0
2,2,0,2.467783,-0.455874,1.914078,4,0,-0.067359,1.0,-0.324821,1
3,3,0,-0.189996,-0.475683,0.555804,2,0,0.050504,2.0,-0.516486,1
4,3,1,-0.569679,-0.405626,0.081988,1,1,4.435388,2.0,-0.451628,2
...,...,...,...,...,...,...,...,...,...,...,...
413,3,0,-0.038123,-0.487520,0.745331,2,0,-0.440402,2.0,-0.527444,1
414,1,1,0.721242,1.461511,-0.770882,3,1,-0.417582,2.0,1.276931,4
415,3,0,0.683274,-0.502981,1.645582,3,0,4.435341,1.0,-0.412043,0
416,3,0,-0.038123,-0.487520,0.745331,2,0,0.119988,2.0,-0.527444,1


In [51]:
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train_all, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.85849057 0.83962264 0.83018868 0.89622642 0.88571429 0.88571429
 0.86666667 0.83809524 0.91428571 0.93333333]
Mean: 0.8748337825696316
Standard Deviation: 0.03254166658895175


Random Forest model has an average accuracy of 81% with a standard deviation of 5%. 

### Hyperparameter Tuning

##### Random Forest

In [55]:

def grid_search(model, param_grid, random=False,
                n_iter=10, scoring='accuracy', verbose=0):
    model_name = model.__class__.__name__
    print(f"Model: {model_name}")
    print("-" * 40)

    cv = StratifiedKFold(n_splits=10, shuffle=True,
                                random_state=42)
    if random:
        model_grid = RandomizedSearchCV(model, param_grid, n_iter=n_iter,
                                        cv = cv, scoring=scoring,
                                        refit=True, return_train_score=True,
                                        random_state=42, n_jobs=-1, verbose=verbose)
    else:
        model_grid = GridSearchCV(model, param_grid, cv=cv, scoring=scoring,
                                  refit=True, return_train_score=True, n_jobs=-1, verbose=verbose)

    model_grid.fit(X_train_all, y_train)

    model_cvres = model_grid.cv_results_
    mean_scores = model_cvres['mean_test_score']
    best_score = mean_scores.max()
    params_space = model_cvres['params']

    for mean_score, params in zip(mean_scores, params_space):
        result = f'{round(mean_score * 100, 2)}% : {params}'
        if mean_score == best_score:
            print('* ', result, ' [BEST]')
        else:
            print(result)

    return model_grid, {model_name: round(best_score * 100, 2)}

In [56]:
xtree_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
xtree_params = [
    { "criterion": ["gini", "entropy"],
      'n_estimators': [75, 100, 150],
     'min_samples_leaf': [1, 2, 5, 10],
     'max_depth': [2, 5],}
]

xtree_grid, xtree_info = grid_search(xtree_clf, xtree_params, verbose=0)

Model: RandomForestClassifier
----------------------------------------
81.97% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 75}
82.16% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 100}
81.88% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 150}
81.97% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 75}
82.26% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 100}
81.88% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 150}
81.97% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 75}
82.26% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 100}
81.88% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 150}
82.07% : {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 10, 'n_estimators': 75}
82.26% : {'criterion':

In [59]:
y_pred_test = xtree_grid.predict(X_test_all)
y_pred_test = pd.Series(y_pred_test, index=X_test_all.index)

y_pred_full_test = [None] * len(X_test)

for i in X_test.index:
    if X_test.loc[i, 'Priority'] == 1 and X_test.loc[i, 'SamePrioCompanion']\
            and X_test.loc[i, 'SamePrioCompanion'] > 1:
        y_pred_full_test[i] = X_test.loc[i, 'SamePrioSurvived'].astype(int)
    else:
        y_pred_full_test[i] = y_pred_test[i]

In [61]:
submission_v1 = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": y_pred_full_test
    })

submission_v1.to_csv('submission_v1.csv', index=False)

### References

More on:

- Hyperparameter Tuning

- Precision Recall Curve

- ROC AUC Curve

- ROC AUC Score

Source: https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8
        
- Model Description

Source: https://www.kaggle.com/chapagain/titanic-solution-a-beginner-s-guide