# Supervised Machine Learning 
This project presents my thorough analysis of the popular Titanic dataset, along with a model that can predict whether passengers aboard the Titanic are likely to survive or not. <br>


## 1. Import libraries and Datasets
Here I will import the libraries that I will be using in my notebook, especially for EDA. 

In [2]:
import pandas as pd
import numpy as np
from collections import Counter
import plotly.express as px
import plotly.figure_factory as ff
import matplotlib.pyplot as plt

Kaggle provides 3 datasets for this challenge. <br>
```train.csv``` , training dataset which we will be working on predominantly.<br>
```test.csv``` , test dataset on which we'll make our final predictions. <br>
```gender_submission.csv``` , this is the format in which we want to submit our final solution.<br>
<br>
Let's start by reading the first two datasets as a pandas dataframe. 

In [3]:
test = pd.read_csv('/kaggle/input/titanic/test.csv')
train = pd.read_csv('/kaggle/input/titanic/train.csv')

Now let's have a quick look at the datasets.

In [4]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We can see that the test dataset does not have the ```'Survived'``` column, which is our response variable. 

In [6]:
print("Training set shape: ", train.shape)
print("Test set shape: ", test.shape)

Training set shape:  (891, 12)
Test set shape:  (418, 11)


Let's check the data types of the training and test set.

In [7]:
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

We also need to check for any ```Null``` values in both of these datasets. 

In [8]:
train.isnull().sum().sort_values(ascending=False)

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

In [9]:
test.isnull().sum().sort_values(ascending=False)

Cabin          327
Age             86
Fare             1
PassengerId      0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Embarked         0
dtype: int64

```'Age'``` and ```'Cabin'``` are two of the most missing values in both of the datasets, which we will take care of later during feature engineering. For now, we need to start our exploratory data analysis of the training dataset. <br>
<br>
## 2. Exploratory Data Analysis (EDA) 
Exploratory data analysis is the process of visualising and analysing data to extract insights. First, let's start by converting ```'Pclass'``` to a string from int so that it could be treated as a categorical variable.

In [10]:
train['Pclass'] = train['Pclass'].astype(str)

### Categorical variable: Sex

In [11]:
train['Sex'].value_counts(dropna= False)

male      577
female    314
Name: Sex, dtype: int64

We can see that there are more male passengers than female passengers on titanic.

In [12]:
sex_train = train[['Sex', 'Survived']].groupby('Sex').mean().sort_values(by='Survived', ascending=False).reset_index()
sex_train

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


Females have a significantly higher chance of surviving than Males. Let's visualize it in an interactive bar plot using plotly. 

In [13]:
px.bar(
    x = 'Sex', y = 'Survived',
    data_frame=sex_train, color='Sex',
    title="Survival Probability by Gender",
    width=500, height=500,
    color_discrete_map={'female': 'lightpink','male': 'royalblue'}
)


### Categorical variable: Pclass (Passenger Class) <br>
Here, 1 = First class, 2 = Second class, 3 = Third class.

In [14]:
train['Pclass'].value_counts(dropna= False)

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [15]:
pclass_train = train[['Pclass', 'Survived']].groupby('Pclass').mean().sort_values(by='Survived', ascending=False).reset_index()
pclass_train

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


Survival probability decreases with passenger class, first class passengers are more likely to survive compared to the other two classes. 

In [16]:
px.bar(
    x = 'Pclass', y = 'Survived',
    data_frame=pclass_train, color='Pclass',
    title="Survival Probability by Passenger Class",
    width=500, height=500,
    color_discrete_map={'1': 'gold','2': 'lightsteelblue','3':'indianred'}
)


### Categorical variable: Embarked <br>
Point of embarkation, where C = Cherbourg, Q = Queenstown, S = Southampton

In [17]:
train['Embarked'].value_counts(dropna= False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

In [18]:
embark_train = train[['Embarked', 'Survived']].groupby('Embarked').mean().sort_values(by='Survived', ascending=False).reset_index()
embark_train

Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.336957


Passengers who embarked from Cherbourg have a slight better chance of survival. This is likely because most first class passengers embarked from that location.  

In [19]:
px.bar(
    x = 'Embarked', y = 'Survived',
    data_frame=embark_train, color='Embarked',
    title="Survival Probability by Embark Location",
    width=500, height=500, 
)

In [20]:
px.histogram(
    x = 'Pclass', 
    data_frame=train.dropna(subset = ['Embarked']), color='Pclass',
    title="Passenger Class with respect to Embark Location",
    width=1000, height=500,
    color_discrete_map={'1': 'gold','2': 'lightsteelblue','3':'indianred'},
    facet_col='Embarked',
    histfunc ='avg',
).update_layout(yaxis_title="Count")

Here, we can see that while Southampton has a higher count of first class passengers, there are also significantly more third class passengers, who mostly did not survive. <br>
<br>
### Detect and remove outliers in numerical variables. 
Next, we will analyze numerical variables, such as ```'Age'```, ```'Fare'```, etc. But first, we need to address outliers in our dataset. Outliers are data points that deviate from the overall trend of the data. Dealing with outliers is crucial because they have the potential to distort our data towards extreme values, leading to inaccurate predictions by our model. Here, I will use a premade function to address this issue. 

In [21]:
def detect_outliers(df, n, features):
    """"
    This function will loop through a list of features and detect outliers in each one of those features. In each
    loop, a data point is deemed an outlier if it is less than the first quartile minus the outlier step or exceeds
    third quartile plus the outlier step. The outlier step is defined as 1.5 times the interquartile range. Once the 
    outliers have been determined for one feature, their indices will be stored in a list before proceeding to the next
    feature and the process repeats until the very last feature is completed. Finally, using the list with outlier 
    indices, we will count the frequencies of the index numbers and return them if their frequency exceeds n times.    
    """
    outlier_indices = [] 
    for col in features: 
        Q1 = np.percentile(df[col], 25)
        Q3 = np.percentile(df[col], 75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR 
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col) 
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(key for key, value in outlier_indices.items() if value > n) 
    return multiple_outliers

outliers_to_drop = detect_outliers(train, 2, ['Age', 'SibSp', 'Parch', 'Fare'])
print("We will drop these {} indices: ".format(len(outliers_to_drop)), outliers_to_drop)
print("Before: {} rows".format(len(train)))
train = train.drop(outliers_to_drop, axis = 0).reset_index(drop = True)
print("After: {} rows".format(len(train)))

We will drop these 10 indices:  [27, 88, 159, 180, 201, 324, 341, 792, 846, 863]
Before: 891 rows
After: 881 rows


Let's start by building a heatmap so we can check for correlations between the numerical variables. 

In [22]:
px.imshow(train[['Survived', 'SibSp', 'Parch', 'Age', 'Fare']].corr().round(2), text_auto=True,
          height=500,
          width=500)

### Numerical variable: SibSp <br>
Number of siblings or spouses aboard the titanic. 

In [23]:
train['SibSp'].value_counts(dropna= False)

0    608
1    209
2     28
4     18
3     13
5      5
Name: SibSp, dtype: int64

In [24]:
sibsp_train = train[['SibSp', 'Survived']].groupby('SibSp').mean().sort_values(by='SibSp').reset_index()
sibsp_train['SibSp'] = sibsp_train['SibSp'].astype(str)
sibsp_train

Unnamed: 0,SibSp,Survived
0,0,0.345395
1,1,0.535885
2,2,0.464286
3,3,0.153846
4,4,0.166667
5,5,0.0


In [25]:
px.bar(
    x = 'SibSp', y = 'Survived',
    data_frame=sibsp_train, color='SibSp',
    title="Survival Probability by number of Siblings/Spouse onboard",
    width=600, height=500, 
)

### Numerical variable: Parch <br>
Number of parents or children aboard the titanic. 

In [26]:
parch_train = train[['Parch', 'Survived']].groupby('Parch').mean().sort_values(by='Parch').reset_index()
parch_train['Parch'] = parch_train['Parch'].astype(str)
parch_train

Unnamed: 0,Parch,Survived
0,0,0.343658
1,1,0.550847
2,2,0.542857
3,3,0.6
4,4,0.0
5,5,0.2
6,6,0.0


In [27]:
px.bar(
    x = 'Parch', y = 'Survived',
    data_frame=parch_train, color='Parch',
    title="Survival Probability by number of Parents/Children onboard",
    width=600, height=500, 
)

### Numerical variable: Age <br>
Age in years, fractional if less than 1. <br>
To analyze ```'Age'```, we will make a dist plot. 

In [28]:
arr = np.array(train['Age'].dropna())
fig = ff.create_distplot([arr], ['Age'])
fig.update_layout(
    height = 500, width = 1000
)

Let's compare the distribution of Age according to wether the passenger survived or not. 

In [29]:
arr = np.array(
    (train.loc[train['Survived'] == 0])['Age'].dropna()
)
arr2 = np.array(
    (train.loc[train['Survived'] == 1])['Age'].dropna()
)
fig = ff.create_distplot([arr,arr2], ['Age who did not survive','Age who did survive'], colors=['slategray', 'magenta'], bin_size=2.5, show_hist=False)
fig.update_layout(
    height = 500, width = 800
)

We can see that children are more likely to survive. 

### Numerical variable: Fare <br>
Similar to ```'Age'``` , we will built a distplot. 

In [30]:
arr = np.array(train['Fare'].dropna())
fig = ff.create_distplot([arr], ['Fare'])
fig.update_layout(
    height = 500, width = 1000
)

The data seems like its heavily skewed. We will address this later in the next step.  

## 3. Data preprocessing <br>
We need to prepare our dataset for the model by filling in missing values, feature engineering, feature encoding, etc. Let's start by dropping the ```'Ticket'``` column in both of the datasets since it is not really significant, and would make our model more prone to overfitting. 

In [31]:
train = train.drop(['Ticket'], axis = 1)
test = test.drop(['Ticket'], axis = 1)

For the missing values in the ```'Embarked'``` column, we will fill them with mode of the column. 

In [32]:
train['Embarked'].fillna(train['Embarked'].dropna().mode()[0], inplace = True)
train.isnull().sum().sort_values(ascending = False)


Cabin          680
Age            170
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Fare             0
Embarked         0
dtype: int64

In [33]:
test.isnull().sum().sort_values(ascending = False)

Cabin          327
Age             86
Fare             1
PassengerId      0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Embarked         0
dtype: int64

We have 1 missing variable in the ```'Fare'``` column of the test dataset. Let's fill it with the median of the column. 

In [34]:
test['Fare'].fillna(test['Fare'].dropna().median(), inplace = True)
test.isnull().sum().sort_values(ascending = False)

Cabin          327
Age             86
PassengerId      0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Fare             0
Embarked         0
dtype: int64

Now there are two columns which have missing values in both of the datasets, ```'Age'``` and ```'Cabin'```. Let's start by joining both of the datasets. 

In [35]:
combine = pd.concat([train, test], axis = 0).reset_index(drop = True)
combine.isnull().sum().sort_values(ascending = False)

Cabin          1007
Survived        418
Age             256
PassengerId       0
Pclass            0
Name              0
Sex               0
SibSp             0
Parch             0
Fare              0
Embarked          0
dtype: int64

In [36]:
combine[combine['Cabin'].notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,C123,S
6,7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,E46,S
10,11,1.0,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,16.7000,G6,S
11,12,1.0,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
1285,1296,,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,27.7208,D40,C
1286,1297,,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20.0,0,0,13.8625,D38,C
1288,1299,,1,"Widener, Mr. George Dunton",male,50.0,1,1,211.5000,C80,C
1292,1303,,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,90.0000,C78,Q


For the ```'Cabin'``` column, there are 1007 missing values.  The initial letter of the Cabin values corresponds to the decks where the cabins are situated. These decks were primarily designated for a single passenger class, although certain decks were utilized by multiple passenger classes. We will keep the initial letter of each of the values, and fill in the missing values with ```'M'```. 

In [37]:
combine['Cabin'] = combine['Cabin'].str[0]
combine[combine['Cabin'].notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,C
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,C,S
6,7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,E,S
10,11,1.0,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,16.7000,G,S
11,12,1.0,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,26.5500,C,S
...,...,...,...,...,...,...,...,...,...,...,...
1285,1296,,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,27.7208,D,C
1286,1297,,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20.0,0,0,13.8625,D,C
1288,1299,,1,"Widener, Mr. George Dunton",male,50.0,1,1,211.5000,C,C
1292,1303,,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,90.0000,C,Q


In [38]:
combine['Cabin'].fillna('M', inplace = True)
combine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,M,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,M,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,M,S


Interestingly, we can also group most of the cabins together depending on ```'Pclass'```.

In [39]:
combine['Pclass'] = combine['Pclass'].astype(str)

In [40]:
px.histogram(
    x = 'Cabin', 
    data_frame=combine, color='Cabin',
    title="Cabin with respect to Pclass",
    width=1500, height=500,
    facet_col='Pclass',
    histfunc ='count',
).update_layout(yaxis_title="Count")

We can conclude that: 
<li>Cabin A, B, and C are only for 1st class passengers.<br>
<li>Cabin D has 87% 1st class and 13% 2nd class passengers.<br>
<li>Cabin E has 83% 1st class, 10% 2nd class, and 7% 3rd class passengers.<br>
<li>Cabin F has 62% 2nd class and 38% 3rd class passengers.<br>
<li>Cabin G is only occupied by 3rd class passengers.<br>
<li>Cabin T is occupied by a 1st class passenger and is grouped with the A cabin.<br>
<li>Passengers with the label "M" in the Cabin feature have missing cabin information and are treated as a separate cabin.<br>

In [41]:
combine['Cabin'] = combine['Cabin'].replace(['A', 'B', 'C'], 'ABC')
combine['Cabin'] = combine['Cabin'].replace(['D', 'E'], 'DE')
combine['Cabin'] = combine['Cabin'].replace(['F', 'G'], 'FG')
combine.loc[combine[combine['Cabin'] == 'T'].index, 'Cabin'] = 'A'
combine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,M,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,ABC,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,M,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,ABC,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,M,S


In [42]:
combine.isnull().sum().sort_values(ascending = False)

Survived       418
Age            256
PassengerId      0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Fare             0
Cabin            0
Embarked         0
dtype: int64

Now we're only missing values in the ```'Age'``` column. And for that, we will loop through each missing age in the list to locate the rows that have the same ```SibSp```, ```Parch``` and ```PClass``` values and fill the missing age with the median of those rows. If rows are not found, simply fill the missing age with the median of the entire Age column.

In [43]:
for i in list(combine[combine['Age'].isnull()].index):
    median_age = combine['Age'].median()
    predict_age = combine['Age'][(combine['SibSp'] == combine.iloc[i]['SibSp']) 
                                 & (combine['Parch'] == combine.iloc[i]['Parch'])
                                 & (combine['Pclass'] == combine.iloc[i]["Pclass"])].median()
    if np.isnan(predict_age):
        combine['Age'].iloc[i] = median_age
    else:
        combine['Age'].iloc[i] = predict_age
combine['Age'].isnull().sum()


Mean of empty slice



0

Remember that our passenger fare column exhibits a significantly skewed distribution towards higher values. To tackle this, we will use a log transformation.

In [44]:
combine['Fare'] = combine['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
arr = np.array(combine['Fare'])
fig = ff.create_distplot([arr], ['Fare'])
fig.update_layout(
    height = 500, width = 1000
)

Next, we will extract titles from the ```'Name'``` column, and we will group similar ones together. 

In [45]:
combine['Title'] = [name.split(',')[1].split('.')[0].strip() for name in combine['Name']]
combine['Title'] = combine['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Lady', 'Jonkheer', 'Don', 'Capt', 'the Countess', 'Sir', 'Dona'], 'Others')
combine['Title'] = combine['Title'].replace(['Mlle', 'Ms'], 'Miss')
combine['Title'] = combine['Title'].replace('Mme', 'Mrs')

px.histogram(
    x='Title',
    color = 'Title',
    data_frame=combine,
    width=500, height=500,
).update_layout(xaxis_type="category")

In [46]:
title_train = combine[['Title', 'Survived']].groupby(['Title'], as_index = False).mean().sort_values(by = 'Survived', ascending = False)
px.bar(
    x = 'Title', y = 'Survived',
    data_frame=title_train, color='Title',
    title="Survival Probability by Title",
    width=600, height=500, 
)

And as suspected, females (Mrs. and Miss) are more likely to survive. We can now drop the ```'Name'``` column. 

In [47]:
combine = combine.drop('Name', axis = 1)

In [48]:
combine['Pclass'] = combine['Pclass'].astype(int)
combine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
0,1,0.0,3,male,22.0,1,0,1.981001,M,S,Mr
1,2,1.0,1,female,38.0,1,0,4.266662,ABC,C,Mrs
2,3,1.0,3,female,26.0,0,0,2.070022,M,S,Miss
3,4,1.0,1,female,35.0,1,0,3.972177,ABC,S,Mrs
4,5,0.0,3,male,35.0,0,0,2.085672,M,S,Mr


Another feature that we can make is checking if the passenger is alone (By using ```SibSp``` and ```Parch```). First, we will calculate family size from ```SibSp``` and ```Parch```. Then use that to check if the passenger is alone or not. 

In [49]:
combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1
combine['IsAlone'] = 0
combine.loc[combine['FamilySize'] == 1, 'IsAlone'] = 1
combine[['IsAlone', 'Survived']].groupby('IsAlone', as_index = False).mean().sort_values(by = 'Survived', ascending = False)

Unnamed: 0,IsAlone,Survived
0,0,0.514535
1,1,0.303538


Passengers who are alone are less likely to survive. We can now drop the ```SibSp``` and ```Parch``` columns. 

In [50]:
combine = combine.drop(['SibSp', 'Parch', 'FamilySize'], axis = 1)
combine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,IsAlone
0,1,0.0,3,male,22.0,1.981001,M,S,Mr,0
1,2,1.0,1,female,38.0,4.266662,ABC,C,Mrs,0
2,3,1.0,3,female,26.0,2.070022,M,S,Miss,1
3,4,1.0,1,female,35.0,3.972177,ABC,S,Mrs,0
4,5,0.0,3,male,35.0,2.085672,M,S,Mr,1


Next, we need to first transform ```Age``` into an ordinal variable. Ordinal variable is much like a categorical variable but with intrisinc ordering in their values. We will group Ages into 5 separate age bands and assign a number to each age band and then compute our new ```Age*Class``` feature. 

In [51]:
combine['AgeBand'] = pd.cut(combine['Age'], 5)
combine.loc[combine['Age'] <= 16.136, 'Age'] = 0
combine.loc[(combine['Age'] > 16.136) & (combine['Age'] <= 32.102), 'Age'] = 1
combine.loc[(combine['Age'] > 32.102) & (combine['Age'] <= 48.068), 'Age'] = 2
combine.loc[(combine['Age'] > 48.068) & (combine['Age'] <= 64.034), 'Age'] = 3
combine.loc[combine['Age'] > 64.034 , 'Age'] = 4
combine = combine.drop('AgeBand', axis = 1)
combine['Age'] = combine['Age'].astype('int')
combine['Age*Class'] = combine['Age'] * combine['Pclass']
combine[['Age', 'Pclass', 'Age*Class']].head()

Unnamed: 0,Age,Pclass,Age*Class
0,1,3,3
1,2,1,2
2,1,3,3
3,2,1,2
4,2,3,6


We will do similar with the ```Fare``` feature. 

In [52]:
combine['FareBand'] = pd.cut(combine['Fare'], 5)
combine[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by = 'FareBand')

Unnamed: 0,FareBand,Survived
0,"(-0.00624, 1.248]",0.066667
1,"(1.248, 2.496]",0.225627
2,"(2.496, 3.743]",0.431884
3,"(3.743, 4.991]",0.669118
4,"(4.991, 6.239]",0.692308


In [53]:
combine.loc[combine['Fare'] <= 1.248, 'Fare'] = 0
combine.loc[(combine['Fare'] > 1.248) & (combine['Fare'] <= 2.496), 'Fare'] = 1
combine.loc[(combine['Fare'] > 2.496) & (combine['Fare'] <= 3.743), 'Fare'] = 2
combine.loc[(combine['Fare'] > 3.743) & (combine['Fare'] <= 4.991), 'Fare'] = 3
combine.loc[combine['Fare'] > 4.991, 'Fare'] = 4

In [54]:
combine['Fare'] = combine['Fare'].astype('int')
combine = combine.drop('FareBand', axis = 1)
combine.head()


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,IsAlone,Age*Class
0,1,0.0,3,male,1,1,M,S,Mr,0,3
1,2,1.0,1,female,2,3,ABC,C,Mrs,0,2
2,3,1.0,3,female,1,1,M,S,Miss,1,3
3,4,1.0,1,female,2,3,ABC,S,Mrs,0,2
4,5,0.0,3,male,2,1,M,S,Mr,1,6


Next, we need to encode our categorical variables. Machine learning models require all input and output variables to be numeric. Therefore, we need to encode all of our categorical data before we can fit the models to our data.

In [55]:
combine = pd.get_dummies(combine, columns = ['Title'])
combine = pd.get_dummies(combine, columns = ['Sex'])
combine = pd.get_dummies(combine, columns = ['Embarked'])
combine = pd.get_dummies(combine, columns = ['Cabin'])
combine.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,IsAlone,Age*Class,Title_Master,Title_Miss,Title_Mr,...,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_ABC,Cabin_DE,Cabin_FG,Cabin_M
0,1,0.0,3,1,1,0,3,0,0,1,...,0,1,0,0,1,0,0,0,0,1
1,2,1.0,1,2,3,0,2,0,0,0,...,1,0,1,0,0,0,1,0,0,0
2,3,1.0,3,1,1,1,3,0,1,0,...,1,0,0,0,1,0,0,0,0,1
3,4,1.0,1,2,3,0,2,0,0,0,...,1,0,0,0,1,0,1,0,0,0
4,5,0.0,3,2,1,1,6,0,0,1,...,0,1,0,0,1,0,0,0,0,1


In [56]:
train = combine[:len(train)]
test = combine[len(train):]

In [57]:
train = train.drop('PassengerId', axis = 1)
train.head()

Unnamed: 0,Survived,Pclass,Age,Fare,IsAlone,Age*Class,Title_Master,Title_Miss,Title_Mr,Title_Mrs,...,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_ABC,Cabin_DE,Cabin_FG,Cabin_M
0,0.0,3,1,1,0,3,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1
1,1.0,1,2,3,0,2,0,0,0,1,...,1,0,1,0,0,0,1,0,0,0
2,1.0,3,1,1,1,3,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
3,1.0,1,2,3,0,2,0,0,0,1,...,1,0,0,0,1,0,1,0,0,0
4,0.0,3,2,1,1,6,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1


In [58]:
train['Survived'] = train['Survived'].astype('int')
train.head()

Unnamed: 0,Survived,Pclass,Age,Fare,IsAlone,Age*Class,Title_Master,Title_Miss,Title_Mr,Title_Mrs,...,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_ABC,Cabin_DE,Cabin_FG,Cabin_M
0,0,3,1,1,0,3,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1
1,1,1,2,3,0,2,0,0,0,1,...,1,0,1,0,0,0,1,0,0,0
2,1,3,1,1,1,3,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
3,1,1,2,3,0,2,0,0,0,1,...,1,0,0,0,1,0,1,0,0,0
4,0,3,2,1,1,6,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1


In [59]:
test = test.drop('Survived', axis = 1)
test.head()

Unnamed: 0,PassengerId,Pclass,Age,Fare,IsAlone,Age*Class,Title_Master,Title_Miss,Title_Mr,Title_Mrs,...,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_ABC,Cabin_DE,Cabin_FG,Cabin_M
881,892,3,2,1,1,6,0,0,1,0,...,0,1,0,1,0,0,0,0,0,1
882,893,3,2,1,0,6,0,0,0,1,...,1,0,0,0,1,0,0,0,0,1
883,894,2,3,1,1,6,0,0,1,0,...,0,1,0,1,0,0,0,0,0,1
884,895,3,1,1,1,3,0,0,1,0,...,0,1,0,0,1,0,0,0,0,1
885,896,3,1,2,0,3,0,0,0,1,...,1,0,0,0,1,0,0,0,0,1


## 4. Modelling <br>
I have gone with the following classifiers for this project: <br>
<li> Logistic regression
<li> Support vector machines
<li> K-nearest neighbours
<li> Decision tree
<li> Random forest
<li> XGBoost
<br>
<br>
Let's start by importing important libraries. 


In [60]:
# Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Model evaluation
from sklearn.model_selection import cross_val_score

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

Before we dive into building our models, we need to do some prep work. First, we should split our training data into two parts: one for the independent variables ```X``` and another for the dependent variable ```Y```. <br>
In our case, the dependent variable ```Y``` will be the ```Survived``` column in our training set. It's what we want our models to predict. As for the independent variables ```X```, they will consist of all the other columns in the training set, except for the ```Survived``` column. <br>
Once our models are trained, they can then use the information in ```X``` to make predictions on new data ```X_test``` and classify whether a passenger survived or not.

In [61]:
X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test.drop('PassengerId', axis = 1).copy()
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_test shape: ", X_test.shape)

X_train shape:  (881, 20)
Y_train shape:  (881,)
X_test shape:  (418, 20)


### Support Vector Machines

In [62]:
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print('Score:',acc_svc)

Score: 81.95


### Random Forest

In [63]:
random_forest = RandomForestClassifier(n_estimators = 100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print('Score:',acc_random_forest)

Score: 88.2


### Decision Trees

In [64]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print('Score:',acc_decision_tree)

Score: 88.2


### Logistic Regression

In [65]:
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print('Score:',acc_log)

Score: 80.82


### K-nearest neighbours

In [66]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
print('Score:',acc_knn)

Score: 85.47


### XGBoost

In [67]:
xgboost = XGBClassifier(n_estimators=25)
xgboost.fit(X_train, Y_train)
Y_pred = xgboost.predict(X_test)
acc_xgboost = round(xgboost.score(X_train, Y_train) * 100, 2)
print('Score:',acc_xgboost)

Score: 86.27


### K-fold cross validation
The next step is to assess the performance of these models and select the one which has the highest prediction accuracy. We will use K-Fold Cross Validation for this purpose. 

In [68]:
classifiers = []
classifiers.append(LogisticRegression(max_iter=200))
classifiers.append(SVC())
classifiers.append(KNeighborsClassifier(n_neighbors = 5))
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier())
classifiers.append(XGBClassifier(n_estimators=25))
len(classifiers)

6

In [69]:
cv_results = []
for classifier in classifiers:
    cv_results.append(cross_val_score(classifier, X_train, Y_train, scoring = 'accuracy', cv = 10))

In [70]:
cv_mean = []
cv_std = []
for cv_result in cv_results:
    cv_mean.append(cv_result.mean())
    cv_std.append(cv_result.std())

In [71]:
cv_res = pd.DataFrame({'Cross Validation Mean': cv_mean, 'Cross Validation Std': cv_std, 'Algorithm': ['Logistic Regression', 'Support Vector Machines', 'KNN', 'Decision Tree', 'Random Forest', 'XGBoost']})
cv_res.sort_values(by = 'Cross Validation Mean', ascending = False, ignore_index = True)

Unnamed: 0,Cross Validation Mean,Cross Validation Std,Algorithm
0,0.821782,0.036357,XGBoost
1,0.811581,0.030551,Random Forest
2,0.810419,0.032679,Logistic Regression
3,0.809308,0.034384,KNN
4,0.809295,0.044291,Decision Tree
5,0.799093,0.045457,Support Vector Machines


As we can see, ```XGBoost``` has the highest cross validation mean and thus we will proceed with this model.

### Hyperparameter Tuning
Hyperparameter tuning involves adjusting the settings of a model to optimize its performance. In this case, I will be tuning the parameters of our model using a technique called ```RandomizedSearchCV```. This will help us find the best combination of parameters for the model.

In [72]:
gbm_param_grid = {
    'n_estimators': range(8, 20),
    'max_depth': range(6, 10),
    'learning_rate': [.4, .45, .5, .55, .6],
    'colsample_bytree': [.6, .7, .8, .9, 1]
}

# Instantiate the regressor: gbm
gbm = XGBClassifier(n_estimators=25)

# Perform random search: grid_mse
xgb_random = RandomizedSearchCV(param_distributions=gbm_param_grid, 
                                    estimator = gbm, scoring = "accuracy", 
                                    verbose = 1, n_iter = 50, cv = 10)

# Fit randomized_mse to the data
xgb_random.fit(X_train, Y_train)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", xgb_random.best_params_)
print("Best accuracy found: ", xgb_random.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits
Best parameters found:  {'n_estimators': 11, 'max_depth': 9, 'learning_rate': 0.4, 'colsample_bytree': 1}
Best accuracy found:  0.8297242083758938


In [73]:
xgboost = XGBClassifier(n_estimators= 13, max_depth= 9, learning_rate= 0.6, colsample_bytree= 1)
xgboost.fit(X_train, Y_train)
Y_pred = xgboost.predict(X_test)
acc_xgboost = round(xgboost.score(X_train, Y_train) * 100, 2)
print('Score:', acc_xgboost)

Score: 86.95


In [74]:
cross_val_score(xgboost, X_train, Y_train, scoring = 'accuracy', cv = 10).mean()

0.8297497446373852

We can see that our score improved slightly. 

## Submission 
```gender_submission.csv``` contains the format in which we need our results. So let's start by importing it. 

In [75]:
gs = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
gs.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [76]:
print(gs.shape)
print(Y_pred.shape)

(418, 2)
(418,)


We have the same number of rows in our predictions dataset. Now we just need to save it as a ```csv``` file and we are done. 

In [77]:
submit = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': Y_pred})
submit.head()

Unnamed: 0,PassengerId,Survived
881,892,0
882,893,0
883,894,0
884,895,0
885,896,0


In [78]:
submit.shape

(418, 2)

In [79]:
submit.to_csv("xgboost_titanic.csv", index = False)

My final Kaggle score was ```0.86```