<a href="https://colab.research.google.com/github/PercyAyimbilaNsolemna/Machine_Learning/blob/main/Titanic_Kaggle_Challenge/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **KAGGLE TITANIC CHALLENGE SOLUTION**

Implementation of kaggle titanic challenge using Deep Learning (Neural Networks)

## **IMPORTING LIBRARIES**

This section imports all the libraries that will be used in building the Machine Learning workflow

In [19]:
#Imports numpy as np
import numpy as np
#Imports pandas as pd
import pandas as pd
#Imports tensorflow as tf
import tensorflow as tf
#Imports train_test_split from scikit learn model_selection
from sklearn.model_selection import train_test_split
#Imports StandardScaler from scikit learn preprocessing
from sklearn.preprocessing import StandardScaler
#Imports mean squared error from scikit learn metrics
from sklearn.metrics import mean_squared_error
#Imports accuracy score from scikit learn metrics
from sklearn.metrics import accuracy_score
#Imports Sequential model from tensorflow keras model
from tensorflow.keras.models import Sequential
#Imports the Dense layer from tensorflow keras layers
from tensorflow.keras.layers import Dense
#Imports OneHotEncode from sklearn preprocessing
from sklearn.preprocessing import OneHotEncoder
#Imports Logistic Regression from sklear.linear_model
from sklearn.linear_model import LogisticRegression
#Imports the RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
#Imports warning and supress all future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#np.random.seed(1) is used to keep all the random function calls consistent.
np.random.seed(1)

## **IMPORTING GOOGLE DRIVE**

This section imports google drive into the notebook so that the dataset for the project can be accessed for training, cross-validation and testing

In [20]:
#Imports drive from google colab
from google.colab import drive

In [21]:
#Connects or mounts google drive to the google colab
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **TITANIC DATASET**

The titanic dataset will be used to build the neural network. Detailed information on the project can be found on [kaggle](https://www.kaggle.com/c/titanic)

The dataset used can be downloaded and explored [here](https://www.kaggle.com/c/titanic/data)

In [22]:
#Loads the training set
X = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/train.csv')

In [23]:
#Loads the test set
X_test = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/test.csv')

In [24]:
#Loads the sample gender submission file, gives a sample look of how the outcomes should be  submitted
gender_submission = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/gender_submission.csv')

## **GENDER SUBMISSION VISUALIZATION**

This section gives a snapshot of the outcome of the model and the look of the file that is supposed to be submitted.

In [25]:
#Outputs the type of data been used
print(f'The type of the gender submission is \n{type(gender_submission)}')

The type of the gender submission is 
<class 'pandas.core.frame.DataFrame'>


In [26]:
#Outputs a detailed information of the gender submission dataset
gender_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [27]:
#Outputs the shape of the gender submission dataset
print(f'The shape of the gender submission is {gender_submission.shape}')

The shape of the gender submission is (418, 2)


In [28]:
#Outputs the titles of the two columns
print(f'The titles of the gender submission file are: \n{gender_submission.columns}')

The titles of the gender submission file are: 
Index(['PassengerId', 'Survived'], dtype='object')


In [29]:
#Outputs the head of the gender_submission file
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [30]:
#Outputs the tail of the gender submission dataset
gender_submission.tail()

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


## **INSIGHTS AFTER DATA EXPLORATION**

This section gives a concise information gained from the gender submission Data Ecploration.

The **predicted values (y^)** should be submitted using two columns that is the **passengerID** and the survival state, **whether he/she survived (1) or did not survive (0)**

## **DATA VISUALIZATION (TRAIN DATASET)**

In this section we will focus on visualizing the train dataset. The major areas that will be checked are checking for any row with null cell and deleting the cell(s), checking the data types stored by each feature and performing data type convertion if the need arises, and one hot encoding.

In [31]:
#Outputs the data type of the train dataset
print(f'The data type of the train dataset is: \n{type(X)}')

The data type of the train dataset is: 
<class 'pandas.core.frame.DataFrame'>


In [32]:
#Outputs information about the train dataset
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [33]:
#Outputs the shape of the train dataset
print(f'The shape of the train dataset is: {X.shape}')

The shape of the train dataset is: (891, 12)


In [34]:
#Outputs the features of the train dataset
print(f'The features in the train dataset are: \n{X.columns}')

The features in the train dataset are: 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


## **FEATURES DEFINITION**

This section throws light on the features of the train dataset

 **VARIABLE**      | **DEFINITION**   | **KEY**
-------------------|------------------|----------------
Survived           | Survival         | 0 = No, 1 = Yes
Pclass             | Ticket class     | 1st, 2nd or 3rd
Name               | Name of Passenger |
PassengerID        | Pasenger ID       |
Sex                | Sex              |
Age                | Age in years     |
SibSp              | # of siblings / spouses aboard the Titanic |
Parch              | # of parents / children aboard the Titanic |
Ticket             | Ticket number    |
Fare               | Passenger fare   |
Cabin              | Cabin number     |
Embarked           | Port of Embarkation | C = Cherbourg, Q = Queenstown, Southampton

In [35]:
#Outputs the head of the train dataset
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [36]:
#Outputs the tail of the train dataset
X.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## **EMPTY CELLS IN TRAIN AND TEST SET**

This section reviews the train and test dataset for NaN cells and fills them up.

The **Cabin** feature will be dropped due to the large number of empty cells

In [37]:
#Iterates through all the features and outputs the length of the features with NaN values in the training set
print('Number of NaN cells in each feature in the training set \n')
for feature in X.columns:
 print(f'The {feature} column has {X[pd.isna(X[feature])].shape[0]} cells with NaN value(s)')


Number of NaN cells in each feature in the training set 

The PassengerId column has 0 cells with NaN value(s)
The Survived column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 177 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Cabin column has 687 cells with NaN value(s)
The Embarked column has 2 cells with NaN value(s)


The **Age feature** has **177 training examples** with NaN values

The **Cabin feature** has **687 training examples** with NaN values

The **Embarked feature** has **2 training examples** with NaN values

In [38]:
#Loops through all the features and outputs the length of the feature with NaN values in the test set
print('The number of NaN cells in each feature in the test set \n')
for feature in X_test.columns:
  print(f'The {feature} column has {X_test[pd.isna(X_test[feature])].shape[0]} cells with NaN value(s)')

The number of NaN cells in each feature in the test set 

The PassengerId column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 86 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 1 cells with NaN value(s)
The Cabin column has 327 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


The **Age feature** has 86 cells with NaN values

The **Cabin feature** has 327 cells with NaN values

The **Fare feature** has 1 cells with NaN values

In [39]:
#Fills all NaN cells in the Age feature with the median age to prevent bias in the train and test set
X['Age'].fillna(X['Age'].median(), inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)

#Fills all NaN cells in the Cabin feature with the mode of the Embarked feature to maintain the distribution of the train
X['Embarked'].fillna(X['Embarked'].mode()[0], inplace=True)

#Fills all NaN cells in the Cabin feature with the mode of the Embarked feature to maintain the distribution of the train and test set
X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)

In [40]:
#Loops through the training dataset to check is there is an NaN value in any of the cells excluding the Cabin feature
print('Checking for NaN value(s) in each of the features in the training set: \n')

for feature in X.columns:
  print(f'The {feature} column has {X[pd.isna(X[feature])].shape[0]} cells with NaN value(s)')

Checking for NaN value(s) in each of the features in the training set: 

The PassengerId column has 0 cells with NaN value(s)
The Survived column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 0 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Cabin column has 687 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


In [41]:
#Loops through the test set for any cell(s) with an NaN value excluding the Cabin feature
print('Checking for NaN value(s) in each of the features in the test set: \n')

for feature in X_test.columns:
  print(f'The {feature} column has {X_test[pd.isna(X_test[feature])].shape[0]} cells with NaN value(s)')

Checking for NaN value(s) in each of the features in the test set: 

The PassengerId column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 0 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Cabin column has 327 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


There are no empty cells in both the traiuning and testing dataset excluding the cabin feature. The Cabin feature will be dropped in the next cell.

## **DROPPING THE CABIN FEATURE**

The **Cabin feature** is dropped due to the huge number of missing values or null values in both the train and test dataset

In [42]:
#Drops the Cabin column in both the train and test set
try:
  X.drop('Cabin', axis=1, inplace=True)
  X_test.drop('Cabin', axis=1, inplace=True)

except KeyError:
  print('The Cabin feature has already been dropped')

Drops the PassengerId feature since it will not be used in training the model

In [43]:
try:
  X.drop('PassengerId', axis=1, inplace=True)

except KeyError:
  print('The PassengerId has been dropped')

In [44]:
#Outputs the columns in both the train and test data set
print('The columns in the train set are: ')
print(X.columns)

print('\nThe columns in the test set are: ')
print(X_test.columns)

The columns in the train set are: 
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Embarked'],
      dtype='object')

The columns in the test set are: 
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Embarked'],
      dtype='object')


## **FEATURE ENGINEERING**

This section deals with creating new features out of the original features.

The **Sibsp and the Parch features** will be combined to get the **family size**

Aside that the **Age feature** will be divided into four sections that is **Children, Teenage, Adult and Elder.**

The **Fare** will equally be divided into four separate sections that's **Low fare, Median fare, Average fare and high fare.**

The Ticket, Age, Name and Fare feature will be dropped

In [45]:
#Adds the Sibsp and Parch features
#The one added is the person whose parents or siblings onboard the titanic
X['FamilySize'] = X['SibSp'] + X['Parch'] + 1

In [46]:
#Adds the SibSp and the Parch features in the test set
#The one added to the family size represents the person whose family onboard the titanic
X_test['FamilySize'] = X_test['SibSp'] + X_test['Parch'] + 1

In [47]:
#Outputs the least and highest value in the train and test set for splitting the ages
print(f"The least Age in the train  set is {X['Age'].min()} and the highest is {X['Age'].max()}")
print(f"The least Age in the test set is {X_test['Age'].min()} and the highest is {X_test['Age'].max()}")

The least Age in the train  set is 0.42 and the highest is 80.0
The least Age in the test set is 0.17 and the highest is 76.0


In [48]:
#Subdivides the Age feature to Children, Tennagers, Adults and Elders
X['Age_bin'] = pd.cut(X['Age'], bins=[X['Age'].min()-1, 12, 20, 40, X['Age'].max()+1], labels=['Children', 'Teenagers', 'Adults', 'Elders'])
X_test['Age_bin'] = pd.cut(X_test['Age'], bins=[0, 12, 20, 40, 120], labels=['Children', 'Teenagers', 'Adults', 'Elders'])

In [49]:
#Outputs the least and greatest value in the test and train set
print(f"The least fare in the train  set is {X['Fare'].min()} and the highest is {X['Fare'].max()}")
print(f"The least fare in the test set is {X_test['Fare'].min()} and the highest is {X_test['Fare'].max()}")

The least fare in the train  set is 0.0 and the highest is 512.3292
The least fare in the test set is 0.0 and the highest is 512.3292


In [50]:
#Divides the Fare feature into Low Fare, Median Fare, Average Fare and High Fare in the train and test set
X['Fare_bin'] = pd.cut(X['Fare'], bins=[X['Fare'].min()-1, 50, 138, 296, X['Fare'].max()+1], labels=['LowFare', 'MedianFare', 'AverageFare', 'HighFare'])
X_test['Fare_bin'] = pd.cut(X_test['Fare'], bins=[X['Fare'].min()-1, 50, 138, 296, X['Fare'].max()+1], labels=['LowFare', 'MedianFare', 'AverageFare', 'HighFare'])

In [51]:
#Drops the features in the list
drop_features = ['Age', 'Ticket', 'Name', 'Fare']

X.drop(drop_features, axis=1, inplace=True)
X_test.drop(drop_features, axis=1, inplace=True)

## **ONE HOT ENCODING**

In this section, we will dive deep into one hot encoding. One hot encoding is transforming categorical data (text) into numerical data for processing.

In one hot encoding only one of the categorical data in the specified featur is 1 and the rest of the distinct values are set to zero.

For instance the **Age feature** has only two distinct values **Male and Female**. If the person is a Male then the one hot encoded feature will be [1 0]  where the first feature is Male and the second is Female. Also, if a person is Femaele then the one hot encoded feature will be [0 1] where the first column is a Male and the second column is Female.

The features that will be one hot encoded are:
* **Sex**
* **Age_bin**
* **Embarked**
* **Fare_bin**

In [52]:
#Loops through the columns excluding the features in the list and displays the distinct values in each column
for column in X.columns:
    print(f'The distinct values in the \033[1m {column} \033[0;0m column are: ')
    print(f'{X[column].unique()} \n')

The distinct values in the [1m Survived [0;0m column are: 
[0 1] 

The distinct values in the [1m Pclass [0;0m column are: 
[3 1 2] 

The distinct values in the [1m Sex [0;0m column are: 
['male' 'female'] 

The distinct values in the [1m SibSp [0;0m column are: 
[1 0 3 4 2 5 8] 

The distinct values in the [1m Parch [0;0m column are: 
[0 1 2 5 3 4 6] 

The distinct values in the [1m Embarked [0;0m column are: 
['S' 'C' 'Q'] 

The distinct values in the [1m FamilySize [0;0m column are: 
[ 2  1  5  3  7  6  4  8 11] 

The distinct values in the [1m Age_bin [0;0m column are: 
['Adults', 'Elders', 'Children', 'Teenagers']
Categories (4, object): ['Children' < 'Teenagers' < 'Adults' < 'Elders'] 

The distinct values in the [1m Fare_bin [0;0m column are: 
['LowFare', 'MedianFare', 'AverageFare', 'HighFare']
Categories (4, object): ['LowFare' < 'MedianFare' < 'AverageFare' < 'HighFare'] 



In [53]:
#Loops through all the features and outputs the datatypes stored by each feature
print('The datatypes of the features are: \n')
X.dtypes

The datatypes of the features are: 



Survived         int64
Pclass           int64
Sex             object
SibSp            int64
Parch            int64
Embarked        object
FamilySize       int64
Age_bin       category
Fare_bin      category
dtype: object

In [54]:
#Creates an object from the OneHotEncoder class
ohe = OneHotEncoder()

### **Performs one hot encoding on the traning set**

In [55]:
#Uses the fit_transform method in the OneHotEncoder class to convert the categorical data to numerical values
transformed_features_train = ohe.fit_transform(X[X.select_dtypes(include=['object', 'category']).columns]).toarray()
#Converts the transformed features from float to ints
transformed_features_train = transformed_features_train.astype(int)
#Outputs the transformed features
print(transformed_features_train)

[[0 1 0 ... 0 1 0]
 [1 0 1 ... 0 0 1]
 [1 0 0 ... 0 1 0]
 ...
 [1 0 0 ... 0 1 0]
 [0 1 1 ... 0 1 0]
 [0 1 0 ... 0 1 0]]


In [56]:
#Extracts the one hot encoded features
encoded_features_train = ohe.get_feature_names_out()

#Outputs the encoded features
for encoded_feature in encoded_features_train:
  print(encoded_feature)

#Outputs the length of the encoded_features
print(f'\nThe number of features that were encoded are: {len(encoded_features_train)}')

Sex_female
Sex_male
Embarked_C
Embarked_Q
Embarked_S
Age_bin_Adults
Age_bin_Children
Age_bin_Elders
Age_bin_Teenagers
Fare_bin_AverageFare
Fare_bin_HighFare
Fare_bin_LowFare
Fare_bin_MedianFare

The number of features that were encoded are: 13


In [57]:
#Creating a pandas Dadaframe from the one hot encoded features
one_hot_encoded_features_train = pd.DataFrame(transformed_features_train, columns=encoded_features_train)

In [58]:
#Outputs the head of the one_hot_encoded_features
one_hot_encoded_features_train.head()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
0,0,1,0,0,1,1,0,0,0,0,0,1,0
1,1,0,1,0,0,1,0,0,0,0,0,0,1
2,1,0,0,0,1,1,0,0,0,0,0,1,0
3,1,0,0,0,1,1,0,0,0,0,0,0,1
4,0,1,0,0,1,1,0,0,0,0,0,1,0


In [59]:
#Outputs the tail of the one_hot_encoded_features
one_hot_encoded_features_train.tail()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
886,0,1,0,0,1,1,0,0,0,0,0,1,0
887,1,0,0,0,1,0,0,0,1,0,0,1,0
888,1,0,0,0,1,1,0,0,0,0,0,1,0
889,0,1,1,0,0,1,0,0,0,0,0,1,0
890,0,1,0,1,0,1,0,0,0,0,0,1,0


In [60]:
#Outputs the shape of the one_hot_encoded_features to be sure the shape tallies with the training data
print(f'The shape of the one hot encoded features is {one_hot_encoded_features_train.shape}')

The shape of the one hot encoded features is (891, 13)


### **Performs one hot encoding on the test set**

In [61]:
#Uses the fit_transform method in the OneHotEncoder class to convert the categorical data to numerical values in the test set
transformed_features_test = ohe.fit_transform(X_test[X_test.select_dtypes(include=['object', 'category']).columns]).toarray()
#Converts the transformed features from float to ints
transformed_features_test = transformed_features_test.astype(int)
#Outputs the transformed features
print(transformed_features_test)

[[0 1 0 ... 0 1 0]
 [1 0 0 ... 0 1 0]
 [0 1 0 ... 0 1 0]
 ...
 [0 1 0 ... 0 1 0]
 [0 1 0 ... 0 1 0]
 [0 1 1 ... 0 1 0]]


In [62]:
#Extracts the one hot encoded features
encoded_features_test = ohe.get_feature_names_out()

In [63]:
#Outputs the names of the encoded features in the test_set
for encoded_feature in encoded_features_test:
  print(encoded_feature)

Sex_female
Sex_male
Embarked_C
Embarked_Q
Embarked_S
Age_bin_Adults
Age_bin_Children
Age_bin_Elders
Age_bin_Teenagers
Fare_bin_AverageFare
Fare_bin_HighFare
Fare_bin_LowFare
Fare_bin_MedianFare


In [64]:
#Convert the test set one hot encoded features to a pandas dataFrame
one_hot_encoded_features_test = pd.DataFrame(transformed_features_test, columns=encoded_features_test)

In [65]:
#Ouputs the head of the one hot encoded features for the test set
one_hot_encoded_features_test.head()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
0,0,1,0,1,0,1,0,0,0,0,0,1,0
1,1,0,0,0,1,0,0,1,0,0,0,1,0
2,0,1,0,1,0,0,0,1,0,0,0,1,0
3,0,1,0,0,1,1,0,0,0,0,0,1,0
4,1,0,0,0,1,1,0,0,0,0,0,1,0


In [66]:
one_hot_encoded_features_test.tail()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
413,0,1,0,0,1,1,0,0,0,0,0,1,0
414,1,0,1,0,0,1,0,0,0,0,0,0,1
415,0,1,0,0,1,1,0,0,0,0,0,1,0
416,0,1,0,0,1,1,0,0,0,0,0,1,0
417,0,1,1,0,0,1,0,0,0,0,0,1,0


## **MERGE THE ONE HOT ENCODED FEATURES TO THE DATASETS**

In this section the one hot encoded features will be merged with the respective dataset.

### Reset the indices in the train and one hot encoded features train

In [67]:
#Reset indices to ensure alignemnt before merging to prevent NA in some cells
X.reset_index(drop=True, inplace=True)
one_hot_encoded_features_train.reset_index(drop=True, inplace=True)

#Outputs the shapes to be sure they are still intact
print(f'The shape of X is {X.shape}')
print(f'The shape of one hot encoded features train is {one_hot_encoded_features_train.shape}')

The shape of X is (891, 9)
The shape of one hot encoded features train is (891, 13)


### Concatenates the X and one hot encoding features train

In [68]:
#Concatenates X and one_hot_encoded_features_train dataframes
X_meg = pd.concat([X, one_hot_encoded_features_train], axis=1)

#Outputs the concatenated dataframe
X_meg

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Embarked,FamilySize,Age_bin,Fare_bin,Sex_female,...,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
0,0,3,male,1,0,S,2,Adults,LowFare,0,...,0,1,1,0,0,0,0,0,1,0
1,1,1,female,1,0,C,2,Adults,MedianFare,1,...,0,0,1,0,0,0,0,0,0,1
2,1,3,female,0,0,S,1,Adults,LowFare,1,...,0,1,1,0,0,0,0,0,1,0
3,1,1,female,1,0,S,2,Adults,MedianFare,1,...,0,1,1,0,0,0,0,0,0,1
4,0,3,male,0,0,S,1,Adults,LowFare,0,...,0,1,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,S,1,Adults,LowFare,0,...,0,1,1,0,0,0,0,0,1,0
887,1,1,female,0,0,S,1,Teenagers,LowFare,1,...,0,1,0,0,0,1,0,0,1,0
888,0,3,female,1,2,S,4,Adults,LowFare,1,...,0,1,1,0,0,0,0,0,1,0
889,1,1,male,0,0,C,1,Adults,LowFare,0,...,0,0,1,0,0,0,0,0,1,0


In [69]:
#Outputs info of the concatenated dataframe
print(X_meg.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Survived              891 non-null    int64   
 1   Pclass                891 non-null    int64   
 2   Sex                   891 non-null    object  
 3   SibSp                 891 non-null    int64   
 4   Parch                 891 non-null    int64   
 5   Embarked              891 non-null    object  
 6   FamilySize            891 non-null    int64   
 7   Age_bin               891 non-null    category
 8   Fare_bin              891 non-null    category
 9   Sex_female            891 non-null    int64   
 10  Sex_male              891 non-null    int64   
 11  Embarked_C            891 non-null    int64   
 12  Embarked_Q            891 non-null    int64   
 13  Embarked_S            891 non-null    int64   
 14  Age_bin_Adults        891 non-null    int64   
 15  Age_bi

### Resets the X_test and the oen hot encoded features

In [70]:
#Resets indices to ensure alignment before merging the dataframes
X_test.reset_index(drop=True, inplace=True)
one_hot_encoded_features_test.reset_index(drop=True, inplace=True)

#Outputs the shape to be sure they are the same
print(f'The shape of the X_test is {X_test.shape}')
print(f'The shape of the one hot encoded features test is {one_hot_encoded_features_test.shape}')

The shape of the X_test is (418, 9)
The shape of the one hot encoded features test is (418, 13)


### Concatenates the X_test and the one hot encoded features test

In [71]:
X_test_meg = pd.concat([X_test, one_hot_encoded_features_test], axis=1)

In [72]:
#Outputs detailed info of the merged dataframe
print(X_test_meg.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   PassengerId           418 non-null    int64   
 1   Pclass                418 non-null    int64   
 2   Sex                   418 non-null    object  
 3   SibSp                 418 non-null    int64   
 4   Parch                 418 non-null    int64   
 5   Embarked              418 non-null    object  
 6   FamilySize            418 non-null    int64   
 7   Age_bin               418 non-null    category
 8   Fare_bin              418 non-null    category
 9   Sex_female            418 non-null    int64   
 10  Sex_male              418 non-null    int64   
 11  Embarked_C            418 non-null    int64   
 12  Embarked_Q            418 non-null    int64   
 13  Embarked_S            418 non-null    int64   
 14  Age_bin_Adults        418 non-null    int64   
 15  Age_bi

## **ONE HOT ENCODED FEATURES EXPLORATION**

This section of the notebook delves into the features that will be deleted, renamed or retained.

**Since the Sex feature is binary which is Sex_female and Sex_male, the two features will be combined with a male being 1 and a female 0.**

The Sex_female and sex feature will be dropped and the Sex_male feature will be renamed as Sex.

The encoded features for **Embarked, Age and Fare** will be retained with no modifications to the feature names

**Features to drop**
- Sex
- Embarked
- Sex_female
- Age_bin
- Fare_bin

**FEatures to rename**
- Sex_male to Sex
- Age_bin_Adults to Age_Adults
- Age_bin_Childrerd to Age_Children
- Age_bin_Elders to Age_Elders
- Age_bin_Teenagers to Age_Teenagers
- Fare_bin_AverageFare to Fare_Average
- Fare_bin_HighFare to Fare_High
- Fare_bin_LowFare to Fare_Low
- Fare_bin_MedianFare to Fare_Median

The same idea will be applied to both one hot encoded features for the train and test set

### DROPS FEATURES IN THE TRAIN AND TEST SET

In [73]:
features_to_drop = ['Sex', 'Embarked', 'Sex_female', 'Age_bin', 'Fare_bin']

In [74]:
#Drops the features in the features to drop from the X_new
try:
  X_meg.drop(features_to_drop, axis=1, inplace=True)

except KeyError:
  print('The features to drop have been dropped already')

In [75]:
#Drops the features in the features to drop from the X_test_meg
try:
  X_test_meg.drop(features_to_drop, axis=1, inplace=True)

except KeyError:
  print('The features to drop have already been dropped')

### RENAME FEATURES IN THE TRAIN SET

In [76]:
#Features to rename
features_to_rename = {'Sex_male': 'Sex', 'Age_bin_Adults': 'Age_Adults', 'Age_bin_Children': 'Age_Children', 'Age_bin_Elders': 'Age_Elders',
                      'Age_bin_Teenagers': 'Age_Teenagers', 'Fare_bin_AverageFare': 'Fare_Average', 'Fare_bin_HighFare': 'Fare_High',
                      'Fare_bin_LowFare': 'Fare_Low', 'Fare_bin_MedianFare': 'Fare_Median'
                      }

# Sex_male to Sex
# Age_bin_Adults to Age_Adults
# Age_bin_Childrerd to Age_Children
# Age_bin_Elders to Age_Elders
# Age_bin_Teenagers to Age_Teenagers
# Fare_bin_AverageFare to Fare_Average
# Fare_bin_HighFare to Fare_High
# Fare_bin_LowFare to Fare_Low
# Fare_bin_MedianFare to Fare_Median

In [77]:
X_meg.rename(columns=features_to_rename, inplace=True)

In [78]:
for column in X_meg.columns:
  print(column)

Survived
Pclass
SibSp
Parch
FamilySize
Sex
Embarked_C
Embarked_Q
Embarked_S
Age_Adults
Age_Children
Age_Elders
Age_Teenagers
Fare_Average
Fare_High
Fare_Low
Fare_Median


### RENAME FEATURES IN THE TEST SET

In [79]:
#Renames the features in the merged test set
X_test_meg.rename(columns=features_to_rename, inplace=True)

In [80]:
for column in X_test_meg.columns:
  print(column)

PassengerId
Pclass
SibSp
Parch
FamilySize
Sex
Embarked_C
Embarked_Q
Embarked_S
Age_Adults
Age_Children
Age_Elders
Age_Teenagers
Fare_Average
Fare_High
Fare_Low
Fare_Median


## **SPLITTING DATA INTO TRAIN AND TARGET VALUES**

This section of the notebook focuses on splitting data in the X_meg to features and target values since this is a supervised learning problem

The survived feature will be our target value since it has the overall outcome whether the passenger survived or not.

All the other features will be features

In [81]:
#Extracts all the features excluding the Survived feature and converts it to a numpy array into the X_train_unscaled
X_train_orig_unscaled = X_meg.drop(['Survived'], axis=1).values
print(f'The shape of the X_train_unscaled is {X_train_orig_unscaled.shape}\n')
print(type(X_train_orig_unscaled))

#Puts the survived feature in y which is the target value
y = X_meg['Survived'].values
print(f'\nThe shape of the target value y is {y.shape}')

The shape of the X_train_unscaled is (891, 16)

<class 'numpy.ndarray'>

The shape of the target value y is (891,)


In [82]:
#Drops the PassengerId and converts the pandas dataframe to a numpy array
X_test_unscaled = X_test_meg.drop(['PassengerId'], axis=1).values
print(f'The shape of the X_test_unscaled is {X_test_unscaled.shape}')

The shape of the X_test_unscaled is (418, 16)


## **SPLITTING THE TRAINING DATA**

The training data will be splitted into train and cross-validation set.

90% of the training data will be used to train the model and 10% of the training data will be used to validate the model to access its performance.

In [83]:
#Splits the X_train_unscaled data into training and cross-validation set
#90% of the X_train_unscaled data will be used for training whiles 10% will be used to assess the performance of the model
X_train_unscaled, X_cv_unscaled, y_train, y_cv = train_test_split(X_train_orig_unscaled, y, train_size=0.90, random_state=42, stratify=y)


print(f'The shape of the training set (input) is {X_train_unscaled.shape}')
print(f'The shape of the training set (target) is {y_train.shape}\n')
print(f'The shape of the cross validation set (input) is {X_cv_unscaled.shape}')
print(f'The shape of the cross validation set (target) is {y_cv.shape}')

The shape of the training set (input) is (801, 16)
The shape of the training set (target) is (801,)

The shape of the cross validation set (input) is (90, 16)
The shape of the cross validation set (target) is (90,)


## **FEATURE SCALING**

Now all the features are numerical ready to be used to train the model.

But there is one thing at hand, our data is not balanced. Using an unscaled data to train a model makes gradient descent converge slowly.

In this section, we will scale our data both the training and test set to ensure that all the features have values between 0 and 1.

This makes gradient descent run faster and prevent overfitting.

In [84]:
#Instantiate the standard scaler from scikit learn
standardScaler = StandardScaler()

#Scales the training set
X_train_scaled = standardScaler.fit_transform(X_train_unscaled)

#Outputs the calculated mean by standard scaler
print(f'The calculated mean for the training data is {standardScaler.mean_}\n')

#Outputs the minimum and maximum value in the X_train_unscaled dataset
print(f'The minimum value in the X_train_unscaled is {X_train_unscaled.min()}')
print(f'The maximum value in the X_train_unscaled is {X_train_unscaled.max()}\n')

#Outputs the minimum and maximum value in the X_train_scaled
print(f'The minimum value in the X_train_scaled is {X_train_scaled.min()}')
print(f'The maximum value in the X_train_scaled is {X_train_scaled.max()}')

The calculated mean for the training data is [2.3133583  0.51935081 0.39450687 1.91385768 0.64918851 0.19101124
 0.082397   0.72659176 0.63171036 0.07990012 0.16729089 0.12109863
 0.03121099 0.00249688 0.82022472 0.14606742]

The minimum value in the X_train_unscaled is 0
The maximum value in the X_train_unscaled is 11

The minimum value in the X_train_scaled is -2.1360009363293746
The maximum value in the X_train_scaled is 19.987496091306557


In [85]:
#Scales the X_cv_unscaled using the mean and standard deviation used to scale the training data
X_cv_scaled = standardScaler.fit_transform(X_cv_unscaled)

#Scales teh X_test_unscaled using the mean and standard deviation used to scale the training data
X_test_scaled = standardScaler.fit_transform(X_test_unscaled)

## **DEVELOPING THE MODEL**

Everything is now set to develop the model. **Traditional ML algorithms** will be utilized in this notebook.

The **Linear Regression and RandomForestRegressor algorithms** will be used to train the model and the one that performs better will be used to make prediction on the test set

### **LOGISTIC REGRESSION MODEL**

In [94]:
#A dictionary to keep track of the train accuracies
train_accuracies_lr = {}

#A dictionary to store the cross validation accuracies for the linear logistic regression algorithm
cv_accuracies_lr = {}

#The number of iterations for the solver to converge#L
number_of_iterations_lr = np.arange(100, 300, 20)

#Outputs the number of iterations
print(number_of_iterations_lr)


[100 120 140 160 180 200 220 240 260 280]


In [95]:
#Loop through the number of iterations for logistic regression
for iteration in number_of_iterations_lr:
  #Instantiate the logistic regression and sets the random state to 16 so the results can be replicated. The max_iteration is also set to the iteration variable
  logisticRegression = LogisticRegression(random_state=16, max_iter=iteration)

  #Fits the logistic regression model on the X_train_scale dataset
  logisticRegression.fit(X_train_scaled, y_train)

  #Computes the accuracy of the train_set and adds it to the train_accuracies_lr
  train_accuracies_lr[iteration] = logisticRegression.score(X_train_scaled, y_train)

  #Computes the accuracy of the cv_set and adds it to the cv_accuracies_lr
  cv_accuracies_lr[iteration] = logisticRegression.score(X_cv_scaled, y_cv)

#Outputs all the max_iterations, the train_accuracies and the cross_validation accuracies
print(f'Iterations:         {number_of_iterations_lr}')
print(f'Train Accuracies:   {train_accuracies_lr}')
print(f'CV Accuracies:      {cv_accuracies_lr}')

Iterations:         [100 120 140 160 180 200 220 240 260 280]
Train Accuracies:   {100: 0.8139825218476904, 120: 0.8139825218476904, 140: 0.8139825218476904, 160: 0.8139825218476904, 180: 0.8139825218476904, 200: 0.8139825218476904, 220: 0.8139825218476904, 240: 0.8139825218476904, 260: 0.8139825218476904, 280: 0.8139825218476904}
CV Accuracies:      {100: 0.7888888888888889, 120: 0.7888888888888889, 140: 0.7888888888888889, 160: 0.7888888888888889, 180: 0.7888888888888889, 200: 0.7888888888888889, 220: 0.7888888888888889, 240: 0.7888888888888889, 260: 0.7888888888888889, 280: 0.7888888888888889}


### **LOGISTIC REGRESSION MODEL ANALYSIS**

The model has a **train_accuracy of 81% and a cross-validation score of 79%**

This is a better results

### **RANDOM FOREST MODEL**

In [97]:
#A dictionary to keep track of the train accuracies
train_accuracies_rf = {}

#A dictionary to store the cross validation accuracies for the linear logistic regression algorithm
cv_accuracies_rf = {}

#The number of iterations for the solver to converge
number_of_iterations_rf = np.arange(100, 400, 20)

#Outputs the number of iterations
print(number_of_iterations_rf)

[100 120 140 160 180 200 220 240 260 280 300 320 340 360 380]


In [98]:
#Loop through the number of iterations for the random forest estimators
for iterator in number_of_iterations_rf:
  #The RandomForestClassifier is used to train the model because the target values are discrete(binary)
  #Instantiate the Random Forest Algorithm and set the random_state to 42 so the rersults can be replicated.
  #Sets the number of tress to the current iterator
  randomForestClassifier = RandomForestClassifier(n_estimators=iterator, random_state=42)

  #Fits the random forest classifier model on the training data
  randomForestClassifier.fit(X_train_scaled, y_train)

  #Calculates the train accuracy score and  adds it to the train accuracies
  train_accuracies_rf[iterator] = randomForestClassifier.score(X_train_scaled, y_train)

  #Calculates the cv accuracy and adds it to the cv accuracies
  cv_accuracies_rf[iterator] = randomForestClassifier.score(X_cv_scaled, y_cv)


#Outputs all the max_iterations, the train_accuracies and the cross_validation accuracies
print(f'Iterations:         {number_of_iterations_rf}')
print(f'Train Accuracies:   {train_accuracies_rf}')
print(f'CV Accuracies:      {cv_accuracies_rf}')

Iterations:         [100 120 140 160 180 200 220 240 260 280 300 320 340 360 380]
Train Accuracies:   {100: 0.8776529338327091, 120: 0.8776529338327091, 140: 0.8776529338327091, 160: 0.8776529338327091, 180: 0.8776529338327091, 200: 0.8776529338327091, 220: 0.8776529338327091, 240: 0.8776529338327091, 260: 0.8776529338327091, 280: 0.8776529338327091, 300: 0.8776529338327091, 320: 0.8776529338327091, 340: 0.8776529338327091, 360: 0.8776529338327091, 380: 0.8776529338327091}
CV Accuracies:      {100: 0.7888888888888889, 120: 0.7888888888888889, 140: 0.7888888888888889, 160: 0.7888888888888889, 180: 0.7777777777777778, 200: 0.7777777777777778, 220: 0.7777777777777778, 240: 0.7777777777777778, 260: 0.7777777777777778, 280: 0.8, 300: 0.8, 320: 0.8, 340: 0.8, 360: 0.8, 380: 0.8}


### **RANDOM FOREST MODEL ANALYSIS**

The model has a **train accuracy of 88% and a cross validation score of 80% when the number of estimators is 280**.

This model performs quite better than the Logistic Regression model.

The preferred model for the titanic will be the random forest model with n_estimators being 280.

## **PREFERRED MODEL FOR THE TITANIC COMPETITION**

The random forest model with 280 number of estimators is the preferred model.

In [99]:
#Sets the number of preferred estimators to 280
preferred_estimators = 280

In [100]:
#Instantiate the random forest classifier model
preferredRandomForestClassifier = RandomForestClassifier(n_estimators=preferred_estimators, random_state=42)

In [102]:
#Calls the fit method from RandomForestClassifier to train the model using the scaled training data
preferredRandomForestClassifier.fit(X_train_scaled, y_train)

In [109]:
#Confirms the accuracy score on the training data
print(f'The training accuracy of the model is {preferredRandomForestClassifier.score(X_train_scaled, y_train):.2f}')

The training accuracy of the model is 0.88


In [110]:
#Confirms the cross validation accuracy
print(f'The cross validation accuracy of the model is {preferredRandomForestClassifier.score(X_cv_scaled, y_cv):.2f}')

The cross validation accuracy of the model is 0.80


## **RUNS THE PREFERRED MODEL ON THE TEST SET**

In this section of the notebook, the selected model will be used to predict whether each of the persons in the test set survived or not.

The fun part, can wait to explore that with you!!

In [111]:
#Runs the preferred model on the test set
predictions = preferredRandomForestClassifier.predict(X_test_scaled)

In [117]:
#Outputs the shape of the predictions
print(f'The shape of the predictiosn is {predictions.shape}')

The shape of the predictiosn is (418,)


In [116]:
#Outputs the predicts
print(predictions)

[0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 1]


## **CREATES THE SUBMISSION FILE**

In the concluding section, we will **create a submission dataframe** which will comprise of the passenger ID from the X_test data and the predictions whcih will be labelled as Survived

In [119]:
#Creates a submission file dataframe which will be sent for grading
submission_file = pd.DataFrame({'PassengerId': X_test.PassengerId, 'Survived': predictions})

#Converts the submission_file dataframe to a csv file
submission_file.to_csv('/content/drive/MyDrive/Datasets/titanic/submission.csv', index=False)

print('Your submission was sucessfully saved!')

Your submission was sucessfully saved!
