<a href="https://colab.research.google.com/github/PercyAyimbilaNsolemna/Machine_Learning/blob/main/Titanic_Kaggle_Challenge/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **KAGGLE TITANIC CHALLENGE SOLUTION**

Implementation of kaggle titanic challenge using Deep Learning (Neural Networks)

## **IMPORTING LIBRARIES**

This section imports all the libraries that will be used in building the Machine Learning workflow

In [1]:
#Imports numpy as np
import numpy as np
#Imports pandas as pd
import pandas as pd
#Imports tensorflow as tf
import tensorflow as tf
#Imports train_test_split from scikit learn model_selection
from sklearn.model_selection import train_test_split
#Imports StandardScaler from scikit learn preprocessing
from sklearn.preprocessing import StandardScaler
#Imports mean squared error from scikit learn metrics
from sklearn.metrics import mean_squared_error
#Imports accuracy score from scikit learn metrics
from sklearn.metrics import accuracy_score
#Imports Sequential model from tensorflow keras model
from tensorflow.keras.models import Sequential
#Imports the Dense layer from tensorflow keras layers
from tensorflow.keras.layers import Dense
#Imports OneHotEncode from sklearn preprocessing
from sklearn.preprocessing import OneHotEncoder
#Imports warning and supress all future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## **IMPORTING GOOGLE DRIVE**

This section imports google drive into the notebook so that the dataset for the project can be accessed for training, cross-validation and testing

In [2]:
#Imports drive from google colab
from google.colab import drive

In [3]:
#Connects or mounts google drive to the google colab
drive.mount('/content/drive')

Mounted at /content/drive


## **TITANIC DATASET**

The titanic dataset will be used to build the neural network. Detailed information on the project can be found on [kaggle](https://www.kaggle.com/c/titanic)

The dataset used can be downloaded and explored [here](https://www.kaggle.com/c/titanic/data)

In [7]:
#Loads the training set
X = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/train.csv')

In [8]:
#Loads the test set
X_test = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/test.csv')

In [9]:
#Loads the sample gender submission file, gives a sample look of how the outcomes should be  submitted
gender_submission = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/gender_submission.csv')

## **GENDER SUBMISSION VISUALIZATION**

This section gives a snapshot of the outcome of the model and the look of the file that is supposed to be submitted.

In [10]:
#Outputs the type of data been used
print(f'The type of the gender submission is \n{type(gender_submission)}')

The type of the gender submission is 
<class 'pandas.core.frame.DataFrame'>


In [12]:
#Outputs a detailed information of the gender submission dataset
gender_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [13]:
#Outputs the shape of the gender submission dataset
print(f'The shape of the gender submission is {gender_submission.shape}')

The shape of the gender submission is (418, 2)


In [14]:
#Outputs the titles of the two columns
print(f'The titles of the gender submission file are: \n{gender_submission.columns}')

The titles of the gender submission file are: 
Index(['PassengerId', 'Survived'], dtype='object')


In [15]:
#Outputs the head of the gender_submission file
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [16]:
#Outputs the tail of the gender submission dataset
gender_submission.tail()

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


## **INSIGHTS AFTER DATA EXPLORATION**

This section gives a concise information gained from the gender submission Data Ecploration.

The **predicted values (y^)** should be submitted using two columns that is the **passengerID** and the survival state, **whether he/she survived (1) or did not survive (0)**

## **DATA VISUALIZATION (TRAIN DATASET)**

In this section we will focus on visualizing the train dataset. The major areas that will be checked are checking for any row with null cell and deleting the cell(s), checking the data types stored by each feature and performing data type convertion if the need arises, and one hot encoding.

In [17]:
#Outputs the data type of the train dataset
print(f'The data type of the train dataset is: \n{type(X)}')

The data type of the train dataset is: 
<class 'pandas.core.frame.DataFrame'>


In [18]:
#Outputs information about the train dataset
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [19]:
#Outputs the shape of the train dataset
print(f'The shape of the train dataset is: {X.shape}')

The shape of the train dataset is: (891, 12)


In [20]:
#Outputs the features of the train dataset
print(f'The features in the train dataset are: \n{X.columns}')

The features in the train dataset are: 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


## **FEATURES DEFINITION**

This section throws light on the features of the train dataset

 **VARIABLE**      | **DEFINITION**   | **KEY**
-------------------|------------------|----------------
Survived           | Survival         | 0 = No, 1 = Yes
Pclass             | Ticket class     | 1st, 2nd or 3rd
Name               | Name of Passenger |
PassengerID        | Pasenger ID       |
Sex                | Sex              |
Age                | Age in years     |
SibSp              | # of siblings / spouses aboard the Titanic |
Parch              | # of parents / children aboard the Titanic |
Ticket             | Ticket number    |
Fare               | Passenger fare   |
Cabin              | Cabin number     |
Embarked           | Port of Embarkation | C = Cherbourg, Q = Queenstown, Southampton

In [21]:
#Outputs the head of the train dataset
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
#Outputs the tail of the train dataset
X.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## **EMPTY CELLS IN TRAIN AND TEST SET**

This section reviews the train and test dataset for NaN cells and fills them up.

The **Cabin** feature will be dropped due to the large number of empty cells

In [24]:
#Iterates through all the features and outputs the length of the features with NaN values in the training set
print('Number of NaN cells in each feature in the training set \n')
for feature in X.columns:
 print(f'The {feature} column has {X[pd.isna(X[feature])].shape[0]} cells with NaN value(s)')


Number of NaN cells in each feature in the training set 

The PassengerId column has 0 cells with NaN value(s)
The Survived column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 177 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Cabin column has 687 cells with NaN value(s)
The Embarked column has 2 cells with NaN value(s)


The **Age feature** has **177 training examples** with NaN values

The **Cabin feature** has **687 training examples** with NaN values

The **Embarked feature** has **2 training examples** with NaN values

In [25]:
#Loops through all the features and outputs the length of the feature with NaN values in the test set
print('The number of NaN cells in each feature in the test set \n')
for feature in X_test.columns:
  print(f'The {feature} column has {X_test[pd.isna(X_test[feature])].shape[0]} cells with NaN value(s)')

The number of NaN cells in each feature in the test set 

The PassengerId column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 86 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 1 cells with NaN value(s)
The Cabin column has 327 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


The **Age feature** has 86 cells with NaN values

The **Cabin feature** has 327 cells with NaN values

The **Fare feature** has 1 cells with NaN values

In [26]:
#Fills all NaN cells in the Age feature with the median age to prevent bias in the train and test set
X['Age'].fillna(X['Age'].median(), inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)

#Fills all NaN cells in the Cabin feature with the mode of the Embarked feature to maintain the distribution of the train
X['Embarked'].fillna(X['Embarked'].mode()[0], inplace=True)

#Fills all NaN cells in the Cabin feature with the mode of the Embarked feature to maintain the distribution of the train and test set
X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)

In [27]:
#Loops through the training dataset to check is there is an NaN value in any of the cells excluding the Cabin feature
print('Checking for NaN value(s) in each of the features in the training set: \n')

for feature in X.columns:
  print(f'The {feature} column has {X[pd.isna(X[feature])].shape[0]} cells with NaN value(s)')

Checking for NaN value(s) in each of the features in the training set: 

The PassengerId column has 0 cells with NaN value(s)
The Survived column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 0 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Cabin column has 687 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


In [31]:
#Loops through the test set for any cell(s) with an NaN value excluding the Cabin feature
print('Checking for NaN value(s) in each of the features in the test set: \n')

for feature in X_test.columns:
  print(f'The {feature} column has {X_test[pd.isna(X_test[feature])].shape[0]} cells with NaN value(s)')

Checking for NaN value(s) in each of the features in the test set: 

The PassengerId column has 0 cells with NaN value(s)
The Pclass column has 0 cells with NaN value(s)
The Name column has 0 cells with NaN value(s)
The Sex column has 0 cells with NaN value(s)
The Age column has 0 cells with NaN value(s)
The SibSp column has 0 cells with NaN value(s)
The Parch column has 0 cells with NaN value(s)
The Ticket column has 0 cells with NaN value(s)
The Fare column has 0 cells with NaN value(s)
The Embarked column has 0 cells with NaN value(s)


There are no empty cells in both the traiuning and testing dataset excluding the cabin feature. The Cabin feature will be dropped in the next cell.

## **DROPPING THE CABIN FEATURE**

The **Cabin feature** is dropped due to the huge number of missing values or null values in both the train and test dataset

In [None]:
#Drops the Cabin column in both the train and test set
X.drop('Cabin', axis=1, inplace=True)
X_test.drop('Cabin', axis=1, inplace=True)

Drops the PassengerId feature since it will not be used in training the model

In [46]:
X.drop('PassengerId', axis=1, inplace=True)

In [47]:
#Outputs the columns in both the train and test data set
print('The columns in the train set are: ')
print(X.columns)

print('\nThe columns in the test set are: ')
print(X_test.columns)

The columns in the train set are: 
Index(['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'FamilySize',
       'Age_bin', 'Fare_bin'],
      dtype='object')

The columns in the test set are: 
Index(['PassengerId', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked',
       'FamilySize', 'Age_bin', 'Fare_bin'],
      dtype='object')


## **FEATURE ENGINEERING**

This section deals with creating new features out of the original features.

The **Sibsp and the Parch features** will be combined to get the **family size**

Aside that the **Age feature** will be divided into four sections that is **Children, Teenage, Adult and Elder.**

The **Fare** will equally be divided into four separate sections that's **Low fare, Median fare, Average fare and high fare.**

The Ticket, Age, Name and Fare feature will be dropped

In [35]:
#Adds the Sibsp and Parch features
#The one added is the person whose parents or siblings onboard the titanic
X['FamilySize'] = X['SibSp'] + X['Parch'] + 1

In [36]:
#Adds the SibSp and the Parch features in the test set
#The one added to the family size represents the person whose family onboard the titanic
X_test['FamilySize'] = X_test['SibSp'] + X_test['Parch'] + 1

In [37]:
#Outputs the least and highest value in the train and test set for splitting the ages
print(f"The least Age in the train  set is {X['Age'].min()} and the highest is {X['Age'].max()}")
print(f"The least Age in the test set is {X_test['Age'].min()} and the highest is {X_test['Age'].max()}")

The least Age in the train  set is 0.42 and the highest is 80.0
The least Age in the test set is 0.17 and the highest is 76.0


In [38]:
#Subdivides the Age feature to Children, Tennagers, Adults and Elders
X['Age_bin'] = pd.cut(X['Age'], bins=[X['Age'].min()-1, 12, 20, 40, X['Age'].max()+1], labels=['Children', 'Teenagers', 'Adults', 'Elders'])
X_test['Age_bin'] = pd.cut(X_test['Age'], bins=[0, 12, 20, 40, 120], labels=['Children', 'Teenagers', 'Adults', 'Elders'])

In [39]:
#Outputs the least and greatest value in the test and train set
print(f"The least fare in the train  set is {X['Fare'].min()} and the highest is {X['Fare'].max()}")
print(f"The least fare in the test set is {X_test['Fare'].min()} and the highest is {X_test['Fare'].max()}")

The least fare in the train  set is 0.0 and the highest is 512.3292
The least fare in the test set is 0.0 and the highest is 512.3292


In [40]:
#Divides the Fare feature into Low Fare, Median Fare, Average Fare and High Fare in the train and test set
X['Fare_bin'] = pd.cut(X['Fare'], bins=[X['Fare'].min()-1, 50, 138, 296, X['Fare'].max()+1], labels=['LowFare', 'MedianFare', 'AverageFare', 'HighFare'])
X_test['Fare_bin'] = pd.cut(X_test['Fare'], bins=[X['Fare'].min()-1, 50, 138, 296, X['Fare'].max()+1], labels=['LowFare', 'MedianFare', 'AverageFare', 'HighFare'])

In [41]:
#Drops the features in the list
drop_features = ['Age', 'Ticket', 'Name', 'Fare']

X.drop(drop_features, axis=1, inplace=True)
X_test.drop(drop_features, axis=1, inplace=True)

## **ONE HOT ENCODING**

In this section, we will dive deep into one hot encoding. One hot encoding is transforming categorical data (text) into numerical data for processing.

In one hot encoding only one of the categorical data in the specified featur is 1 and the rest of the distinct values are set to zero.

For inst6ance the **Age feature** has only two distinct values **Male and Female**. If the person is a Male then the one hot encoded feature will be [1 0]  where the first feature is Male and the second is Female. Also, if a person is Femaele then the one hot encoded feature will be [0 1] where the first column is a Male and the second column is Female.

The features that will be one hot encoded are:
* **Sex**
* **Age_bin**
* **Embarked**
* **Fare_bin**

In [48]:
#Loops through the columns excluding the features in the list and displays the distinct values in each column
for column in X.columns:
    print(f'The distinct values in the \033[1m {column} \033[0;0m column are: ')
    print(f'{X[column].unique()} \n')

The distinct values in the [1m Survived [0;0m column are: 
[0 1] 

The distinct values in the [1m Pclass [0;0m column are: 
[3 1 2] 

The distinct values in the [1m Sex [0;0m column are: 
['male' 'female'] 

The distinct values in the [1m SibSp [0;0m column are: 
[1 0 3 4 2 5 8] 

The distinct values in the [1m Parch [0;0m column are: 
[0 1 2 5 3 4 6] 

The distinct values in the [1m Embarked [0;0m column are: 
['S' 'C' 'Q'] 

The distinct values in the [1m FamilySize [0;0m column are: 
[ 2  1  5  3  7  6  4  8 11] 

The distinct values in the [1m Age_bin [0;0m column are: 
['Adults', 'Elders', 'Children', 'Teenagers']
Categories (4, object): ['Children' < 'Teenagers' < 'Adults' < 'Elders'] 

The distinct values in the [1m Fare_bin [0;0m column are: 
['LowFare', 'MedianFare', 'AverageFare', 'HighFare']
Categories (4, object): ['LowFare' < 'MedianFare' < 'AverageFare' < 'HighFare'] 



In [49]:
#Loops through all the features and outputs the datatypes stored by each feature
print('The datatypes of the features are: \n')
X.dtypes

The datatypes of the features are: 



Survived         int64
Pclass           int64
Sex             object
SibSp            int64
Parch            int64
Embarked        object
FamilySize       int64
Age_bin       category
Fare_bin      category
dtype: object

In [50]:
#Creates an object from the OneHotEncoder class
ohe = OneHotEncoder()

Performs one hot encoding on the traning set

In [60]:
#Uses the fit_transform method in the OneHotEncoder class to convert the categorical data to numerical values
transformed_features = ohe.fit_transform(X[X.select_dtypes(include=['object', 'category']).columns]).toarray()
#Converts the transformed features from float to ints
transformed_features = transformed_features.astype(int)
#Outputs the transformed features
print(transformed_features)

[[0 1 0 ... 0 1 0]
 [1 0 1 ... 0 0 1]
 [1 0 0 ... 0 1 0]
 ...
 [1 0 0 ... 0 1 0]
 [0 1 1 ... 0 1 0]
 [0 1 0 ... 0 1 0]]


In [64]:
#Extracts the one hot encoded features
encoded_features = ohe.get_feature_names_out()

#Outputs the encoded features
for encoded_feature in encoded_features:
  print(encoded_feature)

#Outputs the length of the encoded_features
print(f'\nThe number of features that were encoded are: {len(encoded_features)}')

Sex_female
Sex_male
Embarked_C
Embarked_Q
Embarked_S
Age_bin_Adults
Age_bin_Children
Age_bin_Elders
Age_bin_Teenagers
Fare_bin_AverageFare
Fare_bin_HighFare
Fare_bin_LowFare
Fare_bin_MedianFare

The number of features that were encoded are: 13


In [66]:
#Creating a pandas Dadaframe from the one hot encoded features
one_hot_encoded_features = pd.DataFrame(transformed_features, columns=encoded_features)

In [67]:
#Outputs the head of the one_hot_encoded_features
one_hot_encoded_features.head()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
0,0,1,0,0,1,1,0,0,0,0,0,1,0
1,1,0,1,0,0,1,0,0,0,0,0,0,1
2,1,0,0,0,1,1,0,0,0,0,0,1,0
3,1,0,0,0,1,1,0,0,0,0,0,0,1
4,0,1,0,0,1,1,0,0,0,0,0,1,0


In [68]:
#Outputs the tail of the one_hot_encoded_features
one_hot_encoded_features.tail()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age_bin_Adults,Age_bin_Children,Age_bin_Elders,Age_bin_Teenagers,Fare_bin_AverageFare,Fare_bin_HighFare,Fare_bin_LowFare,Fare_bin_MedianFare
886,0,1,0,0,1,1,0,0,0,0,0,1,0
887,1,0,0,0,1,0,0,0,1,0,0,1,0
888,1,0,0,0,1,1,0,0,0,0,0,1,0
889,0,1,1,0,0,1,0,0,0,0,0,1,0
890,0,1,0,1,0,1,0,0,0,0,0,1,0
