<a href="https://colab.research.google.com/github/PercyAyimbilaNsolemna/Machine_Learning/blob/main/Titanic_Kaggle_Challenge/Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **KAGGLE TITANIC CHALLENGE SOLUTION**

Implementation of kaggle titanic challenge using Deep Learning (Neural Networks)

## **IMPORTING LIBRARIES**

This section imports all the libraries that will be used in building the neural network

In [1]:
#Imports numpy as np
import numpy as np
#Imports pandas as pd
import pandas as pd
#Imports tensorflow as tf
import tensorflow as tf
#Imports train_test_split from scikit learn model_selection
from sklearn.model_selection import train_test_split
#Imports StandardScaler from scikit learn preprocessing
from sklearn.preprocessing import StandardScaler
#Imports mean squared error from scikit learn metrics
from sklearn.metrics import mean_squared_error
#Imports accuracy score from scikit learn metrics
from sklearn.metrics import accuracy_score
#Imports Sequential model from tensorflow keras model
from tensorflow.keras.models import Sequential
#Imports the Dense layer from tensorflow keras layers
from tensorflow.keras.layers import Dense
#Imports OneHotEncode from sklearn preprocessing
from sklearn.preprocessing import OneHotEncoder
#Imports warning and supress all future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## **IMPORTING GOOGLE DRIVE**

This section imports google drive into the notebook so that the dataset for the project can be accessed for training, cross-validation and testing

In [2]:
#Imports drive from google colab
from google.colab import drive

In [3]:
#Connects or mounts google drive to the google colab
drive.mount('/content/drive')

Mounted at /content/drive


## **TITANIC DATASET**

The titanic dataset will be used to build the neural network. Detailed information on the project can be found on [kaggle](https://www.kaggle.com/c/titanic)

The dataset used can be downloaded and explored [here](https://www.kaggle.com/c/titanic/data)

In [4]:
#Loads the training set
X = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/train.csv')

In [5]:
#Loads the test set
X_test = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/test.csv')

In [6]:
#Loads the sample gender submission file, gives a sample look of how the outcomes should be  submitted
gender_submission = pd.read_csv('/content/drive/MyDrive/Datasets/titanic/gender_submission.csv')

## **GENDER SUBMISSION VISUALIZATION**

This section gives a snapshot of the outcome of the model and the look of the file that is supposed to be submitted.

In [7]:
#Outputs the type of data been used
print(f'The type of the gender submission is \n{type(gender_submission)}')

The type of the gender submission is 
<class 'pandas.core.frame.DataFrame'>


In [8]:
#Outputs a detailed information of the gender submission dataset
gender_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [9]:
#Outputs the shape of the gender submission dataset
print(f'The shape of the gender submission is {gender_submission.shape}')

The shape of the gender submission is (418, 2)


In [10]:
#Outputs the titles of the two columns
print(f'The titles of the gender submission file are: \n{gender_submission.columns}')

The titles of the gender submission file are: 
Index(['PassengerId', 'Survived'], dtype='object')


In [11]:
#Outputs the head of the gender_submission file
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [12]:
#Outputs the tail of the gender submission dataset
gender_submission.tail()

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


## **INSIGHTS AFTER DATA EXPLORATION**

This section gives a concise information gained from the gender submission Data Ecploration.

The **predicted values (y^)** should be submitted using two columns that is the **passengerID** and the survival state, **whether he/she survived (1) or did not survive (0)**

## **DATA VISUALIZATION (TRAIN DATASET)**

In this section we will focus on visualizing the train dataset. The major areas that will be checked are checking for any row with null cell and deleting the cell(s), checking the data types stored by each feature and performing data type convertion if the need arrises, and one hot encoding.

In [13]:
#Outputs the data type of the train dataset
print(f'The data type of the train dataset is: \n{type(X)}')

The data type of the train dataset is: 
<class 'pandas.core.frame.DataFrame'>


In [14]:
#Outputs information about the train dataset
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [15]:
#Outputs the shape of the train dataset
print(f'The shape of the train dataset is: {X.shape}')

The shape of the train dataset is: (891, 12)


In [16]:
#Outputs the features of the train dataset
print(f'The features in the train dataset are: \n{X.columns}')

The features in the train dataset are: 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


## **FEATURES DEFINITION**

This section throws light on the features of the train dataset

 **VARIABLE**      | **DEFINITION**   | **KEY**
-------------------|------------------|----------------
Survived           | Survival         | 0 = No, 1 = Yes
Pclass             | Ticket class     | 1st, 2nd or 3rd
Name               | Name of Passenger |
PassengerID        | Pasenger ID       |
Sex                | Sex              |
Age                | Age in years     |
SibSp              | # of siblings / spouses aboard the Titanic |
Parch              | # of parents / children aboard the Titanic |
Ticket             | Ticket number    |
Fare               | Passenger fare   |
Cabin              | Cabin number     |
Embarked           | Port of Embarkation | C = Cherbourg, Q = Queenstown, Southampton

In [17]:
#Outputs the head of the train dataset
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
#Outputs the tail of the train dataset
X.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [28]:
#Iterates through all the and outputs the length of the features with NaN values
for feature in X.columns:
 print(f'The {feature} column has {X[pd.isna(X[feature])].shape[0]} cells with NaN value(s)')


The PassengerId column has 0 cells with NaN value
The Survived column has 0 cells with NaN value
The Pclass column has 0 cells with NaN value
The Name column has 0 cells with NaN value
The Sex column has 0 cells with NaN value
The Age column has 177 cells with NaN value
The SibSp column has 0 cells with NaN value
The Parch column has 0 cells with NaN value
The Ticket column has 0 cells with NaN value
The Fare column has 0 cells with NaN value
The Cabin column has 687 cells with NaN value
The Embarked column has 2 cells with NaN value


The **Age feature** has **177 training examples** with NaN values

The **Cabin feature** has **687 training examples** with NaN values

The **Embarked feature** has **2 training examples** with NaN values

(0, 12)