<a href="https://colab.research.google.com/github/Mark-Barbaric/Introduction_To_Machine_Learning/blob/master/FinalAssignment/Final_Assignment_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Assignment - Titanic Dataset**

This is the final assignment for the course, where we attempt to predict whether someone survived the Titanic based on a number of input factors:

1. Passenger Class
2. Sex
3. Age
4. No. siblings / spouses aboard the Titanic
5. No. parents / children aboard the Titanic
6. Ticket Number
7. Passenger Fare
8. Cabin Number
9. Point of Embarkment (C = Cherbourg, Q = Queenstown, S = Southampton)


In [0]:
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import regularizers
import matplotlib.pyplot as plt

# **Methods**



In [0]:
def normalize_dataset(ds, column_names):

  min_max_scaler = preprocessing.MinMaxScaler()
  normalized_ds = ds.copy()

  for column in normalized_ds.columns:
    if column_names.count(column) == 1:  
      max_value = ds[column].max()
      min_value = ds[column].min()
      normalized_ds[column] = (ds[column] - min_value) / (max_value - min_value)

  return normalized_ds

# **Step 1: Loading and Checking Data**

First step is to load data and address the missing entries.

In [3]:
from google.colab import files
upload = files.upload()

Saving train.csv to train (2).csv


In [4]:
train_data = pd.read_csv('train.csv')
train_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


In [5]:
# check and display and missing values
train_data.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Survived         0
dtype: int64

Upon initial review of the Dataset, there are a number of Data Integrity issues which need to be addressed before the model can be built.

**1. Missing age entries:** there are currently 177 missing entries for age, which is a significant proportion of the input. The most viable option is to fill these values with the averages of each sex.

**2. Significant Missing Entries with Cabin Names:** the cabin name inputs are missing nearly 700 entries, which is far too many to be addressed, so the only viable option is to remove this input.

**3. Inconsistent Pattern with Ticket Data:** ticket data entries vary between 6 digit codes, and combination of Letters and digits, which doesn't present an immediately obvious pattern. Immediate response is to remove this input, but it may be worth including for the first iteration of the model, and then create a new model after removing this input.


# **Step 2: Remove Cabin Data Column**

First step is to remove Cabin column as it has far too few entries to contribute anything to the model, and there is no way of populating the missing entries in a scientific way.


In [6]:
del train_data['Cabin']
train_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S,0
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C,1


In [7]:
train_data.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
Survived         0
dtype: int64

# **Step 3: Replace missing Age data with averages**

First attempt is to replace the missing Age values with the averages for each sex. Makes sense to apply the average split based on sex as there was likely to be a much larger differential in mean ages between genders in the early 1900s.

In [30]:
train_data["Age"].fillna(train_data.groupby("Sex")["Age"].transform("mean"), inplace=True)
train_data["Age"] = train_data["Age"].astype(float).round(1)
train_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S,0
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,27.9,1,2,W./C. 6607,23.4500,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C,1


In [31]:
train_data.isna().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Survived       0
dtype: int64

There appear to be two missing entries in the embarked section. Printing out these two entries shows that they both have the same ticket number, which suggests that they embarked at the same destination. Before this is dealt with, it is worth reviewing all non unique ticket entries to ensure that they are either families or have the same embark destination.

In [32]:
train_data[train_data.isna().any(axis=1)]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived


In [33]:
#train_data = train_data.dropna()
%load_ext google.colab.data_table
train_data[train_data.duplicated(subset = ['Ticket'], keep = False)].sort_values(by = ['Ticket'])

The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
257,258,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.500,S,1
759,760,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.500,S,1
504,505,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.500,S,1
558,559,1,"Taussig, Mrs. Emil (Tillie Mandelbaum)",female,39.0,1,1,110413,79.650,S,1
585,586,1,"Taussig, Miss. Ruth",female,18.0,0,2,110413,79.650,S,1
...,...,...,...,...,...,...,...,...,...,...,...
436,437,3,"Ford, Miss. Doolina Margaret ""Daisy""",female,21.0,2,2,W./C. 6608,34.375,S,0
736,737,3,"Ford, Mrs. Edward (Margaret Ann Watson)",female,48.0,1,3,W./C. 6608,34.375,S,0
86,87,3,"Ford, Mr. William Neal",male,16.0,1,3,W./C. 6608,34.375,S,0
540,541,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.000,S,1


Have also Identified a few instances where non unique ticket number groups have different embarkment locations. There are only a handful so it shouldn't have too significant an impact on the model, but it isn't entirely clear whether they are erroneous or explanable. It could be a case of family members buying tickets from different countries and joining the Titanic at different Embarkment locations. There is also some evidence to suggest that it is erroneous data, as one or two of the instances also had missing age data and were logged as non-survivors. Given how insignificant these entries are, I have decided to continue to include them in the training data. 

The null Embarkment Destination ticket entries on the other hand will be removed.

In [34]:
train_data = train_data.dropna()
train_data.isna().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Survived       0
dtype: int64

In [35]:
train_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S,0
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,27.9,1,2,W./C. 6607,23.4500,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C,1


In [36]:
train_data.count(0)

PassengerId    889
Pclass         889
Name           889
Sex            889
Age            889
SibSp          889
Parch          889
Ticket         889
Fare           889
Embarked       889
Survived       889
dtype: int64

# **Step 4: Perform Final Tidy Up Data Fame**

This includes printing an output of statistical summaries of the dataset to make sure that averages, max, min values are within reason, and normalizing the digit based inputs to improve the training results.


In [45]:
train_data.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
count,889.0,889.0,889.0,889.0,889.0,889.0,889.0
mean,446.0,2.311586,29.685827,0.524184,0.382452,32.096681,0.382452
std,256.998173,0.8347,12.9812,1.103705,0.806761,49.697504,0.48626
min,1.0,1.0,0.4,0.0,0.0,0.0,0.0
25%,224.0,2.0,22.0,0.0,0.0,7.8958,0.0
50%,446.0,3.0,30.0,0.0,0.0,14.4542,0.0
75%,668.0,3.0,35.0,1.0,0.0,31.0,1.0
max,891.0,3.0,80.0,8.0,6.0,512.3292,1.0


In [44]:
normalised_train_data = normalize_dataset(train_data, ['Age', 'SibSp', 'Fare'])
normalised_train_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,0.271357,0.125,0,A/5 21171,0.014151,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.472362,0.125,0,PC 17599,0.139136,C,1
2,3,3,"Heikkinen, Miss. Laina",female,0.321608,0.000,0,STON/O2. 3101282,0.015469,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.434673,0.125,0,113803,0.103644,S,1
4,5,3,"Allen, Mr. William Henry",male,0.434673,0.000,0,373450,0.015713,S,0
...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,0.334171,0.000,0,211536,0.025374,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,0.233668,0.000,0,112053,0.058556,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.345477,0.125,2,W./C. 6607,0.045771,S,0
889,890,1,"Behr, Mr. Karl Howell",male,0.321608,0.000,0,111369,0.058556,C,1


In [0]:
dataset = normalised_train_data.values
x = dataset[:, 1:10]
y = dataset[:, 10]

# **Step 5: Splitting the Test and Training Data**

The final step before creating the model involves splitting the training dataset into testing and training.

In [50]:
x_train, x_val_and_test, y_train, y_val_and_test = train_test_split(x, y, test_size = 0.7)
x_val, x_test, y_val, y_test = train_test_split(x_val_and_test, y_val_and_test, test_size = 0.3)
print(x_train.shape, x_val.shape, x_test.shape, y_train.shape, y_val.shape, y_test.shape)

(266, 9) (436, 9) (187, 9) (266,) (436,) (187,)


# **Step 6: Building and Training The Neural Network**
