# **Deducing Age and Survival change using Keras Neural Network**
In this notebook I'll create a code to predict if a Titanic passenger would survive it's disaster. The data used in the model contains information about the passengers from titanic (see data dictionary below). One major factor about this code is that I try to predict the passenger's age using two methods: Method 1: Filling the missing ages by hand using the method created by [ALLOHVK](https://www.kaggle.com/allohvk). Method 2: Using a Deep Neural Network model. 

**Data Directory:**

* Variable (Meaning)
* Survival (If passenger survived): 0 = No, 1 = Yes
* Pclass (Ticket class): 1 = 1st, 2 = 2nd, 3 = 3rd
* Sex (Passenger's Sex)
* Age (Paseenger's age in years)
* Sibsp (Number of passenger's siblings/spouses aboard the Titanic)
* Parch (Number of passenger's parents/children aboard the Titanic)
* Ticket (Ticket number)
* Fare (Passenger fare)
* Cabin (Cabin number)
* Embarked (Port of Embarkation): C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
# Necessary libraries for the code
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **First glance in the data:**

In [None]:
# To start, let's give a first look at our data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
display(train_data.head(10))

test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
display(test_data.head())

In [None]:
train_data.info()

In [None]:
# Let's check for missing and duplicate data.

combined = train_data.append(test_data)

print("NaN values:")
print(combined.isnull().sum())

print("\n\nDuplicated Values:")
print(combined.duplicated().sum())

As we can see, most of the "Cabin" variable is missing. There's no easy way to fill the empty values, and leaving as it is would result in a biased analysis, so I'll just delete this column.

The "Fare" and "Embarked" columns have only 3 missing values together, they can be easily filled.

In [None]:
# Deleting the "Cabin" column

train_data = train_data.drop(columns="Cabin")

test_data = test_data.drop(columns="Cabin")

In [None]:
# The "Embarked" and "Fare" columns have the following empty values:
    
print("\n\nRows with empty 'Embarked' columns:")
display(train_data.loc[train_data.Embarked.isnull(), ['PassengerId', 'Name', 'Embarked']])

print("\n\nRows with empty 'Fare' columns:")
display(test_data.loc[test_data.Fare.isnull(),['PassengerId', 'Name', 'Fare']])

In [None]:
# We can notice that "Embarked" column has only 3 values: 'S', 'C' and Q.
# Since the most common place someone embarked is in 'S', we'll set that value for the missing values

print('\n\nPort of Embarkation:')
display(train_data.groupby('Embarked').agg({'Name' : 'count'}))

print('\n\nNew rows:')
train_data.Embarked.fillna(train_data['Embarked'].mode().values[0], inplace = True)
display(train_data.loc[(train_data.PassengerId == 62) | (train_data.PassengerId == 830), ['PassengerId', 'Name', 'Embarked']])

# I'll set the empty "Fare" column with a median value, I don't want to overprice Mr Thomas ticket with a mean value.
print("\n\nMr. Thomas new Fare:")
test_data.loc[test_data["PassengerId"] == 1044, ['Fare']] = round(test_data.Fare.median())
display(test_data.loc[test_data.PassengerId == 1044, ['PassengerId', 'Name', 'Fare']])

In [None]:
# Let's do a quick correlation matrix to see which variables are more correlated to survival rate
sns.heatmap(train_data.corr(), annot = True)

# **Filling Age:**
One hard column to fill is "Age" column, it does make a big difference, mainly because children have a high correlation to survival (as showed later).

I'll be using a method made by the user [ALLOHVK](https://www.kaggle.com/allohvk) in his [notebook](https://www.kaggle.com/code/allohvk/titanic-missing-age-imputation-tutorial-advanced/notebook). He proposes that is possible to determine someone's age by it's pclass (Ticket Class), Parch (Number of parents) and name (More expecifically the person's title). This process will help find the Age by hand and by using a Neural Network.

In [None]:
# First we create the Title column to set all names salutes
train_data['Title'], test_data['Title'] = [df.Name.str.extract('([A-Za-z]+)\.', expand=False) for df in [train_data, test_data]]
train_data.head()

# We are going also reduct the number of salutations, this is a tip from ALLOHVK to get less noise.
# This will also help our NN to process the data faster.
TitleDict = {"Capt": "Officer","Col": "Officer","Major": "Officer","Jonkheer": "Royalty", \
             "Don": "Royalty", "Sir" : "Royalty","Dr": "Royalty","Rev": "Royalty", \
             "Countess":"Royalty", "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs","Mr" : "Mr", \
             "Mrs" : "Mrs","Miss" : "Miss","Master" : "Master","Lady" : "Royalty"}

train_data['Title'], test_data['Title'] = [df.Title.map(TitleDict) for df in [train_data, test_data]]

# Checking if there's a null Title
print('\nNull values in train data:')
display(train_data.loc[train_data.Title.isnull(), ['Name', 'Title']])
print('\nNull values in test data:')
display(test_data.loc[test_data.Title.isnull(), ['Name', 'Title']])

# Like in ALLOHVK notebook, there is one null entry, I'll follow his work and set this passenger Title as Royalty
test_data.loc[test_data.PassengerId==1306, "Title"] = "Royalty"

print('\nNew Title:')
display(test_data.loc[test_data.PassengerId==1306, ['Name', 'Title']])

In [None]:
# Now we can check the passenger average age by Title and pclass.
combined = train_data.append(test_data)
combined.groupby(["Title", "Pclass"])["Age"].agg(["mean"])

There is still one important thing to consider while imputing the age: Which female passengers are children or not.

For a male passenger,the salutation "Master" would tell us if he's a children or not, but with women, this becomes a little more complicated.

We could use the salutation "Miss" as a point of start, but what more could be used as a param? Well, a child would not be traveling alone, so if the column "Parch" is bigger than 0 there's a higher change of the passenger to be a child.

In [None]:
# Setting a new title for female children
for df in [train_data, test_data, combined]:
    df.loc[(df['Title'] == 'Miss') & (df['Parch'] > 0), 'Title'] = 'FemaleChild'

display(combined.loc[(combined.Age.isnull()) & (combined.Title=='FemaleChild'), ['Name', 'Age', 'Title']])

In [None]:
display(train_data.head())

Now that the important variables to determine age are set, I'll start to fill the empty values:

# **Method 1 - Determing age by hand**

In the same [notebook](https://www.kaggle.com/code/allohvk/titanic-missing-age-imputation-tutorial-advanced/notebook) from [ALLOHVK](https://www.kaggle.com/allohvk), he propose a method to determine age from the mean of the passengers agrouped by their ticket class, sex and title. I'll follow his method.

In [None]:
# The ages sets by hand will be kept in a different train and test data frame
train_data_hand = train_data.copy()
test_data_hand = test_data.copy()

In [None]:
# This grp variable will contaim the ages to fill the empty values
grp = train_data_hand.groupby(['Pclass','Sex','Title'])['Age'].mean().reset_index()[['Sex', 'Pclass', 'Title', 'Age']]

# This function will fill the age of a passenger based in its pclass, sex and title.
# If a passenger is a mister, is in the 1st class and it's male, he'll have the mean value for that group.
# Now if the passenger is a miss, is in the 2nd class and is female, she'll have the mean for that group.
def fill_age(x):
    return grp[(grp.Pclass==x.Pclass)&(grp.Sex==x.Sex)&(grp.Title==x.Title)]['Age'].values[0]

train_data_hand['Age'], test_data_hand['Age'] = [df.apply(lambda x: fill_age(x) if np.isnan(x['Age']) else x['Age'], axis=1) for df in [train_data_hand, test_data_hand]]

combined = train_data_hand.append(test_data_hand)
display(combined.groupby(['Pclass','Sex','Title'])['Age'].mean())

In [None]:
# Now we check if there's any other null value.
display(combined.isnull().sum())

# **Method 2 - Determining age using neural network**

Now that we have the ages made by hand, we are going to use another method to find the ages using ML.
The first thing we should do is pre-processing the data. The "Survived" column could help to predict age, but since we are trying to decide the ages from the passengers in train and test data, we are going not to use it, because the test data doesn't have the "Survived" column.

In [None]:
# But before everything, I'll set the target variable to be found by the model.
y_ages = train_data[['PassengerId', 'Age']].append(test_data[['PassengerId', 'Age']]).dropna()

train_ages = train_data[['PassengerId', 'Age']]
test_ages = test_data[['PassengerId', 'Age']]

# I'll also set the target variable for the ML model who will predict the survival change.
y_survived = train_data['Survived']

Here I'll create a function to transform the data into numerical data, I'll also encode some columns.

In [None]:
oe = preprocessing.OrdinalEncoder()
def process_data(data):
    #I'll be droping the 'Name' column from the data, I'll not use it.
    processed_data = data.drop(columns = ['Name', 'Survived'], errors = 'ignore')
    processed_data = pd.get_dummies(processed_data, columns = ['Pclass', 'Embarked'])
    processed_data[['Sex', 'Ticket', 'Title']] = oe.fit_transform(data[['Sex', 'Ticket', 'Title']])
    return processed_data

In [None]:
processed_train_data = process_data(train_data)

processed_test_data = process_data(test_data)

Here I'm creating a function to standardize the data, so all data will be in the same scale.

I'll have to create two scalers, one to scale the data to the ML model that will predict age, and one that will scale the data with the age already filled to be used with the ML model that I will later create to predict the survival rate.

In [None]:
scaler = preprocessing.StandardScaler()
def scale_data(data):
    passenger_column = data.pop('PassengerId')
    column_names = data.columns
    scaled_data = scaler.fit_transform(data)   
    scaled_data = pd.DataFrame(scaled_data, columns = column_names)
    return scaled_data.assign(PassengerId = passenger_column)

In [None]:
scaled_train_data = scale_data(processed_train_data.drop(columns = 'Age'))

scaled_test_data = scale_data(processed_test_data.drop(columns = 'Age'))

In [None]:
# Since I'll be creating a supervised NN, I'll be using only the passengers with a age as training data.
combined = scaled_train_data.append(scaled_test_data)

X_ages = combined.loc[combined.PassengerId.isin(y_ages.PassengerId)].drop(['PassengerId'], axis = 1)

In [None]:
#Separating the train data and test data to use in the neural network
X_train, X_test, y_train, y_test = train_test_split(X_ages, y_ages.drop(columns = 'PassengerId'), 
                                                    test_size=0.3, random_state=123)

In [None]:
def plot_history(history):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch

    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Abs Error [MPG]')
    plt.plot(hist['epoch'], hist['mae'], label='Train Error')
    plt.plot(hist['epoch'], hist['val_mae'], label = 'Val Error')
    plt.legend()

    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Square Error [$MPG^2$]')
    plt.plot(hist['epoch'], hist['mse'], label='Train Error')
    plt.plot(hist['epoch'], hist['val_mse'], label = 'Val Error')
    plt.legend()
    plt.show()

In [None]:
#Creating the model
from tensorflow.keras import regularizers
kr = regularizers.l1_l2(l1 = 1e-3, l2 = 1e-3)

model = keras.Sequential([
    layers.Dense(128,activation='relu',input_shape=[len(X_train.keys())], kernel_regularizer = kr),
    layers.Dropout(0.5),
    layers.Dense(64,activation='relu', kernel_regularizer = kr),
    layers.Dropout(0.5),
    layers.Dense(1)
])

model.compile(optimizer = 'RMSprop', loss = 'mse', metrics = ['mae', 'mse'])

In [None]:
history = model.fit(X_train, y_train, batch_size = 64, epochs = 2500,validation_split = 0.3,
                    verbose = 0)
plot_history(history)

In [None]:
#Let's check how this model deals with the test data.
model.evaluate(X_test, y_test)

As seen by the Mean Absolute error, the model presents a error of approximately 8 years for more or less, which I consider a good result.

In [None]:
#This function will set the ages in the train and test data with the help from the model I created.
def input_age(data):
    nan_age = data.loc[data.Age.isnull()].drop(['Age'], axis = 1)
    missing_age = model.predict(
        nan_age.drop(columns = ['PassengerId', 'Survived'], errors = 'ignore'))
    data.loc[data.Age.isnull(), ['Age']] = missing_age
    return data

In [None]:
train_data_ml = input_age(scaled_train_data.assign(Age = train_ages['Age']))
#display(train_data_ml)

test_data_ml = input_age(scaled_test_data.assign(Age = test_ages['Age']))
#display(test_data_ml)

As show below, we have a pretty decent result, showing little difference between the ages set by hand. 

In [None]:
full_data = train_data.append(test_data)
full_data_ml = train_data_ml.append(test_data_ml)
full_data.loc[full_data.PassengerId == full_data_ml.PassengerId, ['Age']] = full_data_ml.Age

print('\nAge set by ML:')
display(full_data.groupby(['Pclass','Sex','Title'])['Age'].mean())

print('\n\nAge set by hand:')
display(train_data_hand.append(test_data_hand).groupby(['Pclass','Sex','Title'])['Age'].mean())

# **Predicting survival rate**

Now that we have all data we need, I'm gonna create another a classification NN to predict if a passenger will survive. There is still some pre processing we should do before letting the NN model. One important thing I should is binning the ages.

This will be the model who will predict if the passenger will survive.

In [None]:
#This is the model created to predict survival rate, it could have less neurons, but this will do.
def create_model(X, y, model_name):
    model = keras.Sequential([
        layers.Dense(32, activation='tanh', input_shape=[len(X.keys())]),
        layers.Dropout(0.5),
        layers.Dense(32, activation='tanh'),
        layers.Dropout(0.5),
        layers.Dense(32, activation='tanh'),
        layers.Dropout(0.5),
        layers.Dense(1, activation='sigmoid')], name = model_name)
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])
    
    history = model.fit(X, y, batch_size = 128, epochs = 600,validation_split = 0.3, verbose = 0)
    
    return model, history

In [None]:
#This is an auxiliar function that will bring some graphs and measures to evaluate the model we'll construct
def display_acc_hist(history):
    history_df = pd.DataFrame(history.history)
    # Start the plot at epoch 0
    history_df.loc[0:, ['loss', 'val_loss']].plot()
    history_df.loc[0:, ['binary_accuracy', 'val_binary_accuracy']].plot()

    print(("Best Validation Loss: {:0.4f}" +\
           "\nBest Validation Accuracy: {:0.4f}")\
           .format(history_df['val_loss'].min(),
           history_df['val_binary_accuracy'].max()))
    return

#And this function will give some important measures
def print_results(y_test, y_pred):
    print('\nConfusion Matrix: \n' , confusion_matrix(y_test, y_pred))
    print('\n', classification_report(y_test, y_pred))
    print('\nAccuracy: ' , accuracy_score(y_test, y_pred))
    return accuracy_score(y_test, y_pred)

We still have some pre-processing to do before training the new model, we need to scale the Ages created by ML and all the data where we set the Age by hand. Fortunally, we already have the functions to do so.

In [None]:
#This function will send the data to scale, and also is gonna to binnarize the ages in groups
def last_processing(data, ages):
    data.loc[data.PassengerId == ages.PassengerId, ['Age']] = ages.Age
    age_interval = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    data['Age'] = pd.cut(data['Age'].astype(int), bins = 10, labels = age_interval)
    data = pd.get_dummies(data, columns = ['Age'])
    return scale_data(data).drop(columns = 'PassengerId')

In [None]:
X_ml = last_processing(processed_train_data.copy(), train_data_ml)

test_data_ml = last_processing(processed_test_data.copy(), test_data_ml)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ml, y_survived, test_size=0.3, random_state=123)
model_ml, history = create_model(X_train, y_train, 'ages_filled_by_NN.')
display_acc_hist(history)

y_pred = model_ml.predict(X_test) > 0.5

accuracy_ml = print_results(y_test, y_pred)

In [None]:
X_hand = last_processing(processed_train_data.copy(), train_data_hand)

test_data_hand = last_processing(processed_test_data.copy(), test_data_hand)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_hand, y_survived, test_size=0.3, random_state=123)
model_hand, history = create_model(X_train, y_train, 'ages_filled_by_hand.')
display_acc_hist(history)

y_pred = np.around(model_hand.predict(X_test))

accuracy_hand = print_results(y_test, y_pred)

# **Results:**
By the end, there is no big difference between the data filled with the ages filled by hand and filled using NN, so I'll just use the model that has the biggest accuracy score. 

In [None]:
if accuracy_ml > accuracy_hand:
    best_model = model_ml
else:
    best_model = model_hand

survived = np.around(best_model.predict(test_data_hand))

output = test_data['PassengerId'].to_frame().assign(Survived = survived)
output['Survived'] = output['Survived'].astype(int)

print('\nThis result was achieved using the model with the ' + best_model._name)
display(output.head())

output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")