### To Explore and disover structure in the Titanic passenger dataset from Kaggle.
This will result in a Django web app to load the data files .csv, train a DNN, set hyperparameters for the DNN, determin appropriate model features to mode in the DNN, make predictions on the test set, generate submission file, submit to Kaggle for evaluation.

$H_{0}$ = There no method of predicting who will survive and who will perish on the Titanic

In [1]:
import ipywidgets as widgets

#### Outline the basics:
1) Gather Data from Kaggle<br> 
2) Examine the data<br> 
2a) Transform data and remove features <br>
2b) Move Classification result to "Y" feature as the prediction variable
3) Transform Categorial data into "One Hot" encoding<br> 
4) Regulaize the data - Data Normalization of Numerical Features<br> 
5) Use Cross Validation to Test/Train the data split<br> 
6) Create DNN with Keras/Tensorflow for simple DNN <br> 
7) Make prediction<br> 
8) Output Prediction file for submission<br> 
  

In [2]:
import pandas as pd
import numpy as np

from keras.backend import backend
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from keras.callbacks import Callback

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, Imputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score




Using TensorFlow backend.


In [3]:
# fix random seed for reproducibility
seed = 123
np.random.seed(seed)
sample = np.random.randint(5,10, size=1)
int(sample)

7

In [4]:
data = pd.read_csv('train.csv')

In [5]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Transform input data to X's (features, independent variables) and Y's the dependent variable

In [6]:
Y = data[['PassengerId','Survived']]

In [7]:
Y.head()

Unnamed: 0,PassengerId,Survived
0,1,0
1,2,1
2,3,1
3,4,1
4,5,0


In [16]:
X = data.drop(columns=['Survived', 'Name', 'SibSp','Parch','Ticket','Cabin'])

In [17]:
X.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked
0,1,3,male,22.0,7.25,S
1,2,1,female,38.0,71.2833,C
2,3,3,female,26.0,7.925,S
3,4,1,female,35.0,53.1,S
4,5,3,male,35.0,8.05,S


So, we have 2 categorical data columns, Pclass and Sex and need to "One Hot encode" those and Normalize the numerical data Age and Fare.  We will use sklearn.preprocessing to transform our data.   We really only have 4 features and an ID column. 
To Train the model we do not need the id column as we are building a DNN to predict on whether or not they survived.  <br>
Also have to handle NaN's in the data for all classes, Pclass, Sex, Age, Fare.   Imputer from Sklearn has that capability, saves time having to write all that code.

First "One Hot Encode" the "sex of a person" to 0 for female 1 for male

In [18]:
enc = LabelEncoder()
enc.fit(X['Sex'])
X['Sex'] = enc.transform(X['Sex'])
X.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked
0,1,3,1,22.0,7.25,S
1,2,1,0,38.0,71.2833,C
2,3,3,0,26.0,7.925,S
3,4,1,0,35.0,53.1,S
4,5,3,1,35.0,8.05,S


Transform the "Class" a proxy for social status in categorical variables vis OneHotEncoding - via Pandas dummies function!

In [19]:
X = pd.get_dummies(X, columns = ['Pclass'])
X.head()

Unnamed: 0,PassengerId,Sex,Age,Fare,Embarked,Pclass_1,Pclass_2,Pclass_3
0,1,1,22.0,7.25,S,0,0,1
1,2,0,38.0,71.2833,C,1,0,0
2,3,0,26.0,7.925,S,0,0,1
3,4,0,35.0,53.1,S,1,0,0
4,5,1,35.0,8.05,S,0,0,1


Transform the embarked points to categorical variables too via OneHotEncoding - via Pandas dummies function! very useful

In [24]:
X = pd.get_dummies(X, columns=['Embarked'])

In [25]:
X.head(5)

Unnamed: 0,PassengerId,Sex,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,1,1,22.0,7.25,0,0,1,0,0,1
1,2,0,38.0,71.2833,1,0,0,1,0,0
2,3,0,26.0,7.925,0,0,1,0,0,1
3,4,0,35.0,53.1,1,0,0,0,0,1
4,5,1,35.0,8.05,0,0,1,0,0,1


Next we fix the NaN's in the Age and Fare data also check for any NaN's in any other Pclass feature

In [26]:
imp = Imputer(missing_values ='NaN', strategy='mean', axis=0)
imp.fit(X[['Age','Fare']])
X[['Age','Fare']] = imp.transform(X[['Age','Fare']])

In [27]:
X['Fare'].isna().any()

False

Nope !  We're good to go to normalize the Age and Fare data!

In [28]:
scaler = StandardScaler()
# Have to call fit first to get it to work with all preprocessing actions!
scaler.fit(X[['Age','Fare']])

StandardScaler(copy=True, with_mean=True, with_std=True)

In [29]:
X[['Age','Fare']] = scaler.transform(X[['Age','Fare']])

In [31]:
X.head(5)

Unnamed: 0,PassengerId,Sex,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,1,1,-0.592481,-0.502445,0,0,1,0,0,1
1,2,0,0.638789,0.786845,1,0,0,1,0,0
2,3,0,-0.284663,-0.488854,0,0,1,0,0,1
3,4,0,0.407926,0.42073,1,0,0,0,0,1
4,5,1,0.407926,-0.486337,0,0,1,0,0,1


We are looking good to load for a DNN predictions!!!

Transform the data to PURE numeric data to input into Keras

In [32]:
X_input = X[['Sex','Age','Fare','Pclass_1','Pclass_2','Pclass_3','Embarked_C','Embarked_Q','Embarked_S']].as_matrix()
Y_input = Y[['Survived']].as_matrix()

In [33]:
X_input[0:5]

array([[ 1.        , -0.5924806 , -0.50244517,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  0.63878901,  0.78684529,  1.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ],
       [ 0.        , -0.2846632 , -0.48885426,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  0.40792596,  0.42073024,  1.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ],
       [ 1.        ,  0.40792596, -0.48633742,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ]])

In [34]:
Y_input[0:5]

array([[0],
       [1],
       [1],
       [1],
       [0]], dtype=int64)

 # Define the Model

Start with a simple DNN with 4 inputs and one output, 3 layers, 10 nodes per layer

In [46]:
model = Sequential()
model.add(Dense(9, input_dim=9, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [57]:
model.compile(loss='binary_crossentropy', optimizer='SGD',metrics=['accuracy'])

In [58]:
model.fit(X_input,Y_input,epochs=50,batch_size=10)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x29c5a56aa20>

In [59]:
scores = model.evaluate(X_input, Y_input)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))


acc: 79.80%


In [60]:
predictions = model.predict(X_input)

In [61]:
#rounded = [round(x[0]) for x in predictions]
#print(rounded)

In [62]:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X_input, Y_input):
  # create model
    model = Sequential()
    model.add(Dense(9, input_dim=9, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='SGD', metrics=['accuracy'])
    # Fit the model
    model.fit(X_input[train], Y_input[train], epochs=50, batch_size=10, verbose=0)
    # evaluate the model
    scores = model.evaluate(X_input[test], Y_input[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

acc: 80.00%
acc: 75.56%
acc: 84.27%
acc: 82.02%
acc: 78.65%
acc: 80.90%
acc: 79.78%
acc: 82.02%
acc: 80.90%
acc: 80.68%
80.48% (+/- 2.19%)


In [51]:
class Metrics(Callback):
    def on_train_begin(self, logs={}):
     self.val_f1s = []
     self.val_recalls = []
     self.val_precisions = []
 
    def on_epoch_end(self, epoch, logs={}):
     val_predict = (np.asarray(self.model.predict(self.model.validation_data[0]))).round()
     val_targ = self.model.validation_data[1]
     _val_f1 = f1_score(val_targ, val_predict)
     _val_recall = recall_score(val_targ, val_predict)
     _val_precision = precision_score(val_targ, val_predict)
     self.val_f1s.append(_val_f1)
     self.val_recalls.append(_val_recall)
     self.val_precisions.append(_val_precision)
     print ("- val_f1: %f — val_precision: %f — val_recall %f" % (_val_f1, _val_precision, _val_recall))
     return
 
metrics = Metrics()

In [56]:
model.metrics

['accuracy']