# Titanic Survival Classification - Experimenting with the data (Part 6)

Now we have an idea of our submission accuracy its time to get as much performance as possible out of our model by applying everything learned so far and from all the research.

In [1]:
#First importing some relevant packages
import numpy as np
import pandas as pd

#Import Tensorflow
import tensorflow as tf

#Import Keras
import keras
from keras import layers
from keras.layers import Input, Dense, Activation, BatchNormalization, Dropout, Reshape, Flatten
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.models import Sequential, Model
from keras import regularizers
from keras.optimizers import Adam

#Import mathematical functions
from random import *
import math
import matplotlib
import matplotlib.pyplot as plt

#Get regular expression package
import re

#Import  Scikit learn framework
import sklearn as sk
from sklearn import svm
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
#Import the functions built in previous parts
from Titanic_Import import *

full_set = pd.read_csv('D:/Datasets/Titanic/train.csv')
sub_set = pd.read_csv('D:/Datasets/Titanic/test.csv')

## Normalizing submission data

So the first and glaringly obvious difference between our cross validation and test data is we applied two (albiet identical) algorithms to normalize each.  Instead of normalizing the submission data seperately to our training and cross validation data lets normalize everything together and then train our model off of data normalized across the same distribution.

In [3]:
append_set = full_set

In [4]:
append_set = append_set.append([sub_set], ignore_index =True )

In [5]:
#Creating our Training Set
def Cleanse_Data_v3(df_in):
    #Put our dataframe into new object to avoid corrupting original dataframe
    test_set = df_in
    
    test_set['Age'] = test_set.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
    test_set['Fare'] = test_set.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))
    
    #Name Length from Anisotropic
    test_set['Name_length'] = test_set['Name'].apply(len)
    
    test_set['Company'] = test_set['SibSp'] + test_set['Parch']
    
    #Normalize numerical fields
    age_mean = test_set['Age'].mean()
    fare_mean = test_set['Fare'].mean()
    
    age_range = test_set['Age'].max() - test_set['Age'].min()
    fare_range = test_set['Fare'].max() - test_set['Fare'].min()
    
    #Standard deviations to test
    age_std = test_set['Age'].std(skipna=True)
    fare_std = test_set['Fare'].std(skipna=True)
    
    
    test_set['Norm_age'] = (test_set['Age'] - age_mean) / age_range
    test_set['Norm_fare'] = (test_set['Fare'] - fare_mean) / fare_range

    
    
    
    #Getting our Deck
    test_set['canc'] = test_set['Cabin'].str.replace(' ', '')
    test_set['Deckstr'] = test_set['canc'].str[0]
    test_set['Deckstr'] = test_set['Deckstr'].fillna(value = 'X')
    test_set['Deckstr'] = test_set['Deckstr'].map( {'A': 1, 'B': 2, 'C' : 3,'D' : 4, 'E' : 5,'F' : 6,'G' : 7, 'T' : 8 ,'X' : 0} ).astype(int)
    
    #Remap Gender and create number of family members present field
    test_set['Sex'] = test_set['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
    

    #Applying Title code from Sia
    test_set['Title'] = test_set['Name'].apply(get_title)
    test_set['Title'] = test_set['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    test_set['Title'] = test_set['Title'].replace('Mlle', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Ms', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Mme', 'Mrs')
    
    

    
    
    #Manually populate embarked with correct values (only 2 looked up correct value based on average fare)
    values = {'Embarked': 'C'}
    test_set = test_set.fillna(value=values)
    test_set['Embarked'] = test_set['Embarked'].map( {'S': 0, 'C': 1, 'Q' : 2} ).astype(int)
    
    
    
    emb_set = pd.get_dummies(test_set.Embarked, prefix='Emb', dummy_na = False)
    title_set = pd.get_dummies(test_set.Title, prefix='ti', dummy_na = True)
    deck_set = pd.get_dummies(test_set.Deckstr, prefix='de', dummy_na = True)

    
    oh_set = pd.concat([test_set,  
                        emb_set, 
                        title_set, 
                        deck_set
                       ], axis=1)
    
    #Create output fully numeric dataframe
    out_set = oh_set.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 
                             'Cabin', 'canc', 'Embarked', 'Title', 'Deckstr', 
                            'Name_length',
                           'Age', 'Fare', 'Name_length'], axis=1)
    return out_set

In [6]:
clean_set = Cleanse_Data_v3(append_set)

In [7]:
clean_set.head(10)

Unnamed: 0,Pclass,Sex,Survived,Company,Norm_age,Norm_fare,Emb_0,Emb_1,Emb_2,ti_Master,...,de_0.0,de_1.0,de_2.0,de_3.0,de_4.0,de_5.0,de_6.0,de_7.0,de_8.0,de_nan
0,3,0,0.0,1,-0.090286,-0.0508,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,1,1.0,1,0.11014,0.074185,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
2,3,1,1.0,0,-0.04018,-0.049482,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,1,1,1.0,1,0.07256,0.038693,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,3,0,0.0,0,0.07256,-0.049238,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,3,0,0.0,0,-0.065233,-0.048441,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
6,1,0,0.0,0,0.310566,0.036278,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,3,0,0.0,4,-0.340818,-0.023815,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
8,3,1,1.0,2,-0.027653,-0.04322,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,2,1,1.0,1,-0.190499,-0.006257,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


So now we have our data normalized across the entire distribution lets split out our training, test and cross validation sets.

In [8]:
def dataset_splitter(df_in, cv_size = 100):
    #defining new dataframe to avoid corrupting original
    test_set = df_in
    
    #Split out Test data for random sampling for CV data
    train_set = test_set[test_set['Survived'].notnull()]
    
    #Randomly sample this time stratifying by survival
    cv_set = train_set.sample(cv_size)
    
    #Drop all rows from our training set that are in our CV set
    new_train = train_set[~train_set.isin(cv_set)].dropna(how = 'all')
    
    #Create numpy arrays out of our Training and CV sets
    Y_Train = new_train['Survived'].values
    Y_CV = cv_set['Survived'].values
    
    new_train = new_train.drop(['Survived'], axis=1)
    cv_set = cv_set.drop(['Survived'], axis=1)
    
    X_Train = new_train.values
    X_CV = cv_set.values
    
    
    #Get Test Data
    sub_set = test_set[test_set['Survived'].isnull()]
    sub_set = sub_set.drop(['Survived'], axis=1)
    X_Test = sub_set.values
    
    return X_Train, Y_Train, X_CV, Y_CV, X_Test

In [13]:
X_Train, Y_Train, X_CV, Y_CV, X_Test = dataset_splitter(clean_set, cv_size = 150)

Now we have our data its time to build a bespoke neural network for our data.

In [14]:
def NN_model_v2(input_shape, layers, act_reg, ker_reg):
    #Having dynamic input shape as I may do feature engineering later.
    X_input = Input(input_shape)
    
    X = Dense(layers[0], input_dim=input_shape, activation='relu')(X_input)
    #X = LeakyReLU()(X)
    #X = BatchNormalization()(X)

    #Our NN Layers
    for i in range(len(layers) - 1):
      X = Dense(layers[i + 1], activation='relu', activity_regularizer = act_reg, kernel_regularizer = ker_reg)(X)
      #X = LeakyReLU()(X)\
      #X = BatchNormalization()(X)

    
    X = Dense(1, activation='sigmoid')(X)

    # Create model. This creates your Keras model instance, you'll use this instance to train/test the model.
    model = Model(inputs = X_input, outputs = X, name='Simple_model')

    return model

After a LOT of experimentation and testing with parameters in the above functions the below is a representative example of the sort of performance I was able to obtain.

In [26]:
layers = [21, 14, 8, 5, 5]

In [27]:
test_model = NN_model_v2((X_Train.shape[1], ), layers, regularizers.l2(0.01), None)
test_model.compile(optimizer = "Adam", loss = "binary_crossentropy", metrics = ["accuracy"])
test_model.fit(x = X_Train, y = Y_Train, epochs = 256, verbose = 1)

Epoch 1/256
Epoch 2/256
Epoch 3/256
Epoch 4/256
Epoch 5/256
Epoch 6/256
Epoch 7/256
Epoch 8/256
Epoch 9/256
Epoch 10/256
Epoch 11/256
Epoch 12/256
Epoch 13/256
Epoch 14/256
Epoch 15/256
Epoch 16/256
Epoch 17/256
Epoch 18/256
Epoch 19/256
Epoch 20/256
Epoch 21/256
Epoch 22/256
Epoch 23/256
Epoch 24/256
Epoch 25/256
Epoch 26/256
Epoch 27/256
Epoch 28/256
Epoch 29/256
Epoch 30/256
Epoch 31/256
Epoch 32/256
Epoch 33/256
Epoch 34/256
Epoch 35/256
Epoch 36/256
Epoch 37/256
Epoch 38/256
Epoch 39/256
Epoch 40/256
Epoch 41/256
Epoch 42/256
Epoch 43/256
Epoch 44/256
Epoch 45/256
Epoch 46/256
Epoch 47/256
Epoch 48/256
Epoch 49/256
Epoch 50/256
Epoch 51/256
Epoch 52/256
Epoch 53/256
Epoch 54/256
Epoch 55/256
Epoch 56/256
Epoch 57/256
Epoch 58/256
Epoch 59/256
Epoch 60/256
Epoch 61/256
Epoch 62/256
Epoch 63/256
Epoch 64/256
Epoch 65/256
Epoch 66/256
Epoch 67/256
Epoch 68/256
Epoch 69/256
Epoch 70/256
Epoch 71/256
Epoch 72/256
Epoch 73/256
Epoch 74/256
Epoch 75/256
Epoch 76/256
Epoch 77/256
Epoch 78

Epoch 84/256
Epoch 85/256
Epoch 86/256
Epoch 87/256
Epoch 88/256
Epoch 89/256
Epoch 90/256
Epoch 91/256
Epoch 92/256
Epoch 93/256
Epoch 94/256
Epoch 95/256
Epoch 96/256
Epoch 97/256
Epoch 98/256
Epoch 99/256
Epoch 100/256
Epoch 101/256
Epoch 102/256
Epoch 103/256
Epoch 104/256
Epoch 105/256
Epoch 106/256
Epoch 107/256
Epoch 108/256
Epoch 109/256
Epoch 110/256
Epoch 111/256
Epoch 112/256
Epoch 113/256
Epoch 114/256
Epoch 115/256
Epoch 116/256
Epoch 117/256
Epoch 118/256
Epoch 119/256
Epoch 120/256
Epoch 121/256
Epoch 122/256
Epoch 123/256
Epoch 124/256
Epoch 125/256
Epoch 126/256
Epoch 127/256
Epoch 128/256
Epoch 129/256
Epoch 130/256
Epoch 131/256
Epoch 132/256
Epoch 133/256
Epoch 134/256
Epoch 135/256
Epoch 136/256
Epoch 137/256
Epoch 138/256
Epoch 139/256
Epoch 140/256
Epoch 141/256
Epoch 142/256
Epoch 143/256
Epoch 144/256
Epoch 145/256
Epoch 146/256
Epoch 147/256
Epoch 148/256
Epoch 149/256
Epoch 150/256
Epoch 151/256
Epoch 152/256
Epoch 153/256
Epoch 154/256
Epoch 155/256
Epoch 15

Epoch 166/256
Epoch 167/256
Epoch 168/256
Epoch 169/256
Epoch 170/256
Epoch 171/256
Epoch 172/256
Epoch 173/256
Epoch 174/256
Epoch 175/256
Epoch 176/256
Epoch 177/256
Epoch 178/256
Epoch 179/256
Epoch 180/256
Epoch 181/256
Epoch 182/256
Epoch 183/256
Epoch 184/256
Epoch 185/256
Epoch 186/256
Epoch 187/256
Epoch 188/256
Epoch 189/256
Epoch 190/256
Epoch 191/256
Epoch 192/256
Epoch 193/256
Epoch 194/256
Epoch 195/256
Epoch 196/256
Epoch 197/256
Epoch 198/256
Epoch 199/256
Epoch 200/256
Epoch 201/256
Epoch 202/256
Epoch 203/256
Epoch 204/256
Epoch 205/256
Epoch 206/256
Epoch 207/256
Epoch 208/256
Epoch 209/256
Epoch 210/256
Epoch 211/256
Epoch 212/256
Epoch 213/256
Epoch 214/256
Epoch 215/256
Epoch 216/256
Epoch 217/256
Epoch 218/256
Epoch 219/256
Epoch 220/256
Epoch 221/256
Epoch 222/256
Epoch 223/256
Epoch 224/256
Epoch 225/256
Epoch 226/256
Epoch 227/256
Epoch 228/256
Epoch 229/256
Epoch 230/256
Epoch 231/256
Epoch 232/256
Epoch 233/256
Epoch 234/256
Epoch 235/256
Epoch 236/256
Epoch 

Epoch 248/256
Epoch 249/256
Epoch 250/256
Epoch 251/256
Epoch 252/256
Epoch 253/256
Epoch 254/256
Epoch 255/256
Epoch 256/256


<keras.callbacks.History at 0x347cd748>

So having done a lot more testing I have found that the F1 score is a much better indicator of the probable accuracy on the public test data than the cross validation accuracy.  So lets have a look at the confusion matrices once more.

In [28]:
train_pred = test_model.predict(x = X_Train)
cv_pred = test_model.predict(x = X_CV)

In [29]:
train_hat = normalize_predictions(train_pred)
cv_hat = normalize_predictions(cv_pred)


In [30]:
acc1, score1, conf1 = Calc_Accuracy(Y_Train, train_hat)

print("Accuracy = ", acc1)
print("F1 Score = ", score1)
print("")
print("Confusion Matrix")
conf1[["Labels", "Actual True", "Actual False"]]

Accuracy =  87.17948717948718
F1 Score =  0.8217636022514071

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,219.0,23.0
1,Pred False,72.0,427.0


In [31]:
acc2, score2, conf2 = Calc_Accuracy(Y_CV, cv_hat)

print("Accuracy = ", acc2)
print("F1 Score = ", score2)
print("")
print("Confusion Matrix")
conf2[["Labels", "Actual True", "Actual False"]]

Accuracy =  80.0
F1 Score =  0.7058823529411765

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,36.0,15.0
1,Pred False,15.0,84.0


Below cells just contain the code to output a prediction set for upload.

In [24]:
test_pred = test_model.predict(x = X_Test)

test_hat = normalize_predictions(test_pred)

In [106]:
sub_df = Create_output_frame(sub_set, test_hat)
sub_df.to_csv("Predictions.csv", index=False, float_format='%1d')

And unfortunately with the best performing models I still had approximately the same performance as from the first upload and I could only match the previous best.  

After doing a LOT of testing with different features/architectures/regularization/training iterations there's not yet a satisfactory neural network model for this data.

So one area for improving models is to get more training data.  As its impossible (and immoral) to go back in time and sink another titanic, instead lets generate our own training data by using a Generative Adversarial Network (GAN).

## Generative Adversarial Network

This is the first ever GAN I have made and is based off of the template model given  - https://deeplearning4j.org/generative-adversarial-network

So the idea is to train two networks, one to generate training data given gaussian noise input and another to spot fake data, trained from both real data and fake generated data.

The idea being our generator gets good enough to fool our discriminator and thus can generate fake training data which we can then add into our training data pool.

So firstly lets transform our data into a format that we can use to train a GAN.  Fortunately we already have this almost entirely done with the earlier appended dataset, the only thing we realistically need to do is to fill in the NaN values in the Survived column with a dummy variable. 

This isn't ideal as in theory any fake data will have a skewed survival rate, however we can apply a simple rounding function afterward to the fake data, and as both real and fake data are assumed to come from the same distribution overall this should not affect our generated data distribution too negatively.

In [28]:
gan_train = clean_set
gan_train['Survived'] = gan_train.Survived.fillna(value=0.5)
actual_gan = gan_train.values

So now we have our data lets build our GAN

In [29]:
class GAN():
    def __init__(self, num_features):
        
        self.num_features = num_features
        optimizer = Adam(0.0002, 0.5)

        # Build and compile the discriminator
        self.discriminator = self.build_discriminator()
        self.discriminator.compile(loss='binary_crossentropy', 
                                   optimizer=optimizer,
                                    metrics=['accuracy'])

        # Build and compile the generator
        self.generator = self.build_generator()
        self.generator.compile(loss='binary_crossentropy', optimizer=optimizer)

        # The generator takes noise as input and generated imgs
        z = Input(shape=(num_features,))
        fake = self.generator(z)

        # For the combined model we will only train the generator
        self.discriminator.trainable = True

        # The valid takes generated images as input and determines validity
        valid = self.discriminator(fake)

        # The combined model  (stacked generator and discriminator) takes
        # noise as input => generates images => determines validity 
        self.combined = Model(z, valid)
        self.combined.compile(loss='binary_crossentropy', optimizer=optimizer)
        
    def build_generator(self):

        noise_shape = (self.num_features,)
        
        model = Sequential()
        
        model.add(Dense(5 ,activation='relu' , input_shape=noise_shape))
        model.add(BatchNormalization())
        model.add(Dense(7 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(9 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(12 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(17 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(self.num_features))
        model.add(BatchNormalization())
        
        model.add(Dense(np.prod(noise_shape), activation='sigmoid'))
        model.add(Reshape(noise_shape))

        model.summary()

        noise = Input(shape=noise_shape)
        fake = model(noise)

        return Model(noise, fake)
    
    def build_discriminator(self):

        fake_shape = (self.num_features,)
        
        model = Sequential()

        model.add(Dense(17,activation='relu', input_shape=fake_shape))
        model.add(BatchNormalization())
        model.add(Dense(12,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(9 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(7 ,activation='relu' ))
        model.add(BatchNormalization())
        model.add(Dense(5 ,activation='relu' ))
        model.add(BatchNormalization())
        
        model.add(Dense(1 ,activation='sigmoid' ))
 
        model.summary()

        out = Input(shape=fake_shape)
        validity = model(out)

        return Model(out, validity)
    
    def train(self, X_train, epochs, batch_size=128, interval = 100):

        half_batch = int(batch_size / 2)

        for epoch in range(epochs):

            # ---------------------
            #  Train Discriminator
            # ---------------------

            # Select a random half batch of images
            idx = np.random.randint(0, X_train.shape[0], half_batch)
            reals = X_train[idx]

            noise = np.random.normal(0, 1, (half_batch, X_train.shape[1]))

            # Generate a half batch of new images
            gen_fakes = self.generator.predict(noise)

            # Train the discriminator
            d_loss_real = self.discriminator.train_on_batch(reals, np.ones((half_batch, 1)))
            d_loss_fake = self.discriminator.train_on_batch(gen_fakes, np.zeros((half_batch, 1)))
            d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)


            # ---------------------
            #  Train Generator
            # ---------------------

            noise = np.random.normal(0, 1, (batch_size, X_train.shape[1]))

            # The generator wants the discriminator to label the generated samples
            # as valid (ones)
            valid_y = np.array([1] * batch_size)

            # Train the generator
            g_loss = self.combined.train_on_batch(noise, valid_y)

            # Plot the progress
            if epoch % interval == 0:
                print ("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch, d_loss[0], 100*d_loss[1], g_loss))
    
    #Generate Training Examples      
    def Generate_data(self, num_examples):
        noise = np.random.normal(0, 1, (num_examples, self.num_features))
        gen_fakes = self.generator.predict(noise)
        
        discrim_preds = self.discriminator.predict(gen_fakes)
        return gen_fakes, discrim_preds

Now to compile the GAN

In [30]:
gan = GAN(actual_gan.shape[1])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 17)                442       
_________________________________________________________________
batch_normalization_9 (Batch (None, 17)                68        
_________________________________________________________________
dense_14 (Dense)             (None, 12)                216       
_________________________________________________________________
batch_normalization_10 (Batc (None, 12)                48        
_________________________________________________________________
dense_15 (Dense)             (None, 9)                 117       
_________________________________________________________________
batch_normalization_11 (Batc (None, 9)                 36        
_________________________________________________________________
dense_16 (Dense)             (None, 7)                 70        
__________

Now to train the GAN (note depending on parameters this may take a LONG time.

In [31]:
gan.train(actual_gan, epochs=5000, batch_size=128)

0 [D loss: 0.724442, acc.: 50.78%] [G loss: 0.748606]
100 [D loss: 0.695665, acc.: 49.22%] [G loss: 0.646381]
200 [D loss: 0.700983, acc.: 49.22%] [G loss: 0.603774]
300 [D loss: 0.703950, acc.: 49.22%] [G loss: 0.564096]
400 [D loss: 0.710166, acc.: 50.00%] [G loss: 0.528174]
500 [D loss: 0.718502, acc.: 50.00%] [G loss: 0.496352]
600 [D loss: 0.726243, acc.: 50.00%] [G loss: 0.468353]
700 [D loss: 0.735141, acc.: 50.00%] [G loss: 0.444457]
800 [D loss: 0.744017, acc.: 50.00%] [G loss: 0.424114]
900 [D loss: 0.751790, acc.: 50.00%] [G loss: 0.406614]
1000 [D loss: 0.760143, acc.: 50.00%] [G loss: 0.391569]
1100 [D loss: 0.766845, acc.: 50.00%] [G loss: 0.378735]
1200 [D loss: 0.773553, acc.: 50.00%] [G loss: 0.367536]
1300 [D loss: 0.779846, acc.: 50.00%] [G loss: 0.357813]
1400 [D loss: 0.785583, acc.: 50.00%] [G loss: 0.349296]
1500 [D loss: 0.790930, acc.: 50.00%] [G loss: 0.341772]
1600 [D loss: 0.795907, acc.: 50.00%] [G loss: 0.335132]
1700 [D loss: 0.800403, acc.: 50.00%] [G lo

Now lets take a sample of generated data to see how it looks.

In [32]:
fake_data, predicted = gan.Generate_data(1000)

In [33]:
fake_data[1]

array([0.5803122 , 0.6479198 , 0.5143897 , 0.27804846, 0.5138869 ,
       0.36106023, 0.4686081 , 0.71719   , 0.1788259 , 0.8167821 ,
       0.829268  , 0.453935  , 0.51976174, 0.5388331 , 0.5564493 ,
       0.5708552 , 0.37824744, 0.2602682 , 0.54370034, 0.5598483 ,
       0.69468004, 0.32086805, 0.57113135, 0.27646056, 0.7292193 ],
      dtype=float32)

In [34]:
predicted[1]

array([0.7684363], dtype=float32)

So the generated data looks absoloutely nothing like our real data.  This is somewhat dissapointing, however maybe if the categorical fields were rounded to the nearest integer it could actually be valuable. 


So lets clean the data up a bit and train a model on GAN data and use our training data as a cross validation set.

Of course we will also have to extract our fake X and Y sets for training.

In [50]:
def segment_fake_data(fake_data):
    fake_Y = np.around(fake_data[:, 2])
    fake_X = fake_data
    fake_X = np.delete(fake_X, [2], axis=1)
    
    fake_X[:, (0, 1, 2)] = np.around(fake_X[:, (0, 1, 2)])
    fake_X[:, 5:] = np.around(fake_X[:, 5:])
    
    return fake_X, fake_Y

So lets now segment our fake data and train a model based upon this.

In [51]:
fake_X, fake_Y = segment_fake_data(fake_data)

In [52]:
fake_X[1]

array([1.        , 1.        , 0.        , 0.5138869 , 0.36106023,
       0.        , 1.        , 0.        , 1.        , 1.        ,
       0.        , 1.        , 1.        , 1.        , 1.        ,
       0.        , 0.        , 1.        , 1.        , 1.        ,
       0.        , 1.        , 0.        , 1.        ], dtype=float32)

So by the looks of it there are a few too many ones in our one-hot vector so the GAN did likely not learn the relations between one-hot vectors.

Well lets build a model and test it out.

In [53]:
layers = [17, 12, 9, 7, 5]

In [54]:
test_model = NN_model_v2((fake_X.shape[1], ), layers, regularizers.l2(0.01), None)
test_model.compile(optimizer = "Adam", loss = "binary_crossentropy", metrics = ["accuracy"])
test_model.fit(x = fake_X, y = fake_Y, epochs = 64, verbose = 1)

Epoch 1/64
Epoch 2/64
Epoch 3/64
Epoch 4/64
Epoch 5/64
Epoch 6/64
Epoch 7/64
Epoch 8/64
Epoch 9/64
Epoch 10/64
Epoch 11/64
Epoch 12/64
Epoch 13/64
Epoch 14/64
Epoch 15/64
Epoch 16/64
Epoch 17/64
Epoch 18/64
Epoch 19/64
Epoch 20/64
Epoch 21/64
Epoch 22/64
Epoch 23/64
Epoch 24/64
Epoch 25/64
Epoch 26/64
Epoch 27/64
Epoch 28/64
Epoch 29/64
Epoch 30/64
Epoch 31/64
Epoch 32/64
Epoch 33/64
Epoch 34/64
Epoch 35/64
Epoch 36/64
Epoch 37/64
Epoch 38/64
Epoch 39/64
Epoch 40/64
Epoch 41/64
Epoch 42/64
Epoch 43/64
Epoch 44/64
Epoch 45/64
Epoch 46/64
Epoch 47/64
Epoch 48/64
Epoch 49/64
Epoch 50/64
Epoch 51/64
Epoch 52/64
Epoch 53/64
Epoch 54/64
Epoch 55/64
Epoch 56/64
Epoch 57/64
Epoch 58/64
Epoch 59/64
Epoch 60/64
Epoch 61/64
Epoch 62/64
Epoch 63/64
Epoch 64/64


<keras.callbacks.History at 0x4659ab38>

In [55]:
train_pred = test_model.predict(x = X_Train)
cv_pred = test_model.predict(x = X_CV)
train_hat = normalize_predictions(train_pred)
cv_hat = normalize_predictions(cv_pred)

In [56]:
acc1, score1, conf1 = Calc_Accuracy(Y_Train, train_hat)

print("Accuracy = ", acc1)
print("F1 Score = ", score1)
print("")
print("Confusion Matrix")
conf1[["Labels", "Actual True", "Actual False"]]

Accuracy =  51.956815114709855
F1 Score =  0.34074074074074073

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,92.0,293.0
1,Pred False,172.0,184.0


In [57]:
acc2, score2, conf2 = Calc_Accuracy(Y_CV, cv_hat)

print("Accuracy = ", acc2)
print("F1 Score = ", score2)
print("")
print("Confusion Matrix")
conf2[["Labels", "Actual True", "Actual False"]]

Accuracy =  51.33333333333333
F1 Score =  0.34234234234234234

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,19.0,58.0
1,Pred False,26.0,47.0


So as expected our performance was pretty abysmal.  

This worked about as well as randomly guessing unfortunately.  

However this experience was valuable in and of itself as I got to play around with a GAN and understand the principles involved.

## Why did it Fail?

This section is a bit of speculation based off the theory and intuitions about how GAN's work.

We know GAN's are very effective at generating image data, (see https://arxiv.org/abs/1611.01331 for example), but it failed spectacularly in this test.  I can think of two primary reasons for this - 
* Orthogonalization of features
* Convolutional Layers

### Orthogonalization of features

While in this type of binary classification having features that are completely orthogonal provides optimal performance, however image data does not contain orthogonal features, in fact pixels are generally assumed to have some relationship with nearby pixels, whether it be to define a line or tone shift in the image.  

Due to the normalizing effects of going deeper through a network this would make it very difficult for a network to learn this absoloute orthogonalization, but would make it much easier to learn relationships between nearby pixels, especially with convolutional layers.

### Convolutional Layers

So this is a bit speculative having yet to use a convolutional GAN on image data, but I suspect that the filters learned in convolutional layers, when applied to gaussian noise effectively imprint features onto the image and build up a clearer picture with more complex features as the image propogates deeper through the network.  All the gaussian noise would do is effectively act as a random selection for which visual features are applied and where, which when trained against another convolutional network,  the discriminator network would find it harder and harder to distinguish which features are from genuine images and which are imprinted onto gaussian noise as they would appear genuine to a human observer.

Hence our data of a series of one hot vectors and a couple of normalized numeric fields would not be suited to generation in a GAN due to the lack of similarity between local columns and lack of convolutional layers to build up a more detailed picture the deeper through the network you go.

*Note all of the above is pure speculation and may be entirely innaccurate*