# Zero or One? (100 marks)

All you will be given about this problem is a training data set. Your objective is to develop a classifier that will have the highest accuracy in unseen examples.

The following cell loads the training data set.

In [1]:
import numpy as np

training_data = np.loadtxt(open("data/training_data.csv"), delimiter=",")
print("Shape of the training data set:", training_data.shape)
print(training_data)

Shape of the training data set: (5000, 39)
[[0. 1. 1. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 [1. 1. 1. ... 1. 0. 0.]
 ...
 [0. 0. 1. ... 0. 1. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 [1. 1. 0. ... 0. 0. 0.]]


The first column is again the response variable. The remaining 38 columns are binary features. You have multiple tasks:

(1) Your first task is to write a function called `train()` that takes `training_data` as input and returns all the fitted parameters of your model. Note that the fitted parameters of your model depend on the model you choose. For example, if you use a naïve Bayes classifier, you could return a list of class priors and conditional likelihoods. (This function will allow us to compute your model on the fly. We should be able to execute it in less than 10 minutes.) 

(2) Your second task is to provide a variable called `fitted_model` which stores the model parameters you found by executing your train() function on the training_data. If your train function takes more than 20 seconds to run, this variable should load precomputed parameter values (possibly from a file) rather than execute the train() function. 

In [2]:
############################################################# Setup #############################################################

import numpy as np
import matplotlib.pyplot as plt

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, LeakyReLU
from keras.losses import mean_squared_error

######################################################## Neural Network #########################################################

class NNclassifier:

    def __init__(self):
        """
        Class Constructor, this initiates the model.
        """
        self.model = self.buildModel()
        # Keras network model.

    def buildModel(self):
        """
        Builds the model for the neural network.  The current network is the best architecture I've so far determined. I use
        succesively smaller dense layers then add a healthy amount of dropout to prevent overfitting.  Leaky ReLU is used to
        smooth training by preventing zero convergence.
        """
        model = Sequential()
        # Initialises the model.
        model.add(Dense(50, activation = 'relu', input_shape=(38,)))
        model.add(LeakyReLU(alpha=0.3))
        model.add(Dropout(0.4))
        # Input Layer.
        model.add(Dense(200, activation = 'relu'))
        model.add(LeakyReLU(alpha=0.3))
        model.add(Dropout(0.4))
        model.add(Dense(100, activation = 'relu'))
        model.add(LeakyReLU(alpha=0.3))
        model.add(Dropout(0.4))
        model.add(Dense(50, activation = 'relu'))
        model.add(LeakyReLU(alpha=0.3))
        model.add(Dropout(0.2))
        model.add(Dense(25, activation = 'relu'))
        model.add(LeakyReLU(alpha=0.3))
        model.add(Dropout(0.1))
        # Hidden Layers
        model.add(Dense(1, activation = 'relu'))
        model.add(LeakyReLU(alpha=0.3))
        # Output Layer.
        model.compile(loss=mean_squared_error, optimizer=keras.optimizers.Adam())
        # Compiling model.
        return model

    def predict(self, testDataX, threshold = 0.5):
        """
        This makes a prediction from a set of data.  Takes in a batch of dimensions (batchSize, 38) returns an array of
        dimensions (batchSize)
        """
        testDataX = np.array(testDataX, dtype = float)
        # Ensuring the incoming data is infact a float.
        predDataY = self.model.predict(testDataX)
        # Making the predictions.
        thresholdArray = np.zeros(np.shape(predDataY))+threshold
        predDataY = np.greater(predDataY, thresholdArray)#, dtype = int)
        # Deciding if predicted data is above or below threshold.
        return predDataY[:,0]

    def accuracy(self, predDataY, testDataY):
        """
        Calculates the accuracy of the prediction.  Takes two arrays both of dimension (batchSize) returns a float.
        """
        yPred = np.array(predDataY, dtype = int)
        yTrue = np.array(testDataY, dtype = int)
        # Ensuring the data is correctly cast.
        rows = np.shape(yPred)
        # Calculates the number of messages in data.
        equal = np.equal(yPred, yTrue, dtype=int)
        # Works out element wise if the two arrays are equal.
        numEqual = np.sum(equal)
        # Calculates the number that are equal.
        accuracy = numEqual/rows
        # Divides the number equal by the number of rows to get the proportion.
        return accuracy[0]

    def train(self, trainData, testData, epochs, batchSize, verbose = False):
        """
        Trains the network.  Takes two arrays, the trainingData and the testingData both of dimensions (batchsize, 39).
        """
        trainDataX = np.array(trainData[:,1:], dtype = float)
        trainDataY = np.array(trainData[:,0] , dtype = int)
        # Training Data.
        testDataX  = np.array(testData[:,1:], dtype = float)
        testDataY  = np.array(testData[:,0] , dtype = int  )
        # Validation Data.
        #learningCurve = []
        # Learning curve.
        for i in range(epochs):
            # Iterating through training epochs.
            self.model.fit(x=trainDataX, y=trainDataY, batch_size=batchSize, epochs=1, verbose = 0)
            # Fits the model.
            #predDataY = self.predict(testDataX)
            #loss = self.accuracy(predDataY, testDataY)
            #learningCurve.append(loss)
            # Creates the learning curve.
            #if verbose and (i%50==0):
            #    print("Episode "+str(i)+" : Accuracy "+str(loss))
        #return learningCurve

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
def train(training_data):
    """
    Train a model on the training_data

    :param training_data: a two-dimensional numpy-array with shape = [5000, 39] 
    
    :return fitted_model: any data structure that captures your model
    """
    agent = NNclassifier()
    agent.train(training_data, training_data, 300, 1000, verbose = True)
    return agent

## Uncomment one of the following two lines depending on whether you want us to compute your model on the 
## fly or load a supplementary file.

fitted_model = train(training_data)

# fitted_model = load(local_file)

(3) Your third task is to provide a function called `test()` that uses your `fitted_model` to classify the observations of `testing_data`. The `testing_data` is hidden and may contain any number of observations (rows). It contains 38 columns that have the same structure as the features of `training_data`. 

In [4]:
def test(testing_data, fitted_model):
    """
    Classify the rows of testing_data using a fitted_model. 

    :param testing_data: a two-dimensional numpy-array with shape = [n_test_samples, 38]
    :param fitted_model: the output of your train function.

    :return class_predictions: a numpy array containing the class predictions for each row
        of testing_data.
    """
    agent = fitted_model
    predictions = agent.predict(testing_data)
    return predictions

In [5]:
# This is a test cell. Do not delete or change. 
# You can use this cell to check whether the returned objects of your function are of the right data type.

# Test data types if input are the first 20 rows of the training_data.
class_predictions = test(training_data[:20, 1:], fitted_model)

# Check data type(s)
assert(isinstance(class_predictions, np.ndarray))

# Check shape of numpy array
assert(class_predictions.shape == (20,))

# Check data type of array elements
assert(np.all(np.logical_or(class_predictions == 0, class_predictions == 1)))


Describe in less than 10 sentences: Explain your classifier. Comment on its performance. What other alternative classifiers did you consider or experiment with? How does the performance of your classifier change as the size of the training set increases? You may want to include figures. 

I have used a basic neural network using dense layers, typically called a multilayer perceptron.  I used progressively smaler hidden layers from the first, this typically creates a stable network, furthermore I included significant dropout especially in the earlier layers to prevent overfitting.  I used Leaky ReLU activations, ReLU activations are computationally effecient and stable, tending to learn well and not overfit; by using leaky ReLU activations the network doesn't get caught converging weights entirely to zero.  Finally I use the ADAM optimiser, this is effecient and automatically calculates hyper-parameters.

The network performs well and averages at maximum 0.967 accuracy given a 3:1 training to testing split of the data.  Altough I attempted many different varients of network architectures, batch sizes and activations this network achieved the highest accuracy.  Interestingly using a mean squared error loss proved more accurate than categorical cross entropy, an unusaual result but likely due to the higher stability of MSE.

Determing the results of training on more data was difficult; by increasing the size of the training-testing split the network learnt better but was much more suceptible to outliers in the ever smaller testing set leading to noisier learning curves.