# Zero or One? (100 marks)

All you will be given about this problem is a training data set. Your objective is to develop a classifier that will have the highest accuracy in unseen examples.

The following cell loads the training data set.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

training_data = np.loadtxt(open("data/training_data.csv"), delimiter=",")
print("Shape of the training data set:", training_data.shape)
print(training_data)


import sklearn # Load in some modules I want (help for data manipulation, calculating values from data & a logstic regressions library)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression


# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

'''Sources

    https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
        Accessed 19:00, 02/05/2019
        

'''

Shape of the training data set: (5000, 39)
[[0. 1. 1. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 [1. 1. 1. ... 1. 0. 0.]
 ...
 [0. 0. 1. ... 0. 1. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 [1. 1. 0. ... 0. 0. 0.]]


'Sources\n\n    https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python\n        Accessed 19:00, 02/05/2019\n        \n\n'

The first column is again the response variable. The remaining 38 columns are binary features. You have multiple tasks:

(1) Your first task is to write a function called `train()` that takes `training_data` as input and returns all the fitted parameters of your model. Note that the fitted parameters of your model depend on the model you choose. For example, if you use a naïve Bayes classifier, you could return a list of class priors and conditional likelihoods. (This function will allow us to compute your model on the fly. We should be able to execute it in less than 10 minutes.) 

(2) Your second task is to provide a variable called `fitted_model` which stores the model parameters you found by executing your train() function on the training_data. If your train function takes more than 20 seconds to run, this variable should load precomputed parameter values (possibly from a file) rather than execute the train() function. 

In [2]:
def train(training_data):
    """
    Train a model on the training_data

    :param training_data: a two-dimensional numpy-array with shape = [5000, 39] 
    
    :return fitted_model: any data structure that captures your model
    """
    response = training_data[:,0] # Neatly defining
    features = training_data[:,1:(training_data.shape[1])]
    
    logRegClass = LogisticRegression() #Our logistic regression object will contained our fitted model, so this is what we
                                        # will return
        
    logRegClass.fit(features, response) #Train our model using the logistic regression library
    
    fitted_model = logRegClass
    
    
    return fitted_model

## Uncomment one of the following two lines depending on whether you want us to compute your model on the 
## fly or load a supplementary file.

fitted_model = train(training_data)
# fitted_model = load(local_file)

(3) Your third task is to provide a function called `test()` that uses your `fitted_model` to classify the observations of `testing_data`. The `testing_data` is hidden and may contain any number of observations (rows). It contains 38 columns that have the same structure as the features of `training_data`. 

In [3]:
def test(testing_data, fitted_model):
    """
    Classify the rows of testing_data using a fitted_model. 

    :param testing_data: a two-dimensional numpy-array with shape = [n_test_samples, 38]
    :param fitted_model: the output of your train function.

    :return class_predictions: a numpy array containing the class predictions for each row
        of testing_data.
    """
    features = testing_data
    
    logRegClass = fitted_model # Fitted model is a logistic regression object from the logistic regression library
    responsePrediction = logRegClass.predict(features)
    
    class_predictions = responsePrediction
    
    return class_predictions

In [4]:
#response = training_data[:,0] # Neatly defining
#features = training_data[:,1:(training_data.shape[1])]
#
#rows = 250
#class_predictions = test(training_data[:rows, 1:], fitted_model)
#
#print("Accuracy:",metrics.accuracy_score(class_predictions, response[:rows]))
#print("Precision:",metrics.precision_score(class_predictions, response[:rows]))
#print("Recall:",metrics.recall_score(class_predictions, response[:rows]))
#
#responsePrediction_curve = fitted_model.predict_proba(features)[::,1] # ROC Curve
#fpr, tpr, _ = metrics.roc_curve(response, responsePrediction_curve)
#auc = metrics.roc_auc_score(response, responsePrediction_curve)
#plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
#plt.legend(loc=4)
#plt.show()

In [5]:
# This is a test cell. Do not delete or change. 
# You can use this cell to check whether the returned objects of your function are of the right data type.

# Test data types if input are the first 20 rows of the training_data.
class_predictions = test(training_data[:20, 1:], fitted_model)

# Check data type(s)
assert(isinstance(class_predictions, np.ndarray))

# Check shape of numpy array
assert(class_predictions.shape == (20,))

# Check data type of array elements
assert(np.all(np.logical_or(class_predictions == 0, class_predictions == 1)))


Describe in less than 10 sentences: Explain your classifier. Comment on its performance. What other alternative classifiers did you consider or experiment with? How does the performance of your classifier change as the size of the training set increases? You may want to include figures. 
---------
   I've used a Logistic Regression classifier and been able to implement this with ease through the **LogisticRegression** class from the **sklearn** library - this comes from the logistic function, but uses $beta_i$ and $x_i$ to predict the probability an entry has response 0 or 1 ($beta_i$ are generated through the train function).
   
   To look at the perfomance of the fitted model I looked at *accuracy*, *precision* and *recall* - I was able to get these values from *training_data* by training on 4750 rows of data and testing on 250 rows of data. Accuracy is self explanatory - how many the fitted model got correct - precision is how correct the fitted model is - what's the chance that when it makes a prediction, that prediction is correct - and recall denotes how useful the model is - in the example of spam/ham emails, the recall denotes how many spam emails are correctly identified. When running this test I got Accuracy=93.2%, Precision=96.1%, Recall=91.1% - which is very good!
   
   I also took a look at an ROC graph (Receiver Operating Characteristic). It shows how well the predictive model can predict the true response values, and the tradeoff between *sensitivity* and *specificity* - The ROC curve does this by plotting **sensitivity**, the probability of predicting a real positive will be a positive, against **1-specificity** the probability of predicting a real negative will be a positive. AUC score for my test was 0.985 - an AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.
   
   I considered using Naive Bayes, but I wanted to try a different classifier and I didn't get it working as well as I could have in coursework 3 so I wanted to see if I could improve with something else. Could've used k-nearest neighbour, but there is 38 different features and it may have been difficult to find the optimum k.
   
   Even with 5000 rows, it takes less than 1 second for the classifier to train and test.
   <img src="images/my_graphics.png">