# Spam Classifier
## Assignment Preamble
Please ensure you carefully read all of the details and instructions on the assignment page, this section, and the rest of the notebook. If anything is unclear at any time please post on the forum or ask a tutor well in advance of the assignment deadline.

In addition to all of the instructions in the body of the assignment below, you must also follow the following technical instructions for all assignments in this unit. *Failure to do so may result in a grade of zero.*
* [At the bottom of the page](#Submission-Test) is some code which checks you meet the submission requirements. You **must** ensure that this runs correctly before submission.
* Do not modify or delete any of the cells that are marked as test cells, even if they appear to be empty.
* Do not duplicate any cells in the notebook – this can break the marking script. Instead, insert a new cell (e.g. from the menu) and copy across any contents as necessary.

Remember to save and backup your work regularly, and double-check you are submitting the correct version.

This notebook is the primary reference for your submission. You may write code in separate `.py` files but it must be clearly imported into the notebook so that it runs without needing to reference those files, and you must explain clearly what functionality is contained in those files (through comments, markdown cells, etc).

As always, **the work you submit for this assignment must be entirely your own.** Do not copy or work with other students. Do not copy answers that you find online. These assignments are designed to help improve your understanding first and foremost – the process of doing the assignment is part of *learning*. They are also used to assess your ability, and so you must uphold academic integrity. Submitting plagiarised work risks your entire place on your degree.

**The pass mark for this assignment is 40%.** We expect that students, on average, will be able to produce a submission which gets a mark between 50-70% within the normal workload allocation for the unit, but this will vary depending on individual backgrounds. Please ask for help if you are struggling.

## Getting Started
Spam refers to unwanted email, often in the form of advertisements. In the literature, an email that is **not** spam is called *ham*. Most email providers offer automatic spam filtering, where spam emails will be moved to a separate inbox based on their contents. Of course this requires being able to scan an email and determine whether it is spam or ham, a classification problem. This is the subject of this assignment.

This assignment has two parts. Each part is worth 50% of the overall grade for this assignment.

For part one you will write a supervised learning based classifier to determine whether a given email is spam or ham. You must write and submit the code in this notebook. The training data is provided for you. You may use any classification method. Marks will be awarded primarily based on the accuracy of your classifier on unseen test data, but there are also marks for estimating how accurate you think your classifier will be.

In part two you will produce a short video explaining your implementation, any decisions or extensions you made, and what parameter values you used. This part is explained in more detail on the assignment page. The video file must be submitted with your assignment.

### Choice of Algorithm
While the classification method is a completely free choice, the assignment folder includes [a separate notebook file](data/naivebayes.ipynb) which can help you implement a Naïve Bayes solution. If you do use this notebook, you are still responsible for porting your code into *this* notebook for submission. A good implementation should give a high  enough accuracy to get a good grade on this section (50-70%).

You could also consider a k-nearest neighbour algorithm, but this may be less accurate. Logistic regression is another option that you may wish to consider.

If you are looking to go beyond the scope of the unit, you might be interested in building something more advanced, like an artificial neural network. This is possible just using `numpy`, but will require significant self-directed learning. *Extensions like this are left unguided and are not factored into the unit workload estimates.*

**Note:** you may use helper functions in libraries like `numpy` or `scipy`, but you **must not** import code which builds entire models for you. This includes but is not limited to use of libraries like `scikit-learn`, `tensorflow`, or `pytorch` – there will be plenty of opportunities for these libraries in later units. The point of this assignment is to understand code the actual algorithm yourself. ***If you are in any doubt about any particular library or function please ask a tutor.*** Submissions which ignore this will receive penalties or even zero marks.

If you choose to implement more than one algorithm, please feel free to include your code and talk about it in part two (your video presentation), but only the code in this notebook will be used in the automated testing.

## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage you to split out your training and test data. You should consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for your classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [1]:
import numpy as np

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Your training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that you will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


## Part One
Write all of the code for your classifier below this cell. There is some very rough skeleton code in the cell directly below. You may insert more cells below this if you wish, but you must not duplicate any cells as this can break the grading script.

### Submission Requirements
Your code must provide a variable with the name `classifier`. This object must have a method called `predict` which takes input data and returns class predictions. The input will be a single $n \times 54$ numpy array, your classifier should return a numpy array of length $n$ with classifications. There is a demo in the cell below, and a test you can run before submitting to check your code is working correctly.

Your code must run on our test machine in under 30 seconds. If you wish to train a more complicated model (e.g. neural network) which will take longer, you are welcome to save the model's weights as a file and then load these in the cell below so we can test it. You must include the code which computes the original weights, but this must not run when we run the notebook – comment out the code which actually executes the routine and make sure it is clear what we need to change to get it to run. Remember that we will be testing your final classifier on additional hidden data.

In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import pickle
from scipy.special import expit

class SpamClassifier:
    def __init__(self, data=None, test_data=None, layers=None, save_params=False, learning_rate=0.0001, epochs=10000, dropout=0.2):
        
        # If data is passed in the initialisation of the class, we are using 'train' mode
        if isinstance(data, np.ndarray):
            self.preprocessed_date = self.preprocess_data(data)
            self.data_X, self.data_Y = self.split_labels(self.preprocessed_date)
            self.params_dict={}
            self.layers = layers

        # Else pre-load network weights as we are using 'predict' mode only
        else:
            self.params_dict, self.layers = self.load_network()

        self.num_layers = len(self.layers)-1            
        self.test_data = test_data
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.dropout = dropout
        self.loss = []
        self.val_acc = []
        self.val_loss = []
        self.save_params=save_params
    
    def preprocess_data(self,data):
        """
        Initiates two preprocessing functions:
         1. Removal of duplicate rows
         2. Removal of rows with duplicate features but where labels are different i.e. two identical emails
         but one is labelled 'spam' and the other 'not spam'.
        """
        # Convert to pandas datframe
        df = pd.DataFrame(data)
        
        # Drop any identical rows
        df.drop_duplicates(inplace=True)
        
        # Create new column with a concatenated string of all the features for each row
        df['featuresconcat'] = df[df.columns[1:]].apply(
            lambda x: ''.join(x.dropna().astype(str)),axis=1)
        
        # Group by this concatenated string to get groups of identical 'emails'
        groups = df.groupby('featuresconcat')

        new_df = None
        
        # Iterate over each group
        for key,item in groups:
            
            group = groups.get_group(key)

            # If only one row in the group, retain it
            if len(group) == 1:
                if not isinstance(new_df, pd.DataFrame):
                    new_df = pd.DataFrame(group.reset_index(drop=True))
                else:
                    new_df = pd.concat([new_df,group])
            
            # If more than one row, only retain if all labels are identical
            else:
                if len(group[0].value_counts()) == 1:
                    new_df = pd.concat([new_df,group])

        new_df = new_df.drop(columns=['featuresconcat'])

        # Convert back to numpy array
        data = new_df.to_numpy()
        
        return data
        
    def split_labels(self, data):
        """
        Splits the data into two parts, the labels and the input features
        
        param data: the entire dataset to be split
        return X, Y: the input features & labels, respectively
        """
        # Get features
        X = data[:,1:]

        # Get labels
        Y = data[:,0]
        Y = np.asarray([[x] for x in Y])
        
        return X,Y

    
    def initialise_weights(self):
        """
        Randomly and iteratively initialise the weights and biases of the layers in the self.params_dict dictionary
        """
        np.random.seed(1)
    
        # Randomly initialise layer weights and biases
        for i in range(self.num_layers):
           # Weights for each layer are represented by a 2D array (current_layer_nodes, next_layer_nodes)
            self.params_dict[f"W{i}"] = np.random.randn(self.layers[i], self.layers[i+1])
            self.params_dict[f"b{i}"] = np.random.randn(self.layers[i+1],)

            
    def relu(self,Z):
        '''
        Performs a threshold calculation to each input, setting any less than zero, to zero.
        
        param Z: the dot product of the weight and output of the previous layer
        
        return : the converted matrix with a minimum value of 0
        '''
        return np.maximum(0,self.clip(Z))


    def dRelu(self, y):
        """
        The derivative of the relu function used to propagate information back through the activation
        function    
        """
        # https://stackoverflow.com/questions/46411180/implement-relu-derivative-in-python-numpy
        return (self.clip(y) > 0) * 1.

    
    def sigmoid(self, Z):
        """
        The final activation function in the network. Converts the final layer neurons output to a floating
        number between 0 and 1 (inclusive)
        """
        # Expit is the scipy implementation of sigmoid. 
        # Equivalent to 1/(1+np.exp(-Z))
        # Source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html
        return expit(Z)
    
    
    def dSigmoid(self, y_estimate):
        """
        The derivative of the sigmoid function, used to backpropagate the loss through the network
        """
        # Source: https://stackoverflow.com/questions/10626134/derivative-of-sigmoid
        return self.clip(y_estimate) * (1-self.clip(y_estimate))
        
        
    def clip(self, y):
        """ 
        Clip values to avoid FloatingPointError and DivideByZero errors
        """
        return np.maximum(y, 0.0000000001)
    
    
    def weighted_binary_crossentropy_loss(self, y, y_estimate):
        """
        Calculates the weighted binary cross entropy loss between the true labels and the predicted labels
        with rebalancing.
        """
        y_estimate_inv = 1.0 - y_estimate
        y_estimate_inv = self.clip(y_estimate_inv)
        
        y_inv = 1.0 - y
        y_inv = self.clip(y_inv)

        # Get weights (proportion of true/false labels)
        true_weights = sum(y)/len(y)
        false_weights = 1-true_weights
        
        # Calculate the weighted binary cross entropy loss between the predictions and true labels
        # Source: https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c
        loss = (1/len(y)) * (np.sum(true_weights*np.multiply(np.log(self.clip(y_estimate)),-y)\
                                    - false_weights*np.multiply(y_inv, np.log(y_estimate_inv))))

        return loss
    
        
    def forward_propagation(self, mode='train'):
        """
        Conducts forward propagation of the inputs through the network to obtain a measure of loss.
        This function is also used by the predict method, but in 'predict' mode the dropout and parameter
        updates are 'turned off'.
        
        """

        if mode =='train':
            X_data = self.data_X
            Y_data = self.data_Y
        elif mode =='predict':
            X_data = self.data_X
        elif mode=='val':
            X_data = self.val_X
            Y_data = self.val_Y
            
        Z=None
        A=None
        
        # For each layer
        for layer_num in range(self.num_layers):
            # If first layer
            if layer_num==0:
                # Z = WX + B. (i.e the linear equation) i.e. we compute the weighted sum of the inputs and the weights
                Z = X_data.dot(self.params_dict[f"W{layer_num}"]) + self.params_dict[f"b{layer_num}"]
                # Relu activation 
                A = self.relu(Z)
                
                # If not train mode, turn off dropout and param dict update
                if mode == 'train':
                    # Dropout randomly applies a mask with probability self.dropout to turn off nodes
                    # in order to force data through other routes through the network
                    D = np.random.binomial(1, self.dropout, size=A.shape)
                    A*=D
                    
                    # update the linear equation (Z=WX+b) and activation parameters
                    self.params_dict[f"Z{layer_num}"] = Z
                    self.params_dict[f"A{layer_num}"] = A
            
            # If last layer
            elif layer_num == self.num_layers-1:
                # Compute weighted sum of previous layer outputs and weights
                Z=A.dot(self.params_dict[f"W{layer_num}"]) + self.params_dict[f"b{layer_num}"]

                # Use the sigmoid function to get a prediction (between 0 and 1)
                y_estimate = self.sigmoid(Z)
                
                if mode == 'train':
                    
                    # Update final Z
                    self.params_dict[f"Z{layer_num}"] = Z  
            
            # If in a middle layer 
            else:
                # Compute weighted sum of previous layer outputs and weights
                Z=A.dot(self.params_dict[f"W{layer_num}"]) + self.params_dict[f"b{layer_num}"]

                # Relu activation
                A = self.relu(Z)
                
                if mode == 'train':
                    # Apply dropout masks
                    D = np.random.binomial(1, self.dropout, size=A.shape)
                    A*=D

                    # Update parameters dictionary
                    self.params_dict[f"Z{layer_num}"] = Z
                    self.params_dict[f"A{layer_num}"] = A

        if mode in ('train','val'):
            # Run predictions and true labels through the weighted binary crossentropy loss function
            loss = self.weighted_binary_crossentropy_loss(Y_data, y_estimate)
            return y_estimate, loss
        else:
            return y_estimate
        
                
    def back_propagation(self, y_est, loss):
        """ 
        Performs the back propagation element of the gradient descent algorithm by propagating the loss 
        backwards through the network using the chain rule of differentiation to calculate the influence that 
        each variable has on the loss.
        
        Updates the weights and biases at each layer by a fraction of that total influence
        
        """

        # Derivative of the loss wrt y_estimate
        # Source https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60
        dL_wrt_y_est = np.divide(1-self.data_Y, self.clip(1-y_est)) - np.divide(self.data_Y,self.clip(y_est)) 

        # Derivate of y_estimate wrt sigmoid
        d_y_est_dsig = self.dSigmoid(y_est)

        # For each layer moving backwards through the network
        for hlayer_num in reversed(range(self.num_layers)):

            # If the final layer
            if hlayer_num == self.num_layers-1:
                # Deriv. of loss wrt last Z
                dL_dz = dL_wrt_y_est * d_y_est_dsig
                
                # Deriv. loss wrt final weights 
                dL_dW = np.dot(self.params_dict[f'A{hlayer_num-1}'].T, dL_dz)
                
                # Deriv. loss wrt final bias
                dL_db = np.sum(dL_dz, axis=0, keepdims=True)
            
            # If at the first layer
            elif hlayer_num == 0:
                # Deriv loss wrt first activation
                dL_dA = np.dot(dL_dz, self.params_dict[f'W{hlayer_num+1}'].T)
                
                # Deriv loss wrt first Z
                dL_dz = dL_dA * self.dRelu(self.params_dict[f'Z{hlayer_num}']) # note similarity to dL_dz_last
                
                # Deriv loss wrt first weights & bias
                dL_dW = np.dot(self.data_X.T, dL_dz)
                dL_db = np.sum(dL_dz, axis=0, keepdims=True)
                
            # If in a middle layer
            else:
                # Deriv. loss wrt activation
                dL_dA = np.dot(dL_dz, self.params_dict[f'W{hlayer_num+1}'].T)
                
                # Deriv. of loss wrt Z 
                dL_dz = dL_dA * self.dRelu(self.params_dict[f'Z{hlayer_num}']) # note similarity to dL_dz_last
                
                # Deriv. loss wrt weights & wrt bias
                dL_dW = np.dot(self.params_dict[f'A{hlayer_num-1}'].T, dL_dz)
                dL_db = np.sum(dL_dz, axis=0, keepdims=True)
            
            # Perform weight update
            self.params_dict[f"W{hlayer_num}"] =self.params_dict[f"W{hlayer_num}"] - (self.learning_rate * dL_dW)
            self.params_dict[f"b{hlayer_num}"] =self.params_dict[f"b{hlayer_num}"] - (self.learning_rate * dL_db)

    def earlystop(self):
        """
        Stops the training process if loss begins to increase. 
        Checks begin after 200 epochs, and will end if the most recent 100 epochs have an average loss
        that is higher than the average loss of the preceding 100 epochs
        """
        # If epochs passed is less than 200, return False to the while loop to keep training
        if len(self.loss) < 200:
            return False

        # Average loss of last 100 epochs
        losscheck = np.mean(self.loss[-100:])
        
        # Average loss of last 200th - 100th epoch
        losscheck_prev = np.mean(self.loss[-200:-100])
        
        # Return true / stop training if loss has increased
        if losscheck > losscheck_prev:          
            print(f"Early stop at epoch {len(self.loss)}")
            return True
        else:
            return False
        
        
    def train(self):
        """
        The core train function for the class which:
        1. Intialises weights
        2. Performs gradient descent algorithm
        3. Calculates validation loss/accuracy (not used during K-Fold CV)
        4. Saves weights/layer structure if required
        """
        # Initialise weights if they don't exist
        if len(self.params_dict.keys())==0:
            self.initialise_weights()
        
        # Perform multiple iterations of the gradient descent algorithm
        for i in range(self.epochs):

            # Loop until earlystopping is triggered
            while not self.earlystop():
                
                # Forward propagation
                y_estimate, loss = self.forward_propagation()

                # Back propagation
                self.back_propagation(y_estimate, loss)
                self.loss.append(loss)

                # Calculate accuracy & loss at each epoch for printing to charts
                if isinstance(self.test_data, np.ndarray):
                    val_preds, val_loss = self.predict(self.test_data, mode='val')
                    self.val_acc.append(self.accuracy(self.val_Y, val_preds))
                    self.val_loss.append(val_loss)
            break
        
        # Save network weights & layer structure
        if self.save_params:
            self.save_network()

                
    def accuracy(self, y, yhat):
        """
        Calculate accuracy of predictions vs true labels
        """
        return np.count_nonzero(yhat == y)/y.shape[0]
        
    
    def predict(self, data, mode='predict'):
        """
        Perform forward propagation once using test/unseen data to obtain predictions
        """
        
        if mode=='val':
            self.val_X, self.val_Y = self.split_labels(data)

            # Get predictions by running one forward pass
            predictions, loss = self.forward_propagation(mode='val')
            
            # Round 0.5 up to 1
            predictions = np.floor(predictions + 0.5)

            return predictions, loss
        
        else:                
            self.data_X = data
            predictions = self.forward_propagation(mode='predict')
            predictions = np.floor(predictions + 0.5)

            # Reshape predictions to fit with test cell
            predictions = predictions.reshape(-1)
            return predictions
    
    def save_network(self):
        """
        Saves the network parameters & layer structure to a pickle file
        """
        with open('model_weights.pkl', 'wb') as f:
            pickle.dump(self.params_dict, f)
            
        with open('layers.pkl', 'wb') as f:
            pickle.dump(self.layers, f)
            
    def load_network(self):
        """
        Load pre-trained parameters & layer structure from pickle file
        """
        with open('model_weights.pkl', 'rb') as f:
            params_dict = pickle.load(f)
            
        with open('layers.pkl', 'rb') as f:
            layers = pickle.load(f)
            
        return params_dict, layers
    
    def plot_loss(self):
        '''
        Plots the loss curve
        '''
        plt.plot(self.loss, label='train_loss')
        plt.plot(self.val_loss, label='val_loss')
        plt.xlabel("Iteration")
        plt.ylabel("Loss")
        plt.title("Loss curve for training & validation data")
        plt.show() 
        
        
    def plot_acc(self):
        '''
        Plots the acc curve
        '''
        plt.plot(self.val_acc)
        plt.xlabel("Iteration")
        plt.ylabel("Accuracy")
        plt.title("Accuracy curve for validation data")
        plt.show() 

    

In [4]:
def kfold_cross_validation():
    """
    Performs kfold cross validation to iterate over multiple folds of the dataset and train multiple models
    in order to obtain a more comprehensive impression of how the model performs on unseen data
    """
    # Get indices of dataset
    fold_indices = np.arange(dataset.shape[0])

    #Split the indices into k-parts (k lists of ndarrays)
    test_indices = np.array_split(fold_indices, 10)

    test_accuracies = []

    for e in test_indices:
        #Define the evaluation set for the current fold
        test_set = dataset[e]
        
        # Mask the evaluation data in the full dataset
        mask_test = np.ones(dataset.shape[0], bool)
        
        #Set indices of the eval set to false
        test_eval[e] = False
        
        #Subset by the bool array:
        train_set = dataset[mask_test]
        
        # Set layer structure
        layers=[54,200,1]
        
        # Initialise and train model
        classifier = SpamClassifier(layers=layers,save_params=False, data=train_set,test_data=test_set, learning_rate=0.0001, epochs=10000, dropout=0.2)
        classifier.train()
         
        # Run predictions using test_set
        test_pred = classifier.predict(test_set[:,1:], mode='predict')
        test_Y = test_set[:,0]
        
        test_accuracies.append(np.count_nonzero(test_pred == test_Y)/test_Y.shape[0])
        accuracy = np.count_nonzero(test_pred == test_Y)/test_Y.shape[0]
        print(f"Accuracy on test data is: {accuracy}")
    print(f"Average accuracy is {np.mean(test_accuracies)} ")
    
# kfold_cross_validation()

In [5]:
# Output from k-fold cross validation run:

# """
# Early stop at epoch 1895
# Accuracy on test data is: 0.9133333333333333
# Early stop at epoch 1708
# Accuracy on test data is: 0.92
# Early stop at epoch 2131
# Accuracy on test data is: 0.9333333333333333
# Early stop at epoch 1921
# Accuracy on test data is: 0.94
# Early stop at epoch 1922
# Accuracy on test data is: 0.9466666666666667
# Early stop at epoch 2057
# Accuracy on test data is: 0.9533333333333334
# Early stop at epoch 2233
# Accuracy on test data is: 0.94
# Early stop at epoch 2177
# Accuracy on test data is: 0.9133333333333333
# Early stop at epoch 2342
# Accuracy on test data is: 0.9
# Early stop at epoch 1837
# Accuracy on test data is: 0.9666666666666667


# Average accuracy is 0.9326666666666666 
# """

In [6]:
def train_model_save_weights(dataset):
    """
    Used to create the final model weights and layer structure pickle files on the full dataset
    """
    layers=[54,200,1]
    classifier = SpamClassifier(layers=layers,save_params=True, data=dataset,test_data=None, learning_rate=0.0001, epochs=10000, dropout=0.2)
    classifier.train()

# dataset = np.concatenate((training_spam, testing_spam))
# train_model_save_weights(dataset)

In [7]:
def create_classifier():
    """
    Loads the pre-trained weights and model structure into the classifier for making predictions
    
    Note : please make sure the model_weights.pkl & layers.pkl files are in the same directory as this notebook
    """
    classifier = SpamClassifier()
    return classifier

In [8]:
classifier = create_classifier()

### Accuracy Estimate
In the cell below there is a function called `my_accuracy_estimate()` which returns `0.5`. Before you submit the assignment, write your best guess for the accuracy of your classifier into this function, as a percentage between `0` and `1`. So if you think you will get 80% of inputs correct, return the value `0.8`. This will form a small part of the marking criteria for the assignment, to encourage you to test your own code.

In [9]:
def my_accuracy_estimate():
    return 0.93

Write all of the code for your classifier above this cell.

### Testing Details
Your classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods. At the very high end of the grading scale, your accuracy will also be compared to the best submissions from other students (in your own cohort and others!). Your estimate from the cell above will also factor in, and you will be rewarded for being close to your actual accuracy (overestimates and underestimates will be treated the same).

#### Test Cell
The following code will run your classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

The original skeleton code above classifies every row as ham, but once you have written your own classifier you can run this cell again to test it. So long as your code sets up a variable called `classifier` with a method called `predict`, the test code will be able to run. 

Of course you may wish to test your classifier in additional ways, but you *must* ensure this version still runs before submitting.

**IMPORTANT**: you must set `SKIP_TESTS` back to `True` before submitting this file!

In [13]:
SKIP_TESTS = False

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

Accuracy on test data is: 0.936


In [11]:
import sys
import pathlib

fail = False;

if not SKIP_TESTS:
    fail = True;
    print("You must set the SKIP_TESTS constant to True in the cell above.")
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("This notebook file must be named spamclassifier.ipynb")
    
if "create_classifier" not in dir():
    fail = True;
    print("You must include a function called create_classifier.")

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("You must include a function called my_accuracy_estimate.")
else:
    if my_accuracy_estimate() == 0.5:
        print("Warning:")
        print("You do not seem to have provided an accuracy estimate, it is set to 0.5.")
        print("This is the actually the worst possible accuracy – if your classifier")
        print("got 0.1 then it could invert its results to get 0.9!")
    
print("INFO: Make sure you follow the instructions on the assignment page to submit your video.")
print("Failing to include this could result in an overall grade of zero for both parts.")
print()

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("All checks passed. When you are ready to submit, upload the notebook and readme file to the")
    print("assignment page, without changing any filenames.")
    print()
    print("If you need to submit multiple files, you can archive them in a .zip file. (No other format.)")

INFO: Make sure you follow the instructions on the assignment page to submit your video.
Failing to include this could result in an overall grade of zero for both parts.

All checks passed. When you are ready to submit, upload the notebook and readme file to the
assignment page, without changing any filenames.

If you need to submit multiple files, you can archive them in a .zip file. (No other format.)


In [12]:
# This is a test cell. Please do not modify or delete.