# Spam Classifier
## Assignment Preamble
Please ensure you carefully read all of the details and instructions on the assignment page, this section, and the rest of the notebook. If anything is unclear at any time please post on the forum or ask a tutor well in advance of the assignment deadline.

In addition to all of the instructions in the body of the assignment below, you must also follow the following technical instructions for all assignments in this unit. *Failure to do so may result in a grade of zero.*
* [At the bottom of the page](#Submission-Test) is some code which checks you meet the submission requirements. You **must** ensure that this runs correctly before submission.
* Do not modify or delete any of the cells that are marked as test cells, even if they appear to be empty.
* Do not duplicate any cells in the notebook – this can break the marking script. Instead, insert a new cell (e.g. from the menu) and copy across any contents as necessary.
* Do not use global variables or rely on global state at all. It may lose you marks in the marking script.

Remember to save and backup your work regularly, and double-check you are submitting the correct version.

This notebook is the primary reference for your submission. You may write code in separate `.py` files but it must be clearly imported into the notebook so that it runs without needing to reference those files, and you must explain clearly what functionality is contained in those files (through comments, markdown cells, etc).

As always, **the work you submit for this assignment must be entirely your own.** Do not copy or work with other students. Do not copy answers that you find online. These assignments are designed to help improve your understanding first and foremost – the process of doing the assignment is part of *learning*. They are also used to assess your ability, and so you must uphold academic integrity. Submitting plagiarised work risks your entire place on your degree.

**The pass mark for this assignment is 40%.** We expect that students, on average, will be able to produce a submission which gets a mark between 50-70% within the normal workload allocation for the unit, but this will vary depending on individual backgrounds. Please ask for help if you are struggling.

## Getting Started
Spam refers to unwanted email, often in the form of advertisements. In the literature, an email that is **not** spam is called *ham*. Most email providers offer automatic spam filtering, where spam emails will be moved to a separate inbox based on their contents. Of course this requires being able to scan an email and determine whether it is spam or ham, a classification problem. This is the subject of this assignment.

This assignment has one part, worth 100% of the grade for this coursework.

You will write a supervised learning based classifier to determine whether a given email is spam or ham. You must write and submit the code in this notebook. The training data is provided for you. You may use any classification method. Marks will be awarded primarily based on the accuracy of your classifier on unseen test data, but there are also marks for estimating how accurate you think your classifier will be.

### Choice of Algorithm
While the classification method is a completely free choice, the assignment folder includes [a separate notebook file](data/naivebayes.ipynb) which can help you implement a Naïve Bayes solution. If you do use this notebook, you are still responsible for porting your code into *this* notebook for submission. A good implementation should give a high  enough accuracy to get a good grade on this section (50-70%).

You could also consider a k-nearest neighbour algorithm, but this may be less accurate. Logistic regression is another option that you may wish to consider.

If you are looking to go beyond the scope of the unit, you might be interested in building something more advanced, like an artificial neural network. This is possible just using `numpy`, but will require significant self-directed learning. *Extensions like this are left unguided and are not factored into the unit workload estimates.*

**Note:** you may use helper functions in libraries like `numpy` or `scipy`, but you **must not** import code which builds entire models for you. This includes but is not limited to use of libraries like `scikit-learn`, `tensorflow`, or `pytorch` – there will be plenty of opportunities for these libraries in later units. The point of this assignment is to understand code the actual algorithm yourself. ***If you are in any doubt about any particular library or function please ask a tutor.*** Submissions which ignore this will receive penalties or even zero marks.

## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage you to split out your training and test data. You should consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for your classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [1]:
import numpy as np

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(np.int)


Your training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that you will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)


## Part One
Write all of the code for your classifier below this cell. There is some very rough skeleton code in the cell directly below. You may insert more cells below this if you wish, but you must not duplicate any cells as this can break the grading script.

### Submission Requirements
Your code must provide a variable with the name `classifier`. This object must have a method called `predict` which takes input data and returns class predictions. The input will be a single $n \times 54$ numpy array, your classifier should return a numpy array of length $n$ with classifications. There is a demo in the cell below, and a test you can run before submitting to check your code is working correctly.

Your code must run on our test machine in under 30 seconds. If you wish to train a more complicated model (e.g. neural network) which will take longer, you are welcome to save the model's weights as a file and then load these in the cell below so we can test it. You must include the code which computes the original weights, but this must not run when we run the notebook – comment out the code which actually executes the routine and make sure it is clear what we need to change to get it to run. Remember that we will be testing your final classifier on additional hidden data.

In [3]:
from typing import List, Type, Tuple
import numpy as np


# Skalski, P (2018) Let’s code a Neural Network in plain NumPy. Available at: https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795 (Accessed: 16 April 2022)

class ActivationFunction:
    """
    Abstract class for an activation function.
    """

    @staticmethod
    def apply_function(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        raise NotImplementedError("Abstract method cannot be called.")

    @staticmethod
    def apply_function_derivative(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply derivative of activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        raise NotImplementedError("Abstract method cannot be called.")


class Relu(ActivationFunction):
    """
    Relu activation function implementation.
    """

    @staticmethod
    def apply_function(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        return np.maximum(0, z_indexes)

    @staticmethod
    def apply_function_derivative(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply derivative of activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        return (z_indexes > 0).astype(int)


class Sigmoid(ActivationFunction):
    """
    Sigmoid activation function implementation.
    """

    @staticmethod
    def apply_function(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        return np.divide(1, np.add(1, np.exp(-z_indexes)))

    @staticmethod
    def apply_function_derivative(z_indexes: np.ndarray) -> np.ndarray:
        """
        Apply derivative of activation function to activation values.
        :param z_indexes: Activation values.
        :return: Numpy array.
        """
        sig = Sigmoid.apply_function(z_indexes)
        return np.multiply(sig, np.subtract(1, sig))


class Layer:
    """
    Object implementation of a layer within the neural network model.
    """

    def __init__(self, input_dimension: int, output_dimension: int, activation: Type[ActivationFunction]) -> None:
        self._input_dimension = input_dimension
        self._output_dimension = output_dimension
        self._activation = activation

    def get_input_dimension(self) -> int:
        """
        Getter for input_dimension.
        :return: The number of input values or nodes in previous layer.
        """
        return self._input_dimension

    def get_output_dimension(self) -> int:
        """
        Getter for output_dimension.
        :return: The number of output values or nodes in this layer.
        """
        return self._output_dimension

    def get_activation(self) -> Type[ActivationFunction]:
        """
        Getter for activation.
        :return: The activation function used in this layer.
        """
        return self._activation


class Loss:
    """
    Abstract implementation of a loss function.
    """

    @staticmethod
    def get_cost_value(y: np.ndarray, y_hat: np.ndarray) -> np.ndarray:
        """
        Function to calculate cost by comparing the expected output (y) with the actual output (y_hat).
        :param y: Expected output.
        :param y_hat: Actual output of the neural network.
        :return: Cost calculation.
        """
        raise NotImplementedError("Abstract method cannot be called.")


class BinaryCrossentropy(Loss):
    """
    Binary Crossentropy function implementation.
    """

    @staticmethod
    def get_cost_value(y: np.ndarray, y_hat: np.ndarray) -> np.ndarray:
        """
        Function to calculate cost by comparing the expected output (y) with the actual output (y_hat).
        :param y: Expected output.
        :param y_hat: Actual output of the neural network.
        :return: Cost calculation.
        """
        with np.errstate(divide='ignore'):
            m = y_hat.shape[1]
            p1 = np.divide(-1, m)
            p2 = np.dot(y, np.log(y_hat).transpose())
            p3 = np.dot(np.subtract(1, y), np.log(np.subtract(1, y_hat)).transpose())
            cost = np.multiply(p1, np.add(p2, p3))
        return np.squeeze(cost)


class Parameters:
    """
    Object representation of the neural networks parameter, those being the weightings and biases for each layer.
    """

    def __init__(self, weightings: List[np.ndarray], biases: List[np.ndarray]) -> None:
        self._weightings = weightings
        self._biases = biases

    def get_weightings(self) -> tuple[np.ndarray, ...]:
        """
        Getter for deepcopy of weightings.
        :return: Tuple of numpy arrays containing the weightings for each layer in the neural network.
        """
        return tuple(weighting.copy() for weighting in self._weightings)

    def get_biases(self) -> tuple[np.ndarray, ...]:
        """
        Getter for deepcopy of biases.
        :return: Tuple of numpy arrays containing the biases for each layer in the neural network.
        """
        return tuple(bias.copy() for bias in self._biases)

    def update_weightings(self, alpha: float, gradients_of_weightings: List[np.ndarray]) -> None:
        """
        Update the weighting values after back propagation is completed.
        :param alpha: Float of the learning rate.
        :param gradients_of_weightings: List of the gradients of the weightings for each layer.
        """
        for index in range(len(self._weightings)):
            weighting = self._weightings[index]
            gradient = gradients_of_weightings[index]
            self._weightings[index] = np.subtract(weighting, np.multiply(alpha, gradient))

    def update_biases(self, alpha: float, gradients_of_biases: List[np.ndarray]) -> None:
        """
        Update the biases after back propagation is completed.
        :param alpha: Float of the learning rate.
        :param gradients_of_biases: List of the gradients of the biases for each layer.
        """
        for index in range(len(self._biases)):
            weighting = self._biases[index]
            gradient = gradients_of_biases[index]
            self._biases[index] = np.subtract(weighting, np.multiply(alpha, gradient))


class Model:
    """
    Class representation of a neural network.
    """

    def __init__(self, parameters: Parameters = None, loss: Loss = BinaryCrossentropy):
        self._parameters = parameters
        self._layers: List[Layer] = list()
        self._cost_history = list()
        self._accuracy_history = list()
        self._loss = loss
        self._init_training_attributes()

    def add_layer(self, layer: Layer) -> None:
        """
        Add layer to network, check that inputs to layer match outputs of previous layer.
        :param layer: Layer object.
        """
        if self._layers and self._layers[-1].get_output_dimension() != layer.get_input_dimension():
            raise ValueError("Previous layer output dimension is not equal to new layer input dimension.")
        else:
            self._layers.append(layer)

    def add_loss(self, loss: Type[Loss]) -> None:
        """
        Add loss function to network.
        :param loss: Object representation of loss function.
        """
        self._loss = loss

    def _init_parameters(self, seed: int = 19) -> None:
        """
        Initialise random parameters of weightings and biases for each layer.
        :param seed: Optional seed for reproducibility.
        """
        np.random.seed(seed)

        def initialise_weighting(layer: Layer) -> np.ndarray:
            """
            Create 2 dimensional (output_dimension, input_dimension) Numpy array of random initial weightings.
            :param layer: Corresponding layer of the neural network.
            :return: Two dimensional Numpy array.
            """
            return 0.1 * np.random.randn(layer.get_output_dimension(), layer.get_input_dimension()).astype(
                np.longdouble)

        def initialise_bias(layer: Layer) -> np.ndarray:
            """
            Create 2 dimensional (output_dimension, 1) Numpy array of random initial biases.
            :param layer: Corresponding layer of the neural network.
            :return: Two dimensional Numpy array.
            """
            return 0.1 * np.random.randn(layer.get_output_dimension(), 1).astype(np.longdouble)

        weightings = [initialise_weighting(layer) for layer in self._layers]
        biases = [initialise_bias(layer) for layer in self._layers]
        self._parameters = Parameters(weightings, biases)

    def _single_layer_forward_propagation(self, layer_index: int, prev_activation: np.ndarray,
                                          activator_function: ActivationFunction) -> np.ndarray:
        """
        Transform the activation values from the previous layer into z_index. Then run the given z_indexes through an
        activator function. Additionally, cache the previous activation values and intermediate z_indexes.
        :param layer_index: Index of the current layer being evaluated.
        :param prev_activation: Activation values from the previous layer.
        :param activator_function: Object representation of the activation function for the current layer.
        :return: Y_hat of the current layer.
        """
        curr_weighting = self._parameters.get_weightings()[layer_index]
        curr_bias = self._parameters.get_biases()[layer_index]
        curr_z_index = np.add(np.dot(curr_weighting, prev_activation), curr_bias)
        self._z_index_cache[layer_index] = curr_z_index
        self._activation_cache[layer_index] = prev_activation
        return activator_function.apply_function(curr_z_index)

    def _forward_propagation(self, inputs: np.ndarray) -> np.ndarray:
        """
        Perform propagation on each layer passing forward the activation values
        :param inputs: Initial data.
        :return: Numpy array of the networks outputs.
        """
        curr_activation = inputs
        for index, layer in enumerate(self._layers):
            prev_activation = curr_activation
            curr_activation = self._single_layer_forward_propagation(index, prev_activation, layer.get_activation())
        return curr_activation

    def _single_layer_backward_propagation(self, index: int, curr_d_activation: np.ndarray,
                                           activation_function: Type[ActivationFunction]) -> np.ndarray:
        """
        Store the gradient of the weightings and biases after calculation for a layer of the neural network. Output
        derivative of the previous activation values.
        :param index: Index of the current layer.
        :param curr_d_activation: Derivatives of the current activation values.
        :param activation_function: Activation function used for this layer.
        :return: Array of the derivatives of the previous activation values
        """
        prev_activation = self._activation_cache[index]
        curr_z_index = self._z_index_cache[index]
        curr_weighting = self._parameters.get_weightings()[index]
        m = prev_activation.shape[1]

        curr_d_of_z_index = np.multiply(curr_d_activation, activation_function.apply_function_derivative(curr_z_index))
        curr_d_of_weight = np.divide(np.dot(curr_d_of_z_index, prev_activation.transpose()), m)
        curr_d_of_bias = np.divide(np.sum(curr_d_of_z_index, axis=1, keepdims=True), m)
        prev_d_of_activation = np.dot(curr_weighting.transpose(), curr_d_of_z_index)
        self._gradients_of_weightings[index] = curr_d_of_weight
        self._gradients_of_biases[index] = curr_d_of_bias
        return prev_d_of_activation

    def _backward_propagation(self, y: np.ndarray, y_hat: np.ndarray) -> None:
        """
        Perform backwards propagation on the neural network layer by layer from the end to start.
        :param y: Expected output.
        :param y_hat: Actual output of the neural network.
        """
        y = np.reshape(y, y_hat.shape)

        d1 = np.divide(y, y_hat, out=np.zeros(y.shape), where=(y_hat != 0))
        s1 = np.subtract(1, y)
        s2 = np.subtract(1, y_hat)
        d2 = np.divide(s1, s2, out=np.zeros(s1.shape), where=(s2 != 0))  # Prevents divisions by zero
        prev_d_of_activation = -np.subtract(d1, d2)
        for index, layer in reversed(list(enumerate(self._layers))):
            curr_d_of_activation = prev_d_of_activation
            prev_d_of_activation = self._single_layer_backward_propagation(
                index, curr_d_of_activation, layer.get_activation()
            )

    def _update(self, learning_rate: float) -> None:
        """
        Update the weighting and bias parameters.
        :param learning_rate: Learning rate the neural network is currently being trained at.
        """
        self._parameters.update_weightings(learning_rate, self._gradients_of_weightings)
        self._parameters.update_biases(learning_rate, self._gradients_of_biases)

    def _convert_probabilities_to_ones_and_zeros(self, y_hat: np.ndarray) -> np.ndarray:
        """
        Rounds the value of a probability to the nearest integer so every value is either a 0 or 1.
        :param y_hat: Output of the neural network.
        :return: Array of 0s and 1s
        """
        return np.round(y_hat)

    def _get_accuracy_value(self, y: np.ndarray, y_hat: np.ndarray) -> float:
        """
        Get the percentage of correct outputs.
        :param y: Expected output.
        :param y_hat: Actual output of the neural network.
        :return: Float in the range [0, 1]
        """
        return np.equal(y, self._convert_probabilities_to_ones_and_zeros(y_hat)).mean()

    def _init_training_attributes(self) -> None:
        """
        Initialise arrays for storing numpy arrays while training the neural network.
        """
        self._activation_cache = [None for i in range(len(self._layers))]
        self._z_index_cache = [None for i in range(len(self._layers))]
        self._gradients_of_weightings = [None for i in range(len(self._layers))]
        self._gradients_of_biases = [None for i in range(len(self._layers))]

    def _prepare_input_data(self, x: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Transforms the arrays into a format for use in the training process.
        :param x: Data for all entries.
        :param y: The expected output for all entries.
        :return: Data and expected output in a more useful arrangement.
        """
        return x.transpose(), np.reshape(y, [1, -1])

    def train(self, x: np.ndarray, y: np.ndarray, epochs: int, learning_rate: float) -> Tuple[
        Parameters, List[float], List[float]]:
        """
        Train the neural network.
        :param x: Data for all entries.
        :param y: The expected output for all entries.
        :param epochs: Number of iterations to train the neural network with.
        :param learning_rate: Learning rate of the neural network.
        :return: Parameter object, array of the costs, array of the network's accuracy.
        """
        x, y = self._prepare_input_data(x, y)

        if self._parameters is None:
            self._init_parameters()

        for i in range(epochs):
            self._init_training_attributes()
            y_hat = self._forward_propagation(x)
            self._cost_history.append(self._loss.get_cost_value(y, y_hat))
            self._accuracy_history.append(self._get_accuracy_value(y, y_hat))
            self._backward_propagation(y, y_hat)
            self._update(learning_rate)

        return self._parameters, self._cost_history, self._accuracy_history

    def predict(self, data: np.ndarray) -> np.ndarray:
        """
        Predict outputs from the given data. Raise error if neural network has no parameters object (has not been
        trained).
        :param data: Data of entries.
        :return: Array outputting 0s and 1s
        """
        if self._parameters is None:
            raise NotImplementedError("Model has not been trained yet.")

        curr_activation = data.transpose()
        for index, layer in enumerate(self._layers):
            prev_activation = curr_activation
            curr_weighting = self._parameters.get_weightings()[index]
            curr_bias = self._parameters.get_biases()[index]
            curr_z_index = np.add(np.dot(curr_weighting, prev_activation), curr_bias)
            curr_activation = layer.get_activation().apply_function(curr_z_index)
        return self._convert_probabilities_to_ones_and_zeros(curr_activation)


def get_data():
    """
    Read training and testing data and place them into a single array.
    :return: Array of data to train on.
    """
    training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(int)
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    return np.concatenate((training_spam, testing_spam))


def build_model(parameters=None) -> Model:
    """
    Build a 5 layer model with Relu on the inner layers, Sigmoid on the final layer and Binary Crossentropy as the loss function.
    :param parameters: Initialise model with weightings and biases to use a pretrained model.
    :return: Model.
    """
    model = Model(parameters)
    layers = [Layer(54, 54, Relu), Layer(54, 54, Relu), Layer(54, 54, Relu), Layer(54, 54, Relu), Layer(54, 1, Sigmoid)]
    for layer in layers:
        model.add_layer(layer)
    model.add_loss(BinaryCrossentropy)
    return model


def train_model(model: Model, data: np.ndarray) -> Parameters:
    """
    Train model the way the creator intended.
    :param model: Untrained model.
    :param data: Data to train on.
    :return: Parameters object of the trained model.
    """
    parameters, _, _ = model.train(data[:, 1:], data[:, 0], 750, 0.08)
    return parameters


def store_parameters(parameters: Parameters):
    """
    Pickle the values of a Parameters object.
    :param parameters: Parameter object.
    """
    np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
    np.save("data/weightings.npy", parameters.get_weightings(), allow_pickle=True)
    np.save("data/biases.npy", parameters.get_biases(), allow_pickle=True)


def retrieve_parameters():
    """
    Unpickle the values of a Parameters object.
    :return: Parameter object of pre trained values.
    """
    np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
    weightings = np.load("data/weightings.npy", allow_pickle=True)
    biases = np.load("data/biases.npy", allow_pickle=True)
    return Parameters(weightings, biases)


def train_model_and_save_it_to_file():
    """
    Train a model and pickle it's parameter values.
    """
    np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
    data = get_data()
    model = build_model()
    parameters = train_model(model, data)
    store_parameters(parameters)


def create_classifier():
    parameters = retrieve_parameters()
    return build_model(parameters)


# To create a classifier through training.
# 1. Uncomment the function "create_classifier_from_training"
# 2. Uncomment the line "classifier = create_classifier_from_training(750, 0.08)"
# 3. Comment out "classifier = create_classifier()"
# The values 750 and 0.08 are the epoch and learning rate I used to train the model, these can be changed. Learning rate
# is expected to be in the range [0, 1) using other values would be considered undefined behaviour.
# def create_classifier_from_training(epochs: int, learning_rate: float):
#     np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
#     data = get_data()
#     model = build_model()
#     model.train(data[:, 1:], data[:, 0], epochs, learning_rate)
#     return model

# classifier = create_classifier_from_training(750, 0.08)
classifier = create_classifier()

### Accuracy Estimate
In the cell below there is a function called `my_accuracy_estimate()` which returns `0.5`. Before you submit the assignment, write your best guess for the accuracy of your classifier into this function, as a percentage between `0.5` and `1`. So if you think you will get 80% of inputs correct, return the value `0.8`. This will form a small part of the marking criteria for the assignment, to encourage you to test your own code without bias – give the *most accurate* estimate you can.

*Note* that there is no sense giving a value lower than `0.5` – if you are getting a score in this region, you can flip all of your predictions to get a better score!

In [4]:
def my_accuracy_estimate():
    return 0.92

Write all of the code for your classifier above this cell.

### Testing Details
Your classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods. At the very high end of the grading scale, your accuracy will also be compared to the best submissions from other students (in your own cohort and others!). Your estimate from the cell above will also factor in, and you will be rewarded for being close to your actual accuracy (overestimates and underestimates will be treated the same).

#### Test Cell
The following code will run your classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

The original skeleton code above classifies every row as ham, but once you have written your own classifier you can run this cell again to test it. So long as your code sets up a variable called `classifier` with a method called `predict`, the test code will be able to run. 

Of course you may wish to test your classifier in additional ways, but you *must* ensure this version still runs before submitting.

**IMPORTANT**: you must set `SKIP_TESTS` back to `True` before submitting this file!

In [5]:
SKIP_TESTS = True


def tests():
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels) / test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")


if not SKIP_TESTS:
    tests()

## Submission Test
The following cell tests if your notebook is ready for submission. **You must not skip this step!**

Restart the kernel and run the entire notebook (Kernel → Restart & Run All). Now look at the output of the cell below. 

*If there is no output, then your submission is not ready.* Either your code is still running (did you forget to skip tests?) or it caused an error.

As previously mentioned, failing to follow these instructions can result in a grade of zero.

In [6]:
def submission_tests():
    import sys
    import pathlib

    fail = False;

    if not SKIP_TESTS:
        fail = True;
        print("You must set the SKIP_TESTS constant to True in the cell above.")

    p3 = pathlib.Path('./spamclassifier.ipynb')
    if not p3.is_file():
        fail = True
        print("This notebook file must be named spamclassifier.ipynb")

    if "create_classifier" not in globals():
        fail = True;
        print("You must include a function called create_classifier.")

    if "my_accuracy_estimate" not in globals():
        fail = True;
        print(
            "You must include a function called my_accuracy_estimate which returns a hard-coded value between 0.5 and 1.")
    else:
        if my_accuracy_estimate() == 0.5:
            print("Warning:")
            print("You do not seem to have provided an accuracy estimate, it is set to 0.5.")
            print("This is the actually the worst possible accuracy – if your classifier")
            print("got 0.1 then it could invert its results to get 0.9!")
    print()

    if fail:
        sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
    else:
        print("All checks passed. When you are ready to submit, upload the notebook to the")
        print("assignment page, without changing any filenames.")
        print()
        print("If you need to submit multiple files, you can archive them in a .zip file. (No other format.)")


submission_tests()


All checks passed. When you are ready to submit, upload the notebook to the
assignment page, without changing any filenames.

If you need to submit multiple files, you can archive them in a .zip file. (No other format.)


In [7]:
# This is a test cell. Please do not modify or delete.