# Spam Classifier

## Getting Started
Spam refers to unwanted email, often in the form of advertisements. In the literature, an email that is **not** spam is called *ham*. Most email providers offer automatic spam filtering, where spam emails will be moved to a separate inbox based on their contents. Of course this requires being able to scan an email and determine whether it is spam or ham, a classification problem. 

### Choice of Algorithm
This notebook presents the Naive-Bayes approach, however a 2nd method was implemented  (a Neural Network), which is in .py file and all weights are in csvs, however it should be noted other algorithms can be implemented such as:k-nearest neighbour algorithm, but this may be less accurate. Logistic regression is another option.

## Training Data
The training data is described below and has 1000 rows. There is also a 500 row set of test data. These are functionally identical to the training data, they are just in a separate csv file to encourage you to split out your training and test data. You should consider how to best make use of all available data without overfitting, and to help produce an unbiased estimate for your classifier's accuracy.

The cell below loads the training data into a variable called `training_spam`.

In [1]:
import numpy as np

training_spam = np.loadtxt(open("data/training_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam training data set:", training_spam.shape)
print(training_spam)

Shape of the spam training data set: (1000, 55)
[[1 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [1 1 1 ... 1 1 0]
 [1 0 0 ... 1 1 1]]


The training set consists of 1000 rows and 55 columns. Each row corresponds to one email message. The first column is the _response_ variable and describes whether a message is spam `1` or ham `0`. The remaining 54 columns are _features_ that you will use to build a classifier. These features correspond to 54 different keywords (such as "money", "free", and "receive") and special characters (such as ":", "!", and "$"). A feature has the value `1` if the keyword appears in the message and `0` otherwise.

As mentioned there is also a 500 row set of *test data*. It contains the same 55 columns.

In [2]:
testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
print("Shape of the spam testing data set:", testing_spam.shape)
print(testing_spam)

Shape of the spam testing data set: (500, 55)
[[1 0 0 ... 1 1 1]
 [1 1 0 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]


## Part One
Naive-Bayes approach implementation

### Submission Requirements
The code uses a variable with the name `classifier`. This object have a method called `predict` which takes input data and returns class predictions. The input will be a single $n \times 54$ numpy array, and the classifier is going to return a numpy array of length $n$ with classifications.

In [3]:
import numpy as np
import copy

class SpamClassifier:
    
#   
    def __init__(self, data = None):
  
        if data is not None:
            self.data = data
            
            self.data_outputs = self.data[:, 0]  # 1 = spam, 0 = ham
            self.data_inputs = self.data[:, 1:]  # All 54 features values over 1000 samples for training mode
            self.n_samples, self.n_features = self.data_inputs.shape  # Number of samples (1000), Number of features (54)
            
            self.total_spam = np.count_nonzero(self.data_outputs) # Counting Spam
            self.total_ham = self.n_samples - self.total_spam # Counting Ham
            
        
    def estimate_log_class_priors(self):
    
        ## Probabilities
        prob_spam = self.total_spam / self.n_samples
        prob_ham = self.total_ham / self.n_samples
        probs = np.array([prob_ham, prob_spam], dtype=float)

        ## Logarithm
        log_class_priors = np.log10(probs)

        return log_class_priors
        
#
    def values_ones_spam_ham(self, binary_number):
        
        #Binary number could be either 1 or 0
        
        spam_ham_data_ones_zeros = copy.deepcopy(self.data_inputs) # Copy data to be deleted
        ones_zeros_index_array = [] # This list contains all indices that has 0s (ham) or 1s (spam) in input label
        values_ones_spam_ham = np.array([]) # This array saves the frequency that a "1" or "0" is registered based on filtered data

        for ones_zeros in range(self.n_samples):
            if self.data_outputs[ones_zeros] == binary_number: 
                ones_zeros_index_array.append(ones_zeros) # Saving indices for hams or spams

        spam_ham_data_ones_zeros = np.delete(spam_ham_data_ones_zeros, tuple(ones_zeros_index_array), axis = 0) # Creating data according to input label

        for column in range(self.n_features):
            ones_spam_ham = np.count_nonzero(spam_ham_data_ones_zeros[:,column]) #Counting 1s (keywords) in spam_ham_data_ones_zeros matrix
            values_ones_spam_ham = np.append(values_ones_spam_ham, ones_spam_ham) #Adding values in an array
            
        return values_ones_spam_ham
    
#    
    def estimate_log_class_conditional_likelihoods(self, alpha=1):

        ## Probabilities
        
        values_ones_spam = self.values_ones_spam_ham(0) # To delete 0s
        values_ones_ham = self.values_ones_spam_ham(1) # To delete 1s
            
        ## Calculating probabilities per feature
        prob_ones_spam = np.zeros([1, self.n_features]) #array to save probabilities
        prob_ones_ham = np.zeros([1, self.n_features]) #array to save probabilities

        for i in range(self.n_features):

            # Cardinality must be 2 in all data_inputs´columns as result of is a binary conditional
            features_values = np.unique(self.data_inputs[:,i]) # Calculating unique values per column in data inputs
            cardinality = len(features_values) # Getting cardinality per calumn

            #Implementing probabilities by using laplace method
            prob_ones_spam[0, i] = (values_ones_spam[i] + alpha)/(self.total_spam + (cardinality * alpha))
            prob_ones_ham[0, i] = (values_ones_ham[i] + alpha)/(self.total_ham + (cardinality * alpha))

        conditional_likelihoods = copy.deepcopy(prob_ones_ham)
        conditional_likelihoods = np.append(conditional_likelihoods, prob_ones_spam, axis = 0)
        conditional_likelihoods = np.log10(conditional_likelihoods)

        return conditional_likelihoods
    
#
    def train(self):
        
        log_class_priors = self.estimate_log_class_priors()
        
        log_class_conditional_likelihoods = self.estimate_log_class_conditional_likelihoods()
        
        np.savetxt('log_class_priors.csv', log_class_priors, delimiter=",")
        np.savetxt('log_class_conditional_likelihoods.csv', log_class_conditional_likelihoods, delimiter=",")

#    
    def predict(self, test_data):
        
        # Open probabilities from latest training 
        log_class_priors = np.loadtxt(open("log_class_priors.csv"), delimiter=",")
        log_class_conditional_likelihoods = np.loadtxt(open("log_class_conditional_likelihoods.csv"), delimiter=",")
        
        n_test_samples, n_test_features = test_data.shape # Number of samples (n), Number of features (54)
        
        class_predictions = np.array([])

        for samples in range(n_test_samples):

            ham = log_class_priors[0] + (test_data[samples] @ log_class_conditional_likelihoods[0])
            spam = log_class_priors[1] + (test_data[samples] @ log_class_conditional_likelihoods[1])

            if ham > spam:
                class_predictions_values = 0
            else:
                class_predictions_values = 1

            class_predictions = np.append(class_predictions, class_predictions_values)

        return class_predictions
    
### Function ###
def create_classifier(data):
    classifier = SpamClassifier(training_spam_data) # Initializing the Class
    classifier.train() # Training
    return classifier
    
if __name__ == '__main__':
    
    # Importing Training data from /data/...
    training_spam_data = np.loadtxt(open("data/training_spam.csv"), delimiter=",")
    classifier = create_classifier(training_spam_data)
    
    # Insert new data
    #predictions = classifier.predict(test_data)
    
    # Comments
    # Adding the Neural Network file inside the folder and the link for the video is also here:
    # https://video-uk.engagelms.com/share/aecytBqLyKvppF1Ah6mXUuXXtSaLEA123qsXBgPp8J7Rykss3Rochwrr

### Testing Details
The classifier will be tested against some hidden data from the same source as the original. The accuracy (percentage of classifications correct) will be calculated, then benchmarked against common methods.

#### Test Cell
The following code will run the classifier against the provided test data. To enable it, set the constant `SKIP_TESTS` to `False`.

In [5]:
SKIP_TESTS = True

if not SKIP_TESTS:
    testing_spam = np.loadtxt(open("data/testing_spam.csv"), delimiter=",").astype(np.int)
    test_data = testing_spam[:, 1:]
    test_labels = testing_spam[:, 0]

    predictions = classifier.predict(test_data)
    accuracy = np.count_nonzero(predictions == test_labels)/test_labels.shape[0]
    print(f"Accuracy on test data is: {accuracy}")

In [6]:
import sys
import pathlib

fail = False;

if not SKIP_TESTS:
    fail = True;
    print("You must set the SKIP_TESTS constant to True in the cell above.")
    
p3 = pathlib.Path('./spamclassifier.ipynb')
if not p3.is_file():
    fail = True
    print("This notebook file must be named spamclassifier.ipynb")
    
if "create_classifier" not in dir():
    fail = True;
    print("You must include a function called create_classifier.")

if "my_accuracy_estimate" not in dir():
    fail = True;
    print("You must include a function called my_accuracy_estimate.")
else:
    if my_accuracy_estimate() == 0.5:
        print("Warning:")
        print("You do not seem to have provided an accuracy estimate, it is set to 0.5.")
        print("This is the actually the worst possible accuracy – if your classifier")
        print("got 0.1 then it could invert its results to get 0.9!")
    
print("INFO: Make sure you follow the instructions on the assignment page to submit your video.")
print("Failing to include this could result in an overall grade of zero for both parts.")
print()

if fail:
    sys.stderr.write("Your submission is not ready! Please read and follow the instructions above.")
else:
    print("All checks passed. When you are ready to submit, upload the notebook and readme file to the")
    print("assignment page, without changing any filenames.")
    print()
    print("If you need to submit multiple files, you can archive them in a .zip file. (No other format.)")

INFO: Make sure you follow the instructions on the assignment page to submit your video.
Failing to include this could result in an overall grade of zero for both parts.

All checks passed. When you are ready to submit, upload the notebook and readme file to the
assignment page, without changing any filenames.

If you need to submit multiple files, you can archive them in a .zip file. (No other format.)


In [7]:
# This is a test cell. Please do not modify or delete.