# Spam Classifiers
I use this notebook in showcasing multiple algorithms for performing a binary classification task on the Spambase dataset. 

The dataset has the structure:
- 4601 Examples
- 57 features
- 1 Label:
    - 0 - notSpam - 2788 examples
    - 1 - spam - 1813 examples

We start off by importing the necessary packages. We need to be able to read and write CSV files (csv), perform matrix computations (numpy) and graph our results (matplotlib). TensorFlow provides a streamlined way to implement multiple learning algorithms quickly. 

In [189]:
import csv
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

Next, we set some global variables for the script. The filename, hyperparameters (step size, number of epochs, momentum, batch size), the feature dimension (57) and number of output classes (2).

In [190]:
# filename containing the dataset
filename = 'Datasets/Spambase/spambase.data'

# Hyperparameters
numEpochs = 30
stepSize = 0.25e-3
batchSize = 20
momentum = 0.785

# Information about the data
featureDimension = 57
numClasses = 2

# The percentage of data to use for training
trainRatio = 0.8

With these defined, we can then define some helper functions that manipulate the data.

In [191]:
# Load the data from the filename and return the spam and notSpam arrays
def loadData(filename):
    data = np.array(list(csv.reader(open(filename), delimiter=',', 
            quoting=csv.QUOTE_NONNUMERIC)))
    spam = data[:1813, :]
    notSpam = data[1813:, :]
    return spam, notSpam

# Shuffle, then plit the data according to the 
# train-test ratio (percent - 0.8)
def splitData(spam, notSpam, trainRatio, seed):
    # Shuffle the spam and notSpam
    np.random.seed(seed)
    np.random.shuffle(spam)
    np.random.shuffle(notSpam)
    
    # Split the data according to the ratio
    numSpamTrain = int(trainRatio*spam.shape[0] + 1)
    numNotTrain = int(trainRatio*notSpam.shape[0] + 1)
    
    spamTrain = spam[:numSpamTrain, :]
    spamTest = spam[numSpamTrain:, :]
    
    notTrain = notSpam[:numNotTrain, :]
    notTest = notSpam[numNotTrain:, :]
    
    # Return the arrays still separated by class
    return spamTrain, spamTest, notTrain, notTest

# Takes only a percentage of the training data and returns 
# the concatenated array
# For using only a subset of the training data
def takePercentData(spamTrain, notTrain, percentage, seed):
    percentage /= 100.
    
    numSpam = int(percentage*spamTrain.shape[0] + 1)
    numNot = int(percentage*notTrain.shape[0] + 1)
    
    trainData = spamTrain[:numSpam, :]
    trainData = np.append(trainData, notTrain[:numNot, :], axis=0)
    
    np.random.shuffle(trainData)
    
    return trainData

These functions allow us create and compile the TensorFlow model

In [192]:
def createModel(activation, numHiddenNeurons, numLayers):
    # Define the input layer
    input = (keras.Input(shape = (featureDimension,), name='input'))
   
    # Define first hidden layer
    hidden1 = (keras.layers.Dense(numHiddenNeurons, 
        kernel_regularizer=keras.regularizers.l2(100),
        activation=activation, name='hidden')(input))
    
    # If specified 2 layers, create hidden2, else go to output
    if numLayers == 2:
        hidden2 = (keras.layers.Dense(numHiddenNeurons, 
                    kernel_regularizer=keras.regularizers.l2(100),
                    activation=activation, name='hidden2')(hidden1))
        output = (keras.layers.Dense(2, activation='softmax', name=
                                     'output')(hidden2))
    else:        
        output = (keras.layers.Dense(2, activation='softmax', name=
                                     'output')(hidden1))

    # Put the model together and return it
    model = keras.Model(inputs=input, outputs=output, name='NN')
    
    return model

def compileModel(model, optChoice):
    opt = tf.keras.optimizers.SGD(learning_rate=stepSize, 
        momentum=momentum) if (optChoice == 'sgd'
        ) else tf.keras.optimizers.Adam(learning_rate=stepSize)
    
    # Compile the model with the optimizer, target metrics, and loss
    model.compile(
        optimizer = opt,
        loss = keras.losses.SparseCategoricalCrossentropy(),
        metrics = ['accuracy']
    )

    # Save the model diagram
    # saveModelDiagram(model)
    
    return model

Main function

In [193]:
def main():
    # Load the data
    spamData, notData = loadData(filename)
    curSplit = 1
    spamTrain, spamTest, notTrain, notTest = splitData(spamData, 
                                    notData, trainRatio, curSplit)
    
    # Create the test data
    XTest = np.append(spamTest, notTest, axis=0)
    np.random.shuffle(XTest)
    
    YTest = XTest[:, -1]
    XTest = XTest[:, :-1]
    
    # Take the desired percentage of train data
    percentage = 25
    XTrain = takePercentData(spamTrain, notTrain, percentage, curSplit)
    np.random.shuffle(XTrain)
    YTrain = XTrain[:, -1]
    XTrain = XTrain[:, :-1]
    
    # Create a tf model
    numHiddenNeurons = 10
    activation = 'relu'
    numLayers = 2
    optChoice = 'adam'
    
    model = createModel(activation, numHiddenNeurons, numLayers)
    model = compileModel(model, optChoice)
    
    history = model.fit(XTrain, YTrain, epochs=numEpochs, 
                    validation_data=(XTest, YTest), verbose=1,
                    shuffle=True,
                    callbacks=[keras.callbacks.EarlyStopping()])
#     print(history.history["val_accuracy"][-1])

In [194]:
if __name__ == "__main__":
    main()

Train on 921 samples, validate on 919 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
