# A) Dataset

We have chosen the dataset of emails and our task will be the to detect those that are spam. By training our chosen machine learning algorithm with a collection of labelled training data (messages labelled as spam or not), our machine learning pipeline should then be able to effectively classify whether an email in the test set is a spam email or not - and thus is a binary classification task.

The format of the dataset is a collection of text files containing the subject and message of each email. These text files are identified as spam by the name of the file - which is 'spmsg' for spam messaged and 'msg' for normal messages. 

The first step will be to preprocess this labelled data. To do this, we will create a function that will first read the files from the chosen directory containing the dataset, then the data will be mapped so that any text files named 'spmsg' will be given a nul value, and otherwise normal messages will be given a value of '1'. This will make it clear for the machine learning algorithm to comprehend the class of each message when building the pipeline. 

Using this, a dataframe is created with the contents of the text file as the first column and the label given to each text file as the second column.

In [7]:
import re

dataPath = '/data/tempstore/spam/bare/part[1-9]' # directory of data

def create_DF( directory ): # function that creates dataframe of the text files (emails) in the directory
   ft_RDD = sc.wholeTextFiles(directory) # Read text files from the directory
   print('There are {} files read from directory {}'.format(ft_RDD.count(),directory))# Count the number of files in the directory and display them
   spam_text_RDD = ft_RDD.map(lambda ft: (ft[1], 0.0 if re.search('spmsg',ft[0]) is None else 1.0)) # Label spam data as 0 and normal emails 1
   DF = spark.createDataFrame(spam_text_RDD, schema=['text','label']) # create a DataFrame
   return DF

# display our dataframe
data = create_DF(dataPath)
data.show(5) # Display the top 5 rows


print("Done")

There are 2602 files read from directory /data/tempstore/spam/bare/part[1-9]
+--------------------+-----+
|                text|label|
+--------------------+-----+
|Subject: becoming...|  0.0|
|Subject: zero dow...|  1.0|
|Subject: how does...|  1.0|
|Subject: philosop...|  0.0|
|Subject: job - un...|  0.0|
+--------------------+-----+
only showing top 5 rows

Done


-----

As you can see, the first column of the dataframe that we have created is the subject and content of each email along with the class / label - where 0 is for a spam email and 1 is for a normal email.


-----

# B) Machine Learning Pipeline

In this section, we will build, train and test our machine learning pipeline. The machine learning algorithm chosen to conduct this classification task is Naive Bayes.

First, we'll import the necessary libraries to configure the machine learning pipeline. We then prepare the data by using a 90% training split and a 10% testing split.

In [8]:
# Import libraries:
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer, IDF
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and test data:
train, test = data.randomSplit([0.9, 0.1], seed=1234) # Randomly split data - controls sequence generation of pseudo-random numbers  


To configure the pipeline, we first use the tokenisor to transform the data in the first column of the dataframe from a collection of text into the seperate words of the email. We then implemented a hashing function to transform those words from a string of characters into integers and outputted this into a column named 'features'. This allows for easier and faster indexing and retrieval of the data. Both these steps combined with our Naive Bayes as our estimator make up the three steps of our pipeline.

In [9]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and nb:
tokenizer = Tokenizer(inputCol="text", outputCol="words") # Feature transformer - splits the texts into the words of the text
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") # Feature extractor - hashes the strings into integer buckets
nb = NaiveBayes() # Naive Bayes ML classifier - Our estimator


pipeline = Pipeline(stages=[tokenizer, hashingTF, nb]) # 3 stages to the pipeline, feature transforming, extracting and ML.

# C) Evaluating Performance

We then chose the initial values of the parameters for the pipeline as 50 hashing buckets and 0 as the Naive Bayes smoothing parameter. Using this, we then train the classifier on the dataset using the chosen parameters. From this training, the classifier is fit to the dataset.

In [16]:
from datetime import datetime
import numpy as np

starttime = datetime.now()

# Implement the ML Pipeline and use a ParamGridBuilder to construct a grid of parameters to test over:
paramGrid = ParamGridBuilder() \
   .addGrid(hashingTF.numFeatures, [50]) \
   .addGrid(nb.smoothing, [0]) \
   .build()

# Binary Classification Evaluator
bc_eval=BinaryClassificationEvaluator() # As this task is a binary classification

# Testing using TrainingValidationSplit:
tvs = TrainValidationSplit(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=bc_eval,
                          # 80% of the data will be used for training, 20% for validation.
                          trainRatio=0.8)

# Run TrainValidationSplit:
model = tvs.fit(train)

# Make predictions on training documents:
prediction = model.transform(train)
print(model.bestModel)
print("training: ", bc_eval.evaluate(prediction))


# Make predictions on test documents:
prediction = model.transform(test)
print(model.bestModel)
print("testing: ", bc_eval.evaluate(prediction))


# Make predictions on the testing parameters:
model.transform(test)\
   .select("features", "label", "prediction")\
   .show()
   
print("Done")


endtime = datetime.now()
elapsedtime = endtime - starttime
print ('Elapsed time is %s ' %elapsedtime)


PipelineModel_490caf52a5c7c12640b2
training:  0.491709759657338
PipelineModel_490caf52a5c7c12640b2
testing:  0.3628724216959512
+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(50,[0,1,2,3,5,6,...|  0.0|       1.0|
|(50,[0,1,2,3,4,5,...|  1.0|       1.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,4,5,6,9,...|  0.0|       0.0|
|(50,[1,2,3,5,6,7,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  1.0|       0.0|
|(50,[0,1,2,3,4,5,...|  1.0|       1.0|
|(50,[0,1,2,3,4,5,...|  1.0|       1.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       1.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[1,2,3,5,6,8,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,5,...|  0.0|       0.0|
|(50,[0,1,2,3,4,6,...|  0.0|    

------







The accuracy of the pipeline model is above, showing approximately 49.2% accuracy for the training set and 36.3% for the test set. The total time taken to train and test the pipeline is just under a minute and a half.

-----


# D) Grid Search

Once we have tested that the pipeline works with the data, we will then use a parameter grid that will evaluate the pipeline at different parameter configurations and select the values that return the best results.

To implement a parameter grid, we constructed a grid of parameters to search over. The values for the parameter were different hashing buckets of 50, 100, 150 and 200 and a Naive Bayes smoothing values of 0, 1 and 5.

In [14]:
# Implement the ML Pipeline and use a ParamGridBuilder to construct a grid of parameters to search over:
paramGrid = ParamGridBuilder() \
   .addGrid(hashingTF.numFeatures, [50, 100, 150, 200]) \
   .addGrid(nb.smoothing, [0, 1, 5]) \
   .build()

As before, we train the model use the TVS function and in this case the model picks the best set of parameters for our Naive Bayes classifier.

In [15]:
starttime=datetime.now()
# TrainValidationSplit will try all combinations of values and determine best model using the nb evaluator:
tvs = TrainValidationSplit(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=bc_eval,
                          # 80% of the data will be used for training, 20% for validation.
                          trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters:
model = tvs.fit(train)

# Make predictions on training documents:
prediction = model.transform(train)
print(model.bestModel)
print("training: ", bc_eval.evaluate(prediction))


# Make predictions on test documents:
prediction = model.transform(test)
print(model.bestModel)
print("testing: ", bc_eval.evaluate(prediction))


# Make predictions on the best combination of parameters:
model.transform(test)\
   .select("features", "label", "prediction")\
   .show()
   
print("Done")

# find optimal parameters from grid search
evalmetric=model.validationMetrics
maxparams=np.argmax(evalmetric) #return parameters that give the highest accuracy
print(paramGrid[maxparams])

print('--------------')
endtime = datetime.now()
elapsedtime = endtime - starttime
print ('Final Elapsed time for grid search %s ' %elapsedtime)

PipelineModel_418590c848f789bed4d7
training:  0.4980520824997364
PipelineModel_418590c848f789bed4d7
testing:  0.3718869365928191
+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(200,[2,5,17,18,2...|  0.0|       1.0|
|(200,[0,1,2,3,8,1...|  1.0|       1.0|
|(200,[0,1,2,4,5,6...|  0.0|       0.0|
|(200,[1,4,9,12,13...|  0.0|       0.0|
|(200,[12,17,22,23...|  0.0|       0.0|
|(200,[0,1,4,5,6,8...|  1.0|       1.0|
|(200,[1,4,5,8,9,1...|  1.0|       1.0|
|(200,[0,1,3,4,5,6...|  1.0|       1.0|
|(200,[1,2,3,4,6,1...|  0.0|       0.0|
|(200,[0,1,2,4,6,7...|  0.0|       1.0|
|(200,[0,1,2,3,4,5...|  0.0|       0.0|
|(200,[2,3,4,9,10,...|  0.0|       0.0|
|(200,[5,9,16,18,1...|  0.0|       0.0|
|(200,[1,2,4,5,7,8...|  0.0|       0.0|
|(200,[0,1,2,3,4,5...|  0.0|       0.0|
|(200,[0,1,2,3,4,5...|  0.0|       0.0|
|(200,[1,5,6,7,8,9...|  0.0|       0.0|
|(200,[0,1,2,4,7,8...|  0.0|       0.0|
|(200,[10,13,15,17...|  0.0|   


-----
There is a marginal improvement in accuracy as a result of the grid search, with the training set accuracy 49.8% and test set accuracy 37.2%. The optimal parameters from the grid search were: 200 hashing buckets and 5 as the smoothing factor. This number of buckets must have been large enough to mitigate some of the collisions that must have occurred with 50 hashing buckets part C. Meanwhile, the laplace factor of greater than 0 removes the problem / bias if the certain words are not present in the training set, which would skew the Naive Bayes MAP calculation. The optimal parameter being 5 must mean that this was otherwise a slight issue in part C where the laplace smoothing parameter was set as 0.

Even though the accuracy is quite low, Naive Bayes offered a relatively quick evaluation time just less than  7 minutes to complete the grid search. However, Naive Bayes relies on the independance assumption between instances to hold and better accuracy could have been attained by using a different classification algorithm, such as logistic regression.

----

The work was done in pairs:

Islam Ibrahim & Saeed Ahmed