# DAT405 Introduction to Data Science and AI 
## 2022-2023, Reading Period 2
## Assignment 4: Spam classification using Naïve Bayes 
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well. 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 
7zip (https://www.7-zip.org/download.html) to decompress the data.



In [21]:
#Download and extract data
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
# !tar -xjf 20021010_easy_ham.tar.bz2
# !tar -xjf 20021010_hard_ham.tar.bz2
# !tar -xjf 20021010_spam.tar.bz2

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
tar: Error opening archive: Failed to open '20021010_easy_ham.tar.bz2'
tar: Error opening archive: Failed to open '20021010_hard_ham.tar.bz2'
tar: Error opening archive: Failed to open '20021010_spam.tar.bz2'


*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [22]:
#!ls -la

'ls' is not recognized as an internal or external command,
operable program or batch file.


### 1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [23]:
from os import listdir
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import random

# We extracted the directories beforehand and not in the file

# Get all file names in the different directories
easy_ham_files = listdir("./easy_ham")
hard_ham_files = listdir("./hard_ham")
spam_files = listdir("./spam")

easy_ham = []
hard_ham = []
spam = []

# Load in the content of all files into lists
for file_name in easy_ham_files:
    fd = open(f"./easy_ham/{file_name}", "r")
    easy_ham.append(fd.read())
    
for file_name in hard_ham_files:
    fd = open(f"./hard_ham/{file_name}", "r")
    hard_ham.append(fd.read())

for file_name in spam_files:
    fd = open(f"./spam/{file_name}", "r", encoding="unicode_escape")
    spam.append(fd.read())
    
vect = CountVectorizer()
    
# Combine the emails, create y array where 1 is ham and 0 is spam and then split the data into sets
emails = vect.fit_transform(easy_ham + hard_ham + spam)
Y = ([1] * len(easy_ham)) + ([1] * len(hard_ham)) + ([0] * len(spam))
X_train, X_test, Y_train, Y_test = train_test_split(emails, Y, test_size=0.2)
    


### 2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 





In [24]:
from sklearn.naive_bayes import MultinomialNB 
from sklearn.naive_bayes import BernoulliNB 

# instantiate the models
bnb = BernoulliNB()
mnb = MultinomialNB()

# fit the data to the models
bnb.fit(X_train, Y_train)
mnb.fit(X_train, Y_train)

MultinomialNB()

##### We have also submitted a pdf report with the same content but in a more structured way.

According to the documentation one explicit difference, except the calculation method, is that BernoulliNB penalizes a non-occurance of a feature.
This means that, for example, if the word "investment" often occurs in spam emails and it is not seen when it tries to classify an email it is less likely that it is a spam email.
MultinomialNB does not do this but instead ignores the lack-off a word it has classified.

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [25]:
#Code to report results here

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# predict using the test set and print the results
Y_pred_mnb = mnb.predict(X_test)
print("Multinomial: ")
display(accuracy_score(Y_test, Y_pred_mnb))
print(classification_report(Y_test, Y_pred_mnb))


Y_pred_bnb = bnb.predict(X_test)
print("Berniulli: ")
display(accuracy_score(Y_test, Y_pred_bnb))
print(classification_report(Y_test, Y_pred_bnb))

Multinomial: 


0.9878971255673222

              precision    recall  f1-score   support

           0       0.96      0.96      0.96        94
           1       0.99      0.99      0.99       567

    accuracy                           0.99       661
   macro avg       0.98      0.98      0.98       661
weighted avg       0.99      0.99      0.99       661

Berniulli: 


0.8956127080181543

              precision    recall  f1-score   support

           0       0.93      0.29      0.44        94
           1       0.89      1.00      0.94       567

    accuracy                           0.90       661
   macro avg       0.91      0.64      0.69       661
weighted avg       0.90      0.90      0.87       661



### 4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


In [26]:
# remove uninformative words using count vectorizer
vect2 = CountVectorizer(strip_accents='unicode', stop_words='english')

# Do the same as in part 2/3
emails = vect2.fit_transform(easy_ham + hard_ham + spam)
Y = ([1] * len(easy_ham)) + ([1] * len(hard_ham)) + ([0] * len(spam))
X_train, X_test, Y_train, Y_test = train_test_split(emails, Y, test_size=0.2)

bnb = BernoulliNB()
mnb = MultinomialNB()


bnb.fit(X_train, Y_train)
mnb.fit(X_train, Y_train)

Y_pred_mnb = mnb.predict(X_test)
print("Multinomial: ")
display(accuracy_score(Y_test, Y_pred_mnb))
print(classification_report(Y_test, Y_pred_mnb))


Y_pred_bnb = bnb.predict(X_test)
print("Berniulli: ")
display(accuracy_score(Y_test, Y_pred_bnb))
print(classification_report(Y_test, Y_pred_bnb))

Multinomial: 


0.9939485627836612

              precision    recall  f1-score   support

           0       0.99      0.97      0.98        96
           1       0.99      1.00      1.00       565

    accuracy                           0.99       661
   macro avg       0.99      0.98      0.99       661
weighted avg       0.99      0.99      0.99       661

Berniulli: 


0.8835098335854765

              precision    recall  f1-score   support

           0       0.81      0.26      0.39        96
           1       0.89      0.99      0.94       565

    accuracy                           0.88       661
   macro avg       0.85      0.62      0.66       661
weighted avg       0.88      0.88      0.86       661



A common word will show up in both ham and spam and will therefore not be indicative of either class.
Even if we filter out the stop_words there will be some stragglers left that we want to remove.
There is also a downside to this. 
Some words might be very common but only occur in one of the classes, by removing them we will lose that data and might get a worse fit.

Words that only occur very few times does not provide much indication of if it is a ham or spam word since there are too few occurrences.
You might get an email regarding a very specific topic that is ham, and that email is the only occurrence of that word.
Then you get a spam email that has that same word, then you don't want to use that single occurrence to classify with since it it very indicative of ham where the rest of the email might not have been.

We have used both stop_words and strip_accents when filtering with CountVectorizer.
We decided to use the built in functions since they provide a good base set to use while it is very hard to create our own.
Stop_words contains a list of common English words that do not provide meaning but are instead a function of grammar.
This list would take a lot of manual labour to create and using the ready-made one from CountVectorizer provide an easy to use solution.


### 5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
-	Does the result improve from 3 and 4? 
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.

In [27]:
# reset the data in the arrays
easy_ham = []
hard_ham = []
spam = []


# Remove some of the header, specifically from the word "From:" which occurs far down in the header
for file_name in easy_ham_files:
    fd = open(f"./easy_ham/{file_name}", "r")
    tmp = fd.read().split("\nFrom: ")
    easy_ham.append(tmp[len(tmp) - 1])
    
for file_name in hard_ham_files:
    fd = open(f"./hard_ham/{file_name}", "r")
    tmp = fd.read().split("\nFrom: ")
    hard_ham.append(tmp[len(tmp) - 1])

for file_name in spam_files:
    fd = open(f"./spam/{file_name}", "r", encoding="unicode_escape")
    tmp = fd.read().split("\nFrom: ")
    spam.append(tmp[len(tmp) - 1])
      
# Do the same as in part 2/3/4  
vect3 = CountVectorizer(strip_accents='unicode', stop_words='english')
emails = vect3.fit_transform(easy_ham + hard_ham + spam)

Y = ([1] * len(easy_ham)) + ([1] * len(hard_ham)) + ([0] * len(spam))
X_train, X_test, Y_train, Y_test = train_test_split(emails, Y, test_size=0.2)

# remove comment and extra ) to run with fit_prior=False
bnb = BernoulliNB()#fit_prior=False)
mnb = MultinomialNB()#fit_prior=False)

bnb.fit(X_train, Y_train)
mnb.fit(X_train, Y_train)

Y_pred_mnb = mnb.predict(X_test)
print("Multinomial: ")
display(accuracy_score(Y_test, Y_pred_mnb))
print(classification_report(Y_test, Y_pred_mnb))


Y_pred_bnb = bnb.predict(X_test)
print("Berniulli: ")
display(accuracy_score(Y_test, Y_pred_bnb))
print(classification_report(Y_test, Y_pred_bnb))

Multinomial: 


0.9863842662632375

              precision    recall  f1-score   support

           0       0.94      0.95      0.95        84
           1       0.99      0.99      0.99       577

    accuracy                           0.99       661
   macro avg       0.97      0.97      0.97       661
weighted avg       0.99      0.99      0.99       661

Berniulli: 


0.903177004538578

              precision    recall  f1-score   support

           0       0.76      0.35      0.48        84
           1       0.91      0.98      0.95       577

    accuracy                           0.90       661
   macro avg       0.84      0.66      0.71       661
weighted avg       0.89      0.90      0.89       661



Multinomial increased a bit from part 3 but not much from part 4. 
We can see that the accuracy score metric stayed almost the same but that the precision number went up with 0,07.
Filtering the data from step 3 to 4 seems to have a greater impact on accuracy than filtering out the headers which makes sense considering that the headers should be the same for most emails leaving the words in the headers to not have much impact on the classification.
As for Bernoulli we can see improvements form part 3 and part 4, accuracy increased as well as precision.

Since we are splitting the data in random we might have some spam messages that are very unique in the testing or in the training sets.
The results might be skewed if we by chance get a lot of the same category of spam mails, for example phishing, in the test set but not in the training set.
Then it will be hard for the model to accurately predict that an email is spam on the data it has been fitted with.

We came up with two ways to try and remedy this.
Firstly is to manually categorize the data beforehand and split the training and test sets evenly over the categories.
For example if you have the spam categories of Phishing, malware, unwanted ads and a general spam one, you could then split those in accordance to your testing/training ratios to make sure that you train on all categories evenly.
This does however have the drawback of a lot of manual labor and might not be suitable for large data sets.

Another way to try and do it pragmatically is to count the words before you do the split and try to make sure that both sets contains the largest vocabularies possible.
This would try to accomplish the same as the other method on the assumption that, for example, phishing emails generally contains the same vocabulary.

With fit prior on the model calculates the probability of a word being in either ham or spam.
It uses these weights to try and find if an email is either a spam or a ham email.
When fit_prior is turned of it just uses the occurrences of words the email when it tries to classify it with the probability being an uniform distribution.
This means that if a word is both in spam and ham it will basically be a non factor since it can not determine if it is more likely that it belongs to either.

This lowers the score for both classifiers which makes a lot of sense considering it has less information to use.
The weighted prior gives a lot of data if a word is more or less common in either of the cases and not being able to use that data reduces the accuracy.