#DAT405 Introduction to Data Science and AI 
##2022-2023, Reading Period 1
## Assignment 4: Spam classification using Naïve Bayes 
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well. 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 
7zip (https://www.7-zip.org/download.html) to decompress the data.



In [1]:
#Download and extract data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2022-11-29 21:59:50--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: '20021010_easy_ham.tar.bz2.1'


2022-11-29 21:59:50 (10.4 MB/s) - '20021010_easy_ham.tar.bz2.1' saved [1677144/1677144]

--2022-11-29 21:59:50--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: '20021010_hard_ham.tar.bz2'


2022-11-29 21:59:50 (9.95 MB/s) - '20021010_hard_ham.tar.bz2' saved [1021126/1021126]

--2022-1

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [None]:
!ls -lah
!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

total 96768
drwxr-xr-x     3 oscarcronvall  staff    96B  4 Feb  2022 [1m[36m$RECYCLE.BIN[m[m
drwx------+   46 oscarcronvall  staff   1,4K 29 Nov 21:59 [1m[36m.[m[m
drwxr-xr-x+   76 oscarcronvall  staff   2,4K 29 Nov 21:59 [1m[36m..[m[m
-rw-r--r--@    1 oscarcronvall  staff    14K 29 Nov 16:57 .DS_Store
drwxr-xr-x     9 oscarcronvall  staff   288B 29 Sep  2021 [1m[36m.idea[m[m
drwxr-xr-x     2 oscarcronvall  staff    64B 29 Nov 21:59 [1m[36m.ipynb_checkpoints[m[m
-rw-r--r--     1 oscarcronvall  staff     0B 26 Aug  2021 .localized
-rw-r--r--@    1 oscarcronvall  staff    45K 25 Maj  2022 1Password Emergency Kit A3-TRFSBQ-burtcorp.pdf
-rw-r--r--@    1 oscarcronvall  staff   1,6M 23 Nov 12:50 20021010_easy_ham.tar.bz2
-rw-r--r--     1 oscarcronvall  staff   1,6M 29 Jun  2004 20021010_easy_ham.tar.bz2.1
-rw-r--r--     1 oscarcronvall  staff   997K 16 Dec  2004 20021010_hard_ham.tar.bz2
-rw-r--r--     1 oscarcronvall  staff   1,1M 29 Jun  2004 20021010_spam.tar.bz2
-rw-

###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [None]:
import os
from sklearn.model_selection import train_test_split

#Function that saves each mail from the folder in an array
def save_data(dir):
  _list = []

  with os.scandir(dir) as files:
    for file in files:
      with open(file,"r",encoding = "ISO-8859-1") as file_open:
        content = file_open.read()
        _list.append(content)

  return _list

#Call the save_data function for the three folders
easy_ham_mails = save_data('easy_ham')
hard_ham_mails = save_data('hard_ham')
spam_mails = save_data('spam')
#Making labels for the threee different data sets 
#and for the ham combination data set
spam_mails_label = ['spam']*len(spam_mails)
hard_ham_mails_label = ['hardham']*(len(hard_ham_mails))
easy_ham_mails_label = ['easyham']*len(easy_ham_mails)
ham_mails_label = ['ham']*(len(easy_ham_mails + hard_ham_mails))

#Using the train_test_split fuction from sklearn to 
#split the data into train and test. The train sample
#is the 80% of the data and the test is the 20% of the data.
#We have included the labels to have one for each train and 
#test with the same number of items as the train and test lists.

spam_train, spam_test, spam_label_train, spam_label_test = train_test_split(spam_mails, spam_mails_label, test_size=0.2, random_state=42) #spam

hard_ham_train, hard_ham_test, hard_ham_label_train, hard_ham_label_test = train_test_split(hard_ham_mails, hard_ham_mails_label, test_size=0.2, random_state=42) # hard ham

easy_ham_train, easy_ham_test, easy_ham_label_train, easy_ham_label_test = train_test_split(easy_ham_mails, easy_ham_mails_label, test_size=0.2, random_state=42) # easy ham

ham_train, ham_test, ham_label_train, ham_label_test = train_test_split(easy_ham_mails + hard_ham_mails, ham_mails_label, test_size=0.2, random_state=42) # combined ham

Your discussion here

###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 





In [None]:
#Necessary imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
import numpy as np


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

#Naive_Bayes function that prints the True Positive and False Negative of 
#the previous train and test using the Multinomial and Bernuoulli Naive Bayes
#and CountVectorizer from sklearn package.
def Naive_Bayes(data_to_train_one, data_to_train_two, data_to_test_one, data_to_test_two, label_train_one, label_train_two, label_test_one, label_test_two):  
    #Initialization of CountVectorizer, BernoulliNB and MultinomialNB
    multi_NB = MultinomialNB()
    bernoulli_NB = BernoulliNB()
    count_vec = CountVectorizer()

    #Transforming the train data set with CountVectorizer to vectors
    #and fit the vectors with the Multinomial and Bernoulli Naive Bayes classifier
    train = count_vec.fit_transform(data_to_train_one + data_to_train_two)
    multi_NB.fit(train, label_train_one + label_train_two)
    bernoulli_NB.fit(train, label_train_one + label_train_two)

    #Now is transformed the two test data sets with CountVectorizer 
    #and perform the classification with the predict with  
    #Multinomial and Bernoulli Naive Bayes    
    test_one = count_vec.transform(data_to_test_one)
    multi_NB_test_one = multi_NB.predict(test_one)
    bernoulli_NB_test_one = bernoulli_NB.predict(test_one)

    test_two = count_vec.transform(data_to_test_two)
    multi_NB_test_two = multi_NB.predict(test_two)
    bernoulli_NB_test_two = bernoulli_NB.predict(test_two)
    
    #Calculated the accuracy on the given test_set of the
    #Multinomial and Bernoulli Naive Bayes
    test_set = count_vec.transform(data_to_test_one + data_to_test_two)
    multi_NB_accuracy = multi_NB .score(test_set, label_test_one + label_test_two)
    bernoulli_NB_accuracy = bernoulli_NB .score(test_set, label_test_one + label_test_two)

    #Printing Multinomial results
    unique, counts = np.unique(multi_NB_test_one, return_counts=True)
    print(f"Multinomial True Positives: {dict(zip(unique, counts))[label_test_one[0]]}")
    unique, counts = np.unique(multi_NB_test_two, return_counts=True)
    print(f"Multinomial False Negatives: {dict(zip(unique, counts))[label_test_two[0]]}")
    print(f"Multinomial Accuracy: {multi_NB_accuracy:,.2f}\n")
    
    #Printing Bernoulli results
    unique, counts = np.unique(bernoulli_NB_test_one, return_counts=True)
    print(f"Bernoulli True Positives: {dict(zip(unique, counts))[label_test_one[0]]}")    
    unique, counts = np.unique(bernoulli_NB_test_two, return_counts=True)
    print(f"Bernoulli False Negatives: {dict(zip(unique, counts))[label_test_two[0]]}")
    print(f"Bernoulli Accuracy: {bernoulli_NB_accuracy:,.2f}\n")

In [None]:
Naive_Bayes(ham_train, spam_train, ham_test, spam_test, ham_label_train, spam_label_train, ham_label_test, spam_label_test)

Discussion:

For this part we have used the Multinomial and Bernoulli Niave Bayes from SKlearn (also the CountVectorizer), all the information that we have used to know how to work with this three functions is from the official web page and some examples from internet users. We know that Bernoulli Naive Bayes is better compare with Multinomial Naive Bayes when we have to hand boolean or binary values, in the case of Multinomial is used when we have to hand with discrete values.

From the first moment that we knew the type of data and we take a look we thought that the Multinomial classifier will fit in a better way and, finally, with the results and the accuracy our suspicions were true. We thought that because of the limitation from Bernoulli of boolean or binary values because in many spam emails as they are publicity or spams some elements can be repitied, elements like symbols, and in the case of Bernoulli can not interpret this type of symbols in the same way as Multinomial classifier does.

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [None]:
#Spam vs Easy-ham
print("Spam vs Easy-ham \n--------------------------------")
Naive_Bayes(easy_ham_train, spam_train, easy_ham_test, spam_test, easy_ham_label_train, spam_label_train, easy_ham_label_test, spam_label_test)
#Spam vs Hard-ham
print("Spam vs Hard-ham \n--------------------------------")
Naive_Bayes(hard_ham_train, spam_train, hard_ham_test, spam_test, hard_ham_label_train, spam_label_train, hard_ham_label_test, spam_label_test)

###4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


In [None]:
# We have decided to use the stop words provided by sklearn.
# Why we did this is because they have about 300 words here that all are determined to be 'uninformative'
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter

coding_chars = ['<','>','=','+','@','{','}','[',']','px'] # used to exclude "words" that would occur in html code and email adresses

def words_exclude_stopwords(mails):
  for mail in mails:
    terms = mail.split()
    for word in list(terms):  # using a copy since iteration and modification at same time doesn't work
      if word in ENGLISH_STOP_WORDS: #or any(char in word for char in coding_chars)
        terms.remove(word)
  return terms

# Used to find the most common words in each data set (easy ham , hard ham, spam)
def get_k_common_words(texts, k):
  total_txt = ""
  for text in texts:
    total_txt = total_txt + " " +text
  split_text = total_txt.split()
  counter = Counter(split_text)
  result = counter.most_common(k)
  print("Average word occurance:",average_occurance(counter))
  return result

def get_k_common_words_no_code(texts, k):
  total_txt = ""
  for text in texts:
    if not any(char in text for char in coding_chars):
      total_txt = total_txt + " " +text
  split_text = total_txt.split()
  counter = Counter(split_text)
  result = counter.most_common(k)
  print("Average word occurance:",average_occurance(counter))
  return result

def average_occurance(map):
  total = 0
  itterations = 0
  for key in map.keys():
    itterations = itterations +1
    total = total + map[key]
  return (total / itterations)

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
#Naive_Bayes function that prints the True Positive and False Negative of 
#the previous train and test using the Multinomial and Bernuoulli Naive Bayes
#and CountVectorizer from sklearn package.
def Naive_Bayes_exclude(data_to_train_one, data_to_train_two, data_to_test_one, data_to_test_two, label_train_one, label_train_two, label_test_one, label_test_two):  
    #Initialization of CountVectorizer, BernoulliNB and MultinomialNB
    multi_NB = MultinomialNB()
    bernoulli_NB = BernoulliNB()
    count_vec = CountVectorizer(vocabulary=ENGLISH_STOP_WORDS) # Excluding all words/terms that are in ENGLISH_STOP_WORDS

    #Transforming the train data set with CountVectorizer to vectors
    #and fit the vectors with the Multinomial and Bernoulli Naive Bayes classifier
    train = count_vec.fit_transform(data_to_train_one + data_to_train_two)
    multi_NB.fit(train, label_train_one + label_train_two)
    bernoulli_NB.fit(train, label_train_one + label_train_two)

    #Now is transformed the two test data sets with CountVectorizer 
    #and perform the classification with the predict with  
    #Multinomial and Bernoulli Naive Bayes    
    test_one = count_vec.transform(data_to_test_one)
    multi_NB_test_one = multi_NB.predict(test_one)
    bernoulli_NB_test_one = bernoulli_NB.predict(test_one)

    test_two = count_vec.transform(data_to_test_two)
    multi_NB_test_two = multi_NB.predict(test_two)
    bernoulli_NB_test_two = bernoulli_NB.predict(test_two)
    
    #Calculated the accuracy on the given test_set of the
    #Multinomial and Bernoulli Naive Bayes
    test_set = count_vec.transform(data_to_test_one + data_to_test_two)
    multi_NB_accuracy = multi_NB .score(test_set, label_test_one + label_test_two)
    bernoulli_NB_accuracy = bernoulli_NB .score(test_set, label_test_one + label_test_two)

    #Printing Multinomial results
    unique, counts = np.unique(multi_NB_test_one, return_counts=True)
    print(f"Multinomial True Positives: {dict(zip(unique, counts))[label_test_one[0]]}")
    unique, counts = np.unique(multi_NB_test_two, return_counts=True)
    print(f"Multinomial False Negatives: {dict(zip(unique, counts))[label_test_two[0]]}")
    print(f"Multinomial Accuracy: {multi_NB_accuracy:,.2f}\n")
    
    #Printing Bernoulli results
    unique, counts = np.unique(bernoulli_NB_test_one, return_counts=True)
    print(f"Bernoulli True Positives: {dict(zip(unique, counts))[label_test_one[0]]}")    
    unique, counts = np.unique(bernoulli_NB_test_two, return_counts=True)
    print(f"Bernoulli False Negatives: {dict(zip(unique, counts))[label_test_two[0]]}")
    print(f"Bernoulli Accuracy: {bernoulli_NB_accuracy:,.2f}\n")

In [None]:
# reinstantiate the training, test & label's variables
spam_train, spam_test, spam_label_train, spam_label_test = train_test_split(spam_mails, spam_mails_label, test_size=0.2, random_state=42) #spam

hard_ham_train, hard_ham_test, hard_ham_label_train, hard_ham_label_test = train_test_split(hard_ham_mails, hard_ham_mails_label, test_size=0.2, random_state=42) # hard ham

easy_ham_train, easy_ham_test, easy_ham_label_train, easy_ham_label_test = train_test_split(easy_ham_mails, easy_ham_mails_label, test_size=0.2, random_state=42) # easy ham

ham_train, ham_test, ham_label_train, ham_label_test = train_test_split(easy_ham_mails + hard_ham_mails, ham_mails_label, test_size=0.2, random_state=42) # combined ham

In [None]:
#Spam vs Easy-ham
print("-------------")
print('easy ham')
print("-------------")
Naive_Bayes_exclude(easy_ham_train, spam_train, easy_ham_test, spam_test, easy_ham_label_train, spam_label_train, easy_ham_label_test, spam_label_test)
print("-------------")
print('hard ham')
print("-------------")
#Spam vs Hard-ham
Naive_Bayes_exclude(hard_ham_train, spam_train, hard_ham_test, spam_test, hard_ham_label_train, spam_label_train, hard_ham_label_test, spam_label_test)

### **Below we are printing the 10 most common "words" in each data set.**

Why we want to look at both what "words" are most frequent is so that we can get an understanding of what type of language and or coding tags that are used accross all the emails. For example we can very easy see that a lot of the emails in the hard ham data set has a variety of HTML tags.
We also want to know the average word occurance so that we can get a grasp of the deviation between the most frequent "words" and the rest is.

In [None]:
easy_ham_words = words_exclude_stopwords(easy_ham_mails)
print("Top 10 most frequent words in the easy ham data set: ",get_k_common_words(easy_ham_words, 10), "\n ----")
hard_ham_words = words_exclude_stopwords(hard_ham_mails)
print("Top 10 most frequent words in the hard ham data set: ",get_k_common_words(hard_ham_words, 10), "\n ----")
spam_words = words_exclude_stopwords(spam_mails)
print("Top 10 most frequent words in the spam data set: ",get_k_common_words(spam_words, 10))

If we now instead try to remove all the HTML tags and other common characters in code such as brackets we get the following result.

In [None]:
easy_ham_words = words_exclude_stopwords(easy_ham_mails)
print("Top 10 most frequent words in the easy ham data set: ",get_k_common_words_no_code(easy_ham_words, 10), "\n ----")
hard_ham_words = words_exclude_stopwords(hard_ham_mails)
print("Top 10 most frequent words in the hard ham data set: ",get_k_common_words_no_code(hard_ham_words, 10), "\n ----")
spam_words = words_exclude_stopwords(spam_mails)
print("Top 10 most frequent words in the spam data set: ",get_k_common_words_no_code(spam_words, 10))

Something worth noting when reading the average word occurance for both hard ham emails and spam is that they are very simular in comparison to the easy ham data set. Altough this doesn't prove anything more than that the hard ham data sets share some aspects with the spam data set.

Another interesting thing is that the average occurance of words in the spam data set increased. So this means that the content of these emails are filled with words that doesn't provide value to the message.

###5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
-	Does the result improve from 3 and 4? 
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.

In [None]:
# Reasoning behind this function is that since some emails contain the footer marking in the form of HTML comments: <!-- footer --> <!-- /footer -->
# We are searching for the first footer tag and when that one is found we stop reading the words into the resulting filitered string
# But when the second footer tag is found we will start appending words to the resulting string again

def remove_footer(mail_list):
  result_list = []
  discovered_footer = False
  for mail in mail_list:
    resulting_mail = ""
    rows = mail.split()
    for word in rows:
      if "footer" in word:
        discovered_footer = not discovered_footer
      if not discovered_footer:
        resulting_mail = resulting_mail + str(word)
    result_list.append(resulting_mail)
  return result_list


In [None]:
# Remove all terms that include the word subject, this since the email format uses subject as a tag for it's header
def remove_header(mail_list):
  result_list = []
  for mail in mail_list:
    resulting_mail = ""
    rows = mail.split()
    for word in rows:
      if not "subject" in str(word):
        resulting_mail = resulting_mail + str(word)
    result_list.append(resulting_mail)
  return result_list

# Nesting our two functions remove_footer() & remove_header to remove both footers and headers

easy_ham_filtered = remove_footer(remove_header(easy_ham_mails))
hard_ham_filtered = remove_footer(remove_header(hard_ham_mails))
spam_filtered = remove_footer(remove_header(spam_mails))

In [None]:
#Assuring ourselfs that both footer and subject tags are gone
print("subject" in easy_ham_filtered)
print("subject" in hard_ham_filtered)
print("subject" in spam_filtered)
print("footer" in easy_ham_filtered)
print("footer" in hard_ham_filtered)
print("footer" in spam_filtered)


In [None]:
#Making new labels after the filtered 
spam_mails_label = ['spam']*len(spam_filtered)
hard_ham_mails_label = ['hardham']*(len(hard_ham_filtered))
easy_ham_mails_label = ['easyham']*len(easy_ham_filtered)
ham_mails_label = ['ham']*(len(hard_ham_filtered + easy_ham_filtered))

# reinstantiate the training, test & label's variables
spam_train, spam_test, spam_label_train, spam_label_test = train_test_split(spam_mails, spam_mails_label, test_size=0.2, random_state=42) #spam

hard_ham_train, hard_ham_test, hard_ham_label_train, hard_ham_label_test = train_test_split(hard_ham_filtered, hard_ham_mails_label, test_size=0.2, random_state=42) # hard ham

easy_ham_train, easy_ham_test, easy_ham_label_train, easy_ham_label_test = train_test_split(easy_ham_filtered, easy_ham_mails_label, test_size=0.2, random_state=42) # easy ham

ham_train, ham_test, ham_label_train, ham_label_test = train_test_split(easy_ham_mails + hard_ham_mails, ham_mails_label, test_size=0.2, random_state=42) # combined ham

In [None]:
#Spam vs Easy-ham
print("-------------")
print('easy ham')
print("-------------")
Naive_Bayes_exclude(easy_ham_train, spam_train, easy_ham_test, spam_test, easy_ham_label_train, spam_label_train, easy_ham_label_test, spam_label_test)
print("-------------")
print('hard ham')
print("-------------")
#Spam vs Hard-ham
Naive_Bayes_exclude(hard_ham_train, spam_train, hard_ham_test, spam_test, hard_ham_label_train, spam_label_train, hard_ham_label_test, spam_label_test)