#DAT405 Introduction to Data Science and AI 
##2022-2023, Reading Period 1
## Assignment 4: Spam classification using Naïve Bayes 
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well. 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 
7zip (https://www.7-zip.org/download.html) to decompress the data.



In [2]:
#Download and extract data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2022-11-26 15:14:50--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2’


2022-11-26 15:14:51 (27.9 MB/s) - ‘20021010_easy_ham.tar.bz2’ saved [1677144/1677144]

--2022-11-26 15:14:51--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2’


2022-11-26 15:14:52 (19.9 MB/s) - ‘20021010_hard_ham.tar.bz2’ saved

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [3]:
!ls -lah
!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

total 4.0M
drwxr-xr-x 1 root root 4.0K Nov 26 15:14 .
drwxr-xr-x 1 root root 4.0K Nov 26 15:14 ..
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2
-rw-r--r-- 1 root root 1.2M Jun 29  2004 20021010_spam.tar.bz2
drwxr-xr-x 4 root root 4.0K Nov 22 00:13 .config
drwx--x--x 2  500  500 184K Oct 10  2002 easy_ham
drwx--x--x 2 1000 1000  20K Dec 16  2004 hard_ham
drwxr-xr-x 1 root root 4.0K Nov 22 00:14 sample_data
drwxr-xr-x 2  500  500  36K Oct 10  2002 spam
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-notebook", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/jupyter_core/application.py", line 269, in launch_instance
    return super().launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 845, in launch_instance
    app.initialize(argv)
  File "/usr/local/lib/python3.7/dist-pack

###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [73]:
import os
from sklearn.model_selection import train_test_split

#Function that saves each mail from the folder in an array
def save_data(dir):
  _list = []

  with os.scandir(dir) as files:
    for file in files:
      with open(file,"r",encoding = "ISO-8859-1") as file_open:
        content = file_open.read()
        _list.append(content)

  return _list

#Call the save_data function for the three folders
easy_ham_mails = save_data('easy_ham')
hard_ham_mails = save_data('hard_ham')
spam_mails = save_data('spam')

#Making labels for the threee different data sets 
#and for the ham combination data set
spam_mails_label = ['spam']*len(spam_mails)
hard_ham_mails_label = ['hardham']*(len(hard_ham_mails))
easy_ham_mails_label = ['easyham']*len(easy_ham_mails)
ham_mails_label = ['ham']*(len(easy_ham_mails + hard_ham_mails))

#Using the train_test_split fuction from sklearn to 
#split the data into train and test. The train sample
#is the 80% of the data and the test is the 20% of the data.
#We have included the labels to have one for each train and 
#test with the same number of items as the train and test lists.
spam_train, spam_test, spam_label_train, spam_label_test = train_test_split(spam_mails, spam_mails_label, test_size=0.2, random_state=42)
hard_ham_train, hard_ham_test, hard_ham_label_train, hard_ham_label_test = train_test_split(hard_ham_mails, hard_ham_mails_label, test_size=0.2, random_state=42)
easy_ham_train, easy_ham_test, easy_ham_label_train, easy_ham_label_test = train_test_split(easy_ham_mails, easy_ham_mails_label, test_size=0.2, random_state=42)
ham_train, ham_test, ham_label_train, ham_label_test = train_test_split(easy_ham_mails + hard_ham_mails, ham_mails_label, test_size=0.2, random_state=42)




Your discussion here

###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 





In [84]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
import numpy as np

def Naive_Bayes(data_to_train_one, data_to_train_two, data_to_test_one, data_to_test_two, label_train_one, label_train_two, label_test_one, label_test_two):  
    multi_NB = MultinomialNB()
    bernoulli_NB = BernoulliNB()
    count_vec = CountVectorizer()

    #Train the model using the training sets
    train = count_vec.fit_transform(data_to_train_one + data_to_train_two)

    test_one = count_vec.transform(data_to_test_one)
    test_two = count_vec.transform(data_to_test_two)
    test_set = count_vec.transform(data_to_test_one + data_to_test_two)

    #Predict the response for test dataset
    multi_NB.fit(train, label_train_one + label_train_two)
    bernoulli_NB.fit(train, label_train_one + label_train_two)

    multi_NB_test_one = multi_NB.predict(test_one)
    bernoulli_NB_test_one = bernoulli_NB.predict(test_one)

    multi_NB_test_two = multi_NB.predict(test_two)
    bernoulli_NB_test_two = bernoulli_NB.predict(test_two)

    # Show
    unique, counts = np.unique(multi_NB_test_one, return_counts=True)
    print(f"True positives (Multinomial): {dict(zip(unique, counts))[label_test_one[0]]}")
    unique, counts = np.unique(bernoulli_NB_test_one, return_counts=True)
    print(f"True positives (Bernoulli):   {dict(zip(unique, counts))[label_test_one[0]]}")

    print()

    unique, counts = np.unique(multi_NB_test_two, return_counts=True)
    print(f"True negatives (Multinomial): {dict(zip(unique, counts))[label_test_two[0]]}")
    unique, counts = np.unique(bernoulli_NB_test_two, return_counts=True)
    print(f"True negatives (Bernoulli):   {dict(zip(unique, counts))[label_test_two[0]]}")

    print()

    print(f"Accuracy for multinomial test: {multi_NB .score(test_set, label_test_one + label_test_two):,.2f}")
    print(f"Accuracy for bernoulli test: {bernoulli_NB .score(test_set, label_test_one + label_test_two):,.2f}")

    print()

In [85]:
Naive_Bayes(ham_train, spam_train, ham_test, spam_test, ham_label_train, spam_label_train, ham_label_test, spam_label_test)

True positives (Multinomial): 560
True positives (Bernoulli):   560

True negatives (Multinomial): 87
True negatives (Bernoulli):   23

Accuracy for multinomial test: 0.98
Accuracy for bernoulli test: 0.88



### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [None]:
#Spam vs Easy-ham

#Spam vs Hard-ham

###4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


###5. Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
-	Does the result improve from 3 and 4? 
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages? 

Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.