# DAT405 Introduction to Data Science and AI 
## Assignment 4: Spam classification using Naïve Bayes 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 7zip (https://www.7-zip.org/download.html) to decompress the data.



In [1]:
#Download and extract data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2021-02-15 16:35:40--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 207.244.88.140, 95.216.26.30
Connecting to spamassassin.apache.org (spamassassin.apache.org)|207.244.88.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1,6M) [application/x-bzip2]
Saving to: '20021010_easy_ham.tar.bz2'

     0K .......... .......... .......... .......... ..........  3%  279K 6s
    50K .......... .......... .......... .......... ..........  6% 15,1M 3s
   100K .......... .......... .......... .......... ..........  9%  554K 3s
   150K .......... .......... .......... .......... .......... 12% 14,8M 2s
   200K .......... .......... .......... .......... .......... 15%  584K 2s
   250K .......... .......... .......... .......... .......... 18% 16,1M 2s
   300K .......... .......... .......... .......... .......... 21% 13,3M 1s
   350K .......... .......... .......... .....

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [2]:
!ls -lah

'ls' is not recognized as an internal or external command,
operable program or batch file.


###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that and run on the entire text. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) **0.5p**


In [3]:
#pre-processing code here

my_file = open("data/easy_ham/0350.dabd503455d16db68a0e4bad24ffc736", "r")
content_list = my_file. readlines()
print(content_list)

['From sentto-2242572-56028-1034088521-zzzz=example.com@returns.groups.yahoo.com  Tue Oct  8 17:02:19 2002\n', 'Return-Path: <sentto-2242572-56028-1034088521-zzzz=example.com@returns.groups.yahoo.com>\n', 'Delivered-To: zzzz@localhost.example.com\n', 'Received: from localhost (jalapeno [127.0.0.1])\n', '\tby example.com (Postfix) with ESMTP id DC7B516F17\n', '\tfor <zzzz@localhost>; Tue,  8 Oct 2002 17:01:48 +0100 (IST)\n', 'Received: from jalapeno [127.0.0.1]\n', '\tby localhost with IMAP (fetchmail-5.9.0)\n', '\tfor zzzz@localhost (single-drop); Tue, 08 Oct 2002 17:01:48 +0100 (IST)\n', 'Received: from n13.grp.scd.yahoo.com (n13.grp.scd.yahoo.com\n', '    [66.218.66.68]) by dogma.slashnull.org (8.11.6/8.11.6) with SMTP id\n', '    g98EuBK20554 for <zzzz@example.com>; Tue, 8 Oct 2002 15:56:14 +0100\n', 'X-Egroups-Return: sentto-2242572-56028-1034088521-zzzz=example.com@returns.groups.yahoo.com\n', 'Received: from [66.218.67.193] by n13.grp.scd.yahoo.com with NNFMP;\n', '    08 Oct 200

###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
    - Multinomial Naive Bayes  
    - Bernoulli Naive Bayes. 






In [4]:
#Code here

a) Explain how the classifiers differ. What different interpretations do they have? **1p** 

Your discussion here

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus (hard-ham + easy-ham). 
-   Discuss your results **2.5p** 

In [5]:
#Code to report results here

# 4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. **1p** 

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report and discuss your results. You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making. **1p** 


### 5. Eeking out further performance
**a.**  Use a lemmatizer to normalize the text (for example from the `nltk` library). For one implementation look at the documentation ([here](https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes)). Run your program again and answer the following questions: 
  - Why can lemmatization help?
  -	Does the result improve from 3 and 4? Discuss. **1.5p** 







**b.** The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
 What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages?  **1p** 

**c.** Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
  - What does this parameter mean?
  - How does this alter the predictions? Discuss why or why not. **0.5p** 

**d.** The python model includes smoothing (`alpha` parameter ), explain why this can be important. 
  - What would happen if in the training data set the word 'money' only appears in spam examples? What would the model predict about a message containing the word 'money'? Does the prediction depend on the rest of the message and is that reasonable? Explain your reasoning  **1p** 

### What to report and how to hand in.

- You will need to clearly report all results in the notebook in a clear and appropriate way, either using plots or code output (f.x. "print statements"). 
- The notebook must be reproducible, that means, we must be able to use the `Run all` function from the `Runtime` menu and reproduce all your results. **Please check this before handing in.** 
- Save the notebook and share a link to the notebook (Press share in upper left corner, and use `Get link` option. **Please make sure to allow all with the link to open and edit.**
- Edits made after submission deadline will be ignored, graders will recover the last saved version before deadline from the revisions history.
- **Please make sure all cells are executed and all the output is clearly readable/visible to anybody opening the notebook.**