# DATA 620 - Assignment 6

Jeremy OBrien, Mael Illien, Vanita Thompson

## Document Classification

* It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set (http://archive.ics.uci.edu/ml/datasets/Spambase)
* For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
* For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


Resources:

- http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- https://www.cs.bgu.ac.il/~elhadad/nlp16/spam_classifier.html

## Setup

From spambase documentation:

Number of Instances: 4601 (1813 Spam = 39.4%)
    
Number of Attributes: 58 (57 continuous, 1 nominal class label)

In [1]:
import csv
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

## Data Import

In [2]:
# Read csv data
data = []

with open('./spambase/spambase.data') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(row)

In [3]:
len(data)

4601

In [4]:
# The last items of each row list contains the label of the row
X = np.array([x[:-1] for x in data]).astype(np.float)
y = np.array([x[-1] for x in data]).astype(np.float)

## Data Import (Alt)

In [44]:
#import os
import re
from os import listdir

In [50]:
def get_emails(path, label):
    emails = []
    files = [path + f for f in listdir(path) if f != 'cmds']

    for file in files:
        with open(file, encoding="latin-1") as f:
            emails.append((f.read(), label)) 
    return emails

In [59]:
easy_ham = get_emails('./easy_ham/', 'ham')
len(easy_ham)

2501

In [60]:
spam = get_emails('./spam/', 'spam')
len(spam)

500

In [82]:
# emails contain the full html content, including title sender etc.
# the body of the email is only a portion of the content
easy_ham[1][0]

'From rpm-list-admin@freshrpms.net  Mon Sep  9 18:00:21 2002\nReturn-Path: <rpm-zzzlist-admin@freshrpms.net>\nDelivered-To: yyyy@localhost.spamassassin.taint.org\nReceived: from localhost (jalapeno [127.0.0.1])\n\tby jmason.org (Postfix) with ESMTP id 9D98A16EFC\n\tfor <jm@localhost>; Mon,  9 Sep 2002 18:00:20 +0100 (IST)\nReceived: from jalapeno [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Mon, 09 Sep 2002 18:00:20 +0100 (IST)\nReceived: from auth02.nl.egwn.net (auth02.nl.egwn.net [193.172.5.4]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g11MGS812405 for\n    <jm-rpm@jmason.org>; Fri, 1 Feb 2002 22:16:28 GMT\nReceived: from auth02.nl.egwn.net (localhost [127.0.0.1]) by\n    auth02.nl.egwn.net (8.11.6/8.11.6/EGWN) with ESMTP id g11MF0308879;\n    Fri, 1 Feb 2002 23:15:00 +0100\nReceived: from drone4.qsi.net.nz (drone4-svc-skyt.qsi.net.nz\n    [202.89.128.4]) by auth02.nl.egwn.net (8.11.6/8.11.6/EGWN) with SMTP id\n    g11MEh3

In [95]:
def get_email_body(email):
    # Looking for the last occurence of Date: Sat, 02 Feb 2002 11:20:17 +1300\n
    iter = re.finditer(r"Date: .*\n", email)
    indices = [m.span() for m in iter]
    body_start = indices[-1][1]
    return email[body_start:]

In [99]:
get_email_body(easy_ham[0][0])

"\n\nIn a message dated 9/24/2002 11:24:58 AM, jamesr@best.com writes:\n\n>This situation wouldn't have happened in the first place if California\n>didn't have economically insane regulations.  They created a regulatory\n>climate that facilitated this.  So yes, it is the product of\n>over-regulation.\n>\n\nWhich is to say, if you reduce the argument to absurdity, that law causes \ncrime. \n\n(Yes, I agree that badly written law can make life so frustrating that people \nhave little choice but to subvery it if they want to get anything done. This \nis also true of corporate policies, and all other attempts to regulate \nconduct by rules. Rules just don't work well when situations are fluid or \nambiguous. But I don't think that the misbehavior of energy companies in \nCalifornia can properly be called well-intentioned lawbreaking by parties who \nwere trying to do the right thing but could do so only by falling afoul of \nsome technicality.)\n\nIf you want to get to root causes, we shou

## Data Transformation

### Train Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

### Naive Bayes - Gaussian

In [6]:
# Instantiate and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict on test set
y_pred = gnb.predict(X_test)

In [7]:
gnb.score(X_train,y_train)

0.818944099378882

In [8]:
gnb.score(X_test, y_test)

0.8276611151339609

In [9]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[628 218]
 [ 20 515]]
              precision    recall  f1-score   support

         0.0       0.97      0.74      0.84       846
         1.0       0.70      0.96      0.81       535

    accuracy                           0.83      1381
   macro avg       0.84      0.85      0.83      1381
weighted avg       0.87      0.83      0.83      1381



### Naive Bayes - Bernoulli

In [10]:
# Instantiate and train Gaussian Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
# Predict on test set
y_pred = bnb.predict(X_test)

In [11]:
bnb.score(X_train,y_train)

0.8891304347826087

In [12]:
bnb.score(X_test, y_test)

0.8812454742939899

In [13]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[775  71]
 [ 93 442]]
              precision    recall  f1-score   support

         0.0       0.89      0.92      0.90       846
         1.0       0.86      0.83      0.84       535

    accuracy                           0.88      1381
   macro avg       0.88      0.87      0.87      1381
weighted avg       0.88      0.88      0.88      1381



## Conclusion

## Youtube