# DATA 620 - Assignment 6

Jeremy OBrien, Mael Illien, Vanita Thompson

## Document Classification

* It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set (http://archive.ics.uci.edu/ml/datasets/Spambase)
* For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
* For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


Resources:

- http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- https://www.cs.bgu.ac.il/~elhadad/nlp16/spam_classifier.html

## Setup

From spambase documentation:

Number of Instances: 4601 (1813 Spam = 39.4%)
    
Number of Attributes: 58 (57 continuous, 1 nominal class label)

In [5]:
import csv
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

## Data Import

In [6]:
import re
import random
from os import listdir

In [7]:
def get_emails(path):
    emails = []
    files = [path + f for f in listdir(path) if f != 'cmds']

    for file in files:
        with open(file, encoding="latin-1") as f:
            emails.append(f.read()) 
    return emails

In [8]:
def get_email_body(email):
    # Looking for the last occurence of Date: Sat, 02 Feb 2002 11:20:17 +1300\n
    iter = re.finditer(r"Date: .*\n", email)
    indices = [m.span() for m in iter]
    
    #print(indices)
    body_start = indices[-1][1]
    return email[body_start:].replace("\n", "")

In [9]:
easy_ham = get_emails('./easy_ham/')
len(easy_ham)

2501

In [10]:
spam = get_emails('./spam/')
len(spam)

500

In [11]:
# emails contain the full html content, including title sender etc.
# the body of the email is only a portion of the content
easy_ham[0][:1000]

'From fork-admin@xent.com  Tue Sep 24 17:55:30 2002\nReturn-Path: <fork-admin@xent.com>\nDelivered-To: yyyy@localhost.spamassassin.taint.org\nReceived: from localhost (jalapeno [127.0.0.1])\n\tby jmason.org (Postfix) with ESMTP id 070DF16F03\n\tfor <jm@localhost>; Tue, 24 Sep 2002 17:55:30 +0100 (IST)\nReceived: from jalapeno [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Tue, 24 Sep 2002 17:55:30 +0100 (IST)\nReceived: from xent.com ([64.161.22.236]) by dogma.slashnull.org\n    (8.11.6/8.11.6) with ESMTP id g8OGAEC11404 for <jm@jmason.org>;\n    Tue, 24 Sep 2002 17:10:14 +0100\nReceived: from lair.xent.com (localhost [127.0.0.1]) by xent.com (Postfix)\n    with ESMTP id ACE072940DA; Tue, 24 Sep 2002 09:06:08 -0700 (PDT)\nDelivered-To: fork@spamassassin.taint.org\nReceived: from imo-r09.mx.aol.com (imo-r09.mx.aol.com [152.163.225.105])\n    by xent.com (Postfix) with ESMTP id 522F329409A for <fork@xent.com>;\n    Tue, 24 Sep 2002 09:05:51 -07

In [12]:
get_email_body(easy_ham[0])

"In a message dated 9/24/2002 11:24:58 AM, jamesr@best.com writes:>This situation wouldn't have happened in the first place if California>didn't have economically insane regulations.  They created a regulatory>climate that facilitated this.  So yes, it is the product of>over-regulation.>Which is to say, if you reduce the argument to absurdity, that law causes crime. (Yes, I agree that badly written law can make life so frustrating that people have little choice but to subvery it if they want to get anything done. This is also true of corporate policies, and all other attempts to regulate conduct by rules. Rules just don't work well when situations are fluid or ambiguous. But I don't think that the misbehavior of energy companies in California can properly be called well-intentioned lawbreaking by parties who were trying to do the right thing but could do so only by falling afoul of some technicality.)If you want to get to root causes, we should probably go to the slaying of Abel by Cai

In [13]:
# I cannot explain this. The length is 2500 but only 1817 emails work without an index error
#[get_email_body(m) for m in easy_ham[:1817]]

In [14]:
# sample 500 of the ham emails to balance the dataset
labeled_emails = ([(get_email_body(em), 'ham') for em in random.choices(easy_ham, k=500)] + 
                    [(get_email_body(em), 'spam') for em in spam])

In [15]:
len(labeled_emails)

1000

## Data Transformation

The simple approach taken is case normalization, stopword remova, and stemming, then TF-IDF vectorization (apparently tokenizing doesn’t work well with email due to colloquial speech)

In [27]:
from nltk import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords

In [18]:
# Tokenize
tokens = word_tokenize(get_email_body(easy_ham[0]))
tokens[:10]


['In',
 'a',
 'message',
 'dated',
 '9/24/2002',
 '11:24:58',
 'AM',
 ',',
 'jamesr',
 '@']

In [24]:
# Normalize
# Note: We might not want to get rid off non-alpha characters. Potential value punctions, html tags?
word_tokens = [w.lower() for w in tokens if w.isalpha()] 
print(len(word_tokens))
word_tokens[:10]

223


['in',
 'a',
 'message',
 'dated',
 'am',
 'jamesr',
 'writes',
 'this',
 'situation',
 'would']

In [25]:
# Remove stop words
stop_words = stopwords.words('english')
filtered_words = [w for w in word_tokens if not w in stop_words]
print(len(filtered_words))

107


In [33]:
# Stemming
porter = PorterStemmer()
stemmed_words = [porter.stem(t) for t in filtered_words]
stemmed_words[:20]

['messag',
 'date',
 'jamesr',
 'write',
 'situat',
 'would',
 'happen',
 'first',
 'place',
 'california',
 'econom',
 'insan',
 'regul',
 'creat',
 'regulatori',
 'climat',
 'facilit',
 'ye',
 'product',
 'say']

### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

## Conclusion

## Youtube