# DATA 620 - Assignment 6

Jeremy OBrien, Mael Illien, Vanita Thompson

## Document Classification

* It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set (http://archive.ics.uci.edu/ml/datasets/Spambase)
* For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
* For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


Resources:

- http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- https://www.cs.bgu.ac.il/~elhadad/nlp16/spam_classifier.html

## Setup

In [4]:
import re
import csv
import random
import numpy as np
import pandas as pd
from os import listdir
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from nltk import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

## Data Import & Transformation

In [2]:
# Read individual files from current directory and return the content in a list
def get_emails(path):
    emails = []
    files = [path + f for f in listdir(path) if f != 'cmds']

    for file in files:
        with open(file, encoding="latin-1") as f:
            email = f.read()
            if len(email) != 0:
                emails.append(email) 
    return emails

Notice that the ham and spam corpora are not balanced. We will sample the ham corpus to even out the sizes in the training set.

In [3]:
easy_ham = get_emails('./easy_ham/')
spam = get_emails('./spam/')

print('Number of emails in {} corpus: {}'.format('easy_ham', len(easy_ham)))
print('Number of emails in {} corpus: {}'.format('spam', len(spam)))

Number of emails in easy_ham corpus: 2500
Number of emails in spam corpus: 500


See below for an example of the email format. There are a number of headers followed by the body of the email. Our analysis will be focused on the body content. The function get_email_body will be used to extract this content.

In [4]:
# Extract only the body of the emails, ignoring all the headers
def get_email_body(email):
    # Looking for the last occurence of Date: Sat, 02 Feb 2002 11:20:17 +1300\n
    iter = re.finditer(r"Date: .*\n", email)
    # Otherwise look for repeated \n\n patterm

    indices = [m.span() for m in iter]
    body_start = indices[-1][1]
    body = email[body_start:].replace("\n", "")

    return body

In [5]:
print(easy_ham[2001])

From rssfeeds@jmason.org  Mon Oct  7 12:05:27 2002
Return-Path: <rssfeeds@spamassassin.taint.org>
Delivered-To: yyyy@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id 4123D16F7C
	for <jm@localhost>; Mon,  7 Oct 2002 12:04:04 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Mon, 07 Oct 2002 12:04:04 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g9780fK23260 for
    <jm@jmason.org>; Mon, 7 Oct 2002 09:00:41 +0100
Message-Id: <200210070800.g9780fK23260@dogma.slashnull.org>
To: yyyy@spamassassin.taint.org
From: gamasutra <rssfeeds@spamassassin.taint.org>
Subject: Postmortem: Ubi Soft China's Music Up -- Summer Rainbow
Date: Mon, 07 Oct 2002 08:00:41 -0000
Content-Type: text/plain; encoding=utf-8

URL: http://www.newsisfree.com/click/-0,8613667,159/
Date: 2002-10-06T18

In [6]:
print(get_email_body(easy_ham[2001]))

Ubi China had always wanted to make a PC game for the local market, but a number to factors kept the idea on hold. In January 2001, the right incentive to motivate Ubi China to try a local project finally arrived: the license for "Music Up", a popular animated property.


In [7]:
# Assemble the corpus by combining the spam emails with 500 emails sampled from the 
# ham emails to balance the dataset and assign the known labels ham: 0, spam:1
random.seed(620)
labeled_emails = ([(get_email_body(em), '0') for em in random.choices(easy_ham, k=500)] + 
                    [(get_email_body(em), '1') for em in spam])

print('There are {} emails in this corpus.'.format(len(labeled_emails)))

There are 1000 emails in this corpus.


## Data Processing

The simple approach taken is case normalization, stopword removal, and stemming, then TF-IDF vectorization (apparently tokenizing doesn’t work well with email due to colloquial speech)

Resource: 
- https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

### Manual pre-processing

This is shown for exploration. The functionality is encapsulated in the process_email_body function and the below can be removed later.

In [8]:
get_email_body(easy_ham[0])

"In a message dated 9/24/2002 11:24:58 AM, jamesr@best.com writes:>This situation wouldn't have happened in the first place if California>didn't have economically insane regulations.  They created a regulatory>climate that facilitated this.  So yes, it is the product of>over-regulation.>Which is to say, if you reduce the argument to absurdity, that law causes crime. (Yes, I agree that badly written law can make life so frustrating that people have little choice but to subvery it if they want to get anything done. This is also true of corporate policies, and all other attempts to regulate conduct by rules. Rules just don't work well when situations are fluid or ambiguous. But I don't think that the misbehavior of energy companies in California can properly be called well-intentioned lawbreaking by parties who were trying to do the right thing but could do so only by falling afoul of some technicality.)If you want to get to root causes, we should probably go to the slaying of Abel by Cai

In [9]:
# The sklearn tfidf function does most of the work below. See usage below.

# Tokenize
tokens = word_tokenize(get_email_body(easy_ham[0]))
print(tokens[:10])

['In', 'a', 'message', 'dated', '9/24/2002', '11:24:58', 'AM', ',', 'jamesr', '@']


In [10]:
# Normalize
# ME: Note: We might not want to get rid of non-alpha characters. Potential value punctions, html tags?
word_tokens = [w.lower() for w in tokens if w.isalpha()] 
print(len(word_tokens))
print(word_tokens[:10])

223
['in', 'a', 'message', 'dated', 'am', 'jamesr', 'writes', 'this', 'situation', 'would']


In [11]:
# Remove stop words
stop_words = stopwords.words('english')
filtered_words = [w for w in word_tokens if not w in stop_words]
print(len(filtered_words))
print(filtered_words[:10])

107
['message', 'dated', 'jamesr', 'writes', 'situation', 'would', 'happened', 'first', 'place', 'california']


In [12]:
# Stemming (Consider Lemmatization instead)
porter = PorterStemmer()
stemmed_words = [porter.stem(t) for t in filtered_words]
print(stemmed_words[:20])

['messag', 'date', 'jamesr', 'write', 'situat', 'would', 'happen', 'first', 'place', 'california', 'econom', 'insan', 'regul', 'creat', 'regulatori', 'climat', 'facilit', 'ye', 'product', 'say']


In [13]:
# Consider Lemmatization instead

### Processing (encapsulated):

In [14]:
# Process email: tokenize, remove non-alpha characters, remove stop words, stem, lemmatize
# and return a list of tokens
def process_email_body(email, alpha=True, rm_stopwords=True, stem=True, lemma=False):
    tokens = word_tokenize(email)
    if alpha: tokens = [w.lower() for w in tokens if w.isalpha()] 
    if rm_stopwords: 
        stop_words = stopwords.words('english')
        tokens = [w for w in tokens if not w in stop_words]
    if stem:
        porter = PorterStemmer()
        tokens = [porter.stem(t) for t in tokens]
    return tokens

In [15]:
# Example
print(process_email_body(get_email_body(easy_ham[0])))

['messag', 'date', 'jamesr', 'write', 'situat', 'would', 'happen', 'first', 'place', 'california', 'econom', 'insan', 'regul', 'creat', 'regulatori', 'climat', 'facilit', 'ye', 'product', 'say', 'reduc', 'argument', 'absurd', 'law', 'caus', 'crime', 'ye', 'agre', 'badli', 'written', 'law', 'make', 'life', 'frustrat', 'peopl', 'littl', 'choic', 'subveri', 'want', 'get', 'anyth', 'done', 'also', 'true', 'corpor', 'polici', 'attempt', 'regul', 'conduct', 'rule', 'rule', 'work', 'well', 'situat', 'fluid', 'ambigu', 'think', 'misbehavior', 'energi', 'compani', 'california', 'properli', 'call', 'lawbreak', 'parti', 'tri', 'right', 'thing', 'could', 'fall', 'afoul', 'technic', 'want', 'get', 'root', 'caus', 'probabl', 'go', 'slay', 'abel', 'cain', 'perhap', 'figur', 'went', 'wrong', 'roll', 'learn', 'forward', 'histori', 'creat', 'say', 'cast', 'stone', 'hous', 'whether', 'bicamer', 'unicamer', 'built', 'sand', 'rock', 'left', 'right', 'glass', 'brick', 'twig', 'straw', 'tom']


In [16]:
# train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
#     devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
#     test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]

#### Consider additional feature engineering?
- number of tokens, etc.
- presence of word 'unsubscribe'

For feature engineering, what would you say to the following combinations:

- yes case normalization, yes symbol removal, yes stopword removal, then stem
- yes case normalization, yes symbol removal, no stopword removal, then stem
- yes case normalization, no symbol removal, no stopword removal, then stem
- no case normalization, no symbol removal, no stopword removal, then stem
  
  
- yes case normalization, yes symbol removal, yes stopword removal, then lemmatize
- yes case normalization, yes symbol removal, no stopword removal, then lemmatize
- yes case normalization, no symbol removal, no stopword removal, then lemmatize
- no case normalization, no symbol removal, no stopword removal, then lemmatize
  
  
- yes case normalization, yes symbol removal, yes stopword removal, no stem / lemmatize
- yes case normalization, yes symbol removal, no stopword removal, no stem / lemmatize
- yes case normalization, no symbol removal, no stopword removal, no stem / lemmatize
- no case normalization, no symbol removal, no stopword removal, no stem / lemmatize
  
Then once we've determine the best performer, we can implement some sort of error correction to see if that improves it.

Then we can try versions of the best performer that are noun-only or verb-only (with the settings above, so long as not contradictory)


#### Split labeled corpus into emails and labels:

Apply processing functions to email bodies. 
- Option 1: manual. 
- Option 2: sklearn pre-processor

In [17]:
# Option 1: takes a long time

# emails = [process_email_body(email) for email in labeled_emails] # X
# y = [label for (email, label) in labeled_emails] # y = labels

# vectorizer = TfidfVectorizer()

In [18]:
# Option 2
# Refer to https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
# play with max_features, etc.

emails = [email for (email, label) in labeled_emails] # X
y = [label for (email, label) in labeled_emails] # y = labels

vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', token_pattern = r'[a-zA-Z]+', max_features=1000)

#### Vectorize:

In [19]:
X = vectorizer.fit_transform(emails)

In [20]:
print(vectorizer.get_feature_names()[:200])

['aa', 'ab', 'able', 'absolutely', 'ac', 'accept', 'access', 'account', 'act', 'action', 'actually', 'ad', 'adclick', 'add', 'additional', 'address', 'addresses', 'admanmail', 'admin', 'ads', 'adult', 'advertising', 'ae', 'af', 'ag', 'age', 'agent', 'agents', 'ago', 'ahref', 'aid', 'al', 'align', 'allow', 'alsa', 'alt', 'alternative', 'america', 'american', 'amp', 'annuity', 'answer', 'application', 'apply', 'archive', 'area', 'arial', 'article', 'ascii', 'asciicontent', 'ask', 'asp', 'assist', 'assistance', 'atoll', 'au', 'aug', 'available', 'average', 'aw', 'away', 'awr', 'b', 'ba', 'background', 'bad', 'bank', 'banners', 'base', 'based', 'bb', 'bc', 'beenthere', 'begin', 'believe', 'benefit', 'best', 'better', 'bgcolor', 'bidi', 'big', 'bin', 'bindex', 'bit', 'bitx', 'black', 'blank', 'blockquote', 'blue', 'body', 'bonus', 'book', 'border', 'bordercolor', 'boundary', 'box', 'br', 'build', 'bulk', 'bulklist', 'bush', 'business', 'businesses', 'buy', 'c', 'ca', 'called', 'came', 'camp

In [21]:
print(X.shape)

(1000, 1000)


### Train Test Split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

### Naive Bayes - Gaussian

In [23]:
# Required because of complaint about the matrix being too sparse
print(type(X))
X_train = X_train.toarray()
X_test = X_test.toarray()

<class 'scipy.sparse.csr.csr_matrix'>


In [24]:
# Instantiate and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict on test set
y_pred = gnb.predict(X_test)

In [25]:
gnb.score(X_train,y_train)

0.98

In [26]:
gnb.score(X_test, y_test)

0.9266666666666666

In [27]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[134  12]
 [ 10 144]]
              precision    recall  f1-score   support

           0       0.93      0.92      0.92       146
           1       0.92      0.94      0.93       154

    accuracy                           0.93       300
   macro avg       0.93      0.93      0.93       300
weighted avg       0.93      0.93      0.93       300



### Naive Bayes - Bernoulli

In [28]:
# Instantiate and train Gaussian Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
# Predict on test set
y_pred = bnb.predict(X_test)

In [29]:
bnb.score(X_train,y_train)

0.9057142857142857

In [30]:
bnb.score(X_test, y_test)

0.9033333333333333

In [31]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[142   4]
 [ 25 129]]
              precision    recall  f1-score   support

           0       0.85      0.97      0.91       146
           1       0.97      0.84      0.90       154

    accuracy                           0.90       300
   macro avg       0.91      0.91      0.90       300
weighted avg       0.91      0.90      0.90       300



In [5]:
# load the data
spambase = pd.read_csv("https://raw.githubusercontent.com/Vthomps000/DATA620/master/spambase.data")

In [7]:
# column names from spambase.names data
spambase.columns=['word_freq_make','word_freq_address','word_freq_all','word_freq_3d','word_freq_our','word_freq_over',
              'word_freq_remove','word_freq_internet','word_freq_order','word_freq_mail','word_freq_receive',
              'word_freq_will','word_freq_people','word_freq_report','word_freq_addresses','word_freq_free',
              'word_freq_business','word_freq_email','word_freq_you','word_freq_credit','word_freq_your',
              'word_freq_font','word_freq_000','word_freq_money','word_freq_hp','word_freq_hpl','word_freq_george',
              'word_freq_650','word_freq_lab','word_freq_labs','word_freq_telnet','word_freq_857','word_freq_data',
              'word_freq_415','word_freq_85','word_freq_technology','word_freq_1999','word_freq_parts','word_freq_pm',
              'word_freq_direct','word_freq_cs','word_freq_meeting','word_freq_original','word_freq_project',
              'word_freq_re','word_freq_edu','word_freq_table','word_freq_conference','char_freq_;','char_freq_(',
              'char_freq_[','char_freq_!','char_freq_$','char_freq_#','capital_run_length_average','capital_run_length_longest',
              'capital_run_length_total','spamclass']                       

In [8]:
# Count the number of spam vs. not spam
spam_count = len(spambase[spambase.spamclass==1])
ham_count = len(spambase[spambase.spamclass==0])

print("Spam: %d" %spam_count)
print("Ham: %d" %ham_count)

Spam: 1812
Ham: 2788


### Decision Tree Classifier

In [9]:
#train 70%, test 30%
X = spambase.values[:, 0:57]
y = spambase.values[:, 57]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, test_size=.3)

In [10]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
    max_features=None, max_leaf_nodes=None,
    min_impurity_decrease=1e-07, min_samples_leaf=1,
    min_samples_split=2, min_weight_fraction_leaf=0.0,
    presort=False, random_state=88, splitter='best')
dt.fit(X_train, y_train)
dt.score(X_test, y_test)



0.9130434782608695

In [11]:
#Confusion matrix for Decission Tree
dt_cm = confusion_matrix(y_test, dt.predict(X_test))
pd.DataFrame(data = dt_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,799,68
Actual Spam,52,461


In [12]:
dt_pred = dt.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != dt_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (dt_cm [0,1], dt_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 120
In detail, 68 ham emails are mislabeled as spam, 52 spam emails are mislabeled as ham.


### Adaptive Boosting

In [13]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=88)
ada.fit(X_train, y_train)
ada.score(X_test, y_test)

0.9398550724637681

In [14]:
ada_cm = confusion_matrix(y_test, ada.predict(X_test))

pd.DataFrame(data = ada_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,831,36
Actual Spam,47,466


In [15]:
ada_pred = ada.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != ada_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (ada_cm [0,1], ada_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 83
In detail, 36 ham emails are mislabeled as spam, 47 spam emails are mislabeled as ham.


### Random Forest

In [None]:
#reference:https://datawhatnow.com/feature-importance/

In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=88,
            verbose=0, warm_start=False)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.936231884057971

In [17]:
rf_cm = confusion_matrix(y_test, rf.predict(X_test))
pd.DataFrame(data = rf_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,839,28
Actual Spam,60,453


In [18]:
rf_pred = rf.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != rf_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (rf_cm [0,1], rf_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 88
In detail, 28 ham emails are mislabeled as spam, 60 spam emails are mislabeled as ham.


## Conclusion

In [None]:
# Can we add the other models to this? 

In [20]:
#reference:https://datawhatnow.com/feature-importance/
Conclusion = {'Decision Tree' : [dt.score(X_test, y_test), (y_test != dt_pred).sum()],
             'Random Forest' : [rf.score(X_test, y_test), (y_test != rf_pred).sum()],
             'AdaBoost' : [ada.score(X_test, y_test), (y_test != ada_pred).sum()],
             }
pd.DataFrame (Conclusion)
pd.DataFrame(Conclusion, index=['Accuracy', 'Mislabelled'])

Unnamed: 0,Decision Tree,Random Forest,AdaBoost
Accuracy,0.913043,0.936232,0.939855
Mislabelled,120.0,88.0,83.0


## Youtube