# DATA 620 - Assignment 6

Jeremy OBrien, Mael Illien, Vanita Thompson

## Document Classification

* It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set (http://archive.ics.uci.edu/ml/datasets/Spambase)
* For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
* For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


Resources:

- http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- https://www.cs.bgu.ac.il/~elhadad/nlp16/spam_classifier.html

## Setup

In [2]:
import re
import csv
import random
import numpy as np
import pandas as pd
from os import listdir

from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import svm
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV

from nltk import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

## Data Import & Transformation

In [3]:
# Read individual files from current directory and return the content in a list
def get_emails(path):
    emails = []
    files = [path + f for f in listdir(path) if f != 'cmds']

    for file in files:
        with open(file, encoding="latin-1") as f:
            email = f.read()
            if len(email) != 0:
                emails.append(email) 
    return emails

Notice that the ham and spam corpora are not balanced. We will sample the ham corpus to even out the sizes in the training set.

In [4]:
easy_ham = get_emails('./easy_ham/')
spam = get_emails('./spam/')

print('Number of emails in {} corpus: {}'.format('easy_ham', len(easy_ham)))
print('Number of emails in {} corpus: {}'.format('spam', len(spam)))

Number of emails in easy_ham corpus: 2500
Number of emails in spam corpus: 500


See below for an example of the email format. There are a number of headers followed by the body of the email. Our analysis will be focused on the body content. The function get_email_body will be used to extract this content.

In [5]:
# Extract only the body of the emails, ignoring all the headers
def get_email_body(email):
    # Looking for the last occurence of Date: Sat, 02 Feb 2002 11:20:17 +1300\n
    iter = re.finditer(r"Date: .*\n", email)
    # Otherwise look for repeated \n\n patterm

    indices = [m.span() for m in iter]
    body_start = indices[-1][1]
    body = email[body_start:].replace("\n", "")

    return body

In [6]:
print(easy_ham[2001])

From rssfeeds@jmason.org  Thu Sep 26 16:43:26 2002
Return-Path: <rssfeeds@spamassassin.taint.org>
Delivered-To: yyyy@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id E080E16F76
	for <jm@localhost>; Thu, 26 Sep 2002 16:42:30 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Thu, 26 Sep 2002 16:42:30 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8QFSsg24511 for
    <jm@jmason.org>; Thu, 26 Sep 2002 16:28:54 +0100
Message-Id: <200209261528.g8QFSsg24511@dogma.slashnull.org>
To: yyyy@spamassassin.taint.org
From: fark <rssfeeds@spamassassin.taint.org>
Subject: Friends don't let friends swim drunk. Paging Dr. Darwin..
Date: Thu, 26 Sep 2002 15:28:54 -0000
Content-Type: text/plain; encoding=utf-8

URL: http://www.newsisfree.com/click/-4,8272607,1717/
Date: 2002-09-26T11:

In [7]:
print(get_email_body(easy_ham[2001]))

(NY Daily News)


In [8]:
# Assemble the corpus by combining the spam emails with 500 emails sampled from the 
# ham emails to balance the dataset and assign the known labels ham: 0, spam:1
random.seed(620)
labeled_emails = ([(get_email_body(em), '0') for em in random.choices(easy_ham, k=500)] + 
                    [(get_email_body(em), '1') for em in spam])

print('There are {} emails in this corpus.'.format(len(labeled_emails)))

There are 1000 emails in this corpus.


## Data Processing

The simple approach taken is case normalization, stopword removal, and stemming, then TF-IDF vectorization (apparently tokenizing doesn’t work well with email due to colloquial speech)

Resource: 
- https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

### Manual pre-processing

This is shown for exploration. The functionality is encapsulated in the process_email_body function and the below can be removed later.

In [9]:
get_email_body(easy_ham[0])

'    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>  | I can\'t reproduce this error.For me it is very repeatable... (like every time, without fail).This is the debug log of the pick happening ...18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury18:19:04 Ftoc_PickMsgs {{1 hit}}18:19:04 Marking 1 hits18:19:04 tkerror: syntax error in expression "int ...Note, if I run the pick command by hand ...delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury1 hitThat\'s where the "1 hit" comes from (obviously).  The version of nmh I\'musing is ...delta$ pick -versionpick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]And the relevant part of my .mh_profile ...delta$ mhparam

In [10]:
# The sklearn tfidf function does most of the work below. See usage below.

# Tokenize
tokens = word_tokenize(get_email_body(easy_ham[0]))
print(tokens[:10])

['From', ':', 'Chris', 'Garrigues', '<', 'cwg-dated-1030377287.06fa6d', '@', 'DeepEddy.Com', '>', 'Message-ID']


In [11]:
# Normalize
# ME: Note: We might not want to get rid of non-alpha characters. Potential value punctions, html tags?
word_tokens = [w.lower() for w in tokens if w.isalpha()] 
print(len(word_tokens))
print(word_tokens[:10])

136
['from', 'chris', 'garrigues', 'i', 'ca', 'reproduce', 'this', 'me', 'it', 'is']


In [12]:
# Remove stop words
stop_words = stopwords.words('english')
filtered_words = [w for w in word_tokens if not w in stop_words]
print(len(filtered_words))
print(filtered_words[:10])

84
['chris', 'garrigues', 'ca', 'reproduce', 'repeatable', 'like', 'every', 'time', 'without', 'fail']


In [13]:
# Stemming (Consider Lemmatization instead)
porter = PorterStemmer()
stemmed_words = [porter.stem(t) for t in filtered_words]
print(stemmed_words[:20])

['chri', 'garrigu', 'ca', 'reproduc', 'repeat', 'like', 'everi', 'time', 'without', 'fail', 'debug', 'log', 'pick', 'happen', 'exec', 'pick', 'ftp', 'mercuri', 'exec', 'pick']


In [14]:
# Consider Lemmatization instead

### Processing (encapsulated):

In [15]:
# Process email: tokenize, remove non-alpha characters, remove stop words, stem, lemmatize
# and return a list of tokens
def process_email_body(email, alpha=True, rm_stopwords=True, stem=True, lemma=False):
    tokens = word_tokenize(email)
    if alpha: tokens = [w.lower() for w in tokens if w.isalpha()] 
    if rm_stopwords: 
        stop_words = stopwords.words('english')
        tokens = [w for w in tokens if not w in stop_words]
    if stem:
        porter = PorterStemmer()
        tokens = [porter.stem(t) for t in tokens]
    return tokens

In [16]:
# Example
print(process_email_body(get_email_body(easy_ham[0])))

['chri', 'garrigu', 'ca', 'reproduc', 'repeat', 'like', 'everi', 'time', 'without', 'fail', 'debug', 'log', 'pick', 'happen', 'exec', 'pick', 'ftp', 'mercuri', 'exec', 'pick', 'ftp', 'hit', 'mark', 'tkerror', 'syntax', 'error', 'express', 'int', 'note', 'run', 'pick', 'command', 'hand', 'delta', 'pick', 'ftp', 'hitthat', 'hit', 'come', 'obvious', 'version', 'nmh', 'delta', 'pick', 'compil', 'sun', 'mar', 'ict', 'relev', 'part', 'delta', 'mhparam', 'sel', 'pick', 'command', 'work', 'sequenc', 'actual', 'theon', 'explicit', 'command', 'line', 'search', 'popup', 'theon', 'come', 'get', 'still', 'use', 'version', 'code', 'form', 'day', 'ago', 'abl', 'reach', 'cv', 'repositori', 'today', 'local', 'rout', 'issu', 'think', 'mail']


In [17]:
# train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
#     devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
#     test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]

#### Consider additional feature engineering?
- number of tokens, etc.
- presence of word 'unsubscribe'

For feature engineering, what would you say to the following combinations:

- yes case normalization, yes symbol removal, yes stopword removal, then stem
- yes case normalization, yes symbol removal, no stopword removal, then stem
- yes case normalization, no symbol removal, no stopword removal, then stem
- no case normalization, no symbol removal, no stopword removal, then stem
  
  
- yes case normalization, yes symbol removal, yes stopword removal, then lemmatize
- yes case normalization, yes symbol removal, no stopword removal, then lemmatize
- yes case normalization, no symbol removal, no stopword removal, then lemmatize
- no case normalization, no symbol removal, no stopword removal, then lemmatize
  
  
- yes case normalization, yes symbol removal, yes stopword removal, no stem / lemmatize
- yes case normalization, yes symbol removal, no stopword removal, no stem / lemmatize
- yes case normalization, no symbol removal, no stopword removal, no stem / lemmatize
- no case normalization, no symbol removal, no stopword removal, no stem / lemmatize
  
Then once we've determine the best performer, we can implement some sort of error correction to see if that improves it.

Then we can try versions of the best performer that are noun-only or verb-only (with the settings above, so long as not contradictory)


#### Split labeled corpus into emails and labels:

Apply processing functions to email bodies. 
- Option 1: manual. 
- Option 2: sklearn pre-processor

In [18]:
# Option 1: takes a long time

# emails = [process_email_body(email) for email in labeled_emails] # X
# y = [label for (email, label) in labeled_emails] # y = labels

# vectorizer = TfidfVectorizer()

In [19]:
# Option 2
# Refer to https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
# play with max_features, etc.

emails = [email for (email, label) in labeled_emails] # X
y = [label for (email, label) in labeled_emails] # y = labels

vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', token_pattern = r'[a-zA-Z]+', max_features=1000)

#### Vectorize:

In [20]:
X = vectorizer.fit_transform(emails)

In [21]:
print(vectorizer.get_feature_names()[:200])

['aa', 'ab', 'able', 'absolutely', 'ac', 'accept', 'access', 'account', 'act', 'action', 'actually', 'ad', 'adclick', 'add', 'address', 'addresses', 'admanmail', 'admin', 'administration', 'ads', 'adult', 'advantage', 'advertising', 'ae', 'africa', 'ag', 'age', 'agent', 'agents', 'ago', 'ah', 'ahref', 'aid', 'al', 'alb', 'align', 'allow', 'alsa', 'alt', 'alternative', 'america', 'american', 'amp', 'annuity', 'answer', 'application', 'applications', 'apply', 'apt', 'archive', 'area', 'arial', 'article', 'ascii', 'asciicontent', 'ask', 'asp', 'assist', 'assistance', 'atoll', 'au', 'aug', 'available', 'average', 'aw', 'away', 'awr', 'b', 'background', 'bad', 'bank', 'base', 'based', 'bb', 'bc', 'beenthere', 'begin', 'believe', 'benefit', 'best', 'better', 'bgcolor', 'bidi', 'big', 'billion', 'bin', 'bindex', 'bit', 'bitx', 'black', 'blank', 'blockquote', 'blue', 'body', 'bold', 'bonus', 'book', 'border', 'bordercolor', 'boundary', 'box', 'br', 'bug', 'build', 'bulk', 'bulklist', 'bush', '

In [22]:
print(X.shape)

(1000, 1000)


### Train Test Split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

### Naive Bayes - Gaussian

In [38]:
# Required because of complaint about the matrix being too sparse
print(type(X))
X_train = X_train.toarray()
X_test = X_test.toarray()

<class 'scipy.sparse.csr.csr_matrix'>


In [39]:
# Instantiate and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict on test set
y_pred = gnb.predict(X_test)

In [40]:
gnb.score(X_train,y_train)

0.9914285714285714

In [41]:
gnb.score(X_test, y_test)

0.9366666666666666

In [42]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[131  15]
 [  4 150]]
              precision    recall  f1-score   support

           0       0.97      0.90      0.93       146
           1       0.91      0.97      0.94       154

    accuracy                           0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300



### Naive Bayes - Bernoulli

In [43]:
# Instantiate and train Gaussian Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
# Predict on test set
y_pred = bnb.predict(X_test)

In [44]:
bnb.score(X_train,y_train)

0.8985714285714286

In [45]:
bnb.score(X_test, y_test)

0.9166666666666666

In [46]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[142   4]
 [ 21 133]]
              precision    recall  f1-score   support

           0       0.87      0.97      0.92       146
           1       0.97      0.86      0.91       154

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300



In [47]:
# load the data
spambase = pd.read_csv("https://raw.githubusercontent.com/Vthomps000/DATA620/master/spambase.data")

In [48]:
# column names from spambase.names data
spambase.columns=['word_freq_make','word_freq_address','word_freq_all','word_freq_3d','word_freq_our','word_freq_over',
              'word_freq_remove','word_freq_internet','word_freq_order','word_freq_mail','word_freq_receive',
              'word_freq_will','word_freq_people','word_freq_report','word_freq_addresses','word_freq_free',
              'word_freq_business','word_freq_email','word_freq_you','word_freq_credit','word_freq_your',
              'word_freq_font','word_freq_000','word_freq_money','word_freq_hp','word_freq_hpl','word_freq_george',
              'word_freq_650','word_freq_lab','word_freq_labs','word_freq_telnet','word_freq_857','word_freq_data',
              'word_freq_415','word_freq_85','word_freq_technology','word_freq_1999','word_freq_parts','word_freq_pm',
              'word_freq_direct','word_freq_cs','word_freq_meeting','word_freq_original','word_freq_project',
              'word_freq_re','word_freq_edu','word_freq_table','word_freq_conference','char_freq_;','char_freq_(',
              'char_freq_[','char_freq_!','char_freq_$','char_freq_#','capital_run_length_average','capital_run_length_longest',
              'capital_run_length_total','spamclass']                       

In [49]:
# Count the number of spam vs. not spam
spam_count = len(spambase[spambase.spamclass==1])
ham_count = len(spambase[spambase.spamclass==0])

print("Spam: %d" %spam_count)
print("Ham: %d" %ham_count)

Spam: 1812
Ham: 2788


### Decision Tree Classifier

In [50]:
#train 70%, test 30%
X = spambase.values[:, 0:57]
y = spambase.values[:, 57]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, test_size=.3)

In [51]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
    max_features=None, max_leaf_nodes=None,
    min_impurity_decrease=1e-07, min_samples_leaf=1,
    min_samples_split=2, min_weight_fraction_leaf=0.0,
    presort=False, random_state=88, splitter='best')
dt.fit(X_train, y_train)
dt.score(X_test, y_test)



0.9152173913043479

In [52]:
#Confusion matrix for Decission Tree
dt_cm = confusion_matrix(y_test, dt.predict(X_test))
pd.DataFrame(data = dt_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,797,56
Actual Spam,61,466


In [53]:
dt_pred = dt.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != dt_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (dt_cm [0,1], dt_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 117
In detail, 56 ham emails are mislabeled as spam, 61 spam emails are mislabeled as ham.


### Support Vector Machines

In [78]:
# Configure grid search for hyperparameter tuning at exponential increments
def svm_tune_grid(X, y, kernel, nfolds):
    
    C = [.0001,.001,.01,.1,1,10]
    gamma = [.0001,.001,.01,.1,1,10]
    
    if kernel == 'linear':
        param_grid = {'C': C}
        grid_search = GridSearchCV(svm.SVC(kernel=kernel), 
                                   param_grid, 
                                   cv=nfolds)
    
    elif kernel == 'rbf':
        param_grid = {'C': C, 'gamma': gamma}
        grid_search = GridSearchCV(svm.SVC(kernel=kernel), 
                                   param_grid, 
                                   cv=nfolds)
    else:
        print('Kernel not recognized or supported')
        return
    
    grid_search.fit(X,y)
    grid_search.best_params_
    
    return grid_search.best_params_

Support Vector Machines is a classifier which makes use of a 'kernel trick' to efficiently transform data to a new space in which the margin between different classes can be maximized using a hyperplane.

We employ two common kernels - linear and radial basis function (RBF) - to evaluate their respective performance.

SVM kernels take several parameters.  The C parameter is a regularization term that penalizes misclassification (i.e. a lower value imposes a softer class boundary, or higher value a harder), and is used for both linear and RBF kernels.  The RBF kernel also takes a gamma parameter, which controls the distance over which a given training example influences the boundary.

We perform a grid search to identify good candidates for C (for linearn and RBF) and gamma (only for RBF) parameters.  Due to computational load, we limit the cross validation to five folds - optimally this would be 10.

In [79]:
# Grid search for optimal C and gamma in linear kernel
svm_tune_grid(X_train, y_train, 'linear', 5)

{'C': 10}

In [80]:
# Grid search for optimal C and gamma in radial basis function kernel
svm_tune_grid(X_train, y_train, 'rbf', 5)

{'C': 10, 'gamma': 1}

We fit an SVM classifier with the linear kernel and C parameter value of 10.

In [82]:
# Fit SVM classifier with linear kernel on training set
svm_lin = svm.SVC(C=10, 
               kernel='linear')
svm_lin.fit(X_train, y_train)

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [83]:
svm_lin.score(X_train, y_train)
svm_lin.score(X_test, y_test)
svm_lin_y_pred = svm_lin.predict(X_test)
print(confusion_matrix(y_test, svm_lin_y_pred))
print(classification_report(y_test, svm_lin_y_pred))

[[141   5]
 [  5 149]]
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       146
           1       0.97      0.97      0.97       154

    accuracy                           0.97       300
   macro avg       0.97      0.97      0.97       300
weighted avg       0.97      0.97      0.97       300



We fit an SVM classifier with the radial basis function kernel, C parameter value of 10, and gamma of 1.

In [93]:
# Fit SVM classifier with RBF kernel on training set
svm_rbf = svm.SVC(C=10, 
               kernel='rbf',
                 gamma=1)
svm_rbf.fit(X_train,y_train)

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

In [92]:
svm_rbf.score(X_train, y_train)
svm_rbf.score(X_test, y_test)
svm_rbf_y_pred = svm_rbf.predict(X_test)
print(confusion_matrix(y_test, svm_rbf_y_pred))
print(classification_report(y_test, svm_rbf_y_pred))

[[143   3]
 [  4 150]]
              precision    recall  f1-score   support

           0       0.97      0.98      0.98       146
           1       0.98      0.97      0.98       154

    accuracy                           0.98       300
   macro avg       0.98      0.98      0.98       300
weighted avg       0.98      0.98      0.98       300



(SVM findings)

### Adaptive Boosting

In [13]:
ada = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=88)
ada.fit(X_train, y_train)
ada.score(X_test, y_test)

0.9398550724637681

In [14]:
ada_cm = confusion_matrix(y_test, ada.predict(X_test))

pd.DataFrame(data = ada_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,831,36
Actual Spam,47,466


In [15]:
ada_pred = ada.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != ada_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (ada_cm [0,1], ada_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 83
In detail, 36 ham emails are mislabeled as spam, 47 spam emails are mislabeled as ham.


### Random Forest

In [None]:
#reference:https://datawhatnow.com/feature-importance/

In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=88,
            verbose=0, warm_start=False)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.936231884057971

In [17]:
rf_cm = confusion_matrix(y_test, rf.predict(X_test))
pd.DataFrame(data = rf_cm, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,839,28
Actual Spam,60,453


In [18]:
rf_pred = rf.predict(X_test)
print("Number of mislabeled emails out of a total %d emails in test dataset : %d"
       % (X_test.shape[0],(y_test != rf_pred).sum()))
print("In detail, %d ham emails are mislabeled as spam, %d spam emails are mislabeled as ham."
      % (rf_cm [0,1], rf_cm[1,0]))

Number of mislabeled emails out of a total 1380 emails in test dataset : 88
In detail, 28 ham emails are mislabeled as spam, 60 spam emails are mislabeled as ham.


## Conclusion

In [None]:
# Can we add the other models to this? 

In [20]:
#reference:https://datawhatnow.com/feature-importance/
Conclusion = {'Decision Tree' : [dt.score(X_test, y_test), (y_test != dt_pred).sum()],
             'Random Forest' : [rf.score(X_test, y_test), (y_test != rf_pred).sum()],
             'AdaBoost' : [ada.score(X_test, y_test), (y_test != ada_pred).sum()],
             }
pd.DataFrame (Conclusion)
pd.DataFrame(Conclusion, index=['Accuracy', 'Mislabelled'])

Unnamed: 0,Decision Tree,Random Forest,AdaBoost
Accuracy,0.913043,0.936232,0.939855
Mislabelled,120.0,88.0,83.0


## Youtube