Makaylah Cowan

Spring 2020

CS 251: Data Analysis and Visualization

Supervised learning

In [6]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Task 3: Naive Bayes Classifier

After finishing your email preprocessing pipeline, implement the one other supervised learning algorithm we we will use to classify email, **Naive Bayes**.

### 3a) Implement Naive Bayes

In `naive_bayes.py`, implement the following methods:
- Constructor
- `train(data, y)`: Train the Naive Bayes classifier so that it records the "statistics" of the training set: class priors (i.e. how likely an email is in the training set to be spam or ham?) and the class likelihoods (the probability of a word appearing in each class — spam or ham).
- `predict(data)`: Combine the class likelihoods and priors to compute the posterior distribution. The predicted class for a test sample is the class that yields the highest posterior probability.
- `accuracy(y, y_pred)`: The usual definition :)


#### Bayes rule ingredients: Priors and likelihood (`train`)

To compute class predictions (probability that a test example belong to either spam or ham classes), we need to evaluate **Bayes Rule**. This means computing the priors and likelihoods based on the training data.

**Prior:** $$P_c = \frac{N_c}{N}$$ where $P_c$ is the prior for class $c$ (spam or ham), $N_c$ is the number of training samples that belong to class $c$ and $N$ is the total number of training samples.

**Likelihood:** $$L_{c,w} = \frac{N_{c,w} + 1}{N_{c} + M}$$ where
- $L_{c,w}$ is the likelihood that word $w$ belongs to class $c$ (*i.e. what we are solving for*)
- $N_{c,w}$ is the total count of **word $w$** in emails that are only in class $c$ (*either spam or ham*)
- $N_{c}$ is the total number of **all words** that appear in emails of the class $c$ (*total number of words in all spam emails or total number of words in all ham emails*)
- $M$ is the number of features (*number of top words*).

#### Bayes rule ingredients: Posterior (`predict`)

To make predictions, we now combine the prior and likelihood to get the posterior:

**Posterior:** $$\text{Post}_{i, c} = Log(P_c) + \sum_{j \in J_i}Log(L_{c,j})$$ where
- $\text{Post}_c$ is the posterior for class $c$ for test sample $i$(*i.e. evidence that email $i$ is spam or ham*). What we are solving for.
- $Log(P_c)$ is the logarithm of the prior for class $c$ $P_c$.
- $j \in J_i$ (under the sum) indexes the set of words in the current test sample that have nonzero counts (*i.e. which words show up in the current test set email $i$? $j$ is the index of each of these words.*)
- $\sum_{j \in J_i}Log(L_{c,j})$: we sum over the log-likelihoods ONLY PERTAINING TO CLASS $c$ at word word indices that appear in the current test email $i$ (i.e. indices at which the counts are > 0).

In [40]:
from naive_bayes_multinomial import NaiveBayes

#### Test `train`

In [215]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.random(size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))
nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your class priors are: {nbc.class_priors}\n.')
print(f'Your class likelihoods shape is {nbc.class_likelihoods.shape}.')
print(f'Your likelihoods are:\n{nbc.class_likelihoods}')

Your class priors are: [0.24 0.26 0.25 0.25]
and should be          [0.24 0.26 0.25 0.25].
Your class likelihoods shape is (4, 6) and should be (4, 6).
Your likelihoods are:
[[0.15116 0.18497 0.17571 0.1463  0.16813 0.17374]
 [0.16695 0.17437 0.15742 0.16887 0.15677 0.17562]
 [0.14116 0.1562  0.19651 0.17046 0.17951 0.15617]
 [0.18677 0.18231 0.15884 0.12265 0.16755 0.18187]]
and should be
[[0.15116 0.18497 0.17571 0.1463  0.16813 0.17374]
 [0.16695 0.17437 0.15742 0.16887 0.15677 0.17562]
 [0.14116 0.1562  0.19651 0.17046 0.17951 0.15617]
 [0.18677 0.18231 0.15884 0.12265 0.16755 0.18187]]


#### Test `predict`

In [217]:
num_test_classes = 4
np.random.seed(0)
data_train = np.random.random(size=(100, 10))
data_test = np.random.random(size=(4, 10))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_train, y_test)
test_y_pred = nbc.predict(data_test)

print(f'Your predicted classes are {test_y_pred}'')

Your predicted classes are [2 2 2 2] and should be [2 2 2 2].


### 3c) Spam filtering

Let's start classifying spam email using the Naive Bayes classifier.

- Use `np.load` to load in the train/test split that you created last week.
- Use your Naive Bayes classifier on the Enron email dataset!

**Question 9:** What accuracy do you get on the test set with Naive Bayes?
89% accuracy

In [245]:
import retrieve_emails as ep

In [226]:
# Load your training and test data into numpy ndarrays using np.load()
# (the files you created at the end of the previous notebook)
classes = np.load("email_data/email_classes.npy")
features = np.load("email_data/email_features.npy")
test_x = np.load("email_data/email_test_x.npy")
test_y = np.load("email_data/email_test_y.npy")
train_x = np.load("email_data/email_train_x.npy")
train_y = np.load("email_data/email_train_y.npy")
test_ind = np.load("email_data/email_test_inds.npy")
train_ind = np.load("email_data/email_train_inds.npy")

In [301]:
# Construct your classifier
NB1 = NaiveBayes(len(np.unique(classes)))

In [315]:
# Train and test your classifier
NB1.train(train_x, train_y)
pred = NB1.predict(test_x)
accNB = NB1.accuracy(test_y, pred)
print("Naive Bayes Accuracy:",round(accNB,4))

Naive Bayes Accuracy: 0.8902


### 3d) Confusion matrix

To get a better sense of the errors that the Naive Bayes classifer makes, you will create a confusion matrix. 

- Implement `confusion_matrix` in `naive_bayes.py`.
- Print out a confusion matrix of the spam classification results.

In [229]:
print(NB1.confusion_matrix(test_y, pred))

[[3237.  175.]
 [ 565. 2763.]]


**Question 10:** Interpret the confusion matrix, using the convention that positive detection means spam (*e.g. a false positive means classifying a ham email as spam*). What types of errors are made more frequently by the classifier? What does this mean (*i.e. X (spam/ham) is more likely to be classified than Y (spam/ham) than the other way around*)?


**Reminder: Look back at your preprocessing code: which class indices correspond to spam/ham?**

**Answer 10:** The classifier correctly identified 3237 spam emails and 2763 ham emails, but it incorrectly labeled 565 ham emails as spam and 175 spam emails as ham. The error it made more frequently was labeling ham emails as spam. This means it is better at classifying spam emails.

### 3e) Investigate the misclassification errors

Numbers are nice, but they may not the best for developing your intuition. Sometimes, you want to see what an misclassification *actually* looks like to build your understanding as you look to improve your algorithm. Here, you will take a false positive and a false negative misclassification and retrieve the actual text of the email so see which emails produced the error.

- Determine the index of the **FIRST** false positive and false negative misclassification — i.e. 2 indices in total. Remember to use your inds array to figure out the index of the emails BEFORE shuffling happened.
- **Section B:** Implement the function `retrieve_emails` in `email_preprocessor.py` to return the string of the raw email at the error indices. (**Sections A/C** have been supplied with this function on Classroom.)
- Call your function to print out the two emails that produced misclassifications.

**Question 11:** What do you think it is about each email that resulted in it being misclassified?

**Answer 11:** The false positive email is very very long with numbers scattered throughout. The false negative email has full sentences with sound use of punctation.

In [321]:
# Determine the indices of the 1st FP and FN.
# Note: spam = 0, ham = 1
fp, fn = NB1.fp_fn(test_y, pred)


In [322]:
# Use retrieve_emails() to display the first FP and FN.f\
inds = np.array([fp, fn])
emails = ep.retrieve_emails(inds)

print()
print('The 1st email that is a false positive (classified as spam, but really not) is:')
print('------------------------------------------------------------------------------------------')
print(emails[0])
print('------------------------------------------------------------------------------------------')
print('The 1st email that is a false negative (classified as ham, but really spam) is:')
print('------------------------------------------------------------------------------------------')
print(emails[1])
print('------------------------------------------------------------------------------------------')

Discovered class names: ['ham', 'spam']
Processing data/enron/ham...
Processing data/enron/spam...

The 1st email that is a false positive (classified as spam, but really not) is:
------------------------------------------------------------------------------------------
Subject: california power 2 / 8
please contact kristin walsh ( x 39510 ) or robert johnston ( x 39934 ) for further clarification .
executive summary :
utility bankruptcy appears increasingly likely next week unless the state can clear three hurdles - agreement on payback for the bailout , rate increases , and further short - term funding for dwr purchases of power .
disagreement persists between gov . davis and democrats in the legislature on how the state should be paid back for its bailout of the utilities . the split is over a stock warrant plan versus state ownership of utility transmission assets .
the economics of the long - term contracts appear to show that rate hikes are unavoidable because of the need to amor

## Task 4) Comparison with KNN


- Run a similar analysis to what you did with Naive Bayes above. When computing accuracy on the test set, you may want to reduce the size of the test set (e.g. to the first 500 emails in the test set).
- Copy-paste your `confusion_matrix` method into `knn.py` so that you can run the same analysis on a KNN classifier.

In [193]:
from knn import KNN

In [200]:
# Construct and train your KNN classifier
nc = len(np.unique(classes))
knn = KNN(num_classes=nc)
knn.train(train_x, train_y)

In [196]:
# Evaluate the accuracy of the KNN classifier
y_pred = knn.predict(test_x[:500,:], 10 )
accknn = knn.accuracy(test_y[:500], y_pred)
print("Accuracy:", accknn)

Accuracy: 0.906


In [202]:
print(knn.confusion_matrix(test_y[:500], y_pred))

[[251.  13.]
 [ 34. 202.]]


**Question 12:** What accuracy did you get on the test set (potentially reduced in size)?

**Answer 12:** It was reduced to 500 emails, and it had an accuracy of 91%.

**Question 13:** How does the confusion matrix compare to that obtained by Naive Bayes?

**Answer 13:** (Proportionally) there are slightly less emails that were misclassified for the KNN classifier. They both were able to identify emails that were spam better than they were able to identify emails that were ham.

**Question 14:** Briefly describe at least one pro/con of KNN compared to Naive Bayes on this dataset.

**Answer 14:** Con: KNN takes much longer on this dataset because it makes a lot more distance calculations.
Pro: KNN is more accurate.

**Question 15:** When potentially reducing the size of the test set here, why is it important that we shuffled our train and test set?


**Answer 15:** This is to make sure to eliminate any potential biases in how the emails were compiled.