**Maya Purohit**

CS 251: Data Analysis and Visualization

Fall 2023

Project 6: Supervised learning

In [23]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Task 3: Preprocess full spam email dataset 

Before you build a Naive Bayes spam email classifier, run the full spam email dataset through your preprocessing code.

Download and extract the full **Enron** emails (*zip file should be ~29MB large*). You should see a base `enron` folder, with `spam` and `ham` subfolders when you extract the zip file (these are the 2 classes).

Run the test code below to check everything over.

### 3a. Preprocess dataset

In [102]:
import email_preprocessor as epp

#### Test `count_words` and `find_top_words`

In [103]:
word_freq, num_emails = epp.count_words()

In [104]:
print(f'You found {num_emails} emails in the datset. You should have found 32625.')

You found 32625 emails in the datset. You should have found 32625.


In [105]:
top_words, top_counts = epp.find_top_words(word_freq)
print(f"Your top 10 words are\n{top_words[:10]}\nand they should be\n['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']")
print(f"The associated counts are\n{top_counts[:10]}\nand they should be\n[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]")

Your top 10 words are
['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']
and they should be
['the', 'to', 'and', 'of', 'a', 'in', 'for', 'you', 'is', 'enron']
The associated counts are
[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]
and they should be
[277459, 203659, 148873, 139578, 111796, 100961, 80765, 77592, 68097, 60852]


### 3b. Make feature and class vectors

In [108]:
features, y = epp.make_feature_vectors(top_words, num_emails)
print(features)

[[ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [21. 13. 11. ...  0.  1.  0.]
 ...
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  1.  1. ...  0.  0.  0.]
 [ 0.  1.  0. ...  0.  0.  0.]]


#### Verify class label coding

There are
- 16544 `ham` emails
- 16081 `spam` emails.

In the cell below, print out the number of emails that have class label `0` and the number that have class label `1`.
- The count for class label `0` should be 16544
- The count for class label `1` should be 16081.

If the counts across the labels are reversed, recode class label `0` as `1` and class label `1` as `0` in the label vector `y`. If you do this, print out the counts for each label again and verify you get the above counts.

In [109]:
unique, counts = np.unique(y, return_counts=True)
print(str(unique[0]) + ": "+ str(counts[0]))
print(str(unique[1]) +": " + str(counts[1]))

0.0: 16544
1.0: 16081


### 3b. Make train and test splits of the dataset

Here we divide the email features into a 80/20 train/test split (80% of data used to train the supervised learning model, 20% we withhold and use for testing / prediction).

In [110]:
np.random.seed(0)
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(features, y)

print('Shapes for train/test splits:')
print(f'Train {x_train.shape}, classes {y_train.shape}')
print(f'Test {x_test.shape}, classes {y_test.shape}')
print('\nThey should be:\nTrain (26100, 200), classes (26100,)\nTest (6525, 200), classes (6525,)')

Shapes for train/test splits:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)

They should be:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)


### 3c. Save data in binary format

It adds a lot of overhead to have to run through your raw email -> train/test feature split every time you wanted to work on your project! In this step, you will export the data in memory to disk in a binary format. That way, you can quickly load all the data back into memory (directly in ndarray format) whenever you want to work with it again. No need to parse from text files!

Running the following cell uses numpy's `save` function to make six files in `.npy` format (e.g. `email_train_x.npy`, `email_train_y.npy`, `email_train_inds.npy`, `email_test_x.npy`, `email_test_y.npy`, `email_test_inds.npy`).

In [None]:
np.save('data/email_train_x.npy', x_train)
np.save('data/email_train_y.npy', y_train)
np.save('data/email_train_inds.npy', inds_train)
np.save('data/email_test_x.npy', x_test)
np.save('data/email_test_y.npy', y_test)
np.save('data/email_test_inds.npy', inds_test)

## Task 4: Naive Bayes Classifier

After finishing your email preprocessing pipeline, implement the one other supervised learning algorithm we we will use to classify email, **Naive Bayes**.

### 4a. Implement Naive Bayes

In `naive_bayes.py`, implement the following methods:
- Constructor
- get methods
- `train(data, y)`: Train the Naive Bayes classifier so that it records the "statistics" of the training set: class priors (i.e. how likely an email is in the training set to be spam or ham?) and the class likelihoods (the probability of a word appearing in each class — spam or ham).
- `predict(data)`: Combine the class likelihoods and priors to compute the posterior distribution. The predicted class for a test sample is the class that yields the highest posterior probability.
- `accuracy(y, y_pred)`: The usual definition :)


#### Bayes rule ingredients: Priors and likelihood (`train`)

To compute class predictions (probability that a test example belong to either spam or ham classes), we need to evaluate **Bayes Rule**. This means computing the priors and likelihoods based on the training data.

**Prior:** $$P_c = \frac{N_c}{N}$$ where $P_c$ is the prior for class $c$ (spam or ham), $N_c$ is the number of training samples that belong to class $c$ and $N$ is the total number of training samples.

**Likelihood:** $$L_{c,w} = \frac{T_{c,w} + 1}{T_{c} + M}$$ where
- $L_{c,w}$ is the likelihood that word $w$ belongs to class $c$ (*i.e. what we are solving for*)
- $T_{c,w}$ is the total count of **word $w$** in emails that are only in class $c$ (*either spam or ham*)
- $T_{c}$ is the total count of **all words** that appear in emails of the class $c$ (*total number of words in all spam emails or total number of words in all ham emails*)
- $M$ is the number of features (*number of top words*).

#### Bayes rule ingredients: Posterior (`predict`)

To make predictions, we now combine the prior and likelihood to get the posterior:

**Log Posterior:** $$Log(\text{Post}_{i, c}) = Log(P_c) + \sum_{j \in J_i}x_{i,j}Log(L_{c,j})$$

 where
- $\text{Post}_{i,c}$ is the posterior for class $c$ for test sample $i$(*i.e. evidence that email $i$ is spam or ham*). We solve for its logarithm.
- $Log(P_c)$ is the logarithm of the prior for class $c$.
- $x_{i,j}$ is the number of times the jth word appears in the ith email.
- $Log(L_{c,j})$: is the log-likelihood of the jth word in class $c$.

In [111]:
from naive_bayes import NaiveBayes

#### Test `train`

###### Class priors and likelihoods

The following test should be used only if storing the class priors and likelihoods directly.

In [112]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your class priors are: {nbc.get_priors()}\nand should be          [0.28 0.22 0.32 0.18].')
print(f'Your class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your likelihoods are:\n{nbc.get_likelihoods()}')

print(f'and should be')
print('''[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]''')

Your class priors are: [0.28 0.22 0.32 0.18]
and should be          [0.28 0.22 0.32 0.18].
Your class likelihoods shape is (4, 6) and should be (4, 6).
Your likelihoods are:
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]
and should be
[[0.15997 0.15091 0.2079  0.19106 0.14184 0.14832]
 [0.11859 0.16821 0.17914 0.16905 0.18082 0.18419]
 [0.16884 0.17318 0.14495 0.14332 0.18784 0.18187]
 [0.16126 0.17011 0.15831 0.13963 0.18977 0.18092]]


###### Log of class priors and likelihoods

This test should be used only if storing the log of the class priors and likelihoods.

In [None]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.randint(low=0, high=20, size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your log class priors are: {nbc.get_priors()}\nand should be              [-1.27297 -1.51413 -1.13943 -1.7148 ].')
print(f'Your log class likelihoods shape is {nbc.get_likelihoods().shape} and should be (4, 6).')
print(f'Your log likelihoods are:\n{nbc.get_likelihoods()}')


print(f'and should be')
print('''[[-1.83274 -1.89109 -1.57069 -1.65516 -1.95306 -1.90841]
 [-2.13211 -1.78255 -1.71958 -1.77756 -1.71023 -1.6918 ]
 [-1.77881 -1.75342 -1.93136 -1.94266 -1.67217 -1.70448]
 [-1.82475 -1.77132 -1.84321 -1.96879 -1.66192 -1.70968]]''')

#### Test `predict`

In [113]:
num_test_classes = 4
np.random.seed(0)
data_train = np.random.randint(low=0, high=15, size=(100, 10))
data_test = np.random.randint(low=0, high=15, size=(15, 10))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_train, y_test)
test_y_pred = nbc.predict(data_test)

print(f'Your predicted classes are\n{test_y_pred}\nand should be\n[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]]')

Your predicted classes are
[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]
and should be
[2 0 0 3 0 3 2 1 2 3 1 0 0 1 0]]


### 4b. Spam filtering

Use your Naive Bayes classifier to predict whether emails in the Enron email dataset are spam! Start by running the following code that uses `np.load` to load in the train/test split that you created last week.


In [114]:
import email_preprocessor as ep

In [115]:
x_train = np.load('data/email_train_x.npy')
y_train = np.load('data/email_train_y.npy')
inds_train = np.load('data/email_train_inds.npy')
x_test = np.load('data/email_test_x.npy')
y_test = np.load('data/email_test_y.npy')
inds_test = np.load('data/email_test_inds.npy')

In [116]:
bayes = NaiveBayes(num_classes = 2)
bayes.train(x_train, y_train)


In [117]:
predictions = bayes.predict(x_test)
print(predictions)
acc = bayes.accuracy(predictions, y_test)
print("Accuracy: ", acc)


[0 0 0 ... 0 0 0]
Accuracy:  0.8962452107279694


### 4c. Questions

**Question 7:** What accuracy do you get on the test set with Naive Bayes. It should be roughly 89%.

**Answer 7:** I got an accuracy of 89.625% This is a relatively high accuracy, meaning that the Naive Bayes Classifier is good at determining if an email is spam or ham. 

### 4d. Confusion matrix

To get a better sense of the errors that the Naive Bayes classifier makes, create a confusion matrix. 

- Implement `confusion_matrix` in `naive_bayes.py`.
- Print out a confusion matrix of the spam classification results. Assign the confusion matrix below to the variable `conf_matrix_nb`.
- Run below to help test your confusion matrix

In [118]:
conf_matrix_nb= bayes.confusion_matrix(y_test, predictions)
print(conf_matrix_nb)

[[2790.  491.]
 [ 186. 3058.]]


#### Test confusion matrix

In [119]:
print(f'The total number of entries in your confusion matrix is {int(conf_matrix_nb.sum())} and should be {len(y_test)}.')
print(f'The total number of ham entries in your confusion matrix is {int(conf_matrix_nb[0].sum())} and should be {int(np.sum(y_test == 0))}.')
print(f'The total number of spam entries in your confusion matrix is {int(conf_matrix_nb[1].sum())} and should be {int(np.sum(y_test == 1))}.')

The total number of entries in your confusion matrix is 6525 and should be 6525.
The total number of ham entries in your confusion matrix is 3281 and should be 3281.
The total number of spam entries in your confusion matrix is 3244 and should be 3244.


### 4e. Questions

**Question 8:** Interpret the confusion matrix, using the convention that positive detection means spam (*e.g. a false positive means classifying a ham email as spam*). What types of errors are made more frequently by the classifier? What does this mean (*i.e. X (spam/ham) is more likely to be classified than Y (spam/ham) than the other way around*)?

**Answer 8:** 
In the confusion matrix, we can see that the false positive (type I error) is much more common than a type II error (false negative). This means that emails that are actually ham will get classified as spam more often than emails that are spam getting misclassified as ham emails. 5.7% of spam emails were misclassified as ham emails while 15% of ham emails were misclassified as spam emails. Therefore, type I errors are more likely, according to the confusion matrix above. The numbers on the main diagonal are the largest, indicating that most of the emails in the test files were classified as the correct type by the classifier. 5848/6525 (89.6%) of the emails in the test set were classified correctly by the Naive Bayes classifier.

## Task 5: Comparison with KNN

In [120]:
from knn import KNN

### 5a. KNN spam email classification accuracy
Run a similar analysis to what you did with Naive Bayes above. When computing accuracy on the test set, you may want to reduce the size of the test set (e.g. to the first 500 emails in the test set).

In [121]:
knn = KNN(2)
knn.train(x_train, y_train)
predictionsKNN = knn.predict(x_test, 2)
print(predictionsKNN)
acc = knn.accuracy(predictionsKNN, y_test)
print("Accuracy: ", acc)



[0. 1. 0. ... 0. 0. 0.]
Accuracy:  0.9173946360153257


### 5b. KNN spam email confusion matrix
Copy-paste your `confusion_matrix` method into `knn.py` so that you can run the same analysis on a KNN classifier.

In [124]:
confusion_matrix_knn = knn.confusion_matrix(y_test, predictionsKNN)
print(confusion_matrix_knn)

print(f'The total number of entries in your confusion matrix is {int(confusion_matrix_knn.sum())} and should be {len(y_test)}.')
print(f'The total number of ham entries in your confusion matrix is {int(confusion_matrix_knn[0].sum())} and should be {int(np.sum(y_test == 0))}.')
print(f'The total number of spam entries in your confusion matrix is {int(confusion_matrix_knn[1].sum())} and should be {int(np.sum(y_test == 1))}.')

[[3135.  146.]
 [ 393. 2851.]]
The total number of entries in your confusion matrix is 6525 and should be 6525.
The total number of ham entries in your confusion matrix is 3281 and should be 3281.
The total number of spam entries in your confusion matrix is 3244 and should be 3244.


### 5c. Questions

**Question 9:** What accuracy did you get on the test set (potentially reduced in size)?

**Question 10:** How does the confusion matrix compare to that obtained by Naive Bayes (*If you reduced the test set size, keep that in mind*)?

**Question 11:** Briefly describe at least one pro/con of KNN compared to Naive Bayes on this dataset.

**Question 12:** When potentially reducing the size of the test set here, why is it important that we shuffled our train and test set?

**Answer 9:** I did not reduce the size of the test set when computing the accuracy. I got an accuracy of 91.7% for the KNN classifier when the number of neighbors was set to 2.

**Answer 10:** The KNN classifier is more accurate at determining the class of an email. Unlike the Naive Bayes Classifier, a type II error is more likely to occur with the KNN classifier, meaning that a spam email is more likely to be misclassified as a ham email than the other way around. There are more false negative emails (12.1% of spam emails were misclassified), than false positive emails (4.4% of ham emails were misclassified). The accuracy of the classifier is around 92%, which is higher than the Naive Bayes Classifier, which had an accuracy around 89.6%.

**Answer 11:** One positive aspect of using KNN is that the training process of very fast and efficient because we are simply memorizing the training data for use. The exemplars and the classes are identical to the training set data and classes. Since the Naive Bayes classifier finds the likelihood of each word being in a spam or ham email, the training process for Naive Bayes is much more expensive and will be less efficient. A con of KNN is that we have to compute the distance of each sample in the training set from the test sample, which makes prediction computations expensive for large datasets, like the email dataset. Getting predictions from the KNN classifier took significantly longer than getting predictions from the Naive Bayes Classifier. It is much easier to make predictions with the Naive Bayes Classifier because we only have to perform matrix computations. 

**Answer 12:** When potentially reducing the training set or test set, we want to make sure that the samples are shuffled so that the classifier is trained on both types of emails. If the training set was organized so that all of the ham emails were first and all of the spam emails were after the ham emails, reducing the training set could mean that we lose all of the spam emails. This means that the classifier could be trained without any spam emails, making it ineffective at classifying emails as spam. Our classifier would be very biased if it was only trained with one type of email. Additionally, if the test set was not shuffled and it was organized with all of one type of email at the beginning of the dataset, when reducing the size of the test set, we could reduce it to a point where only one type of email is seen in the test set. When trying to determining the accuracy of the classifier, we would get an inaccurate value because only one type of email was used to test the classifier. Therefore, it is important to shuffle the dataset to ensure that emails of multiple classes are included in the training and the testing process. 

## Sources Cited

I received help from Reva to write my predict method in the KNN class. I also received help from Professor Layton to write my plot_predictions method in the KNN class. 

## AI Disclosure

I did not use AI

## Extensions

### 0. Classify your own datasets

- Find datasets that you find interesting and run classification on them using your KNN algorithm (and if applicable, Naive Bayes). Analysis the performance of your classifer.

### 1. Better text preprocessing

- If you look at the top words extracted from the email dataset, many of them are common "stop words" (e.g. a, the, to, etc.) that do not carry much meaning when it comes to differentiating between spam vs. non-spam email. Improve your preprocessing pipeline by building your top words without stop words. Analyze performance differences.

In my email processor file, I made a list of words that should not be included in the feature vectors because they are words that do not significantly contribute to the meaning of an email. When looping through the words and making the word count dictionary, I skipped the words that were in the list of words that should not be included so that they wouldn't be included in the feature vector. 

In [75]:
np.random.seed(0)
word_freq1, num_emails1 = epp.count_words(remove = True)
top_words1, top_counts1 = epp.find_top_words(word_freq1)
features1, y1 = epp.make_feature_vectors(top_words1, num_emails1)
print(features1)

x_train1, y_train1, inds_train1, x_test1, y_test1, inds_test1 = epp.make_train_test_sets(features1, y1)


[[ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [11.  6.  9. ...  0.  0.  1.]
 ...
 [ 0.  0.  1. ...  0.  0.  0.]
 [ 1.  1.  1. ...  0.  0.  0.]
 [ 0.  1.  0. ...  0.  0.  0.]]


In [77]:
np.save('data/email_train_xEXTENSION.npy', x_train1)
np.save('data/email_train_yEXTENSION.npy', y_train1)
np.save('data/email_train_indsEXTENSION.npy', inds_train1)
np.save('data/email_test_xEXTENSION.npy', x_test1)
np.save('data/email_test_yEXTENSION.npy', y_test1)
np.save('data/email_test_indsEXTENSION.npy', inds_test1)


In [76]:
x_train1 = np.load('data/email_train_xEXTENSION.npy')
y_train1 = np.load('data/email_train_yEXTENSION.npy')
inds_train1 = np.load('data/email_train_indsEXTENSION.npy')
x_test1 = np.load('data/email_test_xEXTENSION.npy')
y_test1 = np.load('data/email_test_yEXTENSION.npy')
inds_test1 = np.load('data/email_test_indsEXTENSION.npy')

In [78]:
bayes2 = NaiveBayes(num_classes = 2)
bayes2.train(x_train1, y_train1)
predictions2 = bayes2.predict(x_test1)
print(predictions2)
acc2 = bayes2.accuracy(predictions2, y_test1)
print("Accuracy: ", acc2)


[0 0 0 ... 0 0 1]
Accuracy:  0.8940996168582376


In [123]:
knn2 = KNN(2)
knn2.train(x_train1, y_train1)
predictionsKNN2 = knn2.predict(x_test1, 2)
print(predictionsKNN2)
acc2 = knn2.accuracy(predictionsKNN2, y_test1)
print("Accuracy: ", acc2)

[0. 1. 0. ... 0. 0. 0.]
Accuracy:  0.9236781609195402


I hypothesized that removing common stop words from the top_words and the feature vectors would significantly increase the accuracy of the classifiers because the stop words do not have much of a correlation with the classification of an email. Words such as "the" and "it" will show up in both spam and ham emails because they are very common words. They are not good indicators of whether an email is spam or ham because they are universal words. I believed that removing these words would make more room for words that would help a classifier differentiate between a spam and a ham email in the top_words list. When there are more specific words that the classifier can look at to make its predictions, the accuracy will be higher because the classifier has a better idea of what a spam and ham email is.

When comparing the values of accuracy for the classifiers when we didn't remove certain stop words as opposed to when we did for both KNN and Naive Bayes, we can see that there is not much change in accuracy. The KNN classifier accuracy increased from 91.7% to 92.4% when we removed the stop words and the Naive Bayes classifier accuracy decreased from 89.6% to 89.4%. This counters my hypothesis because the accuracy did not change significantly when we removed common stop words and the accuracy for the Naive Bayes classifier actually decreased. This may be due to the fact that I only removed a small number of stop words, meaning that the impact would not be significant enough to change the accuracy drastically. I may need to remove more stop words to change the accuracy. Additionally, the accuracy may have gone down for the Naive Bayes Classifier because the stop words are more indicative of an email being spam or ham than I had predicted in my hypothesis. Spam emails may have more instances of a stop word than a ham email, which can be used by the classifier to make an accurate prediction. 

### 2. Feature size

- Explore how the number of selected features for the email dataset influences accuracy and runtime performance.

### 3. Distance metrics
- Compare KNN performance with the $L^2$ and $L^1$ distance metrics

In [69]:
knn2 = KNN(2)
knn2.train(x_train, y_train)
predictionsKNN2 = knn.predict(x_test, 2, euclidean= False)
print(predictionsKNN2)
acc2 = knn2.accuracy(predictionsKNN2, y_test)
print("Accuracy: ", acc2)

[0. 0. 0. ... 0. 0. 0.]
Accuracy:  0.5028352490421456


For this extension, I added a parameter to the predict method of the KNN class to allow the programmer to specify which type of distance they want used, euclidean distance or manhattan distance.

The Manhattan Distance metric reduces the accuracy of the classifier by 41.5% (from 91.7% to 50.2%) meaning that the manhattan distance metric is not a good alternative to the euclidean distance metric. Although it decreases the time that is taken to make predictions for each individual email, it only classifies around 1/2 of the emails correctly when we look at 2 neighbors in KNN. This may be due to the fact that the boundary around the test sample when going out a certain Manhattan distance is not symmetrical, meaning that the data samples that are examined by the classifier are not as inclusive or representative of the values around the test sample. This could lead to less accurate predictions from the classifier. 

### 4. K-Fold Cross-Validation

- Research this technique and apply it to data and your KNN and/or Naive Bayes classifiers.

### 5. Email error analysis

- Dive deeper into the properties of the emails that were misclassified (FP and/or FN) by Naive Bayes or KNN. What is their word composition? How many words were skipped because they were not in the training set? What could plausibly account for the misclassifications?

In [101]:
subtractBayes = y_test-predictions


for i in range(150):
    if subtractBayes[i] == -1.0:
        wordString = ""
        for j in range (x_test[i, :].shape[0]):
            if x_test[i, j] != 0:
                wordString+= " " +top_words[j]
        print("False Positive: ", wordString)

print("\n")
for i in range(150):
    if subtractBayes[i] == 1.0:
        wordString = ""
        for j in range (x_test[i, :].shape[0]):
            if x_test[i, j] != 0:
                wordString+= " " +top_words[j]
        print("False Negative: ", wordString)





False Positive:   to and of you is subject your if com please me more my about get here email know thanks o free contact kaminski
False Positive:   for this on subject have not if please e been message d report c date
False Positive:   the to of you subject your by please me
False Positive:   a for i subject from they day l
False Positive:   subject no o date
False Positive:   the to and of a in for is this s subject with be your from are or by not if com any e but t may information which gas energy http message price also only m mail some market www inc report make use go u most questions
False Positive:   the to of a for you this subject your do make when attached
False Positive:   the to and of a in for you i s subject with be we have it will our all has more but t do said he also should than make want them don r
False Positive:   subject no o date
False Positive:   to and for s subject be as from will are at our these d into through th
False Positive:   the to and of a in for you e

Above, I have displayed the words present in a small number of emails that were misclassified as spam when they were ham (false positive) or misclassified as ham when they were spam (false negative). To do this, I found the feature vector of each email and searched to see which of the top_words were present in the email. I then printed these words out to see if there was a general trend in the emails that were misclassified by the Naive Bayes classifier. Emails that were deemed spam when they were ham tend to ask for the reader to visit a website to perform some action (indicated by the "http"), which is generally a suspicious activity. Therefore, it makes sense why these emails were classified as spam. Emails that were spam that were classified as ham do not have many of the top words in them and are typically shorter than emails that were misclassified as spam emails. This means that emails that don't include as many of the top words tend to be classified as ham even when they are spam. Most of the emails that were in the false negative category do not have any suspicious language in them, making it reasonable for them to be classified as ham. 

### 6. Investigate the misclassification errors

Numbers are nice, but they may not the best for developing your intuition. Sometimes, you want to see what an misclassification *actually looks like* to help you improve your algorithm. Retrieve the actual text of some example emails of false positive and false negative misclassifications to see if helps you understand why the misclassification occurred. Here is an example workflow:

- Decide on how many FP and FN emails you would like to retrieve. Find the indices of this many false positive and false negative misclassification. Remember to use your `test_inds` array to look up the index of the emails BEFORE shuffling happened.
- Implement the function `retrieve_emails` in `email_preprocessor.py` to return the string of the raw email at the error indices.
- Call your function to print out the emails that produced misclassifications.

Do the FP and FN emails make sense? Why? Do the emails have properties in common? Can you quantify and interpret them?

### 7. KNN for regression

KNN can also be used to perform regression between one or more independent variables and a dependent variable. The potential advantage of this approach is that the regression performed by KNN does not assume any specific form of the regression curve (e.g. line, polynomial, etc.) — the regression is entirely training data-dependent.

KNN for regression is largely the same as for classification except for the following change during prediction:
- For each test sample (validation data), the predicted "y value" is the average "y value" of the K nearest training samples.
- You can use MSE to evaluate how well the regression fits the test samples that you plug in.