# Extension 2 - Better text preprocessing

- If you look at the top words extracted from the email dataset, many of them are common "stop words" (e.g. a, the, to, etc.) that do not carry much meaning when it comes to differentiating between spam vs. non-spam email. Improve your preprocessing pipeline by building your top words without stop words. Analyze performance differences.

In [10]:
# Importing all the required stuff

import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
%pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
import email_preprocessor as epp

In [13]:
word_freq, num_emails = epp.count_words(remove_stop_words=True)

In [14]:
print(f'You found {num_emails} emails in the datset. You should have found 32625.')

You found 32625 emails in the datset. You should have found 32625.


In [15]:
top_words, top_counts = epp.find_top_words(word_freq)

# Printing the top 10 words
print('Top 10 words:')
for i in range(10):
    print(f'{i + 1}. {top_words[i]}: {top_counts[i]}')

# Checking if the top words does not contain any stop words
assert 'the' not in top_words
assert 'to' not in top_words
assert 'and' not in top_words
assert 'of' not in top_words
assert 'a' not in top_words
assert 'in' not in top_words
assert 'for' not in top_words
assert 'is' not in top_words
assert 'on' not in top_words
assert 'that' not in top_words
assert 'this' not in top_words
assert 'it' not in top_words
assert 'you' not in top_words
assert 'not' not in top_words
assert 'are' not in top_words
assert 'be' not in top_words
assert 'have' not in top_words
assert 'as' not in top_words
assert 'with' not in top_words
assert 'will' not in top_words
assert 'at' not in top_words
assert 'by' not in top_words
assert 'from' not in top_words
assert 'or' not in top_words
assert 'an' not in top_words
assert 'was' not in top_words
assert 'if' not in top_words
assert 'they' not in top_words
assert 'but' not in top_words
assert 'your' not in top_words
assert 'we' not in top_words
assert 'all' not in top_words
assert 'can' not in top_words
assert 'more' not in top_words

Top 10 words:
1. enron: 60852
2. subject: 46443
3. ect: 35346
4. com: 22742
5. company: 21296
6. please: 19490
7. hou: 17264
8. would: 15166
9. e: 14756
10. new: 14729


# Report + Results

For this extension, I aimed to improve the text preprocessing pipeline by removing stop words from the dataset. I first imported the required libraries and then installed the nltk library to help with the stop word removal. Next, I used the count_words function from the original code to count the frequency of words in the dataset with the remove_stop_words parameter set to True. I found that there were 32,625 emails in the dataset.

After counting the words, I used the find_top_words function to find the top 10 most frequently used words in the dataset. These words were:

enron: 60852
subject: 46443
ect: 35346
com: 22742
company: 21296
please: 19490
hou: 17264
would: 15166
e: 14756
new: 14729
I then verified that the top words did not contain any stop words, as per the requirements. I asserted that the words "the", "to", "and", "of", "a", "in", "for", "is", "on", "that", "this", "it", "you", "not", "are", "be", "have", "as", "with", "will", "at", "by", "from", "or", "an", "was", "if", "they", "but", "your", "we", and "all" were not present in the top words.

In order to remove stop words, I modified the tokenize_words function to remove stop words if the remove_stop_words parameter was set to True. I then used the count_words function with remove_stop_words set to True to count the frequency of the non-stop words in the dataset.

Here are the original results:

Your top 5 words are
['the', 'to', 'and', 'of', 'a']
with counts of
[277459, 203659, 148873, 139578, 111841].
These results are consistent with what we would expect from a large corpus of text.

However, when we remove stop words, the top 10 words are different. Here are the new results:

Top 10 words:

enron: 60852
subject: 46443
ect: 35346
com: 22742
company: 21296
please: 19490
hou: 17264
would: 15166
e: 14756
new: 14729

As we can see, the most frequent words now include terms that are specific to the context of the text corpus. This suggests that removing stop words can help us to identify the most meaningful words in a given text.

It is worth noting that removing stop words may still have other benefits, such as reducing the size of the vocabulary and potentially improving model performance in other ways.