Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [6]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

# Convert ham and spam to DataFrames and concatenate them
df_ham = pd.DataFrame.from_records(ham)
df_spam = pd.DataFrame.from_records(spam)

# Use pd.concat instead of append
df = pd.concat([df_ham, df_spam], ignore_index=True)

# Now df contains both ham and spam emails
print(df.head())

skipped 2649.2004-10-27.GP.spam.txt
skipped 0754.2004-04-01.GP.spam.txt
skipped 2042.2004-08-30.GP.spam.txt
skipped 3304.2004-12-26.GP.spam.txt
skipped 4142.2005-03-31.GP.spam.txt
skipped 3364.2005-01-01.GP.spam.txt
skipped 4201.2005-04-05.GP.spam.txt
skipped 2140.2004-09-13.GP.spam.txt
skipped 2248.2004-09-23.GP.spam.txt
skipped 4350.2005-04-23.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 1414.2004-06-24.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 5105.2005-08-31.GP.spam.txt
                             name  \
0  1061.2000-05-10.farmer.ham.txt   
1  0446.2000-02-18.farmer.ham.txt   
2  0067.1999-12-27.farmer.ham.txt   
3  1553.2000-06-29.farmer.ham.txt   
4  1790.2000-07-28.farmer.ham.txt   

                                             content category  
0  Subject: ena sales on hpl\njust to update you ...      ham  
1  Subject: 98 - 6736 & 98 - 9638 for 1997 ( ua 4...      ham  
2  Subject: hpl nominations for december 28 ,

Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [7]:
import re

def preprocessor(text):
    # Replace all non-alphabet characters with a space
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # Convert the text to lowercase
    return text.lower()

# Example usage
input_text = "Hello, World! 1234"
processed_text = preprocessor(input_text)
print(processed_text)  # Output: "hello  world"


hello  world      


Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# The CountVectorizer converts a text sample into a vector (think of it as an array of floats).
# Each entry in the vector corresponds to a single word and the value is the number of times the word appeared.
# Instantiate a CountVectorizer. Make sure to include the preprocessor you previously wrote in the constructor.
# TODO
vectorizer = CountVectorizer(preprocessor=preprocessor)


# Use train_test_split to split the dataset into a train dataset and a test dataset.
# The machine learning model learns from the train dataset.
# Then the trained model is tested on the test dataset to see if it actually learned anything.
# If it just memorized for example, then it would have a low accuracy on the test dataset and a high accuracy on the train dataset.
# TODO
# Split the dataset into training and testing groups.
# X will be the 'content' (emails), and y will be the 'category' (ham or spam).
X = df['content']
y = df['category']

# Split into training and testing data (80% train, 20% test).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Use the vectorizer to transform the existing dataset into a form in which the model can learn from.
# Remember that simple machine learning models operate on numbers, which the CountVectorizer conveniently helped us do.
# TODO
# Transform the emails into numbers using the vectorizer.
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)



# Use the LogisticRegression model to fit to the train dataset.
# You may remember y = mx + b and Linear Regression from high school. Here, we fitted a scatter plot to a line.
# Logistic Regression is another form of regression. 
# However, Logistic Regression helps us determine if a point should be in category A or B, which is a perfect fit.
# TODO
# Use the Logistic Regression model to learn from the training data.
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)



# Validate that the model has learned something.
# Recall the model operates on vectors. First transform the test set using the vectorizer. 
# Then generate the predictions.
# TODO
# Transform the test set into numbers and make predictions.
y_pred = model.predict(X_test_vectorized)



# We now want to see how we have done. We will be using three functions.
# `accuracy_score` tells us how well we have done. 
# 90% means that every 9 of 10 entries from the test dataset were predicted accurately.
# The `confusion_matrix` is a 2x2 matrix that gives us more insight.
# The top left shows us how many ham emails were predicted to be ham (that's good!).
# The bottom right shows us how many spam emails were predicted to be spam (that's good!).
# The other two quadrants tell us the misclassifications.
# Finally, the `classification_report` gives us detailed statistics which you may have seen in a statistics class.
# TODO
# Check how well the model did using accuracy score, confusion matrix, and classification report.
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)




Accuracy: 0.9757751937984496
Confusion Matrix:
[[693  14]
 [ 11 314]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       707
        spam       0.96      0.97      0.96       325

    accuracy                           0.98      1032
   macro avg       0.97      0.97      0.97      1032
weighted avg       0.98      0.98      0.98      1032



Step 4.

In [12]:
# Let's see which features (aka columns) the vectorizer created. 
# They should be all the words that were contained in the training dataset.
# TODO
# Step 4.1: Display the features created by the vectorizer
features = vectorizer.get_feature_names_out()

# Print the first 10 features (words) as an example
print(features[:10])



# You may be wondering what a machine learning model is tangibly. It is just a collection of numbers. 
# You can access these numbers known as "coefficients" from the coef_ property of the model
# We will be looking at coef_[0] which represents the importance of each feature.
# What does importance mean in this context?
# Some words are more important than others for the model.
# It's nothing personal, just that spam emails tend to contain some words more frequently.
# This indicates to the model that having that word would make a new email more likely to be spam.
# TODO
# Step 4.2: Get the coefficients of the Logistic Regression model
coefficients = model.coef_[0]

# Print the first 10 coefficients as an example
print(coefficients[:10])



# Iterate over importance and find the top 10 positive features with the largest magnitude.
# Similarly, find the top 10 negative features with the largest magnitude.
# Positive features correspond to spam. Negative features correspond to ham.
# You will see that `http` is the strongest feature that corresponds to spam emails. 
# It makes sense. Spam emails often want you to click on a link.
# TODO
# Step 4.3: Find the top 10 positive and negative features

# Sort features by importance
# Positive values (spam) at the top, negative values (ham) at the bottom
sorted_indices = coefficients.argsort()

# Get the top 10 negative features (ham)
top_negative_indices = sorted_indices[:10]
top_negative_words = features[top_negative_indices]
top_negative_values = coefficients[top_negative_indices]

# Get the top 10 positive features (spam)
top_positive_indices = sorted_indices[-10:]
top_positive_words = features[top_positive_indices]
top_positive_values = coefficients[top_positive_indices]

# Print the top negative features (ham)
print("Top 10 ham-related words:")
for word, coef in zip(top_negative_words, top_negative_values):
    print(f"{word}: {coef}")

# Print the top positive features (spam)
print("\nTop 10 spam-related words:")
for word, coef in zip(top_positive_words, top_positive_values):
    print(f"{word}: {coef}")




['aa' 'aaa' 'aaas' 'aac' 'aachecar' 'aafco' 'aaiabe' 'aaigrcrb' 'aaldano'
 'aalland']
[-2.67465092e-01  3.92935771e-04  9.51511524e-05  3.12636720e-03
 -5.48330163e-06  3.48775896e-04  6.62444805e-02  1.47006991e-02
 -5.48330163e-06  2.21182314e-06]
Top 10 ham-related words:
attached: -1.6913882270809077
enron: -1.5553855617069685
daren: -1.4290074098140515
thanks: -1.3322514313102198
doc: -1.2859118693371674
deal: -1.2375469841647124
meter: -1.115141031407056
hpl: -1.1083592522339116
neon: -1.000763122999497
xls: -0.9510368518094542

Top 10 spam-related words:
more: 0.6353542111656005
pain: 0.6476422961981965
only: 0.6595368291593605
here: 0.7069486566875188
money: 0.745124634174134
no: 0.757636214558395
prices: 0.7730663634846965
removed: 0.7804981144150746
hello: 0.8142493283384381
http: 0.8944041770930773


Submission
1. Upload the jupyter notebook to Forage.

All Done!