Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [7]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df_ham = pd.DataFrame.from_records(ham)
df_spam = pd.DataFrame.from_records(spam)

df = pd.concat([df_ham, df_spam], ignore_index=True)

skipped 2248.2004-09-23.GP.spam.txt
skipped 4201.2005-04-05.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4142.2005-03-31.GP.spam.txt
skipped 2140.2004-09-13.GP.spam.txt
skipped 4350.2005-04-23.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
skipped 5105.2005-08-31.GP.spam.txt
skipped 2042.2004-08-30.GP.spam.txt
skipped 1414.2004-06-24.GP.spam.txt
skipped 2649.2004-10-27.GP.spam.txt
skipped 0754.2004-04-01.GP.spam.txt
skipped 3364.2005-01-01.GP.spam.txt
skipped 3304.2004-12-26.GP.spam.txt


In [8]:
df.head()

Unnamed: 0,name,content,category
0,0020.1999-12-15.farmer.ham.txt,Subject: meter 1431 - nov 1999\ndaren -\ncould...,ham
1,4496.2001-05-07.farmer.ham.txt,"Subject: tenaska\ndarren ,\nattached is the la...",ham
2,3948.2001-03-22.farmer.ham.txt,Subject: calpine daily gas nomination\nstill u...,ham
3,2288.2000-09-19.farmer.ham.txt,"Subject: cornhusker\ndarren ,\nhow are things ...",ham
4,1105.2000-05-22.farmer.ham.txt,Subject: re : pg & e texas contract 5098 - 695...,ham


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [9]:
import re

def preprocessor(e):
    cleaned_text = re.sub(r'[^a-zA-Z]', ' ', e)
    cleaned_text = cleaned_text.lower()
    return cleaned_text

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = pd.DataFrame({
    'content': df['content'],  
    'category': df['category']  
})

vectorizer = CountVectorizer(preprocessor=preprocessor)

X = data['content']
y = data['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

y_pred = model.predict(X_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 0.98
Confusion Matrix:
[[692  15]
 [  8 317]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       707
        spam       0.95      0.98      0.96       325

    accuracy                           0.98      1032
   macro avg       0.97      0.98      0.97      1032
weighted avg       0.98      0.98      0.98      1032



Step 4.

In [13]:
feature_names = vectorizer.get_feature_names_out()

coefficients = model.coef_[0]

feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
})

feature_importance['abs_coefficient'] = feature_importance['coefficient'].abs()
sorted_features = feature_importance.sort_values(by='abs_coefficient', ascending=False)

top_positive_features = sorted_features[sorted_features['coefficient'] > 0].head(10)
print("Top 10 positive features (likely to be spam):")
print(top_positive_features[['feature', 'coefficient']])

top_negative_features = sorted_features[sorted_features['coefficient'] < 0].head(10)
print("\nTop 10 negative features (likely to be ham):")
print(top_negative_features[['feature', 'coefficient']])

Top 10 positive features (likely to be spam):
        feature  coefficient
17751      http     1.032538
28480    prices     0.889743
25152        no     0.851404
26597  paliourg     0.738187
30309   removed     0.727645
17060      here     0.709494
24035     money     0.668877
16994     hello     0.667583
18737      info     0.632774
23387   message     0.625233

Top 10 negative features (likely to be ham):
        feature  coefficient
2543   attached    -1.750258
12615     enron    -1.538438
9456      daren    -1.448526
35559    thanks    -1.431525
11031       doc    -1.365414
27538  pictures    -1.201775
24855      neon    -1.166349
9650       deal    -1.117037
17672       hpl    -1.081072
32945    sitara    -1.025823


Submission
1. Upload the jupyter notebook to Forage.

All Done!