## Ham or Spam? Email Classifier Using Logistic Regression

Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This should run without any errors.

In [None]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df = pd.DataFrame.from_records(ham)
df = df._append(pd.DataFrame.from_records(spam))

In [None]:
df # Print the DataFrame

In [None]:
df = df.drop(['name'], axis=1) # Drop the 'name' column from the DataFrame because it will not be used in model training

df['spam'] = df['category'].map({'ham':0, 'spam': 1}) # Mapping the 'category' column to numeric values:'ham' is mapped to 0 and 'spam' is mapped to 1
df # Print the updated DataFrame

Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. The following function `preprocessor` takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [None]:
import re

def preprocessor(e):
    regex = r'[^a-zA-Z\s]' # Defining a regular expression to match any character that is a non alphabet characters and white space 
    modified_string = re.sub(regex, '', e) # replace all matched characters in the string 'e' with an empty string
    return modified_string.lower() # Convert the modified string to lowercase and return it


Step 3: We will now train the machine learning model. All the necessary functions have been imported. The instructions outline the process and suggest which functions to use. It will be helpful to refer to the [scikit-learn documentation](https://scikit-learn.org/stable/api/index.html) to understand how to properly invoke these functions, so consider keeping that tab open. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# The CountVectorizer converts a text sample into a vector (think of it as an array of floats).
# Each entry in the vector corresponds to a single word and the value is the number of times the word appeared.
# Instantiate a CountVectorizer. Make sure to include the preprocessor you previously wrote in the constructor.

vectorizer = CountVectorizer(preprocessor=preprocessor)


# Use train_test_split to split the dataset into a train dataset and a test dataset.
# The machine learning model learns from the train dataset.
# Then the trained model is tested on the test dataset to see if it actually learned anything.
# If it just memorized for example, then it would have a low accuracy on the test dataset and a high accuracy on the train dataset.

X = df['content']
y = df['spam']
X_train, X_test, y_train, y_test = train_test_split(X, y)


# Use the vectorizer to transform the existing dataset into a form in which the model can learn from.
# Remember that simple machine learning models operate on numbers, which the CountVectorizer conveniently helped us do.

X_train_count = vectorizer.fit_transform(X_train)


# Use the LogisticRegression model to fit to the train dataset.
# You may remember y = mx + b and Linear Regression from high school. Here, we fitted a scatter plot to a line.
# Logistic Regression is another form of regression. 
# However, Logistic Regression helps us determine if a point should be in category A or B, which is a perfect fit.

model = LogisticRegression()
model.fit(X_train_count,y_train)


# Validate that the model has learned something.
# Recall the model operates on vectors. First transform the test set using the vectorizer. 
# Then generate the predictions.

X_test_count = vectorizer.transform(X_test)
predictions = model.predict(X_test_count) # calculating model predictions on test data
print('Predictions On Test Data: ',predictions)
print()



# We now want to see how we have done. We will be using three functions.
# `accuracy_score` tells us how well we have done. 
# 90% means that every 9 of 10 entries from the test dataset were predicted accurately.
# The `confusion_matrix` is a 2x2 matrix that gives us more insight.
# The top left shows us how many ham emails were predicted to be ham (that's good!).
# The bottom right shows us how many spam emails were predicted to be spam (that's good!).
# The other two quadrants tell us the misclassifications.
# Finally, the `classification_report` gives us detailed statistics which you may have seen in a statistics class.

model_accuracy_score = accuracy_score(y_true= y_test, y_pred=predictions)
print('Accuracy Score Of Model: ',model_accuracy_score)
print()

model_confusion_matrix = confusion_matrix(y_true= y_test, y_pred= predictions)
print('Confusion Matrix Of The Model: \n',model_confusion_matrix)
print()

model_classification_report = classification_report(y_true= y_test, y_pred=predictions)
print('Model Classification Report:\n', model_classification_report)

Step 4.

In [None]:
# Let's see which features (aka columns) the vectorizer created. 
# They should be all the words that were contained in the training dataset.

features = vectorizer.get_feature_names_out()
print("Features (aka columns) the vectorizer created: ", features)
print()


# You may be wondering what a machine learning model is tangibly. It is just a collection of numbers. 
# You can access these numbers known as "coefficients" from the coef_ property of the model
# We will be looking at coef_[0] which represents the importance of each feature.
# What does importance mean in this context?
# Some words are more important than others for the model.
# It's nothing personal, just that spam emails tend to contain some words more frequently.
# This indicates to the model that having that word would make a new email more likely to be spam.
# TODO

importance = model.coef_[0]
print('Importance of each feature: ',importance)
print()


# Iterate over importance and find the top 10 positive features with the largest magnitude.
# Similarly, find the top 10 negative features with the largest magnitude.
# Positive features correspond to spam. Negative features correspond to ham.
# You will see that `http` is the strongest feature that corresponds to spam emails. 
# It makes sense. Spam emails often want you to click on a link.

features_importance = dict(zip(features, importance)) # Combine features with their importance values into a dictionary
print("Dictionary of features and their importance:\n",features_importance)
print()

sorted_features_importance = sorted(features_importance.items(), key=lambda x:x[1], reverse=True) # Sort the features importance dictionary by the importance values (descending order)
top_positive_features = sorted_features_importance[:-5] # Select top positive features (features with the largest importance values)
top_negative_features = sorted_features_importance[5:] # Select top negative features (features with the smallest importance values)

print("top 10 positive features: \n", top_positive_features)
print()
print("top 10 negative features: \n", top_negative_features)

#### ALL DONE