# IST664 - Final Project- Email Spam Classification

Originality assertion: All of the text and comments in this file are my original work (except for template items written by the instructor). All of the code in this file is my work, except where I give credit to another source. By adding my name below, I affirm this originality assertion.

*** My name: _Darrell_Collison_ ***


**Task 1: Preprocess the data**

For your choice of dataset, you will first process the text, tokenize it and choose whether to do further pre-processing or filtering. If you do some pre-processing or filtering, then using the text with and without it can be one of your experiments.

For each dataset, there is a program template that reads the data. You should run the program on your data of choice and investigate some of the data to choose pre-processing or filtering. If you choose your own dataset, then you should write a similar program to read the text data and the labels.

In [1]:
'''
  This program shell reads email data for the spam classification problem.
  The input to the program is the path to the Email directory "corpus" and a limit number.
  The program reads the first limit number of ham emails and the first limit number of spam.
  It creates an "emaildocs" variable with a list of emails consisting of a pair
    with the list of tokenized words from the email and the label either spam or ham.
  It prints a few example emails.
  Your task is to generate features sets and train and test a classifier.

  Usage:  python classifySPAM.py  <corpus directory path> <limit number>
'''
# open python and nltk packages needed for processing
import os
import sys
import random
import nltk
import collections
import pandas as pd
from nltk.corpus import stopwords

In [2]:
# function to read spam and ham files, train and test a classifier 
def processspamham_data_only(dirPath, limitStr):
    # convert the limit argument from a string to an int
    limit = int(limitStr)
    hamtexts = []
    spamtexts = []
    # os.chdir(dirPath) 
    spam_dir = os.path.join(dirPath, 'spam')
    ham_dir = os.path.join(dirPath, 'ham')

  # process all files in directory that end in .txt up to the limit
  #    assuming that the emails are sufficiently randomized
    for file in os.listdir(spam_dir):
        if (file.endswith(".txt")) and (len(spamtexts) < limit):
            # open file for reading and read entire file into a string
            f = open(os.path.join(spam_dir, file), 'r', encoding="latin-1")
            spamtexts.append(f.read())
            f.close()
    for file in os.listdir(ham_dir):
        if (file.endswith(".txt")) and (len(hamtexts) < limit):
            # open file for reading and read entire file into a string
            f = open(os.path.join(ham_dir, file), 'r', encoding="latin-1")
            hamtexts.append(f.read())
            f.close()
    # print number emails read
    print(f"Number of spam files read: {len(spamtexts)}")
    print(f"Number of ham files read: {len(hamtexts)}")
    
    emaildocs = []
    for spam in spamtexts:
        tokens = nltk.word_tokenize(spam)
        emaildocs.append((tokens, 'spam'))
    for ham in hamtexts:
        tokens = nltk.word_tokenize(ham)
        emaildocs.append((tokens, 'ham'))
    
    random.shuffle(emaildocs)
    
    print(f"\nTotal documents prepared: {len(emaildocs)}")
    return emaildocs

data_directory = 'c:\\Users\\Black Knight\\Documents\\11.Data Science\\1. SYR\\IST 664\\Project' 
load_limit = '1500' # Load 500 spam and 500 ham files



In [3]:
email_documents = processspamham_data_only(data_directory, load_limit)


Number of spam files read: 1500
Number of ham files read: 1500

Total documents prepared: 3000


**Task 2: Produce Features**

The second step is to produce the features in the notation of the NLTK. For this you should write feature functions in Python. You should start with the “bag-of-words” features where you collect all the words in the corpus and select some number of most frequent words to be the word features.

Now use the NLTK Naïve Bayes classifier to train and test a classifier on your feature sets. You should use cross-validation to obtain precision, recall and F-measure scores. Or you can choose to produce the features as a csv file and use Weka or Sci-Kit Learn to train and test a classifier, using cross-validation scores.

Define Function to Score classifier

In [4]:
# Cell 1: Imports and Helper Functions

import os
import random
import nltk
import collections
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, KFold
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer

# Helper function to run cross-validation with standardized metrics
def run_cross_validation(X, y, model, n_splits=5):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scoring = {
        'precision_spam': make_scorer(precision_score, pos_label='spam'),
        'recall_spam': make_scorer(recall_score, pos_label='spam'),
        'f1_spam': make_scorer(f1_score, pos_label='spam'),
        'accuracy': 'accuracy'
    }
    cv_results = cross_validate(model, X, y, cv=kf, scoring=scoring)
    mean_results = {metric: round(cv_results[f'test_{metric}'].mean(), 4) 
                    for metric in scoring.keys()}
    return mean_results


In [5]:
# define a feature definition function here
VOCAB_SIZE = 2500
model_nb = MultinomialNB()
results = {}

def run_experiment_on_data(doc_list, name, vocab_size, model, grams):
    """
    Helper function to process a document list, run vectorization, and perform cross-validation.
    """
    
    # Convert tokenized lists back into single strings for CountVectorizer
    texts = [" ".join(tokens) for tokens, label in doc_list]
    labels = [label for tokens, label in doc_list]

    # Generate the feature matrix
    vectorizer = CountVectorizer(max_features=vocab_size, ngram_range=grams)
    X = vectorizer.fit_transform(texts)
    
    scores = run_cross_validation(X, labels, model, n_splits=5)
    
    return scores

results['Baseline_Raw'] = run_experiment_on_data(
    doc_list=email_documents, 
    name="Baseline (Raw text, no filtering)", 
    vocab_size=VOCAB_SIZE, 
    model=model_nb,
    grams= (1, 1)
)

Print result scores

In [6]:
results_df = pd.DataFrame(results).T
print(results_df)

              precision_spam  recall_spam  f1_spam  accuracy
Baseline_Raw          0.9526       0.9689   0.9606      0.96


**Task 3: Experiments**

For a base level completion of experiments, carry out at least several experiments where you use two different sets of features and compare the results. For example, you may take the unigram word features as a baseline and see if the features you designed improve the accuracy of the classification.

Experiment #1 Same classifer on filtered word set

In [7]:
import string
from nltk.corpus import stopwords

STOP_WORDS = set(stopwords.words('english'))
PUNCTUATION = set(string.punctuation)

def filter_tokens(tokens, remove_stopwords=False, remove_punctuation=False):
    """Filters a single list of tokens based on criteria."""
    filtered = []
    for token in tokens:
        token = token.lower()
        if remove_punctuation and token in PUNCTUATION:
            continue
        if remove_stopwords and token in STOP_WORDS:
            continue
        if len(token) > 1 or token.isalpha(): 
            filtered.append(token)
    return filtered

filtered_docs = [(filter_tokens(tokens, remove_stopwords=True, remove_punctuation=True), label) 
                 for tokens, label in email_documents]

random.shuffle(filtered_docs)

print(f"Filtered documents list size: {len(filtered_docs)}")


Filtered documents list size: 3000


In [8]:
results['Filtered_SW_Punct'] = run_experiment_on_data(
    doc_list=filtered_docs, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=VOCAB_SIZE, 
    model=model_nb,
    grams= (1, 1)

)

print("--- Comparison of Results ---")
results_df = pd.DataFrame(results).T
print(results_df)


--- Comparison of Results ---
                   precision_spam  recall_spam  f1_spam  accuracy
Baseline_Raw               0.9526       0.9689   0.9606    0.9600
Filtered_SW_Punct          0.9558       0.9661   0.9608    0.9607


Experiment #2 Unigram classifer on raw and filtered word set comparison, smaller vocab

In [9]:
results['Raw_Unigram_1500'] = run_experiment_on_data(
    doc_list=email_documents, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=1500, 
    model=model_nb,
    grams= (1, 1)

)
results['Filtered_Unigram_1500'] = run_experiment_on_data(
    doc_list=filtered_docs, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=1500, 
    model=model_nb,
    grams= (1, 1)

)
print("--- Comparison of Results ---")
results_df = pd.DataFrame(results).T
print(results_df)


--- Comparison of Results ---
                       precision_spam  recall_spam  f1_spam  accuracy
Baseline_Raw                   0.9526       0.9689   0.9606    0.9600
Filtered_SW_Punct              0.9558       0.9661   0.9608    0.9607
Raw_Unigram_1500               0.9385       0.9617   0.9498    0.9490
Filtered_Unigram_1500          0.9486       0.9567   0.9525    0.9523


Experiment #3 Bigram classifer on raw and filtered word set comparison

In [10]:
results['Raw_Bigram'] = run_experiment_on_data(
    doc_list=email_documents, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=VOCAB_SIZE, 
    model=model_nb,
    grams= (2, 2)

)
results['Filtered_Bigram'] = run_experiment_on_data(
    doc_list=filtered_docs, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=VOCAB_SIZE, 
    model=model_nb,
    grams= (2, 2)

)
print("--- Comparison of Results ---")
results_df = pd.DataFrame(results).T
print(results_df)


--- Comparison of Results ---
                       precision_spam  recall_spam  f1_spam  accuracy
Baseline_Raw                   0.9526       0.9689   0.9606    0.9600
Filtered_SW_Punct              0.9558       0.9661   0.9608    0.9607
Raw_Unigram_1500               0.9385       0.9617   0.9498    0.9490
Filtered_Unigram_1500          0.9486       0.9567   0.9525    0.9523
Raw_Bigram                     0.9371       0.8971   0.9152    0.9173
Filtered_Bigram                0.9598       0.7955   0.8626    0.8783


Experiment #4 Ngram classifer on raw and filtered word set comparison, large vocab

In [11]:
results['Raw_Ngram_3500'] = run_experiment_on_data(
    doc_list=email_documents, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=3500, 
    model=model_nb,
    grams= (1, 3)

)
results['Filtered_Ngram_3500'] = run_experiment_on_data(
    doc_list=filtered_docs, 
    name="Filtered (Stop words/Punctuation removed)", 
    vocab_size=3500, 
    model=model_nb,
    grams= (1, 3)

)
print("--- Comparison of Results ---")
results_df = pd.DataFrame(results).T
print(results_df)


--- Comparison of Results ---
                       precision_spam  recall_spam  f1_spam  accuracy
Baseline_Raw                   0.9526       0.9689   0.9606    0.9600
Filtered_SW_Punct              0.9558       0.9661   0.9608    0.9607
Raw_Unigram_1500               0.9385       0.9617   0.9498    0.9490
Filtered_Unigram_1500          0.9486       0.9567   0.9525    0.9523
Raw_Bigram                     0.9371       0.8971   0.9152    0.9173
Filtered_Bigram                0.9598       0.7955   0.8626    0.8783
Raw_Ngram_3500                 0.9360       0.9604   0.9479    0.9470
Filtered_Ngram_3500            0.9533       0.9661   0.9596    0.9593
