In [1]:
#### CA02 Building a Spam Detector using Naive Bayes Algorithm

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


# Word Collection

Initialization:
The function initializes an empty list all_words which will eventually contain all the words found in the emails.


Reading Emails:
The function reads all the files in the directory specified by root_dir, which are individual emails.


Word Collection:
For each file, it opens the file and reads each line. It splits the lines into individual words and extends the all_words list with these words.


Creating a Word Frequency Counter:
The function then creates a Counter object from the collections module for all words that are both alphabetic and longer than one character. This is an important preprocessing step because it filters out non-word items (like numbers or punctuation) and very short words (like 'a' or 'I') which are typically not useful in distinguishing spam from non-spam.


Selecting Top Features:
The function returns the 3000 most common words from this collection, along with their frequencies. This subset of words will serve as the feature set for the Naive Bayes classifier. The assumption is that these common words will be the most informative in distinguishing between spam and non-spam emails.

In [2]:

def make_Dictionary(root_dir):
    all_words = []
    # Gather all words from the files
    emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)]
    for mail in emails:
        with open(mail) as m:
            for line in m:
                all_words.extend(line.split())

    # Create a Counter object for all words that are alphabetic and longer than one character
    dictionary_counter = Counter(word for word in all_words if word.isalpha() and len(word) > 1)

    # Return the 3000 most common words as a list of tuples (word, frequency)
    return dictionary_counter.most_common(3000)



# Feature Matrix

Conversion of Dictionary:
The input dictionary is a list of tuples, where each tuple contains a word and its frequency. The function creates a word_index dictionary that maps each word to a unique index. This will be used to build a feature vector for each email, where the index corresponds to a word's position in the feature vector.


Initialization of Feature Matrix and Labels:
The features_matrix is initialized as a 2D NumPy array with zeros. Its dimensions are determined by the number of files (emails) and the length of the dictionary (number of features).

train_labels is a 1D NumPy array initialized to store the labels (0 or 1) indicating whether each email is non-spam or spam, respectively.


Processing Each Email:
The function iterates over each file in the mail_dir directory.
For each file, it reads the contents and splits the text of the email (assumed to be on the third line lines[2]) into individual words.


Feature Extraction:
For each word in the email, it looks up the word's index in the word_index dictionary.
If the word is in the dictionary, it increments the corresponding element in the features_matrix for that email (docID) by 1. This process effectively counts the occurrences of each dictionary word in the email.


Label Assignment:
The function checks if the filename contains the substring 'spmsg'. If it does, the corresponding entry in train_labels is set to 1, indicating spam. Otherwise, it's set to 0, indicating non-spam.


Output:
The function returns the features_matrix and train_labels. The feature matrix is used as input to the Naive Bayes classifier, and the labels are used to train the classifier.

In [3]:
def extract_features(mail_dir, dictionary):
    # Assuming the dictionary passed in is a list of tuples (word, frequency)
    # Convert it into a dictionary of word:index
    word_index = {word[0]: idx for idx, word in enumerate(dictionary)}

    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files), len(dictionary)), dtype=np.int_)
    train_labels = np.zeros(len(files), dtype=np.int_)

    for docID, file in enumerate(files):
        with open(file, 'r') as fi:
            lines = fi.readlines()
            if len(lines) > 2:
                words = lines[2].split()
                for word in words:
                    wordID = word_index.get(word, -1)
                    if wordID >= 0:
                        features_matrix[docID, wordID] += 1
        train_labels[docID] = 1 if 'spmsg' in file else 0

    return features_matrix, train_labels


In [4]:
# Pathnames for testing and training data
# If files for test and train mails are not in the same folder as this notebook input the pathname below. 

TRAIN_DIR = ("train-mails")
TEST_DIR = ("test-mails")


# Label Extraction

The function make_Dictionary is called with TRAIN_DIR as an argument to create a dictionary of the most common words found in the training dataset.


Feature Extraction for Training Data:
extract_features is called with TRAIN_DIR and the previously created dictionary_list to extract the feature matrix and labels from the training data. The feature matrix contains the presence or counts of the dictionary words in each email, and the labels indicate whether each email is spam or not.


Feature Extraction for Test Data:
Similarly, extract_features is called with TEST_DIR and the dictionary_list to prepare the feature matrix and labels for the test data.

In [5]:
# Create a dictionary from the training data
dictionary_list = make_Dictionary(TRAIN_DIR)


# Extract features and labels from the training data
features_matrix, labels = extract_features(TRAIN_DIR, dictionary_list)

# Extract features and labels from the test data
test_features_matrix, test_labels = extract_features(TEST_DIR, dictionary_list)


# Results

Model Training:
A Gaussian Naive Bayes model is instantiated with GaussianNB().
The model is then trained using the .fit() method with the training feature matrix and labels. The Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes the features follow a normal distribution, which is a reasonable assumption when dealing with word counts or frequencies.


Model Prediction:
After training, the model predicts the labels for the test data using the .predict() method and the test feature matrix. This step classifies each email in the test set as either spam or not spam.


Accuracy Calculation:
The accuracy of the model is calculated by comparing the predicted labels with the actual labels from the test set using the accuracy_score function. This score represents the proportion of test emails that were correctly classified by the model.


Printing Results:
The code prints statements to the console to inform the user of the different stages of execution, such as reading data, training the model, predicting labels, and the final accuracy score.



In [6]:
print ("reading and processing emails from TRAIN and TEST folders")


# Training the Naive Bayes model
print("Training Model using Gaussian Naive Bayes algorithm .....")
model = GaussianNB()
model.fit(features_matrix, labels)
print("Training completed")
    
# Predicting the labels of the test data
print("testing trained model to predict Test Data labels")
predicted_labels = model.predict(test_features_matrix)
print("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:") 
    
# Calculating and printing the accuracy
accuracy = accuracy_score(test_labels, predicted_labels)
print(accuracy)




reading and processing emails from TRAIN and TEST folders
Training Model using Gaussian Naive Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9653846153846154


# Further Testing

Here, we wanted to test out different word counts and see how they would affect the model as a whole. In order to do this, we created a loop that would test out word counts from 3000 - 1000 , added a confusion matrix, and the ability to see the classification report in order to view precision and recall. 
 
In observing the different word counts we can see that 2000 words had the highest accuracy of any of the word counts tested. 


-Credit to Andrew Morris

In [7]:
#New labels
new_tr_labels = labels
new_test_labels = test_labels


# List of word counts to test
word_counts = [3000, 2500, 2000, 1500, 1000]

# Loop over the different word counts
for count in word_counts:
    # Select the first 'count' features for both training and test sets
    new_train_set = features_matrix[:, :count]
    new_test_set = test_features_matrix[:, :count]
    
    # Train the Naive Bayes model
    print(f"\nTraining model using the {count} most common words...")
    model = GaussianNB()
    model.fit(new_train_set, new_tr_labels)
    print("Training completed")

    # Predict the labels of the test data
    print(f"Testing trained model to predict Test Data labels using {count} most common words")
    predicted_labels = model.predict(new_test_set)

    # Calculate and print the accuracy
    accuracy = accuracy_score(test_labels, predicted_labels)
    print(f"\nAccuracy Score for {count} Most Popular Words: {accuracy}")

    # Print confusion matrix
    print("Confusion Matrix:")
    print(confusion_matrix(test_labels, predicted_labels))

    # Print classification report
    print("Classification Report:")
    print(classification_report(test_labels, predicted_labels))
    
    



Training model using the 3000 most common words...
Training completed
Testing trained model to predict Test Data labels using 3000 most common words

Accuracy Score for 3000 Most Popular Words: 0.9653846153846154
Confusion Matrix:
[[129   1]
 [  8 122]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.99      0.97       130
           1       0.99      0.94      0.96       130

    accuracy                           0.97       260
   macro avg       0.97      0.97      0.97       260
weighted avg       0.97      0.97      0.97       260


Training model using the 2500 most common words...
Training completed
Testing trained model to predict Test Data labels using 2500 most common words

Accuracy Score for 2500 Most Popular Words: 0.9615384615384616
Confusion Matrix:
[[129   1]
 [  9 121]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       130
        