# CA02 

This is a eMail Spam Classifers that uses Naive Bayes supervised machine learning algorithm. 

Jerry and Nicholson

# Assignment Overview:

This assignment is to build a machine learning model that can classify emails as spam or not spam using the Naive Bayes algorithm.

---

Goal:

Build a model that predicts whether an email is Spam (1) or Not Spam (0).

---

Key challenge: 

This dataset is not a CSV table. Each email is stored as a separate .txt file inside 
folders:

train-mails/ (used to learn patterns)

test-mails/ (used to evaluate performance)

---

Cell's workflow:

Read training emails and build a dictionary of common words

Convert each email into a numeric feature vector (word counts)

Assign labels using filename rule (spmsg* = spam)

Train Naive Bayes model

Predict test labels and compute accuracy


## 

# Imports Libraries

In [17]:
import os
from collections import Counter # count word frequency efficiently
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score # evaluate prediction accuracy

# Part1 Build the Dictionary

Purpose:

Decide which words become our features by counting words across ALL emails (train + test) and keeping the top 3000.

Breakdown of the code logic:

1. Collect all email file paths from both folders (train + test).
2. Open each email and read the full text.
3. Split the text into words and collect all words into one big list.
4. Use `Counter()` to count how often each word appears.
5. Remove noisy tokens:
   - non-alphabet (numbers, symbols, punctuation)
   - one-letter words
6. Keep only the top `top_k` most frequent words as the final vocabulary.


In [18]:
def dictionary(train_dir, test_dir, top_k=3000):
    all_tokens = []

    # collect all files from both folders
    all_files = []
    for folder in [train_dir, test_dir]:
        for fname in os.listdir(folder):
            fpath = os.path.join(folder, fname)
            if os.path.isfile(fpath):
                all_files.append(fpath)

    # read each email and collect tokens
    for fpath in all_files:
        with open(fpath, "r", encoding="latin1", errors="ignore") as f:
            text = f.read()
            words = text.split()
            all_tokens.extend(words)

    # count frequency of each token
    freq = Counter(all_tokens)

    # remove noisy tokens: non-alpha and one-letter
    for token in list(freq.keys()):
        if (not token.isalpha()) or (len(token) == 1):
            del freq[token]

    # keep top K frequent words
    vocab = freq.most_common(top_k)

    return vocab

## Breakdown of the code logic:

1. Collect all email file paths from both folders (train + test).
2. Open each email and read the full text.
3. Split the text into words and collect all words into one big list.
4. Use `Counter()` to count how often each word appears.
5. Remove noisy tokens:
   - non-alphabet (numbers, symbols, punctuation)
   - one-letter words
6. Keep only the top `top_k` most frequent words as the final vocabulary.


# Part2 Convert Emails into Features and Labels

Purpose:

This part is to transform each email from text file into a numerical feature vector based on word frequencies and assign a spam or non-spam label using the file naming rule, so the data becomes usable for Naive Bayes classification.

The goal to create two outputs, feature matrix and label array, these two outputs are the input required to train and test the Naive Bayes model.

Breakdown of the code logic:

1. List all email files in the given folder.
2. Create an empty feature matrix `X`:
   - rows = number of emails
   - columns = 3000 words in vocabulary
3. Create an empty label array `y`.
4. Create a fast lookup map: word → column index.
5. For each email:
   - read full email text
   - split into words
   - count words using `Counter`
   - fill the correct columns in the feature matrix
6. Create label from filename:
   - starts with `spmsg` → spam (1)
   - otherwise → not spam (0)


In [19]:
# Step 2: Convert emails into feature matrix + labels
# -------------------------
def featurize_folder(folder_dir, vocab, top_k=3000):
    files = [
        os.path.join(folder_dir, f)
        for f in os.listdir(folder_dir)
        if os.path.isfile(os.path.join(folder_dir, f))
    ]

    X = np.zeros((len(files), top_k))
    y = np.zeros(len(files))

    # fast lookup: word -> column index
    word_to_col = {w: i for i, (w, _) in enumerate(vocab)}

    for row_id, fpath in enumerate(files):
        # read full email text
        with open(fpath, "r", encoding="latin1", errors="ignore") as f:
            text = f.read()
            words = text.split()

        # count words once per email
        counts = Counter(words)

        # fill vector
        for word, cnt in counts.items():
            if word in word_to_col:
                X[row_id, word_to_col[word]] = cnt

        # label from filename
        filename = os.path.basename(fpath)
        y[row_id] = 1 if filename.startswith("spmsg") else 0

    return X, y

# Part3 Data Paths

In [20]:
# Enter the paths of the training and testing folders
# Update these if your data is stored elsewhere
TRAIN_FOLDER = "./train-mails"
TEST_FOLDER  = "./test-mails"

MAX_FEATURES = 3000

# Part4 Run Feature Pipeline

Purpose:

Build the dictionary first, then convert training and testing emails into numeric features.

Breakdown of the code logic:

1. Build vocabulary using train + test emails (top 3000 words).
2. Convert training emails into `(X_train, y_train)`.
3. Convert testing emails into `(X_test, y_test)`.
4. After this step, we have numbers ready for machine learning training.


In [21]:
# Step 3: Run pipeline
# -------------------------
print("Building dictionary using TRAIN + TEST emails ...")
vocabulary = dictionary(TRAIN_FOLDER, TEST_FOLDER, top_k=MAX_FEATURES)

print("Featurizing training emails ...")
X_train, y_train = featurize_folder(TRAIN_FOLDER, vocabulary, top_k=MAX_FEATURES)

print("Featurizing testing emails ...")
X_test, y_test = featurize_folder(TEST_FOLDER, vocabulary, top_k=MAX_FEATURES)


Building dictionary using TRAIN + TEST emails ...
Featurizing training emails ...
Featurizing testing emails ...


# Part5 Train Naive Bayes and Evaluate Accuracy

Purpose:

Train the Naive Bayes model using training features, then test it on test features and report accuracy.

Breakdown of the code logic:

1. Create a `MultinomialNB` model (good for word counts).
2. Train the model using `X_train` and `y_train`.
3. Predict labels for test data `X_test`.
4. Compare predictions with `y_test`.
5. Print accuracy score (how many emails are classified correctly).


In [22]:
# Step 4: Train + Test Naive Bayes
# -------------------------
print("Training Model using Multinomial Naive Bayes algorithm .....")
nb_model = MultinomialNB(alpha=1.0)  # smoothing helps unseen word cases
nb_model.fit(X_train, y_train)

print("Training completed")
print("testing trained model to predict Test Data labels")

y_pred = nb_model.predict(X_test)

print("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")
print(accuracy_score(y_test, y_pred))


Training Model using Multinomial Naive Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9615384615384616


======================= END OF PROGRAM =========================