# Artificial Intelligence
## Assignment 5 – Machine Learning

### Personal details

* **Name:** Ahmed Jabir Zuhayr

In this assignment, you will implement a **Naive Bayes** classifier for email spam detection. We will go through the general machine learning pipeline, which in general involves the following steps:

1. **Data Preparation**: Load and preprocess the dataset.
2. **Feature Extraction**: Convert complex data into numerical features for the model.
3. **Model Training**: Train a model to predict the output label of a given input.
4. **Model Inference**: Use the trained model to make predictions on new data.
5. **Model Optimization**: Fine-tune the model's hyperparameters or compare different models to maximize performance.
6. **Model Evaluation**: Use different metrics to evaluate the model's real-world performance.

However, some steps may be simplified or skipped entirely depending on the specifics of the problem. As we have chosen to implement a simple Naive Bayes classifier, we will be skipping step 5. Naive Bayes is also a special type of classifier where the feature extraction and model training steps can be thought of as a single combined step, so we will also be skipping step 3.

In [1]:
# Dependencies:
# pip install numpy
# pip install pandas
# pip install scikit-learn

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

### 5.1 – Data Preparation

We will be using the [Enron-Spam](https://github.com/MWiechmann/enron_spam_data) dataset, which contains a collection of emails labeled as either spam or ham (not spam). The dataset is provided as a CSV file that we will load into a Pandas DataFrame. We encourage you to take a look at the CSV file to understand its structure and the features available for analysis.

[Pandas](https://pandas.pydata.org/docs/user_guide/index.html) is a powerful library for data manipulation and analysis in Python. It provides data structures that make it easy to work with structured data. Columns in Pandas are called `Series` objects, and a collection of these is called a `DataFrame`. You can think of a DataFrame as a table where each column can have a different data type (e.g., integers, floats, strings).

**Task 1: Load the dataset (0.1 pt)**

Your first task is to load the dataset in `enron_spam_data.csv` and assign it to the variable `data`. 

(Hint: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [2]:
# ---------- YOUR CODE HERE ----------- #

data = pd.read_csv("enron_spam_data.csv")

# ---------- YOUR CODE HERE ----------- #

data.dropna(inplace=True) # drop any rows with missing values
print(data.columns) # inspect the columns in the DataFrame

Index(['Message ID', 'Subject', 'Message', 'Spam/Ham', 'Date'], dtype='str')


In this assignment we are only concerned with the `Subject`, `Message` and `Spam/ham` columns. These are the inputs and output of our model.

In [3]:
data = data.drop(columns=['Message ID', 'Date']) # drop unnecessary columns
data.rename(columns={'Spam/Ham': 'Label'}, inplace=True) # rename the "Spam/ham" column to "Label"
display(data.sample(10, random_state=42)) # display 10 random rows from the DataFrame

Unnamed: 0,Subject,Message,Label
22808,re : move,"sally ,\nthe may 18 th date is locked in stone...",ham
22814,steering committee meeting - 07 may 2000,"gbn houston steering committee meeting , 07 ma...",ham
19492,goode,"gallo , % , online doctors ! herbs , minerals ...",spam
6017,summer intern,we can hire the person as a summer intern\ndir...,ham
1259,eastrans - lst of month nomination - eff 8 / 1...,"this si to nominate 32 , 800 mmbtu into eastra...",ham
15793,happy new year,"attn : sir ,\ni got to know you through our fo...",spam
13783,master firm agreements,stacey and ellen :\nin light of current circum...,ham
11838,change of company number,please note that the company number for enron ...,ham
21136,learn fast - earn fast ! we make it easy .,!\n?\n.\n- . / . .\n. 848\nn . rainbow blvd . ...,spam
20066,buy oem software at massive discounts !,table align = center width = 100 % trtd align ...,spam


**Task 2: Elementary preprocessing (0.1 pt)**

It is often desirable to treat textual output labels as numerical values. Your second task is to convert the `Label` column into numerical format, where `ham` is represented as `0` and `spam` as `1`. You may consult the Pandas [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html) on how to replace values (see "Parameters" -> "to_replace" -> "dict").

(Hint: note that the `replace` method operates on a `Series` object, not the entire `DataFrame`!)

In [4]:
# ---------- YOUR CODE HERE ----------- #

data["Label"] = data['Label'].replace({"ham": 0, "spam": 1})

# ---------- YOUR CODE HERE ----------- #

data["Label"] = data["Label"].astype(int)  # you can probably ignore any potential warnings due to this line
data["Text"] = data["Subject"] + " " + data["Message"]
display(data.sample(10, random_state=42))

Unnamed: 0,Subject,Message,Label,Text
22808,re : move,"sally ,\nthe may 18 th date is locked in stone...",0,"re : move sally ,\nthe may 18 th date is locke..."
22814,steering committee meeting - 07 may 2000,"gbn houston steering committee meeting , 07 ma...",0,steering committee meeting - 07 may 2000 gbn h...
19492,goode,"gallo , % , online doctors ! herbs , minerals ...",1,"goode gallo , % , online doctors ! herbs , min..."
6017,summer intern,we can hire the person as a summer intern\ndir...,0,summer intern we can hire the person as a summ...
1259,eastrans - lst of month nomination - eff 8 / 1...,"this si to nominate 32 , 800 mmbtu into eastra...",0,eastrans - lst of month nomination - eff 8 / 1...
15793,happy new year,"attn : sir ,\ni got to know you through our fo...",1,"happy new year attn : sir ,\ni got to know you..."
13783,master firm agreements,stacey and ellen :\nin light of current circum...,0,master firm agreements stacey and ellen :\nin ...
11838,change of company number,please note that the company number for enron ...,0,change of company number please note that the ...
21136,learn fast - earn fast ! we make it easy .,!\n?\n.\n- . / . .\n. 848\nn . rainbow blvd . ...,1,learn fast - earn fast ! we make it easy . !\n...
20066,buy oem software at massive discounts !,table align = center width = 100 % trtd align ...,1,buy oem software at massive discounts ! table ...


We also combined the `Subject` and `Message` columns into a single column called `Text`. This will be the input to our model.

Next we need to split our data into training and test sets. The training set will be used to train the model, while the test set will be used to evaluate the model's performance. There are many ways to split the data, but one typical approach is an 80/20 split, meaning 80% of the data will be used for training and 20% for testing.

In [5]:
data_train = data.sample(frac=0.8, random_state=42) # randomly sample 80% of the data for training
data_test = data.drop(data_train.index) # drop the training set from the original data to get the test set

print(f"Training set size: {data_train.shape[0]}, test set size: {data_test.shape[0]}")

Training set size: 26486, test set size: 6621


### 5.2 – Feature Extraction

If we were dealing with, say, predicting the value of a used car based on simple input data such as year of manufacture and mileage, we could use these values directly as the features for our model. However, often in machine learning we need to *engineer our own features* from raw data, especially when dealing with textual or visual data.

We will use the **Bag of Words (BoW)** model to convert our text data into numerical features. The BoW model represents text data as a collection of words, disregarding grammar and word order, but keeping track of the frequency of each word in the document.

The basic idea is as follows:

1. Create a vocabulary of all unique words in the dataset.
2. For each document (subject + message), count the occurrence of each word in the vocabulary.
3. Represent each document as a vector of word counts.

We can then use these vectors as input for our model. This process can be needlessly arduous as well as computationally expensive if done manually, so we will use a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to handle things for us.

In [6]:
# Fit vectorizer on all training data to get a shared vocabulary
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data_train["Text"])

# Get indices for spam and ham
spam_idx = (data_train["Label"] == 1).values
ham_idx = (data_train["Label"] == 0).values

# Get word counts for spam and ham using the same vocabulary
spam_counts = X[spam_idx].sum(axis=0).A1  # .A1 flattens to 1D array
ham_counts = X[ham_idx].sum(axis=0).A1

# Create dictionaries for words and their counts
words = vectorizer.get_feature_names_out()
spam_words = dict(zip(words, spam_counts))
ham_words = dict(zip(words, ham_counts))

# Drop all words that are not present
spam_words = {word: count for word, count in spam_words.items() if count > 0}
ham_words = {word: count for word, count in ham_words.items() if count > 0}

# Some example words and their frequencies
print(f"Frequency of 'free' in data: {spam_words.get('free', 0)} (spam) / {ham_words.get('free', 0)} (ham)")
print(f"Frequency of 'schedule' in data: {spam_words.get('schedule', 0)} (spam) / {ham_words.get('schedule', 0)} (ham)")

Frequency of 'free' in data: 3745 (spam) / 1173 (ham)
Frequency of 'schedule' in data: 50 (spam) / 1954 (ham)


Interpreting feature vectors of this size can get difficult, so you should take a look at the examples in the documentation to get an idea of what the output looks like.

(Don't worry if you don't fully understand the details here, feature engineering for text data will be covered in more detail in the Natural Language Processing course.)

### 5.3 – Model Inference

Now that we have our data prepared and features extracted, we can use a Naive Bayes classifier to predict whether an email is spam or ham based on the subject and message.

The general form of the Naive Bayes classifier is as follows:

$$P(y | w_1, ... , w_i) = \alpha P(y) \prod_{i=1}^{n} P(w_i | y)$$

where
- $P(y)$ is the prior probability of the class $y$ (spam or ham)
- $P(w_i | y)$ is the conditional probability of word $w$ at position $i$ given the class $y$
- $\alpha$ is a normalization constant to ensure that the probabilities sum to 1.

Note that most of the probabilities $P(w_i | y)$ will be very small, so repeatedly multiplying them together can lead to numerical underflow. We can avoid this by performing the calculations in logarithmic space.

**Task 3: Naive Bayes classifier (0.5 pt)**

Complete the `naive_bayes` function. You will need to calculate the conditional probabilities of each word given the class label (= how often this word appears in mails of this class) and use these to update the prior probabilities. You also need to handle cases where a word in the mail is not present in the training data to avoid zero division errors: use Laplace smoothing (a.k.a. add-one smoothing) to ensure that every word has a non-zero probability of occurring.

In [7]:
def normalize(spam_logprob, ham_logprob):
    """
    Normalize the log-probabilities using the log-sum-exp trick.
    """
    max_logprob = max(spam_logprob, ham_logprob)
    spam_prob = np.exp(spam_logprob - max_logprob)
    ham_prob = np.exp(ham_logprob - max_logprob)
    norm = spam_prob + ham_prob
    spam_prob /= norm
    ham_prob /= norm
    return spam_prob, ham_prob

vocab_size = len(words) # number of unique words in the training data
num_of_spam_words = sum(spam_counts)  # total number of words in spam emails
num_of_ham_words = sum(ham_counts)    # total number of words in ham emails
analyzer = vectorizer.build_analyzer() # used for tokenizing the input text

def naive_bayes(message):
    # Get the individual words from the input text
    input_words = analyzer(message)

    # Initialize prior log-probabilities for spam and ham
    spam_logprob = np.log(0.5)
    ham_logprob = np.log(0.5)

    # ---------- YOUR CODE HERE ----------- #
    # 1. Loop through the words in the input message
    for word in input_words:

        # 2. Get the frequencies of the word in spam and ham
        # (Hint: check the previous cell for appropriate variables and examples)
        spam_word_count = spam_words.get(word, 0)
        ham_word_count  = ham_words.get(word, 0)
        
        # 3. Get the probabilities of the word and apply Laplace smoothing
        p_word_spam = (spam_word_count + 1) / (num_of_spam_words + vocab_size)
        p_word_ham  = (ham_word_count  + 1) / (num_of_ham_words  + vocab_size)

        # 4. Update the prior log-probabilities
        # (Hint: remember to stay in log-space and replace products with sums)
        spam_logprob += np.log(p_word_spam)
        ham_logprob  += np.log(p_word_ham)
    # ---------- YOUR CODE HERE ----------- #

    # Normalize the log-probabilities
    spam_prob, ham_prob = normalize(spam_logprob, ham_logprob)

    # Return the normalized probabilities
    return {"P_spam": spam_prob, "P_ham": ham_prob}

Some anecdotal testing can be done by simply coming up with a few test messages and checking the output of the model.

In [8]:
spam_message = "You are the lucky winner of a million dollars! Claim your prize by visiting our website: www.spammywebsite.com"

ambiguous_message = """Weekly update: Our Outlook subscription has been upgraded. New features include: AI-generated meeting summaries, 
agentic scheduling, and more. Check the full list by clicking here. In other news, we now have a company credit card for all employees
to use for business expenses. Click here to learn more."""

ham_message = "Hey Tom, I hope you are doing well. I have some updates on the project. Let's catch up over coffee tomorrow? -Clara"

print("Evaluating spam message:")
result1 = naive_bayes(spam_message)
print(f"P(spam): {result1['P_spam']:.3f}, P(ham): {result1['P_ham']:.3f}\n")
print("Evaluating ambiguous message:")
result2 = naive_bayes(ambiguous_message)
print(f"P(spam): {result2['P_spam']:.3f}, P(ham): {result2['P_ham']:.3f}\n")
print("Evaluating ham message:")
result3 = naive_bayes(ham_message)
print(f"P(spam): {result3['P_spam']:.3f}, P(ham): {result3['P_ham']:.3f}\n")

Evaluating spam message:
P(spam): 1.000, P(ham): 0.000

Evaluating ambiguous message:
P(spam): 0.117, P(ham): 0.883

Evaluating ham message:
P(spam): 0.000, P(ham): 1.000



### 5.4 – Model Evaluation

More rigorous evaluation of the model can be done using various metrics. Some of the most common metrics for classification tasks are:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP + TN + FP + FN}}$$
$$ \text{Precision} = \frac{\text{TP}}{\text{TP + FP}} $$
$$ \text{Recall} = \frac{\text{TP}}{\text{TP + FN}} $$

where:
- TP: True Positives (correctly predicted spam)
- TN: True Negatives (correctly predicted ham)
- FP: False Positives (ham predicted as spam)
- FN: False Negatives (spam predicted as ham)

Accuracy can be thought of as the overall correctness of the model irrespective of what types of errors it makes. Precision tells us how many of the predicted spam emails were actually spam, while recall tells us how many of the actual spam emails were correctly caught by the model.

**Task 4: Evaluation metrics (0.3 pt)**

Implement the `precision` and `recall` functions. The inputs to these functions will be the predicted and true labels of the test set and will be provided based on your model's predictions.

In [9]:
def accuracy(predictions, labels):
    """
    Calculate the accuracy of predictions.
    """
    return sum(predictions == labels) / len(labels)

def precision(predictions, labels):
    TP = sum((predictions == labels) & (predictions == 1))
    # ---------- YOUR CODE HERE ----------- #
    FP = sum((predictions != labels) & (predictions == 1))  # predicted spam, actually ham
    return TP / (TP + FP) if (TP + FP) > 0 else 0.0
    # ---------- YOUR CODE HERE ----------- #

def recall(predictions, labels):
    # ---------- YOUR CODE HERE ----------- #

    TP = sum((predictions == labels) & (predictions == 1))
    FN = sum((predictions != labels) & (predictions == 0))  # predicted ham, actually spam
    return TP / (TP + FN) if (TP + FN) > 0 else 0.0

    # ---------- YOUR CODE HERE ----------- #

def classify(message):
    """
    Classify a message as spam (1) or ham (0) using the naive_bayes function.
    """
    result = naive_bayes(message)
    return 1 if result['P_spam'] > result['P_ham'] else 0 # return the label with the higher probability

def evaluate(predictions, labels):
    """
    Compute and print accuracy, precision, and recall.
    """
    acc = accuracy(predictions, labels)
    prec = precision(predictions, labels)
    rec = recall(predictions, labels)
    print(f"Accuracy: {acc:.3f}, precision: {prec:.3f}, recall: {rec:.3f}")

Running the `evaluate` function will print the accuracy, precision, and recall of the model.

In [10]:
predictions = data_test["Text"].apply(classify)
evaluate(predictions, data_test["Label"])

Accuracy: 0.987, precision: 0.981, recall: 0.993


As you will see in other courses, most of the operations we performed in this exercise can be done much more efficiently using libraries such as [scikit-learn](https://scikit-learn.org/stable/getting_started.html), which provide optimized implementations of various machine learning models as well as operations such as data splitting, feature extraction, and evaluation metrics.

### EXTRA: Discussion

**What would the precision and recall metrics look like if we had a very strict spam filter that classified most emails as spam? If they are in conflict, which metric do you think we should generally optimize for in spam detection? Why?**

Precision would be low because the filter is trigger-happy, many legitimate ham emails get flagged as spam, driving up FP. Most "spam" predictions are wrong.
Recall would be high since almost everything is labeled spam, nearly all actual spam emails get caught, so FN is tiny.

For spam filters, we should optimize for precision. The cost of a false positive (a legitimate email silently buried in a spam folder) is typically higher than the cost of a false negative (a spam email slipping through). Missing a legitimate email is more dangerious than falsely approving a spam email.

## Aftermath

Please provide short answers to the following questions:

**1. Did you experience any issues or find anything particularly confusing?**

No

**2. Is there anything you would like to see improved in the assignment?**

No

### Submission

1. Make sure you have completed all tasks and filled in your personal details at the top of this notebook.
2. Ensure all the code runs without errors: restart the kernel and run all cells in order.
3. Submit *only* this notebook (`ex5.ipynb`) on Moodle.