# "Naive Bayes Algorithm for Detecting Spam Messages"

> "I code a multinomial naive bayes algorithm, step-by-step, in order to identify messages as spam or non-spam."

- author: Migs Germar
- toc: true
- branch: master
- badges: true
- comments: true
- categories: [python, pandas, numpy, matplotlib, seaborn, altair]
- hide: true
- search_exclude: true
- image: images/spam-unsplash-hannes_johnson.jpg

<center><img src = "https://miguelahg.github.io/mahg-data-science/images/spam-unsplash-hannes_johnson.jpg" alt = ""></center>

<center><a href = "https://unsplash.com/photos/mRgffV3Hc6c">Unsplash | Hannes Johnson</a></center>

# Introduction

The Multinomial Naive Bayes Algorithm is a machine learning algorithm based on the Bayes Theorem. It calculates the probability that an event $B$ occurred given that event $A$ occurred. Thus, it is usually used in classification problems. (Vadapalli, 2021)

In this project, we will use the algorithm to determine the probability that a message is spam given its contents. We will then use this probability to decide whether to treat new messages as spam or not. For example, if the probability of being spam is over 50%, then we may treat the message as spam.

Identifying spam is important in the Philippines because phishing campaigns went up by 200% after the pandemic began (Devanesan, 2020), and a telecommunications provider recently had to block around 71 million spam messages (Yap, 2021). Such messages may attempt to steal personal information, steal money from an account, or install malware (FTC, 2020). Thus, machine learning can be a very helpful tool in preventing such harm from occurring.

Though the algorithm can be easily implemented using existing functions such as those in the [`scikit-learn` package](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes), I will manually code the algorithm step-by-step in order to explain the mathematical intuition behind it.

> Note: I wrote this notebook by following a guided project on the [Dataquest](https://www.dataquest.io/) platform, specifically the [Guided Project: Building a Spam Filter with Naive Bayes](https://app.dataquest.io/c/74/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/1/exploring-the-dataset) The general project flow came from Dataquest. The mathematical explanations are also based on what I learned from Dataquest.

# Preparations

Below are the packages necessary for this project.

In [180]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# The Dataset

We will use the SMS Spam Collection Dataset by Almeida and Hidalgo in 2012. It can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).



In [181]:
#collapse-hide
sms_df = pd.read_csv(
    "./private/Naive-Bayes-Files/SMSSpamCollection",
    # Tab-separated
    sep = "\t",
    header = None,
    names = ["label", "sms"]
)

sms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5572 non-null   object
 1   sms     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


The dataset has 2 columns and 5572 rows.

- The `label` column contains "ham" if the message is legitimate, or "spam" if it is spam.
- The `sms` column contains individual SMS messages.

For example, below are the first 5 rows of the dataset.

In [182]:
#collapse-hide
sms_df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Training and Testing Sets

The messages will be split into two sets. The training set, comprising 80% of the total data, will be used to train the Naive Bayes Algorithm. The testing set, with 20% of the total data, will be used to test the model's accuracy.

First, however, let us calculate what percentage of the messages in the dataset are spam.

In [183]:
#collapse-hide
spam_perc = sms_df["label"].eq("spam").sum() / sms_df.shape[0] * 100
print(f"Percentage of spam messages: {spam_perc:.2f}%")

Percentage of spam messages: 13.41%


Only 13% of the messages are spam. Therefore, spam and non-spam messages are not equally represented in this dataset, and this may be problematic. However, this is all the data we have, so the best we can do is to ensure that both the training and testing sets have around 13% of their messages as spam.

This is an example of *proportional stratified sampling*. We first separate the data into two strata (spam and non-spam). We then take 80% of the messages from each strata as the training set. The remaining 20% of each strata is set aside for the testing set.

This has been done with the code below.

In [184]:
#collapse-hide
# Note: I could have used `train_test_split` from sklearn, but I coded this manually for the sake of grasping the logic.
split_lists = {
    "training": [],
    "testing": [],
}

# Stratify the dataset
for label in "spam", "ham":
    stratum = sms_df.loc[sms_df["label"] == label]

    train_part = stratum.sample(
        # Sample 80% of the data points
        frac = 0.8,
        random_state = 1,
    )

    # The other 20% that were not sampled go to the testing set.
    test_part = stratum.loc[~stratum.index.isin(train_part.index)]

    split_lists["training"].append(train_part)
    split_lists["testing"].append(test_part)

split_dfs = pd.Series(dtype = "object")
for key in split_lists:
    # Concatenate spam and non-spam parts into one DataFrame.
    set_df = pd.concat(split_lists[key]).reset_index()
    split_dfs[key] = set_df

    perc_spam = set_df.label.eq('spam').sum() / set_df.shape[0] * 100

    print(f"Number of rows in {key} set: {set_df.shape[0]}")
    print(f"Percentage of {key} messages that are spam: {perc_spam:.2f}%")

Number of rows in training set: 4458
Percentage of training messages that are spam: 13.41%
Number of rows in testing set: 1114
Percentage of testing messages that are spam: 13.38%


We can see that the percentage of spam messages is roughly the same across the two sets. This will help the accuracy of the model later on.

Now, the two sets will be further split into `X` and `y`. `y` refers to the **target**, or the variable that we are trying to predict. In this case, we are trying to predict whether a message is spam or non-spam, so the "label" column is the target:

In [185]:
#collapse-hide
sms_df.label.head()

0     ham
1     ham
2    spam
3     ham
4     ham
Name: label, dtype: object

On the other hand, `X` refers to the **features**, which are information used to predict the target. We only have one feature column as of now, which is the "sms" column.

In [186]:
#collapse-hide
sms_df.sms.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: sms, dtype: object


Thus, we end up with four final objects:

- `X_train`: The messages in the training data.
- `X_test`: The messages in the testing data.
- `y_train`: The labels in the training data. These correspond to `X_train`.
- `y_test`: The labels in the testing data. These correspond to `X_test`.

In [187]:
#collapse-show
# The four objects listed above.
X_train = split_dfs.training[["sms"]].copy()
X_test = split_dfs.testing[["sms"]].copy()
y_train = split_dfs.training["label"].copy()
y_test = split_dfs.testing["label"].copy()

# The Algorithm

Now, let's discuss the multinomial naive bayes algorithm. Conditional probability is necessary in order to understand it. For our use case, let $Spam$ be the event that a message is spam, and $Ham$ be the event for non-spam.

> Note: The mathematical explanations below are not my own ideas. I learned these from the Dataquest course on Naive Bayes.

## Main Formulas

We want to compare the probability that a given message is spam to the probability that it is ham. Thus, we use the following formulas:

$P(Spam|w_1, w_2, \dots , w_n) \propto P(Spam) \cdot \Pi_{i=1}^n P(w_i|Spam)$

$P(Ham|w_1, w_2, \dots , w_n) \propto P(Ham) \cdot \Pi_{i=1}^n P(w_i|Ham)$

> Note: These formulas are not the same as the Bayes Theorem. To understand how these were derived from the Bayes Theorem, see the Appendix of this post.

These two formulas are identical except for the $Spam$ or $Ham$ event. Let us just look at the first equation to unpack it.

The probability of event $B$ given that event $A$ has happened can be represented as $P(B|A)$ ("probability of B given A"). Thus, the left side of the formula, $P(Spam|w_1, w_2, \dots , w_n)$, represents the probability of spam given the contents of a message. Each variable $w_i$ represents one word in the message. For example, $w_1$ is the first word in the message, and so on.

In the middle, the "directly proportional to" sign ($\propto$) is used instead of the equals sign. The left and right sides are not equal, but one increases as the other increases.

At the right side, $P(Spam)$ simply refers to the probability that any message is spam. It can be calculated as the number of spam messages in the dataset over the total number of messages.

Finally, the formula ends with $\Pi_{i=1}^n P(w_i|Spam)$. The $P(w_i|Spam)$ part refers to the probability of a certain word occurring given that the message is known to be spam. We must calculate this probability for each word in the message. Then, because the uppercase pi ($\Pi$) refers to a product, we must multiply the word probabilities together.

## Additive Smoothing and Vocabulary

In order to calculate $P(w_i|Spam)$, we need to use the following formula:

$P(w_i | Spam) = \frac{N_{w_i | Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$

We use an almost identical equation for $P(w_i|Ham)$ as well:

$P(w_i | Ham) = \frac{N_{w_i | Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$

Again, let us just unpack the first formula. $N_{w_i|Spam}$ refers to the number of times that the word appears in the dataset's spam messages.

$\alpha$ is the **additive smoothing parameter**. We will use $\alpha = 1$. This is added to the numerator to prevent it from becoming zero. If it does become zero, the entire product in the main formula will become zero.

$N_{Spam}$ refers to the total number of words in all of the spam messages. Duplicate words are not removed when this is calculated.

Lastly, $N_{Vocabulary}$ refers to the number of words in the **vocabulary**. This is the set of all *unique* words found in any of the messages, whether spam or non-spam. Duplicates are removed.

# Implementation

Based on the theory behind the algorithm, I have written a set of steps to implement it. I will use these steps as the pseudocode for this project.

1. Determine the model **parameters**. These are the variables in the formulas shown earlier. Only the training data will be used for this.
    - Find $P(Spam), P(Ham)$.
        - Divide the number of spam messages by the total number of messages.
        - Do the same for ham messages.
    - Preprocess the messages to focus on individual words.
        - Make all words lowercase.
        - Remove punctuation marks.
    - Form a vocabulary. 
        - Make a set of all the words in the messages, without duplicates.
        - $N_{Vocabulary}$ is the number of words in this set.
    - Find $N_{Spam}, N_{Ham}$.
        - Count the number of times each word appears in each message.
        - Count the total number of words in spam messages. Do the same for ham messages.
    - Find $N_{w_i|Spam}, N_{w_i|Ham}$ for each word in the vocabulary.
        - Sum up the word counts in spam messages to get $N_{w_i|Spam}$.
        - Do the same for ham messages to get $N_{w_i|Spam}$.
1. Write a **predictive function**. This takes a new message and predicts whether it is spam or not.
    - Plug the values that we calculated previously into the equation.
    - Return $P(Spam|w_1, w_2, \dots , w_n)$, $P(Ham|w_1, w_2, \dots , w_n)$, and the prediction ("spam" or "ham").
1. **Evaluate** the model using the testing data.
    - Make predictions for all messages in the testing set.
    - Divide the number of correct predictions by the total number of predictions. This will result in the accuracy of the model.

# Model Parameters

### $P_{Spam}, P_{Ham}$

The probability of spam is equal to the number of spam messages over the total number of messages. The same goes for ham messages.

In [188]:
#collapse-hide
p_label = {}
p_label["spam"] = y_train.eq("spam").sum() / y_train.shape[0]
p_label["ham"] = 1 - p_label["spam"]

print(f"P(Spam) = {p_label['spam'] * 100:.2f}%")
print(f"P(Ham) = {p_label['ham'] * 100:.2f}%")

P(Spam) = 13.41%
P(Ham) = 86.59%


### Message Preprocessing

Below are the messages:

In [189]:
#collapse-hide
X_train.head()

Unnamed: 0,sms
0,Marvel Mobile Play the official Ultimate Spide...
1,"Thank you, winner notified by sms. Good Luck! ..."
2,"Free msg. Sorry, a service you ordered from 81..."
3,"Thanks for your ringtone order, ref number R83..."
4,PRIVATE! Your 2003 Account Statement for shows...


In order to get individual words, we make all words lowercase and remove punctuation marks and other non-word characters.

In [190]:
#collapse-hide
def preprocess_messages(series):
    result = (
        series
        .str.lower()
        # Delete all non-word characters.
        .str.replace(r"[^a-z0-9 ]", "", regex = True)
        .str.strip()
        .str.split()
    )

    return result

X_train = pd.DataFrame(preprocess_messages(X_train.sms))

X_train.head()

Unnamed: 0,sms
0,"[marvel, mobile, play, the, official, ultimate..."
1,"[thank, you, winner, notified, by, sms, good, ..."
2,"[free, msg, sorry, a, service, you, ordered, f..."
3,"[thanks, for, your, ringtone, order, ref, numb..."
4,"[private, your, 2003, account, statement, for,..."


### Vocabulary

Using the preprocessed messages, we can form a set of all of the unique words that they contain.

In [191]:
#collapse-hide
vocab = set()
for lst in X_train.sms:
    vocab.update(lst)

# Use a Series to delete items that are blank or only contain whitespace.
vocab_series = pd.Series(list(vocab))
vocab_series = vocab_series.loc[~vocab_series.str.match("^\s*$")]
vocab = set(vocab_series)

n_vocab = len(vocab)

print(f"Number of words in the vocabulary: {n_vocab}\nFirst few items:")
list(vocab)[:10]

Number of words in the vocabulary: 8385
First few items:


['ptbo',
 'unsecured',
 '0808',
 'brothers',
 'invnted',
 '08702490080',
 'funny',
 'shining',
 'jason',
 'kills']

Above are the first 10 items in the vocabulary. In total, $N_{Vocabulary} = 8385$.

### $N_{Spam}, N_{Ham}$

Using the vocabulary, we can transform the messages to show the number of times that each word appears in each message.

In [192]:
#collapse-hide
vocab_lst = list(sorted(vocab))

word_counts = pd.DataFrame({
    w: [0] * X_train.sms.shape[0]
    for w in vocab_lst
})

for index, word_lst in X_train.sms.iteritems():
    for w in word_lst:
        if w in vocab:
            word_counts.loc[index, w] += 1

word_counts.head()

Unnamed: 0,0,008704050406,0089my,0121,01223585236,01223585334,0125698789,020603,0207,02070836089,...,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Above, I have shown what the `X_train` dataframe looks like now. Each row represents a message. Each column represents a unique word in the vocabulary. The cells show the number of times that each word appeared in each message.

Now, we can calculate $N_{Spam}, N_{Ham}$:

In [193]:
#collapse-hide
def count_n(label, word_counts = word_counts):
    n_label = (
        word_counts
        .loc[y_train == label, :]
        # Sum all of the numbers in the df.
        .sum()
        .sum()
    )
    return n_label

n_label = {}

for label in ["spam", "ham"]:
    n_label[label] = count_n(label)

print(f"Number of words in spam messages: {n_label['spam']}")
print(f"Number of words in ham messages: {n_label['ham']}")

Number of words in spam messages: 14037
Number of words in ham messages: 53977


The result is that $N_{Spam} = 14037$ and $N_{Ham} = 53977$.

### $N_{w_i|Spam}, N_{w_i|Ham}$

Finally, we can use the word counts to determine these two parameters for each word.

In [194]:
#collapse-hide
full_train = pd.concat(
    [y_train, word_counts],
    axis = 1,
)

n_word_given_label = full_train.pivot_table(
    values = vocab_lst,
    index = "label",
    aggfunc = np.sum,
)

n_word_given_label

Unnamed: 0_level_0,0,008704050406,0089my,0121,01223585236,01223585334,0125698789,020603,0207,02070836089,...,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,0,0,0,0,0,0,1,0,0,0,...,0,0,1,1,1,0,1,1,0,1
spam,3,1,1,1,1,1,0,4,2,1,...,1,4,0,0,0,1,0,0,1,0


The dataframe above is named `n_word_given_label`. For example, if we want to access $N_{w_i | Spam}$ for the word "hello", we can use `n_word_given_label.at["spam", "hello"]`. This will give us the value where the "spam" row and "hello" column intersect.

## Predictive Function

Now that all of the parameters have been found, we can write a function that will take a new message and classify it as spam or non-spam.

In [195]:
def predict(word_lst, out = "both", alpha = 1, vocab = vocab, p_label = p_label, n_label = n_label, n_word_given_label = n_word_given_label):
    """Given the list of words in a message, predict whether it is spam or ham.
out: "both" to output both probabilities and prediction. "pred" to output only the prediction."""

    # Set up a Series to store results
    results = pd.Series(dtype = np.float64)

    for label in ["spam", "ham"]:
        # Use P(Spam) or P(Ham)
        final = p_label[label]

        # Iterate through words in the message.
        for w in word_lst:
            # Only include a word if it is already in the vocabulary.
            if w in vocab:
                # Calculate P(w1, w2, ..., wn | Spam) using the formula.
                p_word_given_label = (
                    (n_word_given_label.at[label, w] + alpha)
                    / (n_label[label] + alpha * n_vocab)
                )

                # Multiply the result into the final value.
                final *= p_word_given_label

        results[label] = final
    
    # The prediction is the label with the higher probability in the Series.
    # If the probabilities are equal, the prediction is "uncertain"
    if results["spam"] == results["ham"]:
        prediction = "uncertain"
    else:
        prediction = results.idxmax()

    if out == "both":
        return results, prediction
    elif out == "pred":
        return prediction

Let us try using this function to predict whether a message is spam or ham. We will use this example: "you won a prize claim it now by sending credit card details".

In [196]:
#collapse-hide
results, prediction = predict("you won a prize claim it now by sending credit card details".split())

print("Results:")
for label, value in results.iteritems():
    print(f"P({label} | message) is proportional to {value}")
print(f"This message is predicted to be {prediction}.")

Results:
P(spam | message) is proportional to 2.3208952599406518e-35
P(ham | message) is proportional to 1.8781562825001382e-41
This message is predicted to be spam.


The algorithm determined that $P(Spam|w_1, w_2, \dots , w_n) \propto 2.32 \cdot 10^{-35}$, whereas $P(Ham|w_1, w_2, \dots , w_n) \propto 1.88 \cdot 10^{-41}$. Since the probability for spam was higher, it predicted that the message was spam.

## Model Evaluation

The final step is to evaluate the predictive function. We will use it to predict labels for the messages in the testing set. Then, we will show the predicted labels side-by-side with the real labels.

In [200]:
#collapse-hide
# Preprocess testing messages
X_test_preprocessed = preprocess_messages(X_test.sms)

# Make predictions
y_pred = X_test_preprocessed.apply(predict, out = "pred")
y_pred.name = "prediction"

# Concatenate
full_test = pd.concat(
    [y_test, y_pred, X_test],
    axis = 1
)


full_test.head()

Unnamed: 0,label,prediction,sms
0,spam,spam,England v Macedonia - dont miss the goals/team...
1,spam,spam,SMS. ac Sptv: The New Jersey Devils and the De...
2,spam,spam,Please call our customer service representativ...
3,spam,spam,URGENT! Your Mobile No. was awarded £2000 Bonu...
4,spam,spam,Sunshine Quiz Wkly Q! Win a top Sony DVD playe...


The table above shows the first 5 rows of the testing set. We can see that the algorithm correctly predicted that the first 5 rows were spam.

We will now calculate the overall accuracy of the model by dividing the number of correct predictions by the total number of predictions.

In [203]:
#collapse-hide
acc = y_test.eq(y_pred).sum() / y_pred.shape[0] * 100

print(f"Accuracy: {acc:.2f}%")

Accuracy: 98.74%


The model turned out to have a very high accuracy of 98.74%. This shows that it is effective at filtering spam from non-spam.

However, considering that spam and non-spam did not have equal representation in the data, with only 13% of all messages being spam, the accuracy may be misleading. Thus, let us use other evaluation metrics such as precision, recall, and F1.

# Appendix

Here, I explain how the multinomial naive bayes algorithm was derived from the Bayes Theorem. Given two events $A$ and $B$, we can use the theorem to determine the probability that $B$ happened given that $A$ happened. This probability is written as $P(B|A)$.

$P(B|A) = \frac{P(B) \cdot P(A|B)}{\Sigma_{i = 1}^n (P(B_i) \cdot P(A|B_i))}$

In this case, $B_1$ is the event that the message is non-spam, and $B_2$ is the event that it is spam. $B$ can refer to either $B_1$ or $B_2$, depending on which probability we want to calculate. Also, $A$ refers to the specific contents of one message.

In order to make things clearer, let us say that $Spam$ is the event that the message is spam, and $Ham$ is the event that the message is non-spam.

Then, let us expand event $A$ (the message itself) in order to consider the individual words inside it. For example, the first word in a message can be labeled $w_1$. If we have a total of $n$ words, then the words can be labeled as $w_1, w_2, \dots , w_n$.

Thus, we can rewrite the equation. Here is the probability of a given message being spam:

$P(Spam|w_1, w_2, \dots , w_n) = \frac{P(Spam) \cdot P(w_1, w_2, \dots , w_n|Spam)}{\Sigma_{i = 1}^n (P(B_i) \cdot P(w_1, w_2, \dots , w_n|B_i))}$

Here is the probability of a given message being non-spam:

$P(Ham|w_1, w_2, \dots , w_n) = \frac{P(Ham) \cdot P(w_1, w_2, \dots , w_n|Ham)}{\Sigma_{i = 1}^n (P(B_i) \cdot P(w_1, w_2, \dots , w_n|B_i))}$

Notice that the denominators are the same. Since we only want to compare these two probabilities, we can skip calculating the denominator and just calculate the numerators. We can thus rewrite the equation as follows. Note that the $\propto$ symbol is used instead of $=$ because the two quantities are not equal but directly proportional.

$P(Spam|w_1, w_2, \dots , w_n) \propto P(Spam) \cdot P(w_1, w_2, \dots , w_n|Spam)$

The first factor, $P(Spam)$, is easy to find, as it is simply the number of spam messages divided by the total number of messages. However, $P(w_1, w_2, \dots , w_n|Spam)$ needs to be further expanded.

If we make the assumption that the probability of each word is independent of the probability of the other words, we can use the multiplication rule. The assumption of independence is what makes the algorithm "naive," as it usually doesn't hold true in reality. However, the algorithm is still useful for predictions despite this.

$P(Spam) \cdot P(w_1, w_2, \dots , w_n|Spam) \\ = P(Spam) \cdot P(w_1 \cap w_2 \cap \dots \cap w_n | Spam) \\ = P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot \dots \cdot P(w_n|Spam)$

Note that we still have to find the probability of each word given $Spam$ because we assume that the presence of each word is dependent on $Spam$.

Thus, the final formula is:

$P(Spam|w_1, w_2, \dots , w_n) \propto P(Spam) \cdot \Pi_{i=1}^n P(w_i|Spam)$

Likewise, the formula for $Ham$ is:

$P(Ham|w_1, w_2, \dots , w_n) \propto P(Ham) \cdot \Pi_{i=1}^n P(w_i|Ham)$