# Detecting SPAM emails with a Naive Bayes Classifier

The Naive Bayes Classifier is based on one of the most important results in Statistics: The Bayes Theorem. We will see how this theorem can be employed to determine if an email is SPAM or not.

First, we need to load some important libraries.

In [291]:
# Install missing packages (run once in the notebook)
%pip install pandas

## General Libraries
import pandas as pd
import numpy as np
import re

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Now we need to load the data we will work with. This data can be downloaded from `https://www.kaggle.com/uciml/sms-spam-collection-dataset`.

In [292]:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Cleaning the data

Before going any further, it is clear that our data needs some cleaning. For instance, the **unnamed columns** can be removed. Speaking of columns, some "renaming" would be desirable for the sake of clarity. Also, we would like use a "binary variable" for categorizing the emails: 0 for **not spam** and 1 for **spam**.

In [293]:
data_clean = data
data_clean['spam'] = data_clean['v1'].map({'ham' : 0, 'spam' : 1})
data_clean = data_clean.drop(columns=['v1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
data_clean = data_clean.rename(columns={'v2' : 'email'})
# Eliminar duplicados basados en la columna 'email'
total_duplicates = data_clean.duplicated(subset='email').sum()
unique_dup_emails = data_clean[data_clean.duplicated(subset='email', keep=False)].copy()

# Mantener solo la primera ocurrencia de cada email
data_clean = data_clean.drop_duplicates(subset='email', keep='first').reset_index(drop=True)

print(f"Duplicados encontrados: {total_duplicates}")
print(f"Filas después de eliminar duplicados: {data_clean.shape[0]}")
data_clean.head()

Duplicados encontrados: 403
Filas después de eliminar duplicados: 5169


Unnamed: 0,email,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


It looks nicer, doesn't it? But this is just the beggining. At this point we need to process the emails and turn them into something that our model will "digest" much more easily. In order to do this we need some **Natural Language Processing** (NLP): "NLP is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data," according to Wikipedia.

## Text Processing

Text preprocessing is crucial before building a proper NLP model. Here are the important steps we are going to carry out:

1. Converting words to lower case.
2. Removing special characters.
3. Removing stopwords.
4. Stemming and lemmatization.

More on steps three and four later. For now let us proceed with step number one.

### Lower case and special characters

In [294]:
data_clean['email'] = data_clean['email'].apply(lambda x : x.lower())
data_clean

Unnamed: 0,email,spam
0,"go until jurong point, crazy.. available only ...",0
1,ok lar... joking wif u oni...,0
2,free entry in 2 a wkly comp to win fa cup fina...,1
3,u dun say so early hor... u c already then say...,0
4,"nah i don't think he goes to usf, he lives aro...",0
...,...,...
5164,this is the 2nd time we have tried 2 contact u...,1
5165,will ì_ b going to esplanade fr home?,0
5166,"pity, * was in mood for that. so...any other s...",0
5167,the guy did some bitching but i acted like i'd...,0


Let us do step number two:

In [295]:
data_clean['email'] = data_clean['email'].apply(lambda x : re.sub('[^a-z0-9 ]+', ' ', x))
data_clean

Unnamed: 0,email,spam
0,go until jurong point crazy available only i...,0
1,ok lar joking wif u oni,0
2,free entry in 2 a wkly comp to win fa cup fina...,1
3,u dun say so early hor u c already then say,0
4,nah i don t think he goes to usf he lives aro...,0
...,...,...
5164,this is the 2nd time we have tried 2 contact u...,1
5165,will b going to esplanade fr home,0
5166,pity was in mood for that so any other sug...,0
5167,the guy did some bitching but i acted like i d...,0


Notice that we have assumed that it is "safe" to turn the characters of the emails into lower case letters and that special characters do not posses relevant information. This may be okay for this type of application, but for, say, sentiment analysis, we might need to reconsider this since special characters like exclamation points are used to convey certain emotions.

### Stop words

At this point you migh be wondering "what are stop words?" Well, these are words that are encountered very frequently in a given language but do not carry useful information, thus it is a good practice to remove them. Before doing this, let us take a look into the stop words of the English language:

In [296]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ricardob./nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now onto removing stop words.

In [297]:
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ricardob./nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [298]:
def remove_stop_words(message):

    words = word_tokenize(message)
    words = [word for word in words if word not in stop_words]

    return words

In [299]:
# Ensure the punkt tokenizer is available (download if necessary)
try:
	nltk.data.find('tokenizers/punkt_tab')
except LookupError:
	nltk.download('punkt_tab')

data_clean['email'] = data_clean['email'].apply(remove_stop_words)
data_clean

Unnamed: 0,email,spam
0,"[go, jurong, point, crazy, available, bugis, n...",0
1,"[ok, lar, joking, wif, u, oni]",0
2,"[free, entry, 2, wkly, comp, win, fa, cup, fin...",1
3,"[u, dun, say, early, hor, u, c, already, say]",0
4,"[nah, think, goes, usf, lives, around, though]",0
...,...,...
5164,"[2nd, time, tried, 2, contact, u, u, 750, poun...",1
5165,"[b, going, esplanade, fr, home]",0
5166,"[pity, mood, suggestions]",0
5167,"[guy, bitching, acted, like, interested, buyin...",0


Notice that apart from removing stop words we did something else, that "something else" is called **tokenization**: Tokenization is defined as splitting a text into small units known as **tokens**. We might think that this is as simple as taking a text and each time we find a space between words we split there, but the process is more involved than that. The method `word_tokenize` is clever enough to do thing such as this:

In [300]:
word_tokenize("There's something I'd like to know, dude.")

['There', "'s", 'something', 'I', "'d", 'like', 'to', 'know', ',', 'dude', '.']

### Stemming and lemmatization

It is natural that in any language we will use variations of the same word, e.g., "run", "ran", and "running". These variations are called **inflections**. Even more, there are words that have similar meanings such as "democracy", "democratic", and "democratization". The goal of both stemming and lemmatization is to turn either inflections or derivationally related forms of a word into a common base form. For instance:

*Lemmatization:* am, are, is $\Rightarrow$ be.

*Stemming:* car, cars $\Rightarrow$ car.

Stemming is considered a crude heuristic process that chops off parts of a word by taking into account common prefixes and suffixes. On the other hand, lemmatization takes into consideration the grammar of the word and attemps to find the root word.

In [301]:
## modules for
## stemming and lemmatization
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

Porter = PorterStemmer()
Lemma = WordNetLemmatizer()

print(Porter.stem("car"))
print(Porter.stem("cars"))

print(Lemma.lemmatize("am", wordnet.VERB))
print(Lemma.lemmatize("are", wordnet.VERB))
print(Lemma.lemmatize("is", wordnet.VERB))

car
car
be
be
be


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ricardob./nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In the meantime, for this application, we will stick to *stemming*.

In [302]:
data_clean['email'] = data_clean['email'].apply(lambda x : [Porter.stem(word) for word in x])
data_clean

Unnamed: 0,email,spam
0,"[go, jurong, point, crazi, avail, bugi, n, gre...",0
1,"[ok, lar, joke, wif, u, oni]",0
2,"[free, entri, 2, wkli, comp, win, fa, cup, fin...",1
3,"[u, dun, say, earli, hor, u, c, alreadi, say]",0
4,"[nah, think, goe, usf, live, around, though]",0
...,...,...
5164,"[2nd, time, tri, 2, contact, u, u, 750, pound,...",1
5165,"[b, go, esplanad, fr, home]",0
5166,"[piti, mood, suggest]",0
5167,"[guy, bitch, act, like, interest, buy, someth,...",0


## Training and testing sets

When we are developing a model we do not use all of our data for training, what we do is that we divide the data we posses into two sets: the training set and the testing set. A general rule of thumb is to use 80% of the data for training and 20% for testing our model. There are variations of this depending on the circumstances, but, in general, this is a good starting point. By the way, all the examples of our training data should be picked randomly to avoid any bias; it is not a good practice to pick these examples in a deterministic fashion.

In [303]:
train_set = data_clean.sample(frac=0.8, random_state=1337)
test_set = data_clean.drop(train_set.index)
print(train_set.shape)
print(test_set.shape)

(4135, 2)
(1034, 2)


## Bayes' Theorem

You probably remember something called "conditional probability." Let us assume we have two events A and B that migh be related. Also, suppose that we know that event B has occured, then we might ask what is the probability that event A occurs given that event B already happened. This is written in mathematical terms as follows: $P(A|B)$. This quantity is equal to

$$
\begin{align}
P(A|B)=\frac{P(A\cap B)}{P(B)}
\end{align}
$$.

By the way, when events A and B are independent, we have that $P(A|B)=P(A)$; this means that the ocurrence of B does not influence whatsoever the probability of A. The latter implies that $P(A\cap B)=P(A)P(B)$.

On the other hand, we could also ask what is the probability that B occurs given that A happened:

$$
\begin{align}
P(B|A)=\frac{P(A\cap B)}{P(A)}.
\end{align}
$$

Then, we have that $P(A|B)P(B)=P(B|A)P(A)=P(A\cap B)$. Therefore,

$$
\begin{align}
P(A|B)=\frac{P(B|A)P(A)}{P(B)}.
\end{align}
$$

This last expression is known as **Bayes' Theorem**.

The term $P(A|B)$ is known as the *posterior probability*, the term $P(A)$ is defined as *a prior probability*, $P(B)$ is a *marginal probability*, and $P(B|A)$ is a conditional probability that can be understood as the likelihood of A given a fixed B: $L(A|B)=P(B|A)$.

Let us see Bayes' Theorem in action. Say there is a rare disease that just one out of a thousand people has it. Also, assume there is test for this disease that identifies correctly 99% of the time the people that have the disease. Then, if a person tests positive, what is the probability that this person has the disease?

Let us define two events: D is the event of a person having the disease, T is the event that a test gives a positive result. Then, to answer the question we just asked, we need to compute $P(D|T)$:

$$
\begin{align}
P(D|T)=\frac{P(T|D)P(D)}{P(T)}.
\end{align}
$$

To begin with, we have that $P(D)=0.001$ and $P(T|D)=0.99$. As for $P(T)$, this can be calculated as follows:

$$
\begin{align}
P(T)&=P(T|D)P(D)+P(T|\bar{D})P(\bar{D})\\
\\
&=(0.99)(0.001)+(0.01)(0.999)\\
\\
&=0.01098.
\end{align}
$$

Therefore,

$$
\begin{align}
P(D|T)=\frac{(0.99)(0.001)}{0.01098}=0.09016...
\end{align}
$$

It is worth to consider the following situation: So our hypothetical person realized that the probability of having the disease is not that high, so he/she goes to another lab and takes the test again. If the result is, once again, positive, what is the probability that the person has the disease?

In this case, our prior probability $P(D)$ is no longer 0.001 but 0.09016. Thus, we have to update both the posterior probability $P(D|T)$ and the marginal probability $P(T)$:

$$
\begin{align}
P(D|T)&=\frac{P(T|D)P(D)}{P(T)}\\
\\
&=\frac{P(T|D)P(D)}{P(T|D)P(D)+P(T|\bar{D})P(\bar{D})}\\
\\
&=\frac{(0.99)(0.0916)}{(0.99)(0.0916)+(0.01)(0.9098)}\\
\\
&=0.9075.
\end{align}
$$

As we can see, our hypothetical character should be worried now.

By the way, the latter example was taken from https://www.youtube.com/watch?v=R13BD8qKeTg.

## Naive Bayes Classifier

Let us talk about emails now. Let $W$ be the set of all English words and let an email $m$ be a set of words that belong to $W$: $m=\{w_1,w_2,\dots,w_n\}$. If we want to know what is the probability that said email $m$ is spam we can use, as expected, Bayes' Theorem:

$$
\begin{align}
P(spam|m)&=\frac{P(m|spam)P(spam)}{P(m)}\\
\\
&=\frac{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n)}\\
\\
&=\frac{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)+P(w_1\cap w_2\cap\cdots\cap w_n|not~spam)P(not~spam)}.
\end{align}
$$

At this point it is a good idea to focus our attention on the numerator of the last expression. Notice that we have $P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)$, which is equivalent to the joint probability distribution of $P(w_1\cap w_2\cap\cdots\cap w_n\cap spam)$. By the multiplication rule, this expression can be rewritten as follows:

$$
\begin{align}
P(w_1\cap w_2\cap\cdots\cap w_n\cap spam) = P(spam)P(w_1|spam)P(w_2|w_1\cap spam)\cdots P(w_n|\cap_{i=1}^{n-1}w_i\cap spam).
\end{align}
$$

And here it comes the "naive assumption": given the spam category, we assume that all features of the model, in this case the words of the email, are **mutually and conditionally independent** on the spam category:

$$
\begin{align}
P(w_i|w_{1}\cap\cdots\cap w_{i-1}\cap spam) = P(w_i|spam).
\end{align}
$$

What this expression is telling us is that the probability of having word $w_i$ in a spam message is not affected by the presence of the set of words $\{w_{1},\dots,w_{i-1}\}$ in said message, what we just need to consider is that such email is spam. Consider the sentence "we need your info" and assume that we know we are dealing with an email that is spam. Then, if the naive assumption is true, this could happen:

$$
\begin{align}
P(\text{need}|\text{we}\cap\text{your}\cap\text{info}\cap spam) = P(\text{need}|spam).
\end{align}
$$

However, this is not usually true, what we have, in general, is this:

$$
\begin{align}
P(\text{need}|\text{we}\cap\text{your}\cap\text{info}\cap spam) \neq P(\text{need}|spam).
\end{align}
$$

For this reason we say that this assumption is naive. Nevertheless, in practice, this classifier works very well in many situations.

Let us go back to the numerator. Taking into account our naive premise, the joint probability distribution can be expressed as

$$
\begin{align}
P(w_1\cap w_2\cap\cdots\cap w_n\cap spam) = P(spam)P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam).
\end{align}
$$

Therefore, the probability that a given message $m=\{w_1,w_2,\dots,w_n\}$ is spam can be computed with this expression:

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) = \frac{P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n)}.
\end{align}
$$

You migh be asking, well, how can we classify an email as spam with all this? There are two options: the **Probabilistic Model** and the **Maximum A Posteriori Model (MAP)**.

#### Probabilistic Model

Given a threshold $p$, we classify an email as spam if this condition holds:

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) > p.
\end{align}
$$

#### Maximum A Posteriori Model (MAP)

An email is categorized as spam if

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) > P(not~spam|w_1\cap w_2\cap\cdots\cap w_n),
\end{align}
$$

which is equivalent to

$$
\begin{align}
P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam)P(spam) > P(w_1|not~spam)P(w_2|not~spam)\cdots P(w_n|not~spam)P(not~spam).
\end{align}
$$

Notice that it is not necessary to calculate $P(w_1\cap w_2\cap\cdots\cap w_n)$. For classifying emails we will employ this method.


## Training the Model

Let $W_{\text{t}}$ be the set that contains all the words of the emails that belong to the training set. As expected, $W_{\text{t}}=W_{\text{t-~s}}~\cup W_{\text{t-s}}$ and $W_{\text{t-~s}}~\cap W_{\text{t-s}}=\emptyset$, where $W_{\text{t-~s}}~$ and $W_{\text{t-s}}~$ are the subsets of the training set that contain non-spam and spam emails, respectively. In the training phase we need to compute the following probabilities for the training set:

$$
\begin{align}
P(w_i|spam), & ~\forall w_i\in W_{\text{t-s}}\\
\\
P(w_i|not~spam), & ~\forall w_i\in W_{\text{t-~s}}.
\end{align}
$$

Notice that

$$
\begin{align}
P(w_i|spam)=\frac{\text{number of ocurrences of $w_i$ in spam emails}}{\text{total number of words of spam emails}}.
\end{align}
$$

Similarly,

$$
\begin{align}
P(w_i|not~spam)=\frac{\text{number of ocurrences of $w_i$ in non-spam emails}}{\text{total number of words of non-spam emails}}.
\end{align}
$$

Also, we need to calculate $P(spam)$ and $P(not~spam)$:

$$
\begin{align}
P(spam)&=\frac{|W_{\text{t-s}}~|}{|W_{\text{t}}|}\\
\\
P(not~spam)&=\frac{|W_{\text{t-~s}}~~|}{|W_{\text{t}}|}.
\end{align}
$$

By the way, this way of computing the probabilities is based on the **Bag of Words** model, in which we are interested in the frequencies of each of the words of a corpus without taking into consideration neither grammar  nor order.

This is not the only model at our disposal, another popular option is the **Term Frequency-Inverse Document Frequency (TF-IDF)** model, which is based on information theory. For now, we will focus on the bag-of-words approach, but if you want to know more this is a good starting point: https://en.wikipedia.org/wiki/Tf–idf.

In [304]:
p_spam = train_set[train_set['spam'] == 1].shape[0] / train_set.shape[0]
p_spam

0.12841596130592503

In [305]:
p_not_spam = train_set[train_set['spam'] == 0].shape[0] / train_set.shape[0]
p_not_spam

0.871584038694075

In [306]:
def bag_of_words(corpus):

    """
    This function receives a corpus, i.e., the set of processed emails, and
    returns a dictionary in which each item is a unique word and each word
    has its corresponding number of ocurrences in the corpus.
    """
    bag_of_words = {}

    for message in corpus:
        for word in message:
            if word in bag_of_words:
                bag_of_words[word] += 1
            else:
                bag_of_words[word] = 1

    return bag_of_words

In [307]:
def probability_words(df, vocab_size=None, alpha=1):

    """
    This function receives a dataframe of either spam emails or non-spam emails
    that has been processed as shown above. Using the dictionary that is returned
    by the previous function and the data contained in df, this function computes
    the probability of each word in bag_of_words. 
    
    WITH LAPLACE SMOOTHING: Adds alpha (default=1) to each count to avoid zero 
    probabilities for unseen words.
    Formula: P(w|class) = (count(w) + alpha) / (total_words + alpha * vocab_size)
    
    Parameters:
    -----------
    df : DataFrame
        Contains 'email' column with processed emails
    vocab_size : int
        Total unique words in vocabulary (if None, uses only words in this set)
    alpha : float (default=1)
        Laplace smoothing parameter
    """

    probability_words = {}
    
    # Get the bag of words from the email column
    bow = bag_of_words(df['email'])
    
    # Calculate total number of words
    total_words = sum(bow.values())
    
    # If vocab_size is provided, apply Laplace Smoothing
    if vocab_size is not None:
        for word in vocab_all:
            count = bow.get(word, 0)
            probability_words[word] = (count + alpha) / (total_words + alpha * vocab_size)
    else:
        # Original behavior: only words in this class
        for word, count in bow.items():
            probability_words[word] = count / total_words

    return probability_words

In [308]:
import math

# Build the complete vocabulary from both spam and non-spam emails
spam_emails = train_set[train_set['spam'] == 1]
non_spam_emails = train_set[train_set['spam'] == 0]

spam_bow = bag_of_words(spam_emails['email'])
non_spam_bow = bag_of_words(non_spam_emails['email'])

# Combine vocabularies
vocab_all = set(spam_bow.keys()) | set(non_spam_bow.keys())
vocab_size = len(vocab_all)

print(f"Tamaño del vocabulario: {vocab_size} palabras únicas")
print(f"Palabras en spam: {len(spam_bow)}")
print(f"Palabras en no-spam: {len(non_spam_bow)}")

# Calculate probabilities WITH Laplace Smoothing
probability_spam_words = probability_words(spam_emails, vocab_size=vocab_size, alpha=1)
probability_non_spam_words = probability_words(non_spam_emails, vocab_size=vocab_size, alpha=1)

print("\nProbabilidades calculadas con Laplace Smoothing ✓")

Tamaño del vocabulario: 6324 palabras únicas
Palabras en spam: 2284
Palabras en no-spam: 4868

Probabilidades calculadas con Laplace Smoothing ✓


In [309]:
def classify_email(email, threshold=0.5):

    """
    Improved classifier using:
    1. Laplace Smoothing (no zero probabilities)
    2. Log-probabilities (avoids numerical underflow)
    3. Adjustable threshold for better spam detection
    
    Parameters:
    -----------
    email : list
        List of processed words from an email
    threshold : float (default=0.5)
        Decision threshold. Email is spam if P(spam|email) > threshold
    
    Returns:
    --------
    int : 1 if spam, 0 if not spam
    """
    
    # Handle empty emails
    if not email or len(email) == 0:
        return 0

    # Calculate log probability of spam
    log_prob_spam = math.log(p_spam)
    for word in email:
        if word in probability_spam_words:
            log_prob_spam += math.log(probability_spam_words[word])
    
    # Calculate log probability of non-spam
    log_prob_non_spam = math.log(p_not_spam)
    for word in email:
        if word in probability_non_spam_words:
            log_prob_non_spam += math.log(probability_non_spam_words[word])
    
    # Calculate posterior probability P(spam|email) using Bayes' theorem
    # To avoid overflow: P(spam|email) = 1 / (1 + exp(log_prob_non_spam - log_prob_spam))
    log_odds = log_prob_spam - log_prob_non_spam
    prob_spam_posterior = 1 / (1 + math.exp(-log_odds))
    
    return 1 if prob_spam_posterior > threshold else 0

In [310]:
test_set_hat = test_set.copy()

# Test with different thresholds to find the best one
thresholds_to_test = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
results_by_threshold = {}

print("Experimentando con diferentes thresholds:\n")
print(f"{'Threshold':<12} {'Accuracy':<12} {'Precision':<12} {'Recall':<12} {'F1 Score':<12}")
print("-" * 60)

for threshold in thresholds_to_test:
    try:
        test_set_temp = test_set.copy()
        test_set_temp['prediction'] = test_set['email'].apply(
            lambda x: classify_email(x, threshold=threshold)
        )
        
        _, metrics_temp = performance_metrics(test_set_temp)
        results_by_threshold[threshold] = metrics_temp
        
        acc = metrics_temp.loc['Accuracy', 'Metrics']
        prec = metrics_temp.loc['Precission', 'Metrics']
        rec = metrics_temp.loc['Recall', 'Metrics']
        f1 = metrics_temp.loc['F1 Score', 'Metrics']
        
        print(f"{threshold:<12.1f} {acc:<12.4f} {prec:<12.4f} {rec:<12.4f} {f1:<12.4f}")
    except Exception as e:
        print(f"{threshold:<12.1f} ERROR: {str(e)}")

# Find best threshold based on F1 Score
if results_by_threshold:
    best_threshold = max(results_by_threshold.keys(), 
                          key=lambda t: results_by_threshold[t].loc['F1 Score', 'Metrics'])
    
    print(f"\n{'='*60}")
    print(f"MEJOR THRESHOLD: {best_threshold}")
    print(f"{'='*60}\n")
    
    # Apply best threshold
    test_set_hat['prediction'] = test_set['email'].apply(
        lambda x: classify_email(x, threshold=best_threshold)
    )
else:
    print("Error: No se pudieron calcular los thresholds")
    best_threshold = 0.5
    test_set_hat['prediction'] = test_set['email'].apply(
        lambda x: classify_email(x, threshold=0.5)
    )

print("Resultados con modelo mejorado:")

Experimentando con diferentes thresholds:

Threshold    Accuracy     Precision    Recall       F1 Score    
------------------------------------------------------------
0.1          0.9458       0.6964       0.9590       0.8069      
0.2          0.9700       0.8227       0.9508       0.8821      
0.3          0.9749       0.8529       0.9508       0.8992      
0.4          0.9807       0.8923       0.9508       0.9206      
0.5          0.9855       0.9280       0.9508       0.9393      
0.6          0.9874       0.9580       0.9344       0.9461      
0.7          0.9884       0.9741       0.9262       0.9496      
0.8          0.9884       0.9741       0.9262       0.9496      
0.9          0.9874       0.9823       0.9098       0.9447      

MEJOR THRESHOLD: 0.7

Resultados con modelo mejorado:


## Evaluating the model

So we have built the Naive Bayes Classifier and we have trained it, but is it good? To know how good our model is we need **evaluation metrics**. There are tons of metrics, and the ideal metric, or metrics, will have to be chosen depending on what is important for your particular application. For now, we will mention a few of the most common, however, before going any further, we need to say a few things about the **confusion matrix**.

#### Confusion Matrix

A confusion matrix is a table that allows us to visualize the performance of a classification algorithm.

<img src="confusion.png" alt="Drawing" style="width: 700px;"/>

This type of table receives this name because it lets us observe whether an algorithm is mislabeling two classes (Image taken from https://en.wikipedia.org/wiki/Precision_and_recall).

#### Accuracy

Accuracy is defined as follows:

$$
\text{Accuracy}=\frac{\text{true positives} + \text{true negatives}}{\text{true positives} + \text{false positives} +  \text{true negatives} + \text{false negatives}}.
$$

This metric is useful when both classes are equally important and when we have balanced set, which is not quite the case in this application.

#### Precision

The ratio of positive cases that were correctly labeled over all the examples that were classified as positive is called **precision**:

$$
\text{Precision}=\frac{\text{true positives}}{\text{true positives} + \text{false positives}}.
$$

When we are interested in reducing the amount of false positives and we have imbalanced sets, precision is a good choice as an evaluation metric. In fact, for this application, this metric is appriopriate since we are interested in detecting spam emails: spam is the positive category, if a regular email is classified as spam (false positive), we are sending emails that are important for us to the spam folder; however, if a spam email is labeled as not-spam, said email will end up in our inbox, which is not as serious as not reading an email that we are expecting. Also, keep in mind that our sets are imbalanced: the majority of our emails in the data are not spam.

#### Recall

Recall is the ratio of the examples that were correclty identified as a positive case over all the true positives examples in our data. This metric can be understood as the sensitivity of our model:

$$
\text{Recall}=\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}.
$$

If we want to pay special attention to the false negatives that our model is detecting, and if our sets are imbalanced, then this can be one of our performance metrics. Say we want to build a model that detects a dangerous disease. In this case, we are not interested in telling a person that he/she does not have the disease when that is not the case (false negative).

#### F1 Score

The F1 score is equal to the harmonic mean of precision and recall. It is useful when we want to have a balance between precision and recall and when we do not have balanced sets (large number of actual negatives). It is defined as

$$
\text{F1 Score}=2\frac{\text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}.
$$

In [311]:
def performance_metrics(results):

    positives = results[['spam', 'prediction']][results['spam'] == 1]
    negatives = results[['spam', 'prediction']][results['spam'] == 0]

    true_negatives = negatives[negatives['spam'] == negatives['prediction']].shape[0]
    false_positives = negatives[negatives['spam'] != negatives['prediction']].shape[0]
    true_positives = positives[positives['spam'] == positives['prediction']].shape[0]
    false_negatives = positives[positives['spam'] != positives['prediction']].shape[0]

    confusion_matrix = {'actual positives' : [true_positives, false_negatives],
                        'actual negatives' : [false_positives, true_negatives]}

    confusion_matrix_df = pd.DataFrame.from_dict(confusion_matrix, orient='index',
                                                 columns=['predicted positives', 'predicted negatives'])

    accuracy = (true_positives + true_negatives) / (true_positives + false_positives +  true_negatives + false_negatives)
    precission = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precission * recall) / (precission + recall)

    metrics = {'Accuracy' : accuracy, 'Precission' : precission, 'Recall' : recall, 'F1 Score' : f1_score}

    metrics_df = pd.DataFrame.from_dict(metrics, orient='index', columns=['Metrics'])

    return confusion_matrix_df, metrics_df

In [312]:
confusion_matrix, metrics = performance_metrics(test_set_hat)
print("\nMatriz de Confusión (Modelo Mejorado):")
print(confusion_matrix)
print("\nMétricas (Modelo Mejorado):")
print(metrics)
print(f"\nÚltimo threshold utilizado: {best_threshold}")


Matriz de Confusión (Modelo Mejorado):
                  predicted positives  predicted negatives
actual positives                  113                    9
actual negatives                    3                  909

Métricas (Modelo Mejorado):
            Metrics
Accuracy     0.9884
Precission   0.9741
Recall       0.9262
F1 Score     0.9496

Último threshold utilizado: 0.7


As we can see, our model has good precision, but its recall is poor: a lot of emails that are spam were labeled as not-spam. Although this is not a serious issue for this type of application, this suggests that we should get more examples of spam emails if we want to increase the sensitivity of our model or try different strategies such as n-grams, TF-IDF, etc., or both things.

## Generating new messages

It turns out that we can use the conditional distributions that we learned in the training phase to generate either spam or not spam messages. For creating an spam email we can employ this distribution:

$$P(w|\text{spam}).$$

Notice that said distribution is stored in `probability_spam_words`.

In the next cell, use the `np.random.choice` function and the `join` method for creating an spam message.



In [313]:
# Generate a spam email with a length of 20 words
spam_words = list(probability_spam_words.keys())
spam_probabilities = list(probability_spam_words.values())
spam_email = ' '.join(np.random.choice(spam_words, size=20, p=spam_probabilities))
print("Generated Spam Email:")
print(spam_email)

Generated Spam Email:
hdd get enufcredeit prize payoh doesnt back sari mobil u abl claim purpos txt ymca 2 08718730555 350 termsappli box434sk38wp150ppm18


In [314]:
# Generate a non spam email composed of 20 random words
non_spam_words = list(probability_non_spam_words.keys())
non_spam_probabilities = list(probability_non_spam_words.values())
non_spam_email = ' '.join(np.random.choice(non_spam_words, size=20, p=non_spam_probabilities))
print("Generated Non-Spam Email:")
print(non_spam_email)

Generated Non-Spam Email:
teas ibiza apo today decid 2mrw oh thk fr keypad question use happenin gt along merememberin wait soft happi doinat


The messages that you got should not make much sense since the we followed the "naive" approach and stemmed words. Nevertheless, what you get should give you an idea of the type of words you can find in these two types of emails.

## N-grams Strategy: Implementing Bi-grams

To improve the recall of our model, we can use **n-grams** instead of just individual words (unigrams). An **n-gram** is a sequence of n consecutive words. For example:

- **Unigrams** (1-gram): "free", "money", "click"
- **Bi-grams** (2-gram): "free money", "click here", "limited time"
- **Trigrams** (3-gram): "free money now", "click here today"

Bi-grams capture sequential relationships between words, which can better capture common spam patterns like "limited time", "free money", or "click here". This contextual information can help improve both precision and recall.

We will now implement bi-grams alongside the existing unigrams to create a more powerful feature set for our classifier.

In [315]:
def generate_bigrams(words):
    """
    Generate bi-grams from a list of words.
    
    Parameters:
    -----------
    words : list
        List of stemmed words from an email
    
    Returns:
    --------
    list : Combined list of unigrams and bi-grams
           e.g., ["free", "money", "time", "free_money", "money_time"]
    """
    bigrams = []
    for i in range(len(words) - 1):
        bigrams.append(f"{words[i]}_{words[i+1]}")
    
    # Return unigrams + bigrams combined
    return words + bigrams


def filter_bigrams_by_frequency(data_with_bigrams, min_frequency=3):
    """
    Filter bi-grams that appear with low frequency from a dataset.
    
    Parameters:
    -----------
    data_with_bigrams : DataFrame
        DataFrame with 'email' column containing unigrams + bigrams
    min_frequency : int (default=3)
        Minimum frequency threshold for keeping a bi-gram
    
    Returns:
    --------
    tuple : (filtered_bigrams_set, original_count, filtered_count, removed_bigrams)
    """
    from collections import Counter
    
    # Count all bigrams in the dataset
    all_bigrams = [word for email in data_with_bigrams['email'] for word in email if '_' in word]
    bigram_counter = Counter(all_bigrams)
    
    # Filter bigrams by minimum frequency
    filtered_bigrams = {bigram for bigram, count in bigram_counter.items() if count >= min_frequency}
    
    original_count = len(bigram_counter)
    filtered_count = len(filtered_bigrams)
    removed_count = original_count - filtered_count
    removed_bigrams = {bigram: count for bigram, count in bigram_counter.items() if count < min_frequency}
    
    return filtered_bigrams, original_count, filtered_count, removed_bigrams


# Apply bi-gram transformation to all emails
data_clean['email'] = data_clean['email'].apply(generate_bigrams)
print("Bi-grams generated successfully!")
print(f"\nExample email with unigrams + bigrams:")
print(data_clean['email'].iloc[0][:15])  # Show first 15 features (mix of unigrams and bigrams)

Bi-grams generated successfully!

Example email with unigrams + bigrams:
['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor']


In [316]:
## Re-split the data with bi-grams
train_set_bigrams = data_clean.sample(frac=0.8, random_state=1337)
test_set_bigrams = data_clean.drop(train_set_bigrams.index)

print("Training set shape (with bi-grams):", train_set_bigrams.shape)
print("Test set shape (with bi-grams):", test_set_bigrams.shape)

Training set shape (with bi-grams): (4135, 2)
Test set shape (with bi-grams): (1034, 2)


## Training the Model with Bi-grams

Now we'll train a new Naive Bayes classifier using the combined unigrams and bi-grams features. The process is the same as before, but now each email will have more features (both individual words and word pairs).

In [317]:
# Build the complete vocabulary from both spam and non-spam emails (with bi-grams)
spam_emails_bigrams = train_set_bigrams[train_set_bigrams['spam'] == 1]
non_spam_emails_bigrams = train_set_bigrams[train_set_bigrams['spam'] == 0]

spam_bow_bigrams = bag_of_words(spam_emails_bigrams['email'])
non_spam_bow_bigrams = bag_of_words(non_spam_emails_bigrams['email'])

# Apply frequency-based filtering to bi-grams
print("FILTRADO DE BI-GRAMAS POR FRECUENCIA MÍNIMA")
print("="*70)

# Test different minimum frequency thresholds
min_freq_thresholds = [2, 3, 5, 10]
filtered_bigrams_best = None
best_min_freq = 3

for min_freq in min_freq_thresholds:
    filtered_bigrams_spam, orig_spam, filt_spam, removed_spam = filter_bigrams_by_frequency(
        spam_emails_bigrams, min_frequency=min_freq
    )
    filtered_bigrams_non_spam, orig_non_spam, filt_non_spam, removed_non_spam = filter_bigrams_by_frequency(
        non_spam_emails_bigrams, min_frequency=min_freq
    )
    
    combined_filtered = filtered_bigrams_spam | filtered_bigrams_non_spam
    total_removed = (orig_spam - filt_spam) + (orig_non_spam - filt_non_spam)
    
    print(f"\nMin Frequency = {min_freq}:")
    print(f"  Spam: {orig_spam} → {filt_spam} bi-gramas ({total_removed} eliminados)")
    print(f"  No-Spam: {orig_non_spam} → {filt_non_spam} bi-gramas")
    print(f"  Total después de filtrado: {len(combined_filtered)} bi-gramas únicos")

print("\n" + "="*70)

# Use min_frequency = 3 as default (good balance)
filtered_bigrams_set, orig_all, filt_all, removed_all = filter_bigrams_by_frequency(
    train_set_bigrams, min_frequency=best_min_freq
)

print(f"\nUSANDO MIN_FREQUENCY = {best_min_freq}:")
print(f"  Bi-gramas originales: {orig_all}")
print(f"  Bi-gramas después del filtrado: {filt_all}")
print(f"  Reducción: {orig_all - filt_all} ({100*(orig_all - filt_all)/orig_all:.1f}%)")

# Combine vocabularies (unigrams + filtered bigrams)
vocab_all_bigrams = set(spam_bow_bigrams.keys()) | set(non_spam_bow_bigrams.keys())

# Remove bi-grams that don't meet the frequency threshold
vocab_all_bigrams = (vocab_all_bigrams - {word for word in vocab_all_bigrams if '_' in word}) | filtered_bigrams_set

vocab_size_bigrams = len(vocab_all_bigrams)

print(f"\nTamaño del vocabulario (con bi-gramas filtrados): {vocab_size_bigrams} características únicas")
print(f"Características en spam: {len(spam_bow_bigrams)}")
print(f"Características en no-spam: {len(non_spam_bow_bigrams)}")
print(f"\nComparación:")
print(f"  Vocabulario anterior (solo unigramas): {vocab_size}")
print(f"  Vocabulario con bi-gramas (sin filtrado): {33554}")  # From previous run
print(f"  Vocabulario con bi-gramas (CON FILTRADO): {vocab_size_bigrams}")
print(f"  Reducción gracias al filtrado: {33554 - vocab_size_bigrams} características")

# Calculate probabilities WITH Laplace Smoothing (using filtered bi-grams)
probability_spam_words_bigrams = probability_words(spam_emails_bigrams, vocab_size=vocab_size_bigrams, alpha=1)
probability_non_spam_words_bigrams = probability_words(non_spam_emails_bigrams, vocab_size=vocab_size_bigrams, alpha=1)

# Remove probabilities for bi-grams that were filtered out
probability_spam_words_bigrams = {word: prob for word, prob in probability_spam_words_bigrams.items() 
                                   if not (word in vocab_all_bigrams and '_' in word and word not in filtered_bigrams_set)}
probability_non_spam_words_bigrams = {word: prob for word, prob in probability_non_spam_words_bigrams.items() 
                                       if not (word in vocab_all_bigrams and '_' in word and word not in filtered_bigrams_set)}

print("\nProbabilidades calculadas con Laplace Smoothing (bi-gramas filtrados) ✓")

FILTRADO DE BI-GRAMAS POR FRECUENCIA MÍNIMA

Min Frequency = 2:
  Spam: 5681 → 1456 bi-gramas (23208 eliminados)
  No-Spam: 21107 → 2124 bi-gramas
  Total después de filtrado: 3543 bi-gramas únicos

Min Frequency = 3:
  Spam: 5681 → 628 bi-gramas (25479 eliminados)
  No-Spam: 21107 → 681 bi-gramas
  Total después de filtrado: 1292 bi-gramas únicos

Min Frequency = 5:
  Spam: 5681 → 240 bi-gramas (26340 eliminados)
  No-Spam: 21107 → 208 bi-gramas
  Total después de filtrado: 442 bi-gramas únicos

Min Frequency = 10:
  Spam: 5681 → 51 bi-gramas (26685 eliminados)
  No-Spam: 21107 → 52 bi-gramas
  Total después de filtrado: 103 bi-gramas únicos


USANDO MIN_FREQUENCY = 3:
  Bi-gramas originales: 26585
  Bi-gramas después del filtrado: 1335
  Reducción: 25250 (95.0%)

Tamaño del vocabulario (con bi-gramas filtrados): 7659 características únicas
Características en spam: 7965
Características en no-spam: 25975

Comparación:
  Vocabulario anterior (solo unigramas): 6324
  Vocabulario con bi-g

In [318]:
def classify_email_bigrams(email, threshold=0.5):
    """
    Classifier using bi-grams with:
    1. Laplace Smoothing (no zero probabilities)
    2. Log-probabilities (avoids numerical underflow)
    3. Adjustable threshold for better spam detection
    
    Parameters:
    -----------
    email : list
        List of features (unigrams and bi-grams) from an email
    threshold : float (default=0.5)
        Decision threshold. Email is spam if P(spam|email) > threshold
    
    Returns:
    --------
    int : 1 if spam, 0 if not spam
    """
    
    # Handle empty emails
    if not email or len(email) == 0:
        return 0

    # Calculate log probability of spam
    log_prob_spam = math.log(p_spam)
    for feature in email:
        if feature in probability_spam_words_bigrams:
            log_prob_spam += math.log(probability_spam_words_bigrams[feature])
    
    # Calculate log probability of non-spam
    log_prob_non_spam = math.log(p_not_spam)
    for feature in email:
        if feature in probability_non_spam_words_bigrams:
            log_prob_non_spam += math.log(probability_non_spam_words_bigrams[feature])
    
    # Calculate posterior probability P(spam|email) using Bayes' theorem
    log_odds = log_prob_spam - log_prob_non_spam
    prob_spam_posterior = 1 / (1 + math.exp(-log_odds))
    
    return 1 if prob_spam_posterior > threshold else 0

In [319]:
test_set_bigrams_hat = test_set_bigrams.copy()

# Test with different thresholds to find the best one (using bi-grams)
thresholds_to_test = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
results_by_threshold_bigrams = {}

print("Experimentando con diferentes thresholds (BI-GRAMAS):\n")
print(f"{'Threshold':<12} {'Accuracy':<12} {'Precision':<12} {'Recall':<12} {'F1 Score':<12}")
print("-" * 60)

for threshold in thresholds_to_test:
    try:
        test_set_temp = test_set_bigrams.copy()
        test_set_temp['prediction'] = test_set_bigrams['email'].apply(
            lambda x: classify_email_bigrams(x, threshold=threshold)
        )
        
        _, metrics_temp = performance_metrics(test_set_temp)
        results_by_threshold_bigrams[threshold] = metrics_temp
        
        acc = metrics_temp.loc['Accuracy', 'Metrics']
        prec = metrics_temp.loc['Precission', 'Metrics']
        rec = metrics_temp.loc['Recall', 'Metrics']
        f1 = metrics_temp.loc['F1 Score', 'Metrics']
        
        print(f"{threshold:<12.1f} {acc:<12.4f} {prec:<12.4f} {rec:<12.4f} {f1:<12.4f}")
    except Exception as e:
        print(f"{threshold:<12.1f} ERROR: {str(e)}")

# Find best threshold based on F1 Score
if results_by_threshold_bigrams:
    best_threshold_bigrams = max(results_by_threshold_bigrams.keys(), 
                                  key=lambda t: results_by_threshold_bigrams[t].loc['F1 Score', 'Metrics'])
    
    print(f"\n{'='*60}")
    print(f"MEJOR THRESHOLD (BI-GRAMAS): {best_threshold_bigrams}")
    print(f"{'='*60}\n")
    
    # Apply best threshold
    test_set_bigrams_hat['prediction'] = test_set_bigrams['email'].apply(
        lambda x: classify_email_bigrams(x, threshold=best_threshold_bigrams)
    )
else:
    print("Error: No se pudieron calcular los thresholds")
    best_threshold_bigrams = 0.5
    test_set_bigrams_hat['prediction'] = test_set_bigrams['email'].apply(
        lambda x: classify_email_bigrams(x, threshold=0.5)
    )

Experimentando con diferentes thresholds (BI-GRAMAS):

Threshold    Accuracy     Precision    Recall       F1 Score    
------------------------------------------------------------
0.1          0.9410       0.6763       0.9590       0.7932      
0.2          0.9671       0.8014       0.9590       0.8731      
0.3          0.9720       0.8345       0.9508       0.8889      
0.4          0.9778       0.8722       0.9508       0.9098      
0.5          0.9826       0.9062       0.9508       0.9280      
0.6          0.9874       0.9431       0.9508       0.9469      
0.7          0.9894       0.9664       0.9426       0.9544      
0.8          0.9884       0.9741       0.9262       0.9496      
0.9          0.9894       0.9826       0.9262       0.9536      

MEJOR THRESHOLD (BI-GRAMAS): 0.7

0.4          0.9778       0.8722       0.9508       0.9098      
0.5          0.9826       0.9062       0.9508       0.9280      
0.6          0.9874       0.9431       0.9508       0.9469      
0.7 

In [320]:
## Evaluate the Bi-gram Model

confusion_matrix_bigrams, metrics_bigrams = performance_metrics(test_set_bigrams_hat)
print("Matriz de Confusión (Modelo con BI-GRAMAS):")
print(confusion_matrix_bigrams)
print("\nMétricas (Modelo con BI-GRAMAS):")
print(metrics_bigrams)
print(f"\nÚltimo threshold utilizado: {best_threshold_bigrams}")

Matriz de Confusión (Modelo con BI-GRAMAS):
                  predicted positives  predicted negatives
actual positives                  115                    7
actual negatives                    4                  908

Métricas (Modelo con BI-GRAMAS):
            Metrics
Accuracy     0.9894
Precission   0.9664
Recall       0.9426
F1 Score     0.9544

Último threshold utilizado: 0.7


In [321]:
## Comparison: Unigrams vs Unigrams + Bigrams

print("COMPARACIÓN: MODELO ORIGINAL (UNIGRAMAS) vs MODELO MEJORADO (UNIGRAMAS + BI-GRAMAS)")
print("="*80)

# Get metrics from both models
original_metrics = metrics
bigram_metrics = metrics_bigrams

comparison_data = {
    'Métrica': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Modelo Original (Unigramas)': [
        original_metrics.loc['Accuracy', 'Metrics'],
        original_metrics.loc['Precission', 'Metrics'],
        original_metrics.loc['Recall', 'Metrics'],
        original_metrics.loc['F1 Score', 'Metrics']
    ],
    'Modelo Mejorado (Unigramas + Bi-gramas)': [
        bigram_metrics.loc['Accuracy', 'Metrics'],
        bigram_metrics.loc['Precission', 'Metrics'],
        bigram_metrics.loc['Recall', 'Metrics'],
        bigram_metrics.loc['F1 Score', 'Metrics']
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df['Mejora'] = comparison_df['Modelo Mejorado (Unigramas + Bi-gramas)'] - comparison_df['Modelo Original (Unigramas)']
comparison_df['% Cambio'] = (comparison_df['Mejora'] / comparison_df['Modelo Original (Unigramas)'] * 100).round(2)

print(comparison_df.to_string(index=False))

print("\n" + "="*80)
print("Análisis:")
print(f"  - Tamaño del vocabulario: {vocab_size} → {vocab_size_bigrams} (+{vocab_size_bigrams - vocab_size} características)")
print(f"  - Threshold original: {best_threshold} → Threshold bi-gramas: {best_threshold_bigrams}")
print("="*80)

COMPARACIÓN: MODELO ORIGINAL (UNIGRAMAS) vs MODELO MEJORADO (UNIGRAMAS + BI-GRAMAS)
  Métrica  Modelo Original (Unigramas)  Modelo Mejorado (Unigramas + Bi-gramas)  Mejora  % Cambio
 Accuracy                       0.9884                                   0.9894  0.0010    0.1000
Precision                       0.9741                                   0.9664 -0.0078   -0.8000
   Recall                       0.9262                                   0.9426  0.0164    1.7700
 F1 Score                       0.9496                                   0.9544  0.0048    0.5000

Análisis:
  - Tamaño del vocabulario: 6324 → 7659 (+1335 características)
  - Threshold original: 0.7 → Threshold bi-gramas: 0.7


In [322]:
## Bi-gram Analysis: Most Important Features (with frequency filtering)

# Find the most common bi-grams in spam and non-spam emails (only those that passed the filter)
spam_bigrams_filtered = [word for email in spam_emails_bigrams['email'] for word in email 
                         if '_' in word and word in filtered_bigrams_set]
non_spam_bigrams_filtered = [word for email in non_spam_emails_bigrams['email'] for word in email 
                             if '_' in word and word in filtered_bigrams_set]

# Count frequencies
from collections import Counter

spam_bigram_freq_filtered = Counter(spam_bigrams_filtered)
non_spam_bigram_freq_filtered = Counter(non_spam_bigrams_filtered)

print("="*70)
print("ANÁLISIS DE BI-GRAMAS DESPUÉS DEL FILTRADO (min_freq = 3)")
print("="*70)
print()

print("TOP 10 BI-GRAMAS MAS COMUNES EN SPAM (después del filtrado):")
print("-" * 70)
for bigram, count in spam_bigram_freq_filtered.most_common(10):
    print(f"  {bigram}: {count} ocurrencias")

print("\n\nTOP 10 BI-GRAMAS MAS COMUNES EN NO-SPAM (después del filtrado):")
print("-" * 70)
for bigram, count in non_spam_bigram_freq_filtered.most_common(10):
    print(f"  {bigram}: {count} ocurrencias")

print("\n\nBi-gramas con MAYOR PROBABILIDAD EN SPAM (filtrados):")
print("-" * 70)
spam_bigram_probs_filtered = {word: prob for word, prob in probability_spam_words_bigrams.items() 
                               if '_' in word and word in filtered_bigrams_set}
top_spam_bigrams_filtered = sorted(spam_bigram_probs_filtered.items(), key=lambda x: x[1], reverse=True)[:10]
for bigram, prob in top_spam_bigrams_filtered:
    print(f"  {bigram}: P(bi-grama|spam) = {prob:.6f}")

print("\n\nBi-gramas con MAYOR PROBABILIDAD EN NO-SPAM (filtrados):")
print("-" * 70)
non_spam_bigram_probs_filtered = {word: prob for word, prob in probability_non_spam_words_bigrams.items() 
                                   if '_' in word and word in filtered_bigrams_set}
top_non_spam_bigrams_filtered = sorted(non_spam_bigram_probs_filtered.items(), key=lambda x: x[1], reverse=True)[:10]
for bigram, prob in top_non_spam_bigrams_filtered:
    print(f"  {bigram}: P(bi-grama|no-spam) = {prob:.6f}")

print("\n" + "="*70)
print("IMPACTO DEL FILTRADO")
print("="*70)
print(f"Bi-gramas totales en SPAM (antes): {len(spam_bow_bigrams)}")
print(f"Bi-gramas en SPAM después del filtrado: {len(spam_bigram_freq_filtered)}")
print(f"Reducción: {len(spam_bow_bigrams) - len(spam_bigram_freq_filtered)} ({100*(len(spam_bow_bigrams) - len(spam_bigram_freq_filtered))/len(spam_bow_bigrams):.1f}%)")
print()
print(f"Bi-gramas totales en NO-SPAM (antes): {len(non_spam_bow_bigrams)}")
print(f"Bi-gramas en NO-SPAM después del filtrado: {len(non_spam_bigram_freq_filtered)}")
print(f"Reducción: {len(non_spam_bow_bigrams) - len(non_spam_bigram_freq_filtered)} ({100*(len(non_spam_bow_bigrams) - len(non_spam_bigram_freq_filtered))/len(non_spam_bow_bigrams):.1f}%)")

ANÁLISIS DE BI-GRAMAS DESPUÉS DEL FILTRADO (min_freq = 3)

TOP 10 BI-GRAMAS MAS COMUNES EN SPAM (después del filtrado):
----------------------------------------------------------------------
  co_uk: 35 ocurrencias
  pleas_call: 34 ocurrencias
  contact_u: 24 ocurrencias
  1_50: 22 ocurrencias
  tri_contact: 20 ocurrencias
  po_box: 19 ocurrencias
  custom_servic: 16 ocurrencias
  await_collect: 16 ocurrencias
  prize_guarante: 16 ocurrencias
  guarante_call: 16 ocurrencias


TOP 10 BI-GRAMAS MAS COMUNES EN NO-SPAM (después del filtrado):
----------------------------------------------------------------------
  lt_gt: 184 ocurrencias
  gon_na: 41 ocurrencias
  take_care: 30 ocurrencias
  r_u: 29 ocurrencias
  let_know: 28 ocurrencias
  wan_na: 24 ocurrencias
  wan_2: 24 ocurrencias
  u_r: 24 ocurrencias
  good_morn: 23 ocurrencias
  k_k: 22 ocurrencias


Bi-gramas con MAYOR PROBABILIDAD EN SPAM (filtrados):
----------------------------------------------------------------------


Bi-gram