# 42186 Model-based machine learning
- Matteo Piccagnoni s232713
- Gabriel Lanaro s233541
- Manuel Lovo s...

# Topic-aware SPAM message classification in Bayesian setup

## 1. Introduction


Spam messages are annoying, and sometimes dangerous. Classifying them correctly is important, but we often ignore one key aspect: how confident are we in the predictions? 

In this project, we take a more thoughtful approach to spam detection by combining two powerful tools: **topic modeling** and **Bayesian inference**. 

First, we use **Latent Dirichlet Allocation (LDA)** to discover the hidden topics inside SMS messages, this gives us a better understanding of what the messages are about. 
Then, instead of using a standard classifier, we go full **Bayesian** with a **logistic regression model** that doesn’t just make a prediction, it tells us how uncertain that prediction is. 

Everything is built using Pyro, a probabilistic programming library, which makes it easy to define the model and run inference using both SVI and MCMC. 

This notebook walks through the whole process step-by-step: from data cleaning to topic discovery to classification and uncertainty analysis. By the end, we’ll not only have a working spam filter: we will have one that knows when it’s unsure.

In [None]:
import pandas as pd
import numpy as np
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import torch
import pickle
import pyro

## 2. Dataset and Preprocessing

### 2.1 Dataset

In this project, the **SMS Spam Collection Dataset** has been used. It is a publicly available corpus hosted by the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) which contains 5,574 English SMS messages, each labeled as either "ham" (legitimate) or "spam" (unwanted/unsolicited).

The messages were collected from a variety of sources:
Legitimate (ham) messages were gathered from public forums, SMS chat services, and volunteer contributors.
Spam messages were obtained from known spam databases and online archives of promotional SMS campaigns.

Each row in the dataset consists of two fields:
label: A string indicating whether the message is "ham" or "spam".
message: The actual content of the SMS text, written in natural language (English).

A few example rows:

| label | message |
|-------|---------|
| ham   | Are you coming to the party later? |
| spam  | You’ve won a £1000 cash prize! Text WIN to 80086 to claim now. |


The dataset is realistic and includes a broad range of message types, from casual conversations full of slang and abbreviations to marketing promos and scams that mimic legitimate offers. This makes it ideal for studying both semantic patterns (via topic modeling) and predictive classification (spam vs. ham). The class distribution is slightly imbalanced, with around 13% spam and 87% ham, which reflects real-world conditions.

Overall, this dataset offers a compact but rich playground for experimenting with natural language processing, especially when modeling uncertainty and interpretability, as we do in this Bayesian setup.

In [18]:
df_sms = pd.read_csv(
    "SMSSpamCollection", sep="\t", header=None, names=["label", "message"]
)
df_sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [19]:
df = df_sms.copy()

**NEED TO ADD SOME VISUALS TO MAKE IT COOLER AND TO HAVE A BETTER OVERVIEW OF THE DATASET from preprocessing.ipynb**

### 2.2 Preprocessing

Before we can use the text messages in our models, we need to clean and prepare them. Raw SMS messages are often messy including typos, slang, special characters, and unnecessary words that can confuse a model. 
In this section, we’ll go through standard preprocessing steps like **lowercasing**, **removing stopwords**, **tokenizing**, and **stemming**. These transformations help reduce noise and bring the text into a more consistent format, which is especially important for tasks like topic modeling and classification. 

We’ll explain each step as we apply it in the code.

The following setup downloads the list of common English stopwords (like "the", "is", "and") from the NLTK library and initializes the stopword set and a Porter stemmer. These components will be used later to remove common words that carry little semantic meaning and to reduce words to their base form, respectively.

In [20]:
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matteopiccagnoni/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The following function performs standard text preprocessing to prepare SMS messages for further analysis. It includes lowercasing, removal of URLs, numbers, and punctuation, followed by tokenization, stopword removal, and stemming. 
The result is a list of cleaned and normalized tokens for each message.

In [21]:
# Text Cleaning Function
def clean_message(text):
    text = text.lower()
    text = re.sub(r"http\S+|www.\S+", "", text)  # Remove URLs
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = text.translate(
        str.maketrans("", "", string.punctuation)
    )  # Remove punctuation
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

The cleaned tokenization function is applied to each message in the dataset using the apply method. A new column, "tokens", is created to store the resulting list of preprocessed tokens for each SMS message.

In [22]:
df["tokens"] = df["message"].apply(clean_message)
df[["message", "tokens"]].head()

Unnamed: 0,message,tokens
0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazi, avail, bugi, n, gre..."
1,Ok lar... Joking wif u oni...,"[ok, lar, joke, wif, u, oni]"
2,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entri, wkli, comp, win, fa, cup, final,..."
3,U dun say so early hor... U c already then say...,"[u, dun, say, earli, hor, u, c, alreadi, say]"
4,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goe, usf, live, around, tho..."


Above it is provides a quick preview of the structure and content of the DataFrame. At this stage, the output includes the raw message text, and the newly added tokens column, which contains the list of cleaned and preprocessed words extracted from each message. From the raw message to the tokenized version, the following elements were removed: URLs, numbers, punctuation, common stopwords, and words were reduced to their stemmed forms.

>Please notice that in the tokenized output, some words appear in their stemmed form (for example, "crazy" becomes "crazi"). This is a result of the Porter stemming algorithm, which reduces words to their morphological root to group similar terms together. While the resulting stems are not always real words, they help the model treat related terms (e.g., "crazy", "craziness") as the same feature.
It is also worth noting that abbreviations and slang (e.g., "u", "ur", "msg") were intentionally left unchanged. Although these do not follow standard grammar, they often carry important contextual or signals in SMS communication. Removing or expanding them could potentially obscure patterns that distinguish spam from ham in this domain.

In this step, the list of tokens for each message is joined back into a single string to prepare the input for CountVectorizer, which requires text input in string format. The CountVectorizer transforms the preprocessed messages into a bag-of-words (BoW) matrix, where each row represents a message and each column corresponds to a word in the vocabulary, with entries indicating the word count per message.
This BoW representation is a standard format for text modeling and is especially useful for Latent Dirichlet Allocation (LDA), which operates on document-word frequency data. The resulting matrix is converted into a PyTorch tensor to be compatible with Pyro, which is used for building and training the probabilistic topic model.

In [None]:
texts_str = df["tokens"].apply(
    lambda tokens: " ".join(tokens)
)  # Join tokens back to strings (for CountVectorizer)



# Create bag-of-words matrix


vectorizer = CountVectorizer()


X = vectorizer.fit_transform(texts_str)


X_tensor = torch.tensor(X.toarray(), dtype=torch.float)  # Convert to torch tensor



print("Input for LDA:", X_tensor.shape)  # should be (num_docs, vocab_size)

Input for LDA: torch.Size([5572, 7099])


The final print statement above shows the shape of the tensor, confirming that the data is now structured as (number of documents, vocabulary size), the expected input format for Pyro’s LDA model.

The example below illustrates the structure of the calculated data, providing a clearer understanding and overview of its format and content.

In [25]:
# Print first cleaned message as tokens
print("\nSample tokens:", df["tokens"].iloc[0])

# Print the BoW vector for the first message (as vector index → count)
print("Sample BoW row:", X_tensor[0].nonzero(as_tuple=True)[0].tolist())
print("Word counts:", X_tensor[0][X_tensor[0] > 0].tolist())

# Print the actual words from the vectorizer
vocab = vectorizer.get_feature_names_out()
print("Words in BoW row:", [vocab[i] for i in X_tensor[0].nonzero(as_tuple=True)[0]])


Sample tokens: ['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat']
Sample BoW row: [212, 404, 790, 792, 1063, 1280, 2363, 2416, 2453, 3126, 3265, 4621, 6689, 6899]
Word counts: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Words in BoW row: ['amor', 'avail', 'buffet', 'bugi', 'cine', 'crazi', 'go', 'got', 'great', 'jurong', 'la', 'point', 'wat', 'world']


This code saves the DataFrame with tokens, the fitted vectorizer, and the bag-of-words matrix. These components will be reused in the LDA model to ensure consistency and reproducibility in the topic modeling process.

In [None]:
# save the DataFrame with tokens
df.to_csv("lda_df.csv", index=False)

# save the vectorizer
with open("lda_vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# save bow matrix as numpy array
X_array = X.toarray().astype(np.float32)
np.savez_compressed("BoW_X_Array.npz", X_array)

print("File salvati:")
print("- lda_df.csv")
print("- lda_vectorizer.pkl")
print("- BoW_X_Array.npz")

## 3. LDA

Introduction to LDA

In [None]:
import torch
import numpy as np
import pandas as pd
import pickle
import pyro
import pyro.distributions as dist
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.autoguide import AutoDelta

### 3.1 Topic Number Selection (K)

Since the number of topics *K* in **Latent Dirichlet Allocation (LDA)** must be specified in advance, we ran the Pyro-based LDA model across multiple candidate values (e.g. K = 5, 10, 15, ..., 90) and monitored the **Evidence Lower Bound (ELBO)** loss during training. 

For each value of *K*, we trained the model for a (smaller) fixed number of steps and recorded the final loss. The model with the lowest ELBO loss was selected as the best configuration. This approach allows us to balance model complexity and fit without relying on manual inspection or external coherence metrics.

This code has been ran in Kaggle Notebooks in order to exploit the GPU from Kaggle and get better and faster results.

In [None]:
# Load Data
# X_array = np.load("/kaggle/input/bow-xarray/BoW_X_Array.npz")["arr_0"] # Uncomment this line if using Kaggle
X_array = np.load("BoW_X_Array.npz")["arr_0"]
X_tensor = torch.tensor(X_array, dtype=torch.float)
print("XArray loaded")

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

X_tensor = X_tensor.to(device)

num_docs, vocab_size = X_tensor.shape
K_values = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90]
results = []

for K in K_values:
    print(f"\n--- Training LDA with K = {K} ---")

    def lda_model(data):
        with pyro.plate("topics", K):
            topic_words = pyro.sample(
                "topic_words", dist.Dirichlet(torch.ones(vocab_size).to(device))
            )
        with pyro.plate("documents", num_docs):
            doc_topics = pyro.sample(
                "doc_topics", dist.Dirichlet(torch.ones(K).to(device))
            )
            word_dists = torch.matmul(doc_topics, topic_words)
            logits = torch.matmul(doc_topics, topic_words).log()
            pyro.sample(
                "doc_words", dist.Multinomial(total_count=100, logits=logits), obs=data
            )

    pyro.clear_param_store()
    guide = AutoDelta(lda_model)
    svi = SVI(lda_model, guide, pyro.optim.Adam({"lr": 0.01}), loss=Trace_ELBO())

    for step in range(500):
        loss = svi.step(X_tensor)

    posterior = guide()
    doc_topics = posterior["doc_topics"]
    topic_usage = doc_topics.sum(dim=0).detach().cpu().numpy()

    # Extra statistics
    loss_per_doc = loss / num_docs
    entropy = -(doc_topics * doc_topics.log()).sum(dim=1).mean().item()
    avg_active_per_doc = (doc_topics > 0.05).sum(dim=1).float().mean().item()
    num_active_topics = (topic_usage > 5.0).sum()

    results.append(
        {
            "K": K,
            "Final Loss": float(loss),
            "Loss per Doc": float(loss_per_doc),
            "Entropy": float(entropy),
            "Avg Active Topics/Doc": float(avg_active_per_doc),
            "Active Topics (global)": int(num_active_topics),
        }
    )

results_df = pd.DataFrame(results)
results_df.to_csv("results_k_selection.csv", index=False)

print("\n📊 K comparison results:")
print(results_df)

![image.png](attachment:image.png)

Why did we choose 60? ...

Introduction to Gensim and confirmation of our theory

In [None]:
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt


def compute_coherence_values(dictionary, corpus, texts, start=5, limit=90, step=1):
    coherence_values = []
    models_list = []
    for num_topics in range(start, limit, step):
        model = models.LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            passes=10,
            alpha="auto",
        )
        models_list.append(model)
        coherencemodel = CoherenceModel(
            model=model, texts=texts, dictionary=dictionary, coherence="c_v"
        )
        coherence_values.append(coherencemodel.get_coherence())
    return models_list, coherence_values


# Compute
model_list, coherence_values = compute_coherence_values(
    dictionary, corpus, texts, start=5, limit=90, step=1
)

# Plot
x = range(5, 90)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.title("LDA Topic Coherence")
plt.grid()
plt.show()

### 3.2 Training the LDA Model with the selected K

Training properly and saving the results of the K=60 in a doc

In [None]:
# Data loading
# X_array = np.load("/kaggle/input/bow-xarray/BoW_X_Array.npz")["arr_0"] # Uncomment this line if using Kaggle
X_array = np.load("BoW_X_Array.npz")["arr_0"]
X_tensor = torch.tensor(X_array, dtype=torch.float)

device = "cuda" if torch.cuda.is_available() else "cpu"
X_tensor = X_tensor.to(device)

num_docs, vocab_size = X_tensor.shape
K = 60

print(f"\n🚀 Training final LDA with K = {K} and 1000 steps...")


def lda_model(data):
    with pyro.plate("topics", K):
        topic_words = pyro.sample(
            "topic_words", dist.Dirichlet(torch.ones(vocab_size).to(device))
        )
    with pyro.plate("documents", num_docs):
        doc_topics = pyro.sample("doc_topics", dist.Dirichlet(torch.ones(K).to(device)))
        logits = torch.matmul(doc_topics, topic_words).log()
        pyro.sample(
            "doc_words", dist.Multinomial(total_count=100, logits=logits), obs=data
        )


pyro.clear_param_store()
guide = AutoDelta(lda_model)
svi = SVI(lda_model, guide, pyro.optim.Adam({"lr": 0.01}), loss=Trace_ELBO())

# Training
num_steps = 1000
for step in range(num_steps):
    loss = svi.step(X_tensor)
    if step % 100 == 0:
        print(f"[step {step}] loss = {loss:.2f}")

# Extract posterior distributions
posterior = guide()
doc_topics = posterior["doc_topics"].detach().cpu().numpy()

# Save topic proportions for each document
doc_topics_df = pd.DataFrame(doc_topics, columns=[f"topic_{i}" for i in range(K)])
doc_topics_df.to_csv("doc_topics_K60.csv", index=False)
print("✅ doc_topics saved in 'doc_topics_K60.csv'")

### 3.3 Analysis of results

Little data analysis about the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the topic proportions matrix
doc_topics = pd.read_csv("doc_topics_K60.csv")  # or full path if needed
topic_matrix = doc_topics.values  # shape: [n_documents, K]

# 1. Average number of strong topics per document (threshold > 0.05)
strong_topic_counts = (topic_matrix > 0.05).sum(axis=1)
avg_strong_topics = np.mean(strong_topic_counts)
print(f"✅ Average strong topics per doc (>0.05): {avg_strong_topics:.2f}")

# 2. Total topic usage across all documents
topic_usage = topic_matrix.sum(axis=0)
most_used_topics = topic_usage.argsort()[::-1]
sorted_usage = topic_usage[most_used_topics]

# 3. Bar plot: Total usage per topic
plt.figure(figsize=(10, 5))
plt.bar(range(len(sorted_usage)), sorted_usage)
plt.xlabel("Topic Index (sorted)")
plt.ylabel("Total Usage Across Documents")
plt.title("Topic Usage Distribution (LDA K=60)")
plt.grid(True)
plt.tight_layout()
plt.show()

# 4. Summary printout
print(f"Most used topic index: {most_used_topics[0]} (weight = {sorted_usage[0]:.2f})")
print(
    f"Least used topic index: {most_used_topics[-1]} (weight = {sorted_usage[-1]:.2f})"
)

## 4. Bayesian Logistic Regression Classifier

## 5. Conclusions 