# Data Acquisition & Exploration

First, I imported polars as replacement of pandas here just for experimenting. Then read the csv file after that removed non-essential columns. Then converted `ham` and `spam` as `0` and `1` while coverting them to `int-8`.
After that I labeled the columns for better readability

## Note:

The questions asked in notion are explained at the end of this notebook and bonus point question just before the serialization of model.

### Imports
Importing the Polars library as a faster alternative to Pandas for data manipulation.


In [11]:
import polars as pl


### Loading Data
Reading the raw spam dataset from a CSV file. We use `latin-1` encoding to handle special characters common in SMS datasets.

In [12]:
df_spam = pl.read_csv(
    "C:/Users/hanna/Documents/AI projects/SpamDetection_NLP/Data/spam.csv",
    encoding="latin-1",
)


### Column Selection
Filtering the dataset to keep only the relevant columns, as the original file contains several empty trailing columns.

In [13]:
df_spam = df_spam.select(df_spam.columns[:-3])


### Label Encoding
Mapping the target labels 'ham' and 'spam' to numeric values (0 and 1) and casting them to an efficient 8-bit integer type.

In [14]:
df_spam = df_spam.with_columns(
    pl.col("v1")
      .replace({"ham": 0, "spam": 1})
      .cast(pl.Int8)
      .alias("label")
)


### Cleaning Columns
Removing the original label column after encoding it into a new 'label' column.

In [15]:
df_spam = df_spam.drop("v1")


### Renaming Columns
Renaming the text column to 'text' for clarity and consistency across the pipeline.

In [16]:
df_spam = df_spam.rename({
    "v2": "text"
})


### Data Inspection
Displaying the processed dataframe to verify the structure and contents before moving to the next stage.

In [17]:
df_spam

text,label
str,i8
"""Go until jurong point, crazy..…",0
"""Ok lar... Joking wif u oni...""",0
"""Free entry in 2 a wkly comp to…",1
"""U dun say so early hor... U c …",0
"""Nah I don't think he goes to u…",0
…,…
"""This is the 2nd time we have t…",1
"""Will Ì_ b going to esplanade f…",0
"""Pity, * was in mood for that. …",0
"""The guy did some bitching but …",0


# Pre‑processing Pipeline

### Preprocessing Logic
Setting up a text preprocessing pipeline: converts text to lowercase, removes punctuation, tokenizes, removes stop words, and applies lemmatization to reduce words to their base form.

In [18]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    token = word_tokenize(text)
    token = [t for t in token if t not in stop_words]
    token = [lemmatizer.lemmatize(t) for t in token]
    return " ".join(token) 


df_spam = df_spam.with_columns(
    pl.col('text').map_elements(preprocess, return_dtype=pl.Utf8).alias('text')
)



### Preprocessing Execution
After applying the preprocessing function to the entire dataset to clean the text data for modeling just checking it worked or not.

In [19]:

df_spam

text,label
str,i8
"""go jurong point crazy availabl…",0
"""ok lar joking wif u oni""",0
"""free entry 2 wkly comp win fa …",1
"""u dun say early hor u c alread…",0
"""nah dont think go usf life aro…",0
…,…
"""2nd time tried 2 contact u u å…",1
"""ì_ b going esplanade fr home""",0
"""pity mood soany suggestion""",0
"""guy bitching acted like id int…",0


### Detailed Inspection
Printing the first 10 rows of the cleaned text to get a better sense of the results of the preprocessing.

In [20]:
# Show first 10 rows fully
for row in df_spam.head(10).select("text").to_series():
    print(row)


go jurong point crazy available bugis n great world la e buffet cine got amore wat
ok lar joking wif u oni
free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s
u dun say early hor u c already say
nah dont think go usf life around though
freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send å150 rcv
even brother like speak treat like aid patent
per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune
winner valued network customer selected receivea å900 prize reward claim call 09061701461 claim code kl341 valid 12 hour
mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030


### Split Data Preparation
Importing the necessary utilities for splitting the dataset into training, validation, and testing sets.

In [21]:
from sklearn.model_selection import train_test_split

### Feature Selection
Defining the feature set (X) and the target labels (y) from the dataframe.

In [22]:
X = df_spam["text"]
y = df_spam["label"]

### Train-Val-Test Split
Splitting the data into training (70%), validation (10%), and testing (20%) sets using stratified sampling to maintain class balance.

In [23]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.66, random_state=42, stratify=y_temp
)
print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")


Train: 3900, Val: 568, Test: 1104


### Vectorization Imports
Importing vectorization techniques to convert raw text into numerical features.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


### Bag-of-Words (BoW)
Implementing a Bag-of-Words vectorizer with bigrams and a minimum frequency threshold to capture word frequency and simple context.

In [25]:
bow_vectorizer = CountVectorizer(ngram_range=(1,2), min_df=2)
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_val_bow = bow_vectorizer.transform(X_val)
X_test_bow = bow_vectorizer.transform(X_test)

### TF-IDF Vectorization
Applying TF-IDF (Term Frequency-Inverse Document Frequency) to weigh words based on their importance relative to the entire corpus.

In [26]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df= 2)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### Word Embeddings
Loading pre-trained FastText embeddings to capture semantic meanings and handle out-of-vocabulary words better than sparse vectors.

In [27]:
import gensim.downloader as api

word2vec_model = api.load("fasttext-wiki-news-subwords-300")

### Embedding Logic
Defining a function to aggregate word-level embeddings into a single document-level vector by calculating the mean of the embeddings.

In [28]:
import numpy as np

def document_vector(doc):
    words = doc.split()
    vecs = [word2vec_model[w] for w in words if w in word2vec_model]
    if len(vecs) == 0:  # fallback for empty doc
        return np.zeros(word2vec_model.vector_size)
    return np.mean(vecs, axis=0)

### Embedding Generation
Converting the training, validation, and test sets into dense vectors using the pre-trained embeddings.

In [29]:

X_train_dense = np.array([document_vector(d) for d in X_train])
X_val_dense = np.array([document_vector(d) for d in X_val])
X_test_dense = np.array([document_vector(d) for d in X_test])

### Evaluation Framework
Defining a reusable evaluation function to train models and print performance metrics like accuracy, precision, recall, and F1-score.

In [30]:
from sklearn.metrics import accuracy_score, classification_report

def evaluate_model(model, X_tr, y_tr, X_te, y_te, description="Model"):
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    print(f"\n=== {description} ===")
    print("Accuracy:", accuracy_score(y_te, y_pred))
    print(classification_report(y_te, y_pred, zero_division=0))

### Naive Bayes Training
Training and evaluating a Multinomial Naive Bayes model on both BoW and TF-IDF features to compare their effectiveness.

In [31]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
evaluate_model(nb_model, X_train_bow, y_train, X_val_bow, y_val, "Naive Bayes (BoW)")
evaluate_model(nb_model, X_train_tfidf, y_train, X_val_tfidf, y_val, "Naive Bayes (TF-IDF)")


=== Naive Bayes (BoW) ===
Accuracy: 0.9753521126760564
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       492
           1       0.96      0.86      0.90        76

    accuracy                           0.98       568
   macro avg       0.97      0.92      0.94       568
weighted avg       0.98      0.98      0.97       568


=== Naive Bayes (TF-IDF) ===
Accuracy: 0.9559859154929577
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       492
           1       1.00      0.67      0.80        76

    accuracy                           0.96       568
   macro avg       0.98      0.84      0.89       568
weighted avg       0.96      0.96      0.95       568



### Logistic Regression Training
Training and evaluating a Logistic Regression model on the sparse BoW and TF-IDF features.

In [32]:
from sklearn.linear_model import LogisticRegression

# Logistic Regression (sparse)
lr_model = LogisticRegression(max_iter=500)
evaluate_model(lr_model, X_train_bow, y_train, X_val_bow, y_val, "Logistic Regression (BoW)")
evaluate_model(lr_model, X_train_tfidf, y_train, X_val_tfidf, y_val, "Logistic Regression (TF-IDF)")


=== Logistic Regression (BoW) ===
Accuracy: 0.971830985915493
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       492
           1       1.00      0.79      0.88        76

    accuracy                           0.97       568
   macro avg       0.98      0.89      0.93       568
weighted avg       0.97      0.97      0.97       568


=== Logistic Regression (TF-IDF) ===
Accuracy: 0.9665492957746479
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       492
           1       0.98      0.76      0.86        76

    accuracy                           0.97       568
   macro avg       0.97      0.88      0.92       568
weighted avg       0.97      0.97      0.96       568



### Logistic Regression (Dense)
Evaluating Logistic Regression performance when using dense Word2Vec/FastText embeddings.

In [33]:
lr_dense_model = LogisticRegression(max_iter=500)
evaluate_model(lr_dense_model, X_train_dense, y_train, X_val_dense, y_val, "Logistic Regression (Word2Vec)")


=== Logistic Regression (Word2Vec) ===
Accuracy: 0.9295774647887324
              precision    recall  f1-score   support

           0       0.95      0.97      0.96       492
           1       0.79      0.64      0.71        76

    accuracy                           0.93       568
   macro avg       0.87      0.81      0.84       568
weighted avg       0.93      0.93      0.93       568



### Generative Demo
Building a Markov Chain on the training data to demonstrate a simple generative approach to text, showing how words follow one another.

In [34]:
import random
from collections import defaultdict

def build_markov_chain(texts, n=3):
    chain = defaultdict(list)
    corpus = " ".join(texts)
    for i in range(len(corpus) - n):
        ngram = corpus[i:i+n]
        next_char = corpus[i+n]
        chain[ngram].append(next_char)
    return chain

def generate_text(chain, length=100):
    ngram = random.choice(list(chain.keys()))
    result = ngram
    for _ in range(length):
        if ngram in chain:
            next_char = random.choice(chain[ngram])
            result += next_char
            ngram = result[-len(ngram):]
        else:
            break
    return result

# Build chain on training data
markov_chain = build_markov_chain(X_train, n=20)

# Generate 5 sample texts
print("\n=== 3-gram Markov Chain Generated Text ===")
for i in range(5):
    print(generate_text(markov_chain, length=80))


=== 3-gram Markov Chain Generated Text ===
ceive å500000 easter prize drawplease telephone 09041940223 claim 290305 prize transferred someone e
leep wish great day full feeling better opportunity last thought babe love kiss urgent ur å500 guara
a thanks talk saturday dear cherish brother role model k im leaving soon little 9 ok anyway need cha
e india onionrs ltgt petrolrs ltgt beerrs ltgt shesil ltgt hello yeah ive got bath need hair ill com
axx match startedindia ltgt 2 jokin oni lar ìï busy wun disturb ì_ guy go see movie side ok come n p


### Final Evaluation
Performing a final evaluation of the best-performing model (Naive Bayes with BoW) on the unseen test set.

In [35]:
nb_bow_model = nb_model.fit(X_train_bow, y_train)
y_pred = nb_model.predict(X_test_bow)
print(f"\n=== Naives Bayes Model with BOW ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, zero_division=0))


=== Naives Bayes Model with BOW ===
Accuracy: 0.9764492753623188
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       956
           1       0.94      0.88      0.91       148

    accuracy                           0.98      1104
   macro avg       0.96      0.94      0.95      1104
weighted avg       0.98      0.98      0.98      1104



### Model Serialization
Saving the trained Naive Bayes model using Pickle for future deployment.

In [36]:
import pickle

with open("spam_bow_nb.pkl", "wb") as f:
    pickle.dump(nb_bow_model, f)

### Vectorizer Serialization
Saving the BoW vectorizer to ensure consistent preprocessing during inference.

In [37]:
with open("bow_vectorizer.pkl", "wb") as f:
    pickle.dump(bow_vectorizer, f)

# Questions Asked in Notion

### Overall understanding of teh data:

overall this was an imbalanced dataset but it nature as we usually have more ham mails instead of spam.Same like Fraud detection dataset.

Bag of Words works best with this dataset as word level count for spam dataset that is imbalanced help us get as much context we can. Both NB and LR works with BoW. But NB performs better in case of Recall and that i think key distinguishing factor in case of evaluation.

TD-IDF suffers here in both NB and LR in case of Recall means it was unable to identify the spam messages out of total spam messages present. I think it happens because most of the spam messages contain the same spam-words repeating accross. So when we apply TD-IDF it has given the less weitage to them because it prefers rare words do its quality suffers

In my case Word-Vec doesnot have a good result. Even it was the worst of all.It could because of 2 things mainly:
1. Small dataset
2. Or it may require DL techniques for better catching the context for overall better performance.

### Final Words 
1. For our spam detection problem: Naive Bayes + BoW is the best choice

2. Simple, fast, interpretable, and robust

3. TF-IDF and Word2Vec only make sense if you plan to scale up dataset or move to neural/transformer-based models

# Compare generative vs. discriminative performance:

- Generative (NB) models estimate better at catching minority class (spam) even with small datasets.

- Discriminative (LR) models estimate safer, avoid false positives, but more conservative, missing some spams.

- On small, sparse datasets like yours, generative NB with BoW slightly outperforms discriminative LR in F1/recall.

# Discuss how N‑gram size and embedding choice affected results:
- Unigrams (1‑gram) captures most spam tokens, e.g., “free”, “win”, “prize”

- Adding bigrams helped detecting phrase-level spam patterns (“call now”, “win free ticket”)

# Effect of Embedding Choice

| Representation | Observed Effect in my Results                                                                   |
| -------------- | ------------------------------------------------------------------------------------------------- |
| BoW            | Strongest recall, simple, directly counts spam tokens, robust for small datasets                  |
| TF-IDF         | Damaged recall for spam, downweights frequent spam words → overly conservative                    |
| Word2Vec       | Averaging embeddings **dilutes rare spam tokens**, LR cannot separate them → lowest F1 and recall |


# Speed, Memory, and Explainability

| Model       | Speed                             | Memory                                    | Explainability                                                       |
| ----------- | --------------------------------- | ----------------------------------------- | -------------------------------------------------------------------- |
| NB BoW      | ✅ Very fast to train & predict    | ✅ Low memory (sparse counts)              | ✅ Highly interpretable (word probabilities for spam)                 |
| NB TF-IDF   | ✅ Fast                            | ✅ Slightly higher (sparse TF-IDF vectors) | ✅ Probabilities still interpretable                                  |
| LR BoW      | ✅ Fast                            | ✅ Moderate                                | ✅ Coefficients interpretable (weight per token)                      |
| LR TF-IDF   | ✅ Fast                            | ✅ Moderate                                | ✅ Coefficients interpretable, harder to interpret downweighted words |
| LR Word2Vec | ⚠ Slower (embeddings + averaging) | ⚠ Dense vectors → higher memory           | ⚠ Low explainability; difficult to trace which words triggered spam  |
