# SMS Spam Detection NLP Pipeline Generative vs Discriminative + Sparse vs Dense

**Stakeholder & scenario:**

A telecom company wants to automatically detect spam SMS to protect customers from fraud and reduce support complaints. The model will classify incoming messages as spam or ham (legit), enabling warning/blocking systems.

In [11]:
!pip -q install pandas numpy scikit-learn nltk gensim matplotlib seaborn


In [12]:
import numpy as np
import pandas as pd
import re
import random

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from gensim.models import Word2Vec

SEED = 42
np.random.seed(SEED)
random.seed(SEED)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_table(url, header=None, names=["label", "text"])
df.head()


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
print("Shape:", df.shape)
print(df["label"].value_counts())
print("\nClass ratio (%):")
print(df["label"].value_counts(normalize=True) * 100)


Shape: (5572, 2)
label
ham     4825
spam     747
Name: count, dtype: int64

Class ratio (%):
label
ham     86.593683
spam    13.406317
Name: proportion, dtype: float64


In [15]:
for cls in ["ham", "spam"]:
    print(f"\n--- {cls.upper()} examples ---")
    for t in df[df["label"] == cls]["text"].head(5).tolist():
        print("-", t)



--- HAM examples ---
- Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
- Ok lar... Joking wif u oni...
- U dun say so early hor... U c already then say...
- Nah I don't think he goes to usf, he lives around here though
- Even my brother is not like to speak with me. They treat me like aids patent.

--- SPAM examples ---
- Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
- FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
- WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
- Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 

In [16]:
df["char_len"] = df["text"].str.len()
df["word_len"] = df["text"].str.split().apply(len)

df.groupby("label")[["char_len","word_len"]].mean()


Unnamed: 0_level_0,char_len,word_len
label,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,71.482487,14.310259
spam,138.670683,23.911647


# Train/Val/Test Split

In [17]:
# Encode labels and split

df["y"] = df["label"].map({"ham": 0, "spam": 1})

X = df["text"].values
y = df["y"].values

# First split: train (70%) vs temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=SEED, stratify=y
)

# Second split: val (10%) vs test (20%) from temp (30%)
# val proportion of temp = 10/30 = 1/3
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=2/3, random_state=SEED, stratify=y_temp
)

print("Train:", len(X_train), "Val:", len(X_val), "Test:", len(X_test))


Train: 3900 Val: 557 Test: 1115


# Preprocessing Pipeline



In [18]:
# Building Preprocessor

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_tokens(text: str):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens


In [19]:
sample = X_train[0]
print("Original:\n", sample)
print("\nTokens:\n", preprocess_tokens(sample))


Original:
 Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.

Tokens:
 ['goal', 'arsenal', 'henry', 'v', 'liverpool', 'henry', 'score', 'simple', 'shot', 'yard', 'pas', 'bergkamp', 'give', 'arsenal', 'goal', 'margin', 'min']


# Feature engineering

In [20]:
# BoW + Naive Bayes (Generative Model)

bow_nb = Pipeline([
    ("vectorizer", CountVectorizer(
        tokenizer=preprocess_tokens,
        ngram_range=(1,1)
    )),
    ("classifier", MultinomialNB())
])

bow_nb.fit(X_train, y_train)

y_val_pred = bow_nb.predict(X_val)

print("BoW + Naive Bayes (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))




BoW + Naive Bayes (Validation)
Accuracy: 0.9820466786355476
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       482
        spam       0.99      0.88      0.93        75

    accuracy                           0.98       557
   macro avg       0.98      0.94      0.96       557
weighted avg       0.98      0.98      0.98       557



Multinomial Naive Bayes is a generative classifier that models word distributions within each class. It performs well on count-based sparse features such as Bag-of-Words.

In [21]:
# BoW + Logistic Regression (Discriminative)

bow_lr = Pipeline([
    ("vectorizer", CountVectorizer(
        tokenizer=preprocess_tokens,
        ngram_range=(1,1)
    )),
    ("classifier", LogisticRegression(
        max_iter=2000,
        random_state=SEED
    ))
])

bow_lr.fit(X_train, y_train)

y_val_pred = bow_lr.predict(X_val)

print("BoW + Logistic Regression (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))




BoW + Logistic Regression (Validation)
Accuracy: 0.9748653500897666
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       482
        spam       1.00      0.81      0.90        75

    accuracy                           0.97       557
   macro avg       0.99      0.91      0.94       557
weighted avg       0.98      0.97      0.97       557



In [22]:
# TF-IDF + Naive Bayes

tfidf_nb = Pipeline([
    ("vectorizer", TfidfVectorizer(
        tokenizer=preprocess_tokens,
        ngram_range=(1,2)
    )),
    ("classifier", MultinomialNB())
])

tfidf_nb.fit(X_train, y_train)

y_val_pred = tfidf_nb.predict(X_val)

print("TF-IDF (1,2) + Naive Bayes (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))




TF-IDF (1,2) + Naive Bayes (Validation)
Accuracy: 0.9533213644524237
              precision    recall  f1-score   support

         ham       0.95      1.00      0.97       482
        spam       1.00      0.65      0.79        75

    accuracy                           0.95       557
   macro avg       0.97      0.83      0.88       557
weighted avg       0.96      0.95      0.95       557



In [23]:
# TF-IDF + Linear SVM (Strong baseline)

tfidf_svm = Pipeline([
    ("vectorizer", TfidfVectorizer(
        tokenizer=preprocess_tokens,
        ngram_range=(1,2)
    )),
    ("classifier", LinearSVC(random_state=SEED))
])

tfidf_svm.fit(X_train, y_train)

y_val_pred = tfidf_svm.predict(X_val)

print("TF-IDF (1,2) + Linear SVM (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))




TF-IDF (1,2) + Linear SVM (Validation)
Accuracy: 0.9856373429084381
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       482
        spam       1.00      0.89      0.94        75

    accuracy                           0.99       557
   macro avg       0.99      0.95      0.97       557
weighted avg       0.99      0.99      0.99       557



In [24]:
# Prepare tokenized data

train_tokens = [preprocess_tokens(t) for t in X_train]
val_tokens   = [preprocess_tokens(t) for t in X_val]
test_tokens  = [preprocess_tokens(t) for t in X_test]


In [25]:
# Train Word2Vec (Skip-gram)

w2v_model = Word2Vec(
    sentences=train_tokens,
    vector_size=100,
    window=5,
    min_count=2,
    workers=2,
    sg=1,     # Skip-gram
    seed=SEED
)


In [26]:
# Convert messages → vectors

def document_vector(tokens, model):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

X_train_w2v = np.vstack([document_vector(t, w2v_model) for t in train_tokens])
X_val_w2v   = np.vstack([document_vector(t, w2v_model) for t in val_tokens])
X_test_w2v  = np.vstack([document_vector(t, w2v_model) for t in test_tokens])


In [27]:
# Word2Vec + Logistic Regression

w2v_lr = LogisticRegression(max_iter=2000, random_state=SEED)
w2v_lr.fit(X_train_w2v, y_train)

y_val_pred = w2v_lr.predict(X_val_w2v)

print("Word2Vec + Logistic Regression (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))


Word2Vec + Logistic Regression (Validation)
Accuracy: 0.9587073608617595
              precision    recall  f1-score   support

         ham       0.96      0.99      0.98       482
        spam       0.92      0.76      0.83        75

    accuracy                           0.96       557
   macro avg       0.94      0.87      0.90       557
weighted avg       0.96      0.96      0.96       557



In [28]:
# Word2Vec + Linear SVM

w2v_svm = LinearSVC(random_state=SEED)
w2v_svm.fit(X_train_w2v, y_train)

y_val_pred = w2v_svm.predict(X_val_w2v)

print("Word2Vec + Linear SVM (Validation)")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred, target_names=["ham", "spam"]))


Word2Vec + Linear SVM (Validation)
Accuracy: 0.9658886894075404
              precision    recall  f1-score   support

         ham       0.97      0.99      0.98       482
        spam       0.94      0.80      0.86        75

    accuracy                           0.97       557
   macro avg       0.95      0.90      0.92       557
weighted avg       0.97      0.97      0.96       557



# Final Evaluation on TEST

In [29]:
# Help functions

from sklearn.metrics import precision_recall_fscore_support, accuracy_score, classification_report
import pandas as pd

def get_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="binary", zero_division=0
    )
    return acc, p, r, f1


In [30]:
results = []

# ---- Sparse pipelines ----
sparse_models = {
    "BoW + Naive Bayes": bow_nb,
    "BoW + Logistic Regression": bow_lr,
    "TF-IDF(1,2) + Naive Bayes": tfidf_nb,
    "TF-IDF(1,2) + Linear SVM": tfidf_svm,
}

for name, model in sparse_models.items():
    y_pred = model.predict(X_test)
    acc, p, r, f1 = get_metrics(y_test, y_pred)
    results.append([name, acc, p, r, f1])

# ---- Dense models (Word2Vec vectors) ----
y_pred = w2v_lr.predict(X_test_w2v)
results.append(["Word2Vec(avg) + Logistic Regression", *get_metrics(y_test, y_pred)])

y_pred = w2v_svm.predict(X_test_w2v)
results.append(["Word2Vec(avg) + Linear SVM", *get_metrics(y_test, y_pred)])

results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1"])
results_df = results_df.sort_values("F1", ascending=False).reset_index(drop=True)

results_df


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,"TF-IDF(1,2) + Linear SVM",0.983857,0.958042,0.919463,0.938356
1,BoW + Logistic Regression,0.983857,0.992481,0.885906,0.93617
2,BoW + Naive Bayes,0.98296,0.939189,0.932886,0.936027
3,"TF-IDF(1,2) + Naive Bayes",0.967713,1.0,0.758389,0.862595
4,Word2Vec(avg) + Linear SVM,0.950673,0.879032,0.731544,0.798535
5,Word2Vec(avg) + Logistic Regression,0.93991,0.866071,0.651007,0.743295


In [31]:
best_model_name = results_df.loc[0, "Model"]
print("Best model based on F1:", best_model_name)

# Re-create y_pred for best model
if best_model_name in sparse_models:
    best_model = sparse_models[best_model_name]
    y_pred_best = best_model.predict(X_test)
else:
    # Word2Vec cases
    if "Logistic" in best_model_name:
        y_pred_best = w2v_lr.predict(X_test_w2v)
    else:
        y_pred_best = w2v_svm.predict(X_test_w2v)

print("\nDetailed classification report (TEST):")
print(classification_report(y_test, y_pred_best, target_names=["ham", "spam"]))


Best model based on F1: TF-IDF(1,2) + Linear SVM

Detailed classification report (TEST):
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.96      0.92      0.94       149

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Analysis & Discussion

Generative vs Discriminative Models

In this project, both generative and discriminative classifiers were evaluated for SMS spam detection.

Multinomial Naive Bayes is a generative model, as it learns the probability distribution of words within each class (spam or ham) and applies Bayes’ theorem to make predictions. This approach is computationally efficient and performs well with count-based features.

Linear SVM, on the other hand, is a discriminative model. It directly learns a decision boundary that best separates spam and ham messages in the feature space without modeling how the text itself is generated.

Observation:
The TF-IDF (1,2) + Linear SVM model achieved the highest F1-score on the test set. This suggests that discriminative models were better able to exploit the informative weighted features produced by TF-IDF, especially when bigrams were included. While Naive Bayes performed competitively, it was outperformed by Linear SVM in terms of overall classification balance.

Sparse vs Dense Representations

Two types of feature representations were compared: sparse vectors (Bag-of-Words and TF-IDF) and dense embeddings (Word2Vec).

Sparse representations treat each word or phrase as an independent feature. TF-IDF improves upon Bag-of-Words by reducing the influence of very frequent but less informative words while emphasizing rare and discriminative terms.

Dense Word2Vec embeddings capture semantic similarity between words by learning distributed representations based on context. Document-level vectors were obtained by averaging word embeddings.

Observation:
The sparse TF-IDF representation outperformed the Word2Vec-based models. This is likely because SMS spam detection relies heavily on specific keywords and short phrases (e.g., “call now”, “free entry”, “win prize”) that TF-IDF captures effectively. Averaging Word2Vec embeddings may dilute such signals, especially in short text messages.

Effect of N-grams

Including bigrams (1,2-grams) in the TF-IDF representation contributed to improved performance. Bigrams allow the model to capture short but meaningful phrases commonly found in spam messages, such as “call now” or “limited offer”.

However, using higher-order n-grams also increases the feature space size, leading to higher memory usage and longer training times. In this case, the performance gains from bigrams justified the additional complexity.

Speed, Memory, and Explainability Trade-offs

Naive Bayes was the fastest and most memory-efficient model and offered high interpretability through class-conditional word probabilities.

TF-IDF + Linear SVM required more computational resources but delivered the best predictive performance.

Word2Vec-based models required additional training time and were less interpretable, as individual dimensions of dense embeddings do not correspond to human-readable features.

Overall, TF-IDF combined with Linear SVM provided the best balance between performance and practicality for this real-world SMS spam detection task.