# SMS Spam Classifier – Training Notebook

This notebook:
1. Loads the SMS Spam Collection Dataset from Kaggle (data/spam.csv)
2. Cleans and preprocesses text
3. Uses TF–IDF for feature extraction
4. Trains a Logistic Regression classifier
5. Evaluates the model
6. Saves the trained model and vectorizer to `models/`

## 1. Imports

In [1]:
import os
import re
import joblib
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score
)

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK resources if not already present
try:
    _ = stopwords.words("english")
except LookupError:
    nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arjun\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


## 2. Load dataset

In [2]:
DATA_PATH = os.path.join("..", "data", "spam.csv")

df = pd.read_csv(DATA_PATH, encoding="latin-1")

# The Kaggle CSV often has extra unnamed columns. Keep only relevant ones.
# v1 = label (ham/spam), v2 = message text
df = df[["v1", "v2"]]
df = df.rename(columns={"v1": "label", "v2": "text"})

print(df.head())
print("\nShape:", df.shape)
print("\nLabel distribution:\n", df["label"].value_counts())

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

Shape: (5572, 2)

Label distribution:
 label
ham     4825
spam     747
Name: count, dtype: int64


## 3. Basic cleaning (drop missing, strip whitespace)

In [3]:
df = df.dropna(subset=["text", "label"])
df["text"] = df["text"].astype(str).str.strip()
df = df[df["text"] != ""]

df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 4. Text preprocessing function

- Lowercase
- Remove non-letters
- Tokenize by whitespace
- Remove stopwords
- Apply stemming

In [4]:
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

def preprocess_text(text: str) -> str:
    # Lowercase
    text = text.lower()
    # Keep only letters and spaces
    text = re.sub(r"[^a-z\s]", " ", text)
    # Tokenize on whitespace
    tokens = text.split()
    # Remove stopwords and stem
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    # Join back to string
    return " ".join(tokens)

# Test preprocessing on a sample
sample = df["text"].iloc[0]
print("Original:", sample)
print("Processed:", preprocess_text(sample))

Original: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Processed: go jurong point crazi avail bugi n great world la e buffet cine got amor wat


## 5. Apply preprocessing

In [5]:
df["clean_text"] = df["text"].apply(preprocess_text)
df[["text", "clean_text"]].head()

Unnamed: 0,text,clean_text
0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entri wkli comp win fa cup final tkt st m...
3,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,"Nah I don't think he goes to usf, he lives aro...",nah think goe usf live around though


## 6. Encode labels
We'll map:
- ham  -> 0
- spam -> 1

In [6]:
label_mapping = {"ham": 0, "spam": 1}
df["target"] = df["label"].map(label_mapping)

print(df[["label", "target"]].head())

  label  target
0   ham       0
1   ham       0
2  spam       1
3   ham       0
4   ham       0


## 7. Train–test split

In [7]:
X = df["clean_text"].values
y = df["target"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

Train size: 4457
Test size: 1115


## 8. TF–IDF vectorization

In [8]:
tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2)  # unigrams + bigrams
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("TF–IDF matrix shape (train):", X_train_tfidf.shape)
print("TF–IDF matrix shape (test):", X_test_tfidf.shape)

TF–IDF matrix shape (train): (4457, 5000)
TF–IDF matrix shape (test): (1115, 5000)


## 9. Train Logistic Regression classifier

(You may alternatively try `MultinomialNB` if you want a Naive Bayes baseline.)


In [9]:
log_reg = LogisticRegression(
    max_iter=1000,
    n_jobs=-1,
    random_state=42
)

log_reg.fit(X_train_tfidf, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,1000


## 10. Evaluate model

In [10]:
y_pred = log_reg.predict(X_test_tfidf)
y_proba = log_reg.predict_proba(X_test_tfidf)[:, 1]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, target_names=["ham", "spam"]))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.9721973094170404

Classification report:
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98       966
        spam       0.98      0.81      0.89       149

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.93      1115
weighted avg       0.97      0.97      0.97      1115


Confusion matrix:
 [[964   2]
 [ 29 120]]


## 11. Naive Bayes baseline for comparison

In [11]:
from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)

y_nb_pred = nb_clf.predict(X_test_tfidf)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_nb_pred))

Naive Bayes Accuracy: 0.9704035874439462


## 12. Save best model and vectorizer
We'll save:
- models/spam_model.pkl      (Logistic Regression)
- models/vectorizer.pkl      (TF–IDF vectorizer)

These will be used by the Streamlit app.

In [12]:
os.makedirs(os.path.join("..", "models"), exist_ok=True)

MODEL_PATH = os.path.join("..", "models", "spam_model.pkl")
VECTORIZER_PATH = os.path.join("..", "models", "vectorizer.pkl")

joblib.dump(log_reg, MODEL_PATH)
joblib.dump(tfidf, VECTORIZER_PATH)

print("Saved model to:", MODEL_PATH)
print("Saved vectorizer to:", VECTORIZER_PATH)

Saved model to: ..\models\spam_model.pkl
Saved vectorizer to: ..\models\vectorizer.pkl


## 13. Quick test of saved artifacts

In [13]:
loaded_model = joblib.load(MODEL_PATH)
loaded_vectorizer = joblib.load(VECTORIZER_PATH)

test_sms = "Congratulations! You have won a free ticket. Call now to claim your prize."
processed_sms = preprocess_text(test_sms)
sms_vec = loaded_vectorizer.transform([processed_sms])
pred_label = loaded_model.predict(sms_vec)[0]
pred_proba = loaded_model.predict_proba(sms_vec)[0, 1]

label_inv_map = {0: "ham", 1: "spam"}

print("SMS:", test_sms)
print("Predicted:", label_inv_map[pred_label])
print("Spam probability:", pred_proba)

SMS: Congratulations! You have won a free ticket. Call now to claim your prize.
Predicted: spam
Spam probability: 0.8203460230544561
