 https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

In [0]:
filepath = '/Volumes/sample_catalog/sample_schema/sample_volume/SMSSpamCollection'

In [0]:
df = spark.read.text(filepath)
pdf = df.toPandas()
pdf[['label', 'message']] = pdf['value'].str.split('\t', expand=True)
pdf = pdf.drop(columns='value')


In [0]:
pdf.head()

In [0]:
print(pdf['label'].value_counts())


The dataset is imbalanced: many more "ham" messages than "spam".

Imbalance causes models to favor the majority class, which leads to high accuracy but poor detection of "spam" (low recall).

In [0]:
from sklearn.utils import resample
import pandas as pd

# Separate classes
spam_df = pdf[pdf['label'] == 'spam']
ham_df = pdf[pdf['label'] == 'ham']

# Upsample spam to match ham
spam_upsampled = resample(spam_df, 
                          replace=True,     # sample with replacement
                          n_samples=len(ham_df), 
                          random_state=42)

# Combine into balanced dataset
balanced_df = pd.concat([ham_df, spam_upsampled])


In [0]:
print(balanced_df['label'].value_counts())

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    balanced_df['message'], balanced_df['label'], test_size=0.2, random_state=42)


In [0]:
# from sklearn.feature_extraction.text import TfidfVectorizer

# vectorizer = TfidfVectorizer(stop_words='english')
# X_train_tfidf = vectorizer.fit_transform(X_train)
# X_test_tfidf = vectorizer.transform(X_test)


import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Custom tokenizer to extract only words with 3 or more English letters
def clean_tokenizer(text):
    return re.findall(r'\b[a-z]{3,}\b', text.lower())

# Rebuild vectorizer with clean tokenizer
vectorizer = TfidfVectorizer(tokenizer=clean_tokenizer, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)



ML models can't work directly with text.

We use TF-IDF (Term Frequency–Inverse Document Frequency):

Tells us which words are important to each message.

Converts messages into numerical vectors (feature matrix).

TF-IDF stands for:
TF = Term Frequency:
How frequently a word appears in a message.

IDF = Inverse Document Frequency:
How rare that word is across all messages.

Together, TF-IDF captures:

"How important is a word in this message, compared to the whole dataset?"

Why stop_words='english'?
This removes common words like:

"the", "is", "and", "a", "you", etc.

These don’t help the model learn much — they appear in both spam and ham.

In [0]:
print(vectorizer.get_feature_names_out())

In [0]:
# Pick one message to explain
i = 0  # Change this to see different messages

row = X_train_tfidf[i].toarray()[0]
feature_names = vectorizer.get_feature_names_out()

# Get non-zero TF-IDF scores
nonzero_words = [(feature_names[j], row[j]) for j in row.nonzero()[0]]

# Sort by score
df_words = pd.DataFrame(nonzero_words, columns=["word", "tfidf_score"]).sort_values(by="tfidf_score", ascending=False)

df_words.head(10)  # Top 10 important words for message i


In [0]:
df_words.head(10).plot.bar(x='word', y='tfidf_score', title='Top Words in Message', legend=False)


"TF-IDF helps the model ignore common words like 'the', 'is', 'you', and instead focus on words that stand out. For example, the word 'free' appears in many spam messages but not in most normal messages — so it gets a high score."

In [0]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train_tfidf, y_train)



We use Logistic Regression, a simple and effective classification model.

It learns patterns in word usage that separate "spam" from "ham".

Despite the name, logistic regression is a classification algorithm — not regression!

It works like this:
It finds a boundary line (or surface) that best separates the two classes.

The model learns weights for each word (feature) during training.

It uses the logistic (sigmoid) function to estimate the probability that a message is spam.

Imagine every word in a message votes whether it thinks the message is spam or not:

Some words like “Congratulations”, “Free”, “Win” vote more heavily toward spam.

Words like “Lunch”, “Meeting”, or “Hey” lean toward ham.

The model adds up the weighted votes, and if the final score crosses a threshold (usually 0.5), it calls the message spam.

In [0]:
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))


This gives precision, recall, and F1-score.

Pay attention to how well it performs on both "spam" and "ham".

With balanced training data, you’ll likely see:

Higher recall for spam

More balanced F1-scores

In [0]:
sample = ["Congratulations! You won a free ticket to Bahamas!",
          "click this link",
          "review the contract - thomas@thomas",
          "review the contract and click the link thomas@thomas@thomas"]

sample_tfidf = vectorizer.transform(sample)
print(clf.predict(sample_tfidf))

In [0]:
pred_proba = clf.predict_proba(sample_tfidf)
for msg, prob in zip(sample, pred_proba):
    print(f"{msg} => Spam probability: {prob[1]:.2f}")
