# Spam vs Ham Text Classifier

This project is about building a simple text classifier that can distinguish between spam and ham (normal) messages. I'll be using the SMS Spam Collection dataset and trying out different machine learning approaches.

## Introduction

The goal here is to build a classifier that can automatically detect spam messages. This is a binary classification problem where each message is either "spam" or "ham" (not spam).

I'll start by loading the data, cleaning it up a bit, then extracting features using TF-IDF. After that, I'll train a classifier and see how well it performs.

## Load Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'text']

print("First few rows:")
print(df.head())
print("\nClass distribution:")
print(df['label'].value_counts())

First few rows:
  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

Class distribution:
label
ham     4825
spam     747
Name: count, dtype: int64


## Preprocessing

Before training, I need to clean up the text. I'll convert everything to lowercase, remove punctuation, and get rid of extra spaces. I'm keeping it simple here - no fancy techniques like stemming.

In [2]:
import string

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ' '.join(text.split())
    return text

df['text_clean'] = df['text'].apply(preprocess)

print("Example before preprocessing:")
print(df['text'].iloc[0])
print("\nExample after preprocessing:")
print(df['text_clean'].iloc[0])

Example before preprocessing:
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

Example after preprocessing:
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat


## Feature Extraction

Now I'll use TF-IDF to convert the text into numbers that the model can work with. TF-IDF gives higher weights to words that are important but not too common across all messages.

I'll split the data first, then fit the vectorizer only on the training set to avoid data leakage.

In [3]:
X = df['text_clean']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=3000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Training set shape: {X_train_tfidf.shape}")
print(f"Test set shape: {X_test_tfidf.shape}")

Training set shape: (4457, 3000)
Test set shape: (1115, 3000)


## Train Model

I'll start with Logistic Regression since it usually works well for text classification and is easy to understand.

In [7]:
model = LogisticRegression(random_state=42)
model.fit(X_train_tfidf, y_train)

preds = model.predict(X_test_tfidf)

## Evaluation

Let me check how well the model performs on the test set.

In [5]:
accuracy = accuracy_score(y_test, preds)
precision = precision_score(y_test, preds, pos_label='spam')
recall = recall_score(y_test, preds, pos_label='spam')
f1 = f1_score(y_test, preds, pos_label='spam')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, preds))

Accuracy: 0.9686
Precision: 1.0000
Recall: 0.7667
F1 Score: 0.8679

Confusion Matrix:
[[965   0]
 [ 35 115]]


## Experiments

I want to see if using bigrams (pairs of words) helps improve the results. Let me try that and compare.

In [6]:
vectorizer_bigram = TfidfVectorizer(max_features=3000, ngram_range=(1, 2))
X_train_bigram = vectorizer_bigram.fit_transform(X_train)
X_test_bigram = vectorizer_bigram.transform(X_test)

model_bigram = LogisticRegression(random_state=42)
model_bigram.fit(X_train_bigram, y_train)
preds_bigram = model_bigram.predict(X_test_bigram)

accuracy_bigram = accuracy_score(y_test, preds_bigram)
f1_bigram = f1_score(y_test, preds_bigram, pos_label='spam')

print("Comparison:")
print(f"Unigrams - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")
print(f"Bigrams  - Accuracy: {accuracy_bigram:.4f}, F1: {f1_bigram:.4f}")

Comparison:
Unigrams - Accuracy: 0.9686, F1: 0.8679
Bigrams  - Accuracy: 0.9695, F1: 0.8722


## Conclusion

This was a good learning project for me. The Logistic Regression model worked pretty well on this dataset and achieved decent accuracy.

What worked well:
- TF-IDF was effective for converting text to features
- Simple preprocessing was enough to get good results
- Logistic Regression was straightforward and performed well

What didn't work as well:
- Bigrams didn't improve results much in my experiment, which was a bit surprising
- The model might struggle with messages that use a lot of slang or abbreviations

What I would improve next:
- Try more advanced preprocessing techniques like handling numbers or special characters differently
- Experiment with other classifiers like Random Forest or SVM
- Maybe collect more data or try to balance the classes if needed
- Add cross-validation to get more reliable performance estimates