Project 3: Spam Email Classifier

In this project, I built a machine learning model that classifies text messages as spam or not spam (ham).
The goal is to understand how textual data is represented numerically and how linear classifiers can be used for text classification.

This is a supervised binary classification problem.

After importing libraries required for handling text data, numerical processing, and machine learning.
Scikit-learn provides utilities for text vectorization and model training.

In [1]:
import pandas as pd
import numpy as np

After loading a dataset containing text messages and their labels,
each message is labeled as either spam or ham (not spam).
This labeled data allows to train a supervised learning model.

In [2]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_table(url, header=None, names=["label", "message"])

data.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Before training a model, it is important to understand the dataset.
I checked how many spam and ham messages are present to see if the data is balanced.

In [3]:
data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


Machine learning models cannot directly work with raw text.
Text must be converted into numerical form using a process called vectorization.
I start with a simple Bag-of-Words representation.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(data["message"])

The target labels are converted into numerical form.
Spam messages are encoded as 1, and ham messages as 0.

In [5]:
y = data["label"].map({"ham": 0, "spam": 1})

The dataset is split into training and testing sets.
This helps evaluate how well the model generalizes to unseen messages.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

I used Logistic Regression as a linear classifier.
It learns a decision boundary that separates spam and non-spam messages.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Then, I evaluate the model using accuracy, precision, recall, and F1-score.
These metrics provide a better understanding of performance, especially for imbalanced data.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

TF-IDF gives more importance to meaningful words and reduces the impact of very common words.
This often improves text classification performance.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = tfidf.fit_transform(data["message"])

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

This project demonstrates how text classification works in machine learning.
Key concepts include text vectorization, linear classifiers, and proper evaluation metrics.
The same pipeline can be extended to sentiment analysis, topic classification, and document filtering.