# Joe Garcia
# 12/16/25
# Data 620 Web Analytics
# Week_10_Assignment_Document_Classification


This project uses the Spambase dataset to build a model that can classify emails as spam or non-spam. The goal is to train a simple but effective classifier using numeric email features, then evaluate it in a fair way by splitting the data into training, dev, and test sets. To keep the workflow consistent, we use a scikit-learn pipeline that standardizes the features and fits a Logistic Regression model, then report performance using a confusion matrix and classification metrics.

# Loading the Spambase Dataset and Splitting Features vs. Labels

This code pulls the Spambase dataset directly from the UCI Machine Learning Repository and loads it into a pandas DataFrame. Since the file doesn’t include column names, it’s read with header=None, so the columns are numbered automatically. Then the dataset is split into inputs and outputs: X contains the first 57 columns, which are the numeric features used to describe each email, and y is the last column, which is the class label—1 means spam and 0 means non-spam (ham).

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report


In [13]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data" 
df = pd.read_csv(url, header=None) 

X = df.iloc[:, :-1] # 57 numeric features 
y = df.iloc[:, -1] # class label (1=spam, 0=non-spam)


# Splitting the Data into Train, Dev, and Test Sets

This section splits the dataset into three parts so we can train the model, tune it, and then evaluate it fairly. First, train_test_split separates out 60% of the data for training and keeps the remaining 40% in a temporary set. Then that temporary set is split again into 20% dev and 20% test (each half of the 40%). The stratify= option is used in both splits so the spam vs. non-spam proportions stay consistent across all three sets, and random_state=42 ensures the split is reproducible.

In [8]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.40, random_state=42, stratify=y
)
X_dev, X_test, y_dev, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

# Building and Evaluating a Logistic Regression Pipeline on the Dev Set


Here, a scikit-learn Pipeline is created to keep preprocessing and modeling in one clean workflow. The StandardScaler() step standardizes all 57 numeric features so they’re on a similar scale, which helps Logistic Regression train more reliably. The classifier is a LogisticRegression model with max_iter=2000 to give it enough iterations to converge. After fitting the pipeline on the training data, the model is tested on the dev set (the set used during development) and printed results include a confusion matrix plus a classification report showing precision, recall, and F1-score.

In [4]:
model = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])

model.fit(X_train, y_train)

# --- Evaluate on DEV set (used during development) ---
dev_pred = model.predict(X_dev)
print("DEV SET RESULTS")
print(confusion_matrix(y_dev, dev_pred))
print(classification_report(y_dev, dev_pred, digits=3))

DEV SET RESULTS
[[535  23]
 [ 45 317]]
              precision    recall  f1-score   support

           0      0.922     0.959     0.940       558
           1      0.932     0.876     0.903       362

    accuracy                          0.926       920
   macro avg      0.927     0.917     0.922       920
weighted avg      0.926     0.926     0.926       920



# Final Evaluation on the Test Set

After finishing development on the dev set, this section evaluates the trained pipeline on the test set, which is meant to be the “final exam” for the model. The code generates predictions for X_test, then prints a confusion matrix to show how many emails were correctly or incorrectly classified as spam vs. non-spam. It also prints a classification report with precision, recall, and F1-score, giving a clearer picture of performance than accuracy alone.

In [5]:
test_pred = model.predict(X_test)
print("\nTEST SET RESULTS")
print(confusion_matrix(y_test, test_pred))
print(classification_report(y_test, test_pred, digits=3))


TEST SET RESULTS
[[528  30]
 [ 46 317]]
              precision    recall  f1-score   support

           0      0.920     0.946     0.933       558
           1      0.914     0.873     0.893       363

    accuracy                          0.917       921
   macro avg      0.917     0.910     0.913       921
weighted avg      0.917     0.917     0.917       921



# Predicting Spam Probability for New Emails
This section treats a few rows from the test set as “new” unseen emails and shows how the model behaves on individual examples. It selects the first five test records, then uses predict_proba() to return the model’s confidence for each class (non-spam vs. spam) and predict() to return the final predicted label. The results are organized into a small DataFrame so you can quickly compare the predicted class with the probability scores, which is useful when you want more detail than a simple 0/1 prediction.

In [6]:
new_docs = X_test.iloc[:5]                 # these are "new/unseen" to the model
new_probs = model.predict_proba(new_docs)  # probability of each class
new_labels = model.predict(new_docs)

results = pd.DataFrame({
    "pred_label": new_labels.astype(int),
    "prob_nonspam": new_probs[:, 0],
    "prob_spam": new_probs[:, 1],
})
print("\nNEW DOCUMENT PREDICTIONS (sample)")
print(results)



NEW DOCUMENT PREDICTIONS (sample)
   pred_label  prob_nonspam  prob_spam
0           0      0.889214   0.110786
1           1      0.000926   0.999074
2           0      0.999996   0.000004
3           0      0.930433   0.069567
4           1      0.048542   0.951458


# Conclusion and Discussion

In this project, we trained a Logistic Regression model to classify emails as spam or non-spam using the Spambase dataset. We split the data into training, dev, and test sets so we could improve the model using the dev set while still keeping the test set for a fair final check. Putting scaling and the classifier into a pipeline also kept the workflow consistent and clean.

Overall, the dev and test results were similar, which suggests the model generalizes well. The probability outputs are also useful because they show confidence, not just a yes/no label, which can help flag uncertain emails for review.