<a href="https://colab.research.google.com/github/Lfirenzeg/msds620/blob/main/Week%206/Data_620_LFMG_Document_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data 620
## Assignment: Document Classification

### By Luis Munoz Grass

#### Instructions

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

## Solution

In [1]:
# to install dependencies
!pip install ucimlrepo scikit-learn pandas

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [38]:
# imports
import pandas as pd
import numpy as np
import joblib
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    roc_auc_score
)


## Method

In [27]:
# fetching the dataset
spambase = fetch_ucirepo(id=94)
X = spambase.data.features
y = spambase.data.targets

print(f"Dataset shape: {X.shape}")
print("Class distribution:")
print(y.value_counts())

Dataset shape: (4601, 57)
Class distribution:
Class
0        2788
1        1813
Name: count, dtype: int64


In [15]:
# Variable types
print(spambase.data.features.dtypes)

word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
word_freq_ge

It looks like the features in the data set are manually selected lexical and formatting cues. So instead of raw text what is used is counts of "spammy" words (like free, credit), numbers (85, 415, 650), punctuation (!, $, #), and capitalization patterns (average run length, longest run).

In [4]:
# splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    stratify=y
)

In [5]:
# building a pipeline both scaling and applying logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(solver='liblinear', max_iter=1000))
])

In [7]:
# hyperparameter tuning to find best C via 5‑fold CV optimizing F1
param_grid = { 'clf__C': [0.01, 0.1, 1, 10] }
grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)
grid.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


  y = column_or_1d(y, warn=True)


In [10]:
print("Best C:", grid.best_params_['clf__C'])


Best C: 10


## Results

In [8]:
# now we can evaluate on the hold‑out set
y_pred = grid.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['ham','spam']))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.9262

Classification Report:
              precision    recall  f1-score   support

         ham       0.93      0.95      0.94       558
        spam       0.92      0.89      0.90       363

    accuracy                           0.93       921
   macro avg       0.93      0.92      0.92       921
weighted avg       0.93      0.93      0.93       921

Confusion Matrix:
 [[530  28]
 [ 40 323]]


So far we get that out of 921 messages, the model correctly labels about 853. That's actually a strong baseline.  Although accuracy alone can mask  the performance for each class, and even more so with a 60/40 ham spam split.

Ham precision is 0.93, so of all messages flagged as ham 93% truly were ham.
Spam precision is 0.92, so of all messages flagged as spam, 92% truly were spam.

F1 Scores: 0.94 for ham and 0.90 for spam, balancing precision/recall. The slightly lower spam F1 indicates we miss a few more spam items relative to how many we catch.


In [31]:
fp = cm[0,1]  # ham misclassified as spam
fn = cm[1,0]  # spam misclassified as ham
print(f"\nFalse Positives (ham classified as spam): {fp}")
print(f"False Negatives (spam classified as ham): {fn}")


False Positives (ham classified as spam): 28
False Negatives (spam classified as ham): 40


A higher false negative rate (40 out of 323) for spam means then that roughly 1 in 9 spam emails gets delivered. At the same time false positives may be more damaging (one could miss important mail), we could keep the threshold where it is or even raise it.

So we could explore which features are the most useful ones.

In [37]:
# we'd need to extract the trained LR step from the pipeline
clf = grid.best_estimator_.named_steps['clf']
# get feature names (from the original DataFrame X)
feature_names = X.columns
# pull out the coefficients out of 57
coefs = clf.coef_.ravel()
# for easy viweing we can build a data frame and sort by absolute weight
fi_coef = (
    pd.DataFrame({'feature': feature_names,
        'coef': coefs,
        'abs_coef': np.abs(coefs)})
    .sort_values('abs_coef', ascending=False)
    .reset_index(drop=True)
)

print("Top 10 features by absolute weight:")
print(fi_coef.head(10))

Top 10 features by absolute weight:
                      feature       coef   abs_coef
0            word_freq_george -10.042307  10.042307
1               word_freq_415  -5.557715   5.557715
2                word_freq_cs  -3.936521   3.936521
3                word_freq_hp  -2.817376   2.817376
4  capital_run_length_average   2.336231   2.336231
5               word_freq_lab  -2.099793   2.099793
6                word_freq_3d   2.048932   2.048932
7                word_freq_85  -2.008946   2.008946
8           word_freq_meeting  -1.749093   1.749093
9               word_freq_edu  -1.705858   1.705858


The following are strong Ham indicators:

- word_freq_george (-10.04), indicating that the word 'george" almost never shows up in spam.

- word_freq_415 (-5.56), possibly an area code or number more commnly seen in personal or business email.

The following are strong Spam indicators:

- capital_run_length_average (+2.34), which makes sence since spam messages tend to use ALL CAPS more frequently (like "FREE", "URGENT").

- word_freq_3d (+2.05), interesting, so the term 3D seems to appear in more promotional or marketing emails, which can be classified as spam.

At this point we can throw in the comparisson with other models: Naive Bayes and Random Forest.

## Model Comparison

In [41]:
# we'll save our best LR pipeline
lr_model = grid.best_estimator_

# then define NB and RF pipelines
nb_pipeline = Pipeline([
    ('clf', MultinomialNB())
])

rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

In [43]:
# and then evaluate each with 5‑fold CV on the TRAINING data (using y_train.ravel() for 1d array)
for name, model in [
    ('LogisticRegression', lr_model),
    ('MultinomialNB',       nb_pipeline),
    ('RandomForest',        rf_pipeline)
]:
    scores = cross_val_score(
        model,
        X_train,
        y_train.values.ravel(),
        cv=5,
        scoring='f1',
        n_jobs=-1
    )
    print(f"{name:20s} CV F1: {scores.mean():.4f} +/- {scores.std():.4f}")

LogisticRegression   CV F1: 0.9006 +/- 0.0117
MultinomialNB        CV F1: 0.7279 +/- 0.0228
RandomForest         CV F1: 0.9413 +/- 0.0174


Not suprisingly Random Forest performed better than the other 2 models, probably because of its ability to capture nonlinear interactions among word clues and formatting features

Finally, we can try to find a pattern in the fake negatives and fake positives.

In [44]:
# we go back to building a test‐set data frame with true vs predicted
df_test = X_test.copy()
df_test['true'] = y_test.values.ravel()
df_test['pred'] = y_pred

In [45]:
# False negatives
false_negatives = df_test[(df_test['true']==1) & (df_test['pred']==0)]

# False positives
false_positives = df_test[(df_test['true']==0) & (df_test['pred']==1)]

print(f"\nFalse Negatives (spam→ham): {len(false_negatives)}")
print(f"False Positives (ham→spam): {len(false_positives)}")




False Negatives (spam→ham): 40
False Positives (ham→spam): 28


In [46]:
# then we can inspect a few
print("\n A few False Negatives")
print(false_negatives.head())

print("\n A few False Positives")
print(false_positives.head())


 A few False Negatives
      word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
1683            0.14                0.0           0.29           0.0   
1687            0.00                0.0           0.00           0.0   
773             0.00                0.0           0.00           0.0   
1233            0.00                0.0           0.00           0.0   
1560            0.00                0.0           1.20           0.0   

      word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
1683           0.14             0.0               0.0                 0.0   
1687           0.00             0.0               0.0                 0.0   
773            0.52             0.0               0.0                 0.0   
1233           0.00             0.0               0.0                 0.0   
1560           0.00             0.0               0.0                 0.0   

      word_freq_order  word_freq_mail  ...  char_freq_(  char_freq_[  \
1683    

- False Negatives: It seems that when the model fails to flag these as spam, it's often because the messages lack the typical promotional keywords associated with spam (free, credit, 000, etc.). In the five examples, every one has zero occurrences of those terms, so the classifier ends ups leaning toward ham, even though other features like the capitalization statistics (long total caps counts) are very much spam like. This implies that more subtly worded or personalized spam, where senders avoid typical marketing keywords will evade detection.

- False Positives: On the other hand, we have high frequencies of "mail" or "internet" tokens, and numbers, that may be newsletter or technical alerts instead of scam. But, the presence of those tokens, combined with some uppercase runs, makes the classifier reach a spam threshold.  