# Data 620 Week 10: Document Classification
Matthew Tillmawitz

For this assignment we have been tasked with using the [Spambase](https://archive.ics.uci.edu/dataset/94/spambase) dataset from the University of California Irvine to perform document classification. The dataset consists of 4601 e-mails which have been labeled as either legitimate or spam with 57 pre-computed features. The original documents were not included in the dataset and appear to be unavailable, so additional feature engineering cannot be conducted. This is unfortunate, as it may have been interesting to use models such RoBERTa to generate embeddings from the documents and compare how models trained on these embeddings performed compared to the hand-crafted features. The feature ranges have already been documented and it has been confirmed there are no missing values for features, so exploratory data analysis will be skipped.

The model types tested will be Logistic Regression, Random Forest, Adaboost, and a Neural Network. These models cover a range of model classes and demonstrate how different architectures perform on the problem. As the original study made use of a number of these model types as a baseline, we included Adaboost as a model type that was not already evaluated and will be changing the training metric. During development it was noted many models were achieving an AUC-ROC close to or even exceeding 0.99 indicating saturation of the metric. As a result, model selection will rely on average precision also known as AUC-PR. This metric can be useful when evaluating imbalanced datasets and targeting the minority class. We will compare the model performance we achieve to that of the baseline models to determine the efficacy of this method.


In [134]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, average_precision_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

Column names are provided in the documentation for the dataset and are copied over directly.

In [135]:
columns = [
    'word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
    'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet',
    'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will',
    'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free',
    'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
    'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money',
    'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650',
    'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857',
    'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
    'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct',
    'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project',
    'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference',
    'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
    'char_freq_$', 'char_freq_#', 'capital_run_length_average',
    'capital_run_length_longest', 'capital_run_length_total', 'spam'
]
filepath = '/Users/matttillman/School/data_620/data/spambase.data'
random_seed = 8675309

Checking the integrity and validity of the data, the number of samples and features matches the expected values from the dataset source. The numeric classifiers are 1 for spam and 0 for legitimate e-mails.

In [136]:
df = pd.read_csv(filepath, header=None, names=columns)
print(f"Dataset loaded: {df.shape[0]} samples, {df.shape[1]-1} features")
print(f"Class distribution:\n{df['spam'].value_counts()}")

Dataset loaded: 4601 samples, 57 features
Class distribution:
spam
0    2788
1    1813
Name: count, dtype: int64


## Preparing the Splits

The features have already been created as part of the dataset, and without the raw documents no further features can reasonably be generated. We will therefore conduct no additional feature engineering and instead move directly to generating splits. We will use the train/validate/test splitting method as is current best practice, with the test split containing 20% of the data and the validate set consisting of 16% of the data (20% of the non-test data). Stratified sampling is used to ensure the class imbalance is consistent in each of the splits. All features are continuous and have no missing values per the documentation, and we center and scale the features using scikit-learn's `RobustScaler`. This scaler was chosen as it is resilient to outliers due to using median and interquartile range instead of mean and variance in the relevant calculations.

In [137]:
X = df.drop('spam', axis=1).values
y = df['spam'].values

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=random_seed, stratify=y_temp)

scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

## Logistic Regression
Logistic regression is conducted as a baseline. We evaluate different penalties using Lasso, Ridge, and ElasticNet. The best performing model uses Ridge regression and achieves an average precision of 0.90 on the validation set.

In [138]:
import warnings
warnings.filterwarnings('ignore')

# l1_ratio only used when doing elasticnet, probably fitting way more models than I need, refactor
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'l1_ratio': [0.25, 0.5, 0.75]
}

lr = LogisticRegression(max_iter=1000, random_state=random_seed, solver='saga')

grid_search = GridSearchCV(lr, param_grid, cv=5, scoring='average_precision', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation Average Precision: {grid_search.best_score_:.4f}")

best_model = grid_search.best_estimator_
val_score = average_precision_score(y_val, best_model.predict_proba(X_val)[:, 1])
print(f"Validation Average Precision: {val_score:.4f}\n")

Fitting 5 folds for each of 9 candidates, totalling 45 fits

Best parameters: {'l1_ratio': 0.25, 'penalty': 'l2'}
Best cross-validation Average Precision: 0.9343
Validation Average Precision: 0.8990



## Random Forest

Random forest is a powerful ensemble method that performs well on a number of problems. After hyperparameter tuning, the best model achieves an average precision of 0.98 on the validation set.

In [139]:
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf = RandomForestClassifier(random_state=random_seed, n_jobs=-1)

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='average_precision', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation Average Precision: {grid_search.best_score_:.4f}")

rf_model = grid_search.best_estimator_
val_score = average_precision_score(y_val, rf_model.predict_proba(X_val)[:, 1])
print(f"Validation Average Precision: {val_score:.4f}")

Fitting 5 folds for each of 48 candidates, totalling 240 fits

Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}
Best cross-validation Average Precision: 0.9797
Validation Average Precision: 0.9776


## Adaboost

Adaboost was not evaluated in the original study but is another strong ensemble method for classification in particular. While having long training times due to the sequential nature of the algorithm, it demonstrates strong performance with an average precision of 0.97 on the validation set.

In [140]:
param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
    'estimator__max_depth': [1, 2, 3]
}

base_estimator = DecisionTreeClassifier(random_state=random_seed)
ada = AdaBoostClassifier(estimator=base_estimator, random_state=random_seed)

grid_search = GridSearchCV(ada, param_grid, cv=5, scoring='average_precision', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation Average Precision: {grid_search.best_score_:.4f}")

ab_model = grid_search.best_estimator_
val_score = average_precision_score(y_val, ab_model.predict_proba(X_val)[:, 1])
print(f"Validation Average Precision: {val_score:.4f}")

Fitting 5 folds for each of 48 candidates, totalling 240 fits

Best parameters: {'estimator__max_depth': 3, 'learning_rate': 0.5, 'n_estimators': 300}
Best cross-validation Average Precision: 0.9793
Validation Average Precision: 0.9717


## Neural Network
Due to the complexity of the neural network, code is broken into functions. We experiment with a number of hyperparameters and layer sizes, with hidden layers ranging from [64, 32], [128, 64], and [128, 64, 32]. Dropout values were either 0.3 or 0.5 and learning rate was either 0.001 or 0.0001. These values allowed for reasonable ranges of variation to be addressed while still keeping training times managable on the available hardware. Models were trained over a maximum of 100 epochs with early stopping if models had not improved for 10 epochs. All model configurations stopped before reaching the full 100 epochs and thus reached convergence. The best performing model configuration was {'hidden_dims': [128, 64, 32], 'dropout': 0.5, 'lr': 0.001} and achieved an average precision of 0.96.

In [141]:
class SpamClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dims=[64, 32], dropout=0.3):
        super(SpamClassifier, self).__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        layers.append(nn.Sigmoid())
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

In [142]:
def train_neural_network(X_train, y_train, X_val, y_val, hidden_dims=[64, 32], dropout=0.3, lr=0.001, epochs=50, batch_size=64):
    # Setting random seeds for reproducibility and to preserve my sanity when writing the report
    np.random.seed(random_seed)
    torch.manual_seed(random_seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # Can probably do this conversion once, refactor
    X_train_tensor = torch.FloatTensor(X_train)
    y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1)
    X_val_tensor = torch.FloatTensor(X_val)
    y_val_tensor = torch.FloatTensor(y_val).unsqueeze(1)
    
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    
    model = SpamClassifier(X_train.shape[1], hidden_dims, dropout)
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    best_val_loss = float('inf')
    patience = 10
    patience_counter = 0
    train_losses = []
    val_losses = []
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val_tensor)
            val_loss = criterion(val_outputs, y_val_tensor).item()
            val_losses.append(val_loss)
            
            val_preds = val_outputs.numpy().flatten()
            val_ap = average_precision_score(y_val, val_preds)
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.4f}, "
                  f"Val Loss: {val_loss:.4f}, Val Average Precision: {val_ap:.4f}")
        
        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_model_state = model.state_dict()
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"\nEarly stopping at epoch {epoch+1}")
                break
    
    model.load_state_dict(best_model_state)
    print(f"\nBest validation loss: {best_val_loss:.4f}\n")
    
    return model, train_losses, val_losses

In [143]:
configs = [
    {'hidden_dims': [64, 32], 'dropout': 0.3, 'lr': 0.001},
    {'hidden_dims': [64, 32], 'dropout': 0.3, 'lr': 0.0001},
    {'hidden_dims': [64, 32], 'dropout': 0.5, 'lr': 0.001},
    {'hidden_dims': [128, 64], 'dropout': 0.3, 'lr': 0.001},
    {'hidden_dims': [128, 64], 'dropout': 0.3, 'lr': 0.0001},
    {'hidden_dims': [128, 64], 'dropout': 0.5, 'lr': 0.001},
    {'hidden_dims': [128, 64, 32], 'dropout': 0.3, 'lr': 0.001},
    {'hidden_dims': [128, 64, 32], 'dropout': 0.3, 'lr': 0.0001},
    {'hidden_dims': [128, 64, 32], 'dropout': 0.5, 'lr': 0.001},
]

best_ap = 0
best_config = None
best_model = None

for i, config in enumerate(configs):
    print(f"\n{'='*70}")
    print(f"\nConfig {i+1}/{len(configs)}: {config}")
    print(f"\n{'='*70}")
    model, _, _ = train_neural_network(
        X_train, y_train, X_val, y_val,
        hidden_dims=config['hidden_dims'],
        dropout=config['dropout'],
        lr=config['lr'],
        epochs=100
    )
    
    model.eval()
    with torch.no_grad():
        X_val_tensor = torch.FloatTensor(X_val)
        val_preds = model(X_val_tensor).numpy().flatten()
        val_ap = average_precision_score(y_val, val_preds)
    
    print(f"Validation Average Precision: {val_ap:.4f}")
    
    if val_ap > best_ap:
        best_ap = val_ap
        best_config = config
        nn_model = model

print(f"\n{'='*70}")
print(f"Best configuration: {best_config}")
print(f"Best validation Average Precision: {best_ap:.4f}")
print(f"{'='*70}\n")



Config 1/9: {'hidden_dims': [64, 32], 'dropout': 0.3, 'lr': 0.001}

Epoch 10/100 - Train Loss: 0.1716, Val Loss: 0.1945, Val Average Precision: 0.9487
Epoch 20/100 - Train Loss: 0.1410, Val Loss: 0.1816, Val Average Precision: 0.9517
Epoch 30/100 - Train Loss: 0.1325, Val Loss: 0.1759, Val Average Precision: 0.9538
Epoch 40/100 - Train Loss: 0.1389, Val Loss: 0.2963, Val Average Precision: 0.9518

Early stopping at epoch 40

Best validation loss: 0.1759

Validation Average Precision: 0.9518


Config 2/9: {'hidden_dims': [64, 32], 'dropout': 0.3, 'lr': 0.0001}

Epoch 10/100 - Train Loss: 0.4668, Val Loss: 0.4544, Val Average Precision: 0.8967
Epoch 20/100 - Train Loss: 0.2744, Val Loss: 0.4115, Val Average Precision: 0.9118

Early stopping at epoch 26

Best validation loss: 0.3320

Validation Average Precision: 0.9189


Config 3/9: {'hidden_dims': [64, 32], 'dropout': 0.5, 'lr': 0.001}

Epoch 10/100 - Train Loss: 0.2043, Val Loss: 0.3282, Val Average Precision: 0.9392
Epoch 20/100 - T

## Evaluation on Test Set

In [None]:
def evaluate_model(model, X_test, y_test, model_name, is_neural_net=False):    
    if is_neural_net:
        model.eval()
        with torch.no_grad():
            X_test_tensor = torch.FloatTensor(X_test)
            y_pred_proba = model(X_test_tensor).numpy().flatten()
            y_pred = (y_pred_proba > 0.5).astype(int)
    else:
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        y_pred = model.predict(X_test)
    
    decimal_places = 3
    metrics = {
        'Model': model_name,
        'Accuracy': round(accuracy_score(y_test, y_pred), decimal_places),
        'Precision': round(precision_score(y_test, y_pred), decimal_places),
        'Recall': round(recall_score(y_test, y_pred), decimal_places),
        'F1': round(f1_score(y_test, y_pred), decimal_places),
        'Avg Precision': round(average_precision_score(y_test, y_pred_proba), decimal_places)
    }
    
    return metrics

The models evaluated had a rather interesting heirarchy that was established, with the same ordering from most to least performant of Random Forest, Adaboost, Neural Network, and Logistic Regression by all metrics. It is unusual for a single model to be the most performant in every metric included, let alone for all models to fall into such a defined ordering. Additionally, both the Random Forest and Neural Network models performed better on the test set than the validation set in terms of average precision, indicating both models would generalize particularly well. When comparing to the values achieved by the baseline models included with the dataset, our Logistic Regression slightly underperforms by both accuracy and precision. Our Neural Network outperformed that of the baseline by both metrics, while the Random Forest model almost perfectly matched that of the baseline. As Adaboost was not included in the original study we will compare it to Xgboost as both models are very similar. In this comparison our Adaboost model matched the performance of the Xgboost model, however just barely falling in the bottom range of values for the model. For our purposes a model matched the performance of the baselines if it fell into the given ranges recorded for each model and over- or underperformed a baseline if it fell outside of the range. Overall, all models exhibited extremely good performance on the problem achieving scores of 0.9 or higher by all metrics, except in the case of Logistic Regression.

In [145]:
metrics_lr = evaluate_model(lr_model, X_test, y_test, 'Logistic Regression')
metrics_rf = evaluate_model(rf_model, X_test, y_test, 'Random Forest')
metrics_ab = evaluate_model(ab_model, X_test, y_test, 'Adaboost')
metrics_nn = evaluate_model(nn_model, X_test, y_test, 'Neural Network', is_neural_net=True)

df_results = pd.DataFrame([metrics_lr, metrics_rf, metrics_ab, metrics_nn])
df_results = df_results.set_index('Model')

print("\n" + "=" * 70)
print("MODEL COMPARISON TABLE")
print("=" * 70)
print(df_results.to_string())
print("=" * 70)

metrics = ['Accuracy', 'Precision', 'Recall', 'F1', 'Avg Precision']
print("BEST MODELS BY METRIC:")
print("=" * 70)
for metric in metrics:
    best_model = df_results[metric].idxmax()
    best_value = df_results[metric].max()
    print(f"{metric:15s}: {best_model:25s} ({best_value:.4f})")
print("=" * 70)


MODEL COMPARISON TABLE
                     Accuracy  Precision  Recall     F1  Avg Precision
Model                                                                 
Logistic Regression     0.901      0.891   0.854  0.872          0.943
Random Forest           0.952      0.954   0.923  0.938          0.981
Adaboost                0.946      0.943   0.917  0.930          0.979
Neural Network          0.940      0.935   0.912  0.923          0.975
BEST MODELS BY METRIC:
Accuracy       : Random Forest             (0.9520)
Precision      : Random Forest             (0.9540)
Recall         : Random Forest             (0.9230)
F1             : Random Forest             (0.9380)
Avg Precision  : Random Forest             (0.9810)
