# Spam Classification Challenge

>If, at any point before the written exam, you (as a group) submit an e-mail spam classifier that beats my precision/recall performance on the e-mail data test set, I will award everyone a 2.5% bonus in the written exam (e.g., if you score 87.5% in the exam, I will bump it up to 90%).


## Rules
- Must be submitted before the day of the written exam.
- Must be implemented in Python.
- Must use a method discussed in the course (variations/modifications are okay).
- Must be a single submission that all of you unanimously agree on.
- You win and earn the 2.5% bonus, if 2*(precision*recall)/(precision+recall) of your model on the test set is
larger than mine (this is called F1-score).
- I will upload the training set and my own code on May 12, 2025. You won’t have access to the test set.

### 1. Load imports and data
Classification-Data will be imported as a bag of words.

In [7]:
from models import NaiveBayes, NeuralNetwork
from utils import DataLoader, Evaluator


# Load spam classification data
X, y = DataLoader.load_spam_data('./data_train')

Vocabulary loaded
Loaded 4125 emails (1176 spam, 2949 no-spam)
Feature matrix shape: (4125, 50371)


### 2. Add the model you want to evaluate
Add as many models as you want. You can add a model severalt times with different hyperparameters.

*Recommended:  Add a meaningful description for easier evaluation.*

Use existent models with different hyperparameters or add your own model by inheriting from `Model` class.

In [8]:
# Define models to evaluate
models = [
    NaiveBayes(name="Naive Bayes"),
    NeuralNetwork(name="NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01)",
        hidden_dim=16, epochs=5, lr=0.01, loss_type="logistic2"
    ),
]
print(f"{len(models)} models defined.")

2 models defined.


### 3. Train all models
Evaluation will use a random state for reproducibility.

All defined models will be evaluated using k-fold-cross-validation to ensure robust performance. We should be able to identify overfitting and how good a model generalizes on unseen data. A higher k will result in a longer training duration.

In [11]:
print(f"\nEvaluating {len(models)} models...")

# Create evaluator and run k-fold cross-validation
evaluator = Evaluator(models=models, n_splits=3, random_state=42)

results = evaluator.evaluate(X, y, verbose=True)


Evaluating 2 models...

=== K-Fold Evaluation for Model: Naive Bayes ===
Trained Naive Bayes: 757 spam, 1993 no-spam emails
Fold 1/3: Accuracy = 0.8145; F1 Score = 0.5813
Fold 2/3: Accuracy = 0.8567; F1 Score = 0.6562
Fold 3/3: Accuracy = 0.8378; F1 Score = 0.6188

Model: Naive Bayes
Mean Accuracy: 0.8364 ± 0.0173

=== K-Fold Evaluation for Model: NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01) ===
Epoch 1/5, Loss: 0.692718
Epoch 2/5, Loss: 0.429021
Epoch 3/5, Loss: 0.266898
Epoch 4/5, Loss: 0.165316
Epoch 5/5, Loss: 0.164964
Fold 1/3: Accuracy = 0.8793; F1 Score = 0.7530
Fold 2/3: Accuracy = 0.9629; F1 Score = 0.9354
Fold 3/3: Accuracy = 0.9745; F1 Score = 0.9535

Model: NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01)
Mean Accuracy: 0.9389 ± 0.0424


## 4. Evaluate the trained models

In [12]:
evaluator.print_summary()

best_model_acc = evaluator.best_model("mean_accuracy")
if best_model_acc:
    print(f"\nBest performing model (ACC): {best_model_acc}")
    best_acc = results[best_model_acc]['mean_accuracy']
    best_std = results[best_model_acc]['std_accuracy']
    print(f"Best accuracy: {best_acc:.4f} ± {best_std:.4f}")
    
best_model_f1 = evaluator.best_model("mean_f1")
if best_model_f1:
    print(f"\nBest performing model (F1): {best_model_f1}")
    best_f1 = results[best_model_f1]['mean_f1']
    best_std = results[best_model_f1]['std_f1']
    print(f"Best f1-score: {best_f1:.4f} ± {best_std:.4f}")

print("\nEvaluation completed successfully!")


FINAL K-FOLD COMPARISON RESULTS
Naive Bayes: ACC(0.8364 ± 0.0173); F1(0.6188 0.0306)
NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01): ACC(0.9389 ± 0.0424); F1(0.8806 0.0906)

Best performing model (ACC): NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01)
Best accuracy: 0.9389 ± 0.0424

Best performing model (F1): NN (Logistic Loss, Hidden=16, Epochs=5, LR=0.01)
Best f1-score: 0.8806 ± 0.0906

Evaluation completed successfully!
