# SPAM Dataset visualization

In [None]:
from src.preprocessing import data_spam, load_data, preprocessing, visualize

Here we charge the dataframe from the spambase : 4600 emails with 58 features

In [None]:
_, _, df, features = load_data(data_spam)
df

And verify with the .info() method.

We also check that there is no missing values in each of the columns

In [None]:
df.info()

Using the .DOCUMENTATION file, we learn that each feature named "word_freq_*WORD*" represent the percentage of word in the email which **are** the word *WORD*. (48 features of this type)

For exemple :
- **word_freq_credit** gives the percentage of word in the email that match the word "*credit*"
- **word_freq_report** same with word "*report*"

6 features are named "char_freq_*CHAR*" which is the same as before but with a character *CHAR*.

Exemples :
- **char_freq_;**
- **char_freq_$**

The remaining columns are : 

**capital_run_length_average** = average length of uninterrupted sequences of capital letters

**capital_run_length_longest** = length of longest uninterrupted sequence of capital letters

**capital_run_length_total** = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

**is_spam** = denotes whether the e-mail was considered spam (1) or not (0),  i.e. unsolicited commercial e-mail

In [None]:
df.describe()

We check the number of spam in the dataset :

In [None]:
df["Class"].value_counts()

For some of the features we can see how the data is distributed, distinguishing spam and no spam, and the different correlations
Most of the other features have their density concentrated around 0.

2D correlation here is not quite relevent because there is a lot of superposition and datas are concentrated.
Correlation matrix is way more usefull to understand this.

In [None]:
print(features)

visualize(data_spam, features[:5], random=False)

# Objectives

The objective of this notebook is to build and compare several Machine Learning models on the Spam dataset. We will evaluate four different approaches — K-Nearest Neighbors (KNN), a Multilayer Perceptron Neural Network (MLP), a Random Forest classifier, and a Linear Support Vector Classifier (Linear SVC). For each model, we will train, evaluate, and compare their prediction accuracies. In addition, we will analyze feature importance to better understand which variables contribute the most to the classification performance.

# Functions and Workflow Description

The preprocessing function loads the dataset, normalize the continuous features (not the binary-ones), and return a train, val, test split.
For each model, we defined it, set a train and a predict function. We train the model with the split we made in preprocessing.
Then we call the benchmark function (same for every model), which uses the predict function, and returns an accuracy report, a confusion matrix and the overall feature importance. It allows us to compare models between each other.


# K nearest neighbour


In [None]:
from src.knn import KNNModel

samples = preprocessing(data=data_spam, test_size=0.3, validation_size=0.1)
model = KNNModel()
model.train(x=samples.X_train, y=samples.y_train)
model.benchmark(x=samples.X_test, y=samples.y_test)

## Analysis


The dataset being a bit unbalanced, what interest us, more than the overall accuracy is the recall and precision, and the f1-score which combine those two metrics. So our objective is to find a compromise between detecting as many spam messages as possible (high recall) while avoiding misclassifying legitimate emails as spam (high precision), which is exactly what the F1-score helps us evaluate.

As we can see here with the classification report, a quite simple model such as KNN (K=3) as very good performances. Considering the spam problem, our goal can be not to miss any spam e-mails, so maximize the recall for class 1. Here this is the metric with the less percentage (83,5%) compare to non-spam recall.

Next we would like to equilibrate these two recall and maximize them, without influencing the overall precision.

Let's do feature selection with the 6 most important features (more than 0.04%) to see what happens

In [None]:
samples = preprocessing(data=data_spam, test_size=0.15, validation_size=0.15)

cols = [4, 6, 15, 22, 44, 51]
samples.X_train = samples.X_train[:, cols]
samples.X_test = samples.X_test[:, cols]
samples.X_validation = samples.X_validation[:, cols]

model = KNNModel()
model.train(x=samples.X_train, y=samples.y_train)
model.benchmark(x=samples.X_test, y=samples.y_test)

There isn't any improvement in the model but the training and prediction is much faster.

In [None]:
from src.nn_interface import MLPModel

samples = preprocessing(data=data_spam, test_size=0.15, validation_size=0.15)
model = MLPModel(input_size=57, epochs=30)
model.train(samples)

In [None]:
model.benchmark(samples.X_test, samples.y_test)

# Analysis

Here, we implement a simple Neural Network—an MLP with three hidden layers and 10 units per layer. The model is trained for a total of 30 epochs. We observed that beyond 25 epochs, the validation loss barely improves, which makes this training duration a reasonable choice.

Compared to KNN, the results are significantly better: the precision for the positive (spam) class remains around 87% (–2 pts), but the recall increases to 96% (+13.5 pts). This means that the model successfully identifies 96% of spam emails, which is a very strong result. Even though the classifier may incorrectly label some legitimate emails as spam, this type of error has limited consequences.

A key advantage is that we improved the most critical metric (recall) without severely compromising the others. Additionally, we can tune the decision threshold applied to the MLP’s probabilistic output. Since the model outputs a probability between 0 and 1, we arbitrarily classify values above 0.5 as spam and below 0.5 as non-spam. Adjusting this threshold allows us to control the trade-off between precision and recall depending on the application’s requirements.

We also noticed that the accuracy can vary by up to 2% across training runs, likely due to differences in the model’s random initialization. A potential improvement would be to optimize or stabilize this initialization step in order to achieve more consistent performance.

Let's do feature selection with the 6 most important features (more than 0.04%) to see what happens

In [None]:
samples = preprocessing(data=data_spam, test_size=0.15, validation_size=0.15)

cols = [6, 24, 26, 45, 52, 55]
samples.X_train = samples.X_train[:, cols]
samples.X_test = samples.X_test[:, cols]
samples.X_validation = samples.X_validation[:, cols]


model = MLPModel(input_size=6, epochs=100)
model.train(samples)

In [None]:
model.benchmark(samples.X_test, samples.y_test)

Using only 6 features for spam classification reduces the model’s performance.
With such a limited number of features, the model has less information to separate spam from non-spam, which makes learning harder. Consequently, the loss decreases more slowly during training, requiring more epochs, and even then, it does not reach the same minimum loss as when using the full set of features.

We decide not to do feature selection anymore as the performance are quite good for this dataset.

# Random Forest

In [None]:
from src.RForest import RForest

samples = preprocessing(data=data_spam, test_size=0.3, validation_size=0.1)
model = RForest()
model.train(x=samples.X_train, y=samples.y_train)

In [None]:
model.benchmark(x=samples.X_test, y=samples.y_test)

# Analysis

Surprisingly, the Random Forest achieves the best overall performance among all the models we tested. However, if we focus specifically on correctly identifying spam messages, it remains slightly less effective than the MLP (–4 points in recall).

Another surprising observation is that the Random Forest relies heavily on the 7th feature (word_freq_remove). While this feature was also important for the other models, it becomes dominant here, accounting for nearly 30% of the total feature importance—far more than in the other classifiers.

A plausible explanation for the strong performance of the Random Forest is that decision-tree–based models excel at capturing nonlinear interactions and threshold-based patterns in the data. The spam dataset contains many frequency-based features that behave in a piecewise manner (e.g., presence or absence of certain keywords, sudden increases in word frequency). Random Forests are particularly good at exploiting such structures, allowing them to build diverse trees that capture different aspects of the data. Additionally, their ensemble nature reduces overfitting while improving robustness, which likely contributes to their high overall accuracy.

# Linear SVC

In [None]:
from src.kernel_methods import LinearSVC_

samples = preprocessing(data=data_spam, test_size=0.3, validation_size=0.1)
model = LinearSVC_()
model.train(x=samples.X_train, y=samples.y_train)

In [None]:
model.benchmark(x=samples.X_test, y=samples.y_test)

# Conclusion

For the spam classification problem, the Random Forest achieves the best overall performance, while the MLP slightly outperforms it in detecting spam emails (class 1). The models generally rely on specific key features, such as word_freq_remove, which strongly influence the predictions. Overall, both the Random Forest and the MLP are suitable choices. The optimal model ultimately depends on the objective: whether we want to catch all spam messages, even if it means misclassifying some legitimate emails (e.g., for children, phishing, etc.), or whether we prefer to maximize spam detection while keeping legitimate emails untouched to avoid missing important messages. Additionally, we can further adjust the decision threshold to fine-tune the trade-off between precision and recall for spam detection.