# Machine Learning Model

In order to resolve the spam classification problem, I will use in this notebook machine learning algorithms.  

I will first implement a baseline model without tuning any hyperparameter.  
Then I will try to find the best ML model to resolve this classification problem. Finally I will find its best hyperparameters so it has the best accuracy.

In [59]:
import numpy as np
import pandas as pd

from utils import Model, preprocess_text

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier


from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


## Import data

In [60]:
df = pd.read_csv("Spam Email raw text for NLP.csv")
df.drop('FILE_NAME', axis=1, inplace=True)
df['CATEGORY'] = df['CATEGORY'].replace({1: 'Spam', 0: 'Non Spam'})
class_labels = ["Non Spam", "Spam"]
print(f"Shape: {df.shape}")
df.head(10)

Shape: (5796, 2)


Unnamed: 0,CATEGORY,MESSAGE
0,Spam,"Dear Homeowner,\n\n \n\nInterest Rates are at ..."
1,Spam,ATTENTION: This is a MUST for ALL Computer Use...
2,Spam,This is a multi-part message in MIME format.\n...
3,Spam,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...
4,Spam,This is the bottom line. If you can GIVE AWAY...
5,Spam,------=_NextPart_000_00B8_51E06B6A.C8586B31\n\...
6,Spam,"<STYLE type=""text/css"">\n\n<!--\n\nP{\n\n fon..."
7,Spam,<HR>\n\n<html>\n\n<head>\n\n <title>Secured I...
8,Spam,"<table width=""600"" border=""20"" align=""center"" ..."
9,Spam,"<html>\n\n\n\n<head>\n\n<meta http-equiv=""Cont..."


In [61]:
# Check the preprocessing on the 10 first lines
# processed = df['MESSAGE'].head(10).apply(preprocess_text)
# processed

In [62]:
label_encoder = LabelEncoder()# Instantiate a label encoder
df['CATEGORY_ENC'] = label_encoder.fit_transform(df['CATEGORY'])# Fit and transform the encoder on labels

In [63]:
X = df['MESSAGE']
y = df['CATEGORY_ENC']


### First baseline model

In [64]:
# Define pipeline
baseline_model = Model(X, y, MultinomialNB(), TfidfVectorizer(preprocessor=preprocess_text))
baseline_model.fit()

In [65]:
baseline_model.results_report(class_labels)

              precision    recall  f1-score   support

    Non Spam       0.95      0.99      0.97       762
        Spam       0.99      0.90      0.94       398

    accuracy                           0.96      1160
   macro avg       0.97      0.95      0.96      1160
weighted avg       0.96      0.96      0.96      1160



The results are already really good (0.96 accuracy) even if we chose a vectorizer and a model without tuning its hyperparameters.

### Improve Baseline Model

Now lets find the best model for our classification problem.

In [66]:
# List all the vectorizers and models we want to try
parameters = {
    'Vectorizer': [CountVectorizer(preprocessor=preprocess_text), TfidfVectorizer(preprocessor=preprocess_text)],
    'Model_Architecture': [MultinomialNB(), SVC(random_state=42), RandomForestClassifier(random_state=42), KNeighborsClassifier(),
                            XGBClassifier(random_state=42), LogisticRegression(), GradientBoostingClassifier(random_state=42)]}



# Apply the grid search
baseline_model.fit_gridsearch(parameters)

# Get the best estimator
accuracy_score = baseline_model.grid_search.best_score_

print("The best parameters are:", baseline_model.grid_search.best_params_)
print("The best accuracy score is:", accuracy_score)


The best parameters are: {'Model_Architecture': LogisticRegression(), 'Vectorizer': CountVectorizer(preprocessor=<function preprocess_text at 0x0000020224E12CB0>)}
The best accuracy score is: 0.9849026336346391


#### Explicit the results of these parameters

In [67]:
baseline_model = Model(X, y, LogisticRegression(), CountVectorizer(preprocessor=preprocess_text))
baseline_model.fit()

In [68]:
baseline_model.results_report(class_labels)

              precision    recall  f1-score   support

    Non Spam       0.99      0.99      0.99       762
        Spam       0.99      0.98      0.99       398

    accuracy                           0.99      1160
   macro avg       0.99      0.99      0.99      1160
weighted avg       0.99      0.99      0.99      1160



With our best vectorizer/model combination, we now have a 0.98 accuracy.  
Let's try to improve even more our results by tuning the hyperparameters of the model.

##### Now lets tune the hyperparameters

In [69]:
import warnings

# Suppress warnings for the moment
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")

# List the parameters we want to try
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'max_iter': [100, 200, 300]
}

# Separate data before using the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Vectorize the data
X_train_vectorized = CountVectorizer(preprocessor=preprocess_text).fit_transform(X_train)

# Create the model
logreg_model = LogisticRegression()

# Do the grid search with the list of parameters defined above
grid_search = GridSearchCV(logreg_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_vectorized, y_train)

# Get the best parameters
best_params = grid_search.best_params_
accuracy_score = grid_search.best_score_

print("Best Hyperparameters:", best_params)
print("Best accuracy score:", accuracy_score)



Best Hyperparameters: {'C': 10, 'max_iter': 100}
Best accuracy score: 0.9855489528698433


#### Explicit the results of the model with the best parameters

In [70]:
baseline_model = Model(X, y, LogisticRegression(C=100, max_iter=100), CountVectorizer(preprocessor=preprocess_text))
baseline_model.fit()

warnings.resetwarnings()

In [71]:
baseline_model.results_report(class_labels)

              precision    recall  f1-score   support

    Non Spam       0.99      0.99      0.99       762
        Spam       0.99      0.98      0.98       398

    accuracy                           0.99      1160
   macro avg       0.99      0.99      0.99      1160
weighted avg       0.99      0.99      0.99      1160



The results we obtain with the best hyperparameters aren't much better than the previous ones. Indeed as the accuracy of the model was already very high, it's harder to improve even more.  
The model we obtain as really good (almost perfect) results.  
However, to find the absolute best tuned model for this problem, it would have been good to do a grid search directly with all parameters for every model. But running the code would probably have taken multiple hours.