# Deep Learning Model

Here the aim is to explore the performances of a NN model on the spam classification problem.

I will build a model using Keras and Tensorflow and take a look at its results.

In [24]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from keras.callbacks import EarlyStopping

from keras.preprocessing.sequence import pad_sequences

from utils import preprocess_text, tokenize_data, tokenize_data

from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px


## Import Data

In [25]:
df = pd.read_csv("Spam Email raw text for NLP.csv")
df.drop('FILE_NAME', axis=1, inplace=True)
df['CATEGORY'] = df['CATEGORY'].replace({1: 'Spam', 0: 'Non Spam'})
class_labels = ["Non Spam", "Spam"]
print(f"Shape: {df.shape}")
df.head(10)

Shape: (5796, 2)


Unnamed: 0,CATEGORY,MESSAGE
0,Spam,"Dear Homeowner,\n\n \n\nInterest Rates are at ..."
1,Spam,ATTENTION: This is a MUST for ALL Computer Use...
2,Spam,This is a multi-part message in MIME format.\n...
3,Spam,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...
4,Spam,This is the bottom line. If you can GIVE AWAY...
5,Spam,------=_NextPart_000_00B8_51E06B6A.C8586B31\n\...
6,Spam,"<STYLE type=""text/css"">\n\n<!--\n\nP{\n\n fon..."
7,Spam,<HR>\n\n<html>\n\n<head>\n\n <title>Secured I...
8,Spam,"<table width=""600"" border=""20"" align=""center"" ..."
9,Spam,"<html>\n\n\n\n<head>\n\n<meta http-equiv=""Cont..."


## Preprocess data

In [26]:
label_encoder = LabelEncoder()# Instantiate a label encoder
df['CATEGORY_ENC'] = label_encoder.fit_transform(df['CATEGORY'])# Fit and transform the encoder on labels


X = df['MESSAGE'].apply(preprocess_text)
y = df['CATEGORY_ENC']

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# Tokenize data before using the model
tokenizer, X_train= tokenize_data(X_train)
X_test = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=200)

## Create the model

Here I will use a Simple Neural Network model because it is highly customizable.

In [29]:
# Create a sequential model with multiple layers
model = Sequential()
model.add(Embedding(input_dim=2000, output_dim=128, input_length=200))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

I create a Sequential NN model. Then I add 3 internal layers. The first one has the same number of neurones as the size of the output size of the embedding layer. I use a Rectified Linear Unit activation for each internal layers. The output layer has logically a size of 1 and has a sigmoid activation function. The sigmoid activation function produces probabilities between 0 and 1 and is oftenly used in a binary classification problem. 

In [30]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Define early stopping and fit the model
early_stopping = EarlyStopping(monitor="loss")
# Only 20 epochs are needed because the early stopping applies fast in the training
model.fit(X_train, y_train, epochs=20, batch_size=2, callbacks=[early_stopping])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


<keras.src.callbacks.History at 0x22c41c24a30>

I use the Adam optimizer and the crossentropy loss to compile the model. This combination is also oftenly used in binary classification problems. The accuracy is the metrics that we will take a look at.  
I also define an early stopping criteria to stop the training if the loss isnt improving after 2 epochs.

The results seem to be really good, let's have more details using a classification report and a confusion matrix :

In [31]:
# Make prediction to be able to print the classification report
y_pred = model.predict(X_test)
prediction = (y_pred > 0.5).astype(int)

print(classification_report(y_test, prediction, target_names=class_labels))

# Print a confusion matrix with the results of the model
confusion_matrix_plot = confusion_matrix(y_test, prediction)
fig = px.imshow(confusion_matrix_plot, 
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)
fig.show()

              precision    recall  f1-score   support

    Non Spam       0.99      0.99      0.99       762
        Spam       0.98      0.98      0.98       398

    accuracy                           0.99      1160
   macro avg       0.99      0.99      0.99      1160
weighted avg       0.99      0.99      0.99      1160



The model has indeed really good results with 0.99 of accuracy. We can see with the confusion matrix that he doesnt have more difficulties to predict Spam nor Non Spam, which we could expect with a 0.99 accuracy.

Maybe the structure of the model is a bit too much for this problem and we could have good results too with less neurones by layer. It would allow to gain some compuation time.