# Spam Email classification using deep learning

## Introduction:

Spam email classification utilizes machine learning to distinguish between legitimate and unwanted emails. By analyzing the content of the email, the long short term memory (LSTM) deep learning model will understand the pattern in the content and classify the mail as spam or not spam.

## Dataset:

The Email classification model is trained and evaluated on a dataset containing labeled examples of text with corresponding label indicating whether the content of mail is spam or not.

link to the dataset: https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset

So below are all the libraries that you will need for this project

In [21]:
#using python 3.10.1
import pandas as pd
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
import math
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from keras.models import load_model
from keras.optimizers import RMSprop
from keras.callbacks import LearningRateScheduler, EarlyStopping
from sklearn.metrics import classification_report,confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils import to_categorical
from keras.layers import Bidirectional
from keras.callbacks import ModelCheckpoint, EarlyStopping

Lets begin with loading the dataset.

In [27]:
df=pd.read_csv("emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [25]:
df['spam'].value_counts()

spam
0    4360
1    1368
Name: count, dtype: int64

The code below sets up functions for common text preprocessing tasks, such as removing stopwords and lemmatizing words.

These preprocessing steps are crucial for cleaning and transforming raw text data before using it for NLP tasks like sentiment analysis or text classification.

In [28]:
nltk.download('stopwords')
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

def rem_subject(text):
    return text[9:]

def remove_stop_words(sentence): 
  if isinstance(sentence, str):
    words = sentence.split() 
    filtered_words = [word for word in words if word not in stop_words] 
    return ' '.join(filtered_words)
  else:
    return ""

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    if isinstance(text, str):
        st = ""
        for w in w_tokenizer.tokenize(text):
            st = st + lemmatizer.lemmatize(w) + " "
        return st.strip()
    else:
        return ""


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>


Now we will pass the text field through the function (Run this only once!!)

In [29]:
df['text']=df['text'].apply(rem_subject)
df['text']=df['text'].apply(remove_stop_words)
df['text']=df['text'].apply(lemmatize_text)

In [30]:
df.head()

Unnamed: 0,text,spam
0,naturally irresistible corporate identity lt r...,1
1,stock trading gunslinger fanny merrill muzo co...,1
2,unbelievable new home made easy im wanting sho...,1
3,4 color printing special request additional in...,1
4,"money , get software cd ! software compatibili...",1


We also have to calculate the average length of each sentence in the dataset (This will be of help later!)

In [32]:
def Cavg(text):
    arr=[]
    for i in text:
        np.array(arr.append(len(i)))
    return(np.average(arr))
avg_len=Cavg(df['text'])
print(avg_len)

1171.6222067039107


The code below sets up a tokenizer, fits it on the text data, and converts the text into sequences of integers. This sequence data is often used as input to train machine learning models, such as neural networks, for tasks like sentiment analysis or text classification. The choice of num_words limits the vocabulary size to the most frequent words, which can help manage computational resources and improve model efficiency.

In [33]:
max_word=5000 #the maximum number of words to keep, based on word frequency.
max_sequence_length = 1000 #the average length was 96 so we can round of to 100
tokenizer = Tokenizer(num_words=max_word, split=' ') 
tokenizer.fit_on_texts(df['text'].values)
X = tokenizer.texts_to_sequences(df['text'].values)
X= pad_sequences(X, maxlen=max_sequence_length)

After completing the preprocessing of the dataset, the next step is to start preparing for deep learning modelling

In [34]:
y = keras.utils.to_categorical(df['spam'], 2, dtype="float32")
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

Lets quickly take a look at how the data is looking after splitting for training and testing.

In [35]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4009, 1000)
(4009, 2)
(1719, 1000)
(1719, 2)


The code give below defines a sequential neural network model for a binary classification task using Keras. It consists of an Embedding layer followed by two Bidirectional LSTM layers with dropout. The final layer is a Dense layer with softmax activation. The model is compiled using the Adam optimizer and categorical crossentropy loss, with accuracy as the evaluation metric.

In [37]:
# Model
model = Sequential()
model.add(Embedding(max_word, 1000, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(128, dropout=0.5, return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.5)))
model.add(Dense(2, activation='softmax'))
# Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Finally lets train the model on our training dataset. (I had my epochs at 1 as my model gave extremely good results. You can change the number of epochs based on your linking. Just remember the more epochs the more time will the training process take.)

In [49]:
history = model.fit(X_train, y_train, epochs=1, validation_data=(X_test, y_test))



After training is complete we will test the model on testing dataset and generate the classification report to analyse the performance of our model.

In [50]:
y_pred=model.predict(X_test)
report = classification_report(y_test.argmax(axis=1), np.around(y_pred, decimals=0).argmax(axis=1))
print("\nClassification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1278
           1       0.99      0.94      0.96       441

    accuracy                           0.98      1719
   macro avg       0.98      0.97      0.97      1719
weighted avg       0.98      0.98      0.98      1719



Finally lets save the model and the tokenizer so that it can be used later for applications. (I didnt run the cells below to save space.)

In [None]:
import pickle
# saving deep learning model
model.save("EmailClassifier.h5")
# saving tokenizer
with open('tokenizer_email.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)


After saving the model and tokenizer we can load using the following code

In [None]:
with open('tokenizer_email.pickle', 'rb') as handle:
    loaded_tokenizer = pickle.load(handle)
loaded_model=keras.models.load_model('"EmailClassifier.h5"')

Lets Do some testing on input given by us!

In [42]:
max_sequence_length = 1000
label=['Not Spam', 'Spam']

In [51]:
test_txt=["Hi James, Have you claim your complimentary gift yet? I've compiled in here a special astrology gift that predicts everything about you in the future? This is your enabler to take the correct actions now. >> Click here to claim your copy now >> Claim yours now, and thank me later. Love, Heather", "Hey mate! i am mailing you regarding the opening at our company. You applied here for the role of a data scientist an we would like you to share your resume. regards, Manager"]
seq = tokenizer.texts_to_sequences(test_txt)
ans = pad_sequences(seq, maxlen=max_sequence_length)
preds = model.predict(ans)
for i in preds:
    print(label[np.around(i, decimals=0).argmax()])

Spam
Not Spam
