# Spam Email classification using deep learning

## Introduction:

Spam email classification utilizes machine learning to distinguish between legitimate and unwanted emails. By analyzing the content of the email, the long short term memory (LSTM) deep learning model will understand the pattern in the content and classify the mail as spam or not spam.

## Dataset:

The Email classification model is trained and evaluated on a dataset containing labeled examples of text with corresponding label indicating whether the content of mail is spam or not.

link to the dataset: https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset

So below are all the libraries that you will need for this project

In [None]:
pip install -r requirements.txt


In [None]:
python.exe -m pip install --upgrade pip

In [None]:
#using python 3.10.1
import pandas as pd
import re
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras as keras
import math
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import LearningRateScheduler, EarlyStopping
from sklearn.metrics import classification_report,confusion_matrix
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

Lets begin with loading the dataset.

In [None]:
df=pd.read_csv("emails.csv")
df.head() #display first 5 rows of the dataset

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [15]:
df['spam'].value_counts() #count of spam and ham emails

spam
0    4360
1    1368
Name: count, dtype: int64

The code below sets up functions for common text preprocessing tasks, such as removing stopwords and lemmatizing words.

These preprocessing steps are crucial for cleaning and transforming raw text data before using it for NLP tasks like sentiment analysis or text classification.

In [None]:
nltk.download('stopwords') 
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

def rem_subject(text): #function to remove 'Subject:' from email text
    return text[9:] 

def remove_stop_words(sentence): 
  if isinstance(sentence, str): #check if input is string
    words = sentence.split() 
    filtered_words = [word for word in words if word not in stop_words] #remove stop words
    return ' '.join(filtered_words) #return filtered sentence
  else:
    return ""

w_tokenizer = nltk.tokenize.WhitespaceTokenizer() #tokenizer that splits text by whitespace
lemmatizer = nltk.stem.WordNetLemmatizer() #lemmatizer that reduces words to their base form
def lemmatize_text(text):
    if isinstance(text, str):
        st = ""
        for w in w_tokenizer.tokenize(text):
            st = st + lemmatizer.lemmatize(w) + " "
        return st.strip()
    else:
        return ""


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SAHITHI\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Now we will pass the text field through the function (Run this only once!!)

In [18]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SAHITHI\AppData\Roaming\nltk_data...


True

In [19]:
df['text']=df['text'].apply(rem_subject)
df['text']=df['text'].apply(remove_stop_words)
df['text']=df['text'].apply(lemmatize_text)

In [20]:
df.head()

Unnamed: 0,text,spam
0,irresistible corporate identity lt really hard...,1
1,ding gunslinger fanny merrill muzo colza attai...,1
2,ble new home made easy im wanting show homeown...,1
3,rinting special request additional information...,1
4,et software cd ! software compatibility . . . ...,1


We also have to calculate the average length of each sentence in the dataset (This will be of help later!)

In [None]:
def Cavg(text):
    arr=[]
    for i in text:
        np.array(arr.append(len(i))) #calculate length of each email text "i" and store in numpy array
    return(np.average(arr))
avg_len=Cavg(df['text'])
print(avg_len)

1162.4809706703911


The code below sets up a tokenizer, fits it on the text data, and converts the text into sequences of integers. This sequence data is often used as input to train machine learning models, such as neural networks, for tasks like sentiment analysis or text classification. The choice of num_words limits the vocabulary size to the most frequent words, which can help manage computational resources and improve model efficiency.

In [22]:
max_word=5000 #the maximum number of words to keep, based on word frequency.
max_sequence_length = 1000 #the average length was 96 so we can round of to 100
tokenizer = Tokenizer(num_words=max_word, split=' ') 
tokenizer.fit_on_texts(df['text'].values)
X = tokenizer.texts_to_sequences(df['text'].values)
X= pad_sequences(X, maxlen=max_sequence_length)

After completing the preprocessing of the dataset, the next step is to start preparing for deep learning modelling

In [24]:
y = keras.utils.to_categorical(df['spam'], 2) #one-hot encoding of spam labels - 2 categories spam and not spam
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42) #split data into training and testing sets

Lets quickly take a look at how the data is looking after splitting for training and testing.

In [None]:
print(X_train.shape) #shape = (number of training samples, features)
print(y_train.shape) #shape = (number of training samples, 2)
print(X_test.shape) #shape = (number of testing samples, features)
print(y_test.shape) #shape = (number of testing samples, 2)
#x inp y outp 

(4009, 1000)
(4009, 2)
(1719, 1000)
(1719, 2)


The code give below defines a sequential neural network model for a binary classification task using Keras. It consists of an Embedding layer followed by two Bidirectional LSTM layers with dropout. The final layer is a Dense layer with softmax activation. The model is compiled using the Adam optimizer and categorical crossentropy loss, with accuracy as the evaluation metric.

In [28]:
# Model
model = Sequential()
model.add(Embedding(max_word, 1000, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(128, dropout=0.5, return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.5)))
model.add(Dense(2, activation='softmax'))
# Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])



Finally lets train the model on our training dataset. (I had my epochs at 1 as my model gave extremely good results. You can change the number of epochs based on your linking. Just remember the more epochs the more time will the training process take.)

In [29]:
history = model.fit(X_train, y_train, epochs=1, validation_data=(X_test, y_test))

[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2928s[0m 23s/step - accuracy: 0.9369 - loss: 0.1467 - val_accuracy: 0.9738 - val_loss: 0.0923
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2928s[0m 23s/step - accuracy: 0.9369 - loss: 0.1467 - val_accuracy: 0.9738 - val_loss: 0.0923


After training is complete we will test the model on testing dataset and generate the classification report to analyse the performance of our model.

In [30]:
y_pred=model.predict(X_test)
report = classification_report(y_test.argmax(axis=1), np.around(y_pred, decimals=0).argmax(axis=1))
print("\nClassification Report:")
print(report)

[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 1s/step
[1m54/54[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 1s/step

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1278
           1       0.98      0.92      0.95       441

    accuracy                           0.97      1719
   macro avg       0.98      0.96      0.96      1719
weighted avg       0.97      0.97      0.97      1719


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1278
           1       0.98      0.92      0.95       441

    accuracy                           0.97      1719
   macro avg       0.98      0.96      0.96      1719
weighted avg       0.97      0.97      0.97      1719



Finally lets save the model and the tokenizer so that it can be used later for applications. (I didnt run the cells below to save space.)

In [34]:
import pickle
# saving deep learning model
model.save("EmailClassifier.keras")
# saving tokenizer
with open('tokenizer_email.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)


After saving the model and tokenizer we can load using the following code

In [35]:
with open('tokenizer_email.pickle', 'rb') as handle:
    loaded_tokenizer = pickle.load(handle)
loaded_model=keras.models.load_model('EmailClassifier.keras')

  saveable.load_own_variables(weights_store.get(inner_path))


Lets Do some testing on input given by us!

In [36]:
max_sequence_length = 1000
label=['Not Spam', 'Spam']

In [40]:
test_txt=["hello click on this link to send me infinite money >> link >>"]
seq = tokenizer.texts_to_sequences(test_txt)
ans = pad_sequences(seq, maxlen=max_sequence_length)
preds = model.predict(ans)
for i in preds:
    print(label[np.around(i, decimals=0).argmax()])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 237ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 237ms/step
Spam
Spam
