## 1. Import Library and open the data sets

At this stage, I am only opening files from the enron1 and enron2 folders due to the RAM limitations of the device I am using.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_files
from tqdm import tqdm

X, y = [], []
for i in tqdm(range(1,3)):
    emails = load_files(f"enron{i}")
    X = np.append(X, emails.data)
    y = np.append(y, emails.target)

100%|██████████| 2/2 [00:06<00:00,  3.48s/it]


## 2. Define the data frame as 'df' with two columns, 'text' and 'target'

In [2]:
df = pd.DataFrame(columns=['text', 'target'])
df['text'] = [x for x in X]
df['target'] = [t for t in y]

In [3]:
df

Unnamed: 0,text,target
0,b'Subject: nesa / hea \' s 24 th annual meetin...,0.0
1,b'Subject: meter 1431 - nov 1999\r\ndaren -\r\...,0.0
2,"b""Subject: investor here .\r\nfrom : mr . rich...",1.0
3,"b""Subject: hi paliourg all available meds . av...",1.0
4,b'Subject: january nominations at shell deer p...,0.0
...,...,...
11024,b'Subject: investment / partnership proposal\r...,1.0
11025,b'Subject: re : fwd : praca dyplomowa v edycja...,0.0
11026,"b'Subject: fw : citi , wells , enron , sl and ...",0.0
11027,b'Subject: re : subscription renewal\r\nstepha...,0.0


## 3. Import library for data cleansing

In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize

## 4. Define function for cleansing data

In [5]:
def cleansing(text):
    # Make sentence being lowercase
    text = text.lower()
    
    # Decode bytes to string
    text = text.decode('latin-1')

    # Remove hashtag
    pattern_3 = r'#([^\s]+)'
    text = re.sub(pattern_3, '', text)

    # Remove general punctuation, math operation char, etc.
    pattern_4 = r'[\,\@\*\_\-\!\:\;\?\'\.\"\)\(\{\}\<\>\+\%\$\^\#\/\`\~\|\&\|]'
    text = re.sub(pattern_4, ' ', text)

    # Remove emoji
    pattern_6 = r'\\[a-z0-9]{1,5}'
    text = re.sub(pattern_6, '', text)

    # Remove (\); ([); (])
    pattern_9 = r'[\\\]\[]'
    text = re.sub(pattern_9, '', text)

    # Remove character non ASCII
    pattern_10 = r'[^\x00-\x7f]'
    text = re.sub(pattern_10, '', text)

    # Remove character non ASCII
    pattern_11 = r'(\\u[0-9A-Fa-f]+)'
    text = re.sub(pattern_11, '', text)

    # Remove multiple whitespace
    pattern_12 = r'(\s+|\\n)'
    text = re.sub(pattern_12, ' ', text)
    
    # Remove whitespace at the first and end sentences
    text = text.rstrip()
    text = text.lstrip()
    return text

def tokenisasi(text):
    tokens = nltk.tokenize.word_tokenize(text)
    return tokens

## 5. Split data frame, 'df_X' contains 'text', and 'df_y' contains 'target'

In [6]:
df_X = df.drop(['target'], axis=1)
df_y = df['target']

In [7]:
df['text'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 11029 entries, 0 to 11028
Series name: text
Non-Null Count  Dtype 
--------------  ----- 
11029 non-null  object
dtypes: object(1)
memory usage: 86.3+ KB


## 6. Apply cleansing function for data cleansing

In [8]:
df_X['text'] = df_X['text'].apply(cleansing)
df_X

Unnamed: 0,text
0,subject nesa hea s 24 th annual meeting saddle...
1,subject meter 1431 nov 1999 daren could you pl...
2,subject investor here from mr richard mayer de...
3,subject hi paliourg all available meds availab...
4,subject january nominations at shell deer park...
...,...
11024,subject investment partnership proposal dear s...
11025,subject re fwd praca dyplomowa v edycja mba wa...
11026,subject fw citi wells enron sl and i 2 form a ...
11027,subject re subscription renewal stephanie than...


## 7. In the modeling stage, I used the Neural Network algorithm (Long Short Term Memory) from the keras library.

## LSTM

In this stage, I performed tokenization using the help of the keras library and limited the number of words tokenized to 50000, and the sequence length to 250.

In [9]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

sentences = df_X['text'].to_list()
MAX_NB_WORDS = 50000
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = tokenizer.texts_to_sequences(sentences)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of Data Tensor : ', X.shape)

Found 73237 unique tokens.
Shape of Data Tensor :  (11029, 250)


In this stage, I created a new column for each unique value in the label column, and filled it with 0 or 1. Then, the .values method was used to retrieve the values from the processed DataFrame and convert them into a numpy array. The final result is a tensor containing labels in the form of one hot encoding (0 or 1), so that it can be processed by Machine Learning.

In [10]:
Y = pd.get_dummies(df_y).values
print('Shape of Label Tensor : ', Y.shape)

Shape of Label Tensor :  (11029, 2)


In this stage, I used the help of the sklearn library to split the data into 'train' and 'test'. The train data is 80%, and the test data is 20%.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.20, random_state=42)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

(8823, 250) (8823, 2)
(2206, 250) (2206, 2)


In this stage, I will implement a model to perform classification on 'train' data.

In [12]:
from keras.models import Sequential
from keras.layers import Embedding, SpatialDropout1D, LSTM, Dense

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 2
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2)

Epoch 1/2
Epoch 2/2


Here is an explanation of the code above:

* The first and second lines import the necessary modules to create and train a neural network model in Keras.
* The third line creates a Sequential object, which is used to add layers to the neural network model in sequence.
* The fourth line adds an Embedding layer to the model, which is used to convert each word in the document into a numeric vector with the same dimension. MAX_NB_WORDS and EMBEDDING_DIM are parameters used to adjust the size of the embedding layer.
* The fifth line adds a SpatialDropout1D layer, which is used to reduce overfitting in the model.
* The sixth line adds an LSTM layer, which is a type of recurrent layer in neural networks used to process sequential data such as text.
* The seventh line adds a Dense layer with softmax activation to the model, which is used to perform classification on text data into two categories (in this case, positive and negative).
* The eighth line compiles the model using categorical_crossentropy loss function, adam optimizer, and accuracy metrics.
* The ninth and tenth lines train the model using X_train and Y_train as input and output data. epochs and batch_size are parameters used to adjust the number of iterations and batch size during model training.
* The last line saves the training results of the model in the history object.

## 8. Making predictions on the 'test' data.

In [13]:
y_pred = model.predict(X_test)
y_pred.shape



(2206, 2)

In [14]:
y_pred_labels = np.argmax(y_pred, axis=1)
print(y_pred_labels)

[0 0 0 ... 0 0 1]


In [15]:
from sklearn.metrics import accuracy_score

y_pred_classes = np.argmax(y_pred, axis=1)
Y_test_classes = np.argmax(Y_test, axis=1)
accuracy = accuracy_score(Y_test_classes, y_pred_classes)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 97.01%


In [16]:
from sklearn.metrics import classification_report

y_pred_classes = np.argmax(y_pred, axis=1)
Y_test_classes = np.argmax(Y_test, axis=1)
print(classification_report(Y_test_classes, y_pred_classes))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      1622
           1       0.91      0.99      0.95       584

    accuracy                           0.97      2206
   macro avg       0.95      0.98      0.96      2206
weighted avg       0.97      0.97      0.97      2206



At this stage, I obtained the accuracy of the applied model to the data. I obtained an accuracy of 97%.

## 9. Save Model

In [17]:
model.save("email_spam_classifier.h5")

## 10. Testing on different data, namely files from the 'enron4' folder.

In [20]:
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.models import load_model

X, y = [], []
for i in tqdm(range(4,5)):
    emails = load_files(f"enron{i}")
    X = np.append(X, emails.data)
    y = np.append(y, emails.target)

df_new = pd.DataFrame(columns=['text', 'target'])
df_new['text'] = [x for x in X]
df_new['target'] = [t for t in y]

df_X = df_new.drop(['target'], axis=1)
df_y = df_new['target']

df_X['text'] = df_X['text'].apply(cleansing)

sentences = df_X['text'].to_list()

# lakukan preprocessing pada data baru
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(sentences)
X_new = tokenizer.texts_to_sequences(sentences)
X_new = pad_sequences(X_new, maxlen=MAX_SEQUENCE_LENGTH)

loaded_model = load_model("email_spam_classifier.h5")

# lakukan prediksi pada data baru
y_prob = loaded_model.predict(X_new)
y_pred = y_prob.argmax(axis=-1)

# konversi nilai prediksi menjadi label sentimen
labels = {0: "spam", 1: "ham"}
df_X['target'] = [labels[pred] for pred in y_pred]

100%|██████████| 1/1 [00:01<00:00,  1.23s/it]




## 11. Viewing the classification results.

In [21]:
df_X

Unnamed: 0,text,target
0,subject are you infected with spyware is your ...,spam
1,subject turn 500 into 1200 day starting today ...,spam
2,subject mystery shopping extra casual income c...,spam
3,subject re kathy from epe s answer amy that s ...,spam
4,subject start date 1 27 02 hourahead hour 5 st...,spam
...,...,...
5994,subject your pharmacy fk do you want a cheap p...,spam
5995,subject 51 mortgage rates as low as 3 99 g day...,spam
5996,subject fw upto 80 off on prescrlpt 1 on drogs...,spam
5997,subject huge oem soft discounts here 75 pyroxe...,ham


In [22]:
df_X.loc[df_X['target'].str.contains('spam')]

Unnamed: 0,text,target
0,subject are you infected with spyware is your ...,spam
1,subject turn 500 into 1200 day starting today ...,spam
2,subject mystery shopping extra casual income c...,spam
3,subject re kathy from epe s answer amy that s ...,spam
4,subject start date 1 27 02 hourahead hour 5 st...,spam
...,...,...
5993,subject pay les for microsoft office software ...,spam
5994,subject your pharmacy fk do you want a cheap p...,spam
5995,subject 51 mortgage rates as low as 3 99 g day...,spam
5996,subject fw upto 80 off on prescrlpt 1 on drogs...,spam


In [23]:
df_X.loc[df_X['target'].str.contains('ham')]

Unnamed: 0,text,target
6,subject you need this jkoutsi look at this of ...,ham
17,subject,ham
20,subject borax craig paliourg valiumxanaxcialis...,ham
21,subject personals do you know what i want good...,ham
22,subject paliourg iit demokritos gr culpable re...,ham
...,...,...
5975,subject medications at huge discounts valium a...,ham
5983,subject azalea goodyear biography faro reject ...,ham
5984,subject hey check it out hi i just saw somethi...,ham
5987,subject re average girly ejaculation movies da...,ham


The files in the 'enron4' folder contain 1555 emails categorized as 'ham (not spam)', and 4444 emails categorized as 'spam'.