Details about the dataset:

The csv file contains 5172 rows, each row for each email. There are 3002 columns. The first column indicates Email name. The name has been set with numbers and not recipients' name to protect privacy. The last column has the labels for prediction : 1 for spam, 0 for not spam. The remaining 3000 columns are the 3000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells. Thus, information regarding all 5172 emails are stored in a compact dataframe rather than as separate text files.

Let's start this by exploring our dataset a little bit. The most important thing to consider is that:
1. the label for each email is stored in the very last column
2. the columns are not of the email itself but rather the count for each of the most common 3000 words in all emails. 
3. A total of 5172 email hence 5172 rows
4. A total of 3000 most common words, with the first column being the email id, and the last being a label for a total of 3002 columns. 

What are things I want to know? 
1. spam to non-spam ratio
2. most commonly appearing word


Lets try building multiple models and see how they do.
1. Linear Regression
2. Feed Forward Neural Network
3. Bayesian Network Model
4. Skip connection
5. gradient clipping

Perhaps in the future we can find a datasset where the actual emails are maintained so we can try building a recurrent neural network. 

In [2]:
# Import related libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from sklearn.metrics import accuracy_score
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

In [3]:
# read the csv and convert it into a dataframe. We than make a copy of the dataframe before we make any changes to keep an original copy
df = pd.read_csv('Datasets/emails.csv')
df_orig = df.copy()

In [4]:
# retrieve a list of the 3000 most common words we chose to use
common_words = df.columns.tolist()[1:-1]

# save the number of training examples and number of words
m_train = df.shape[0]
num_words = len(common_words)

In [5]:
# fill missing values if there are any
print(df.isnull().sum())
df.fillna(0, inplace=True)

Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64


In [6]:
# data normalisation
word_columns = df.columns[1:-1]
label_column = df.columns[-1]
df[word_columns] = (df[word_columns] - df[word_columns].min()) / (df[word_columns].max() - df[word_columns].min())

In [7]:
label_column = df.columns[-1]
spam_counts = df[label_column].value_counts()
print(spam_counts)


Prediction
0    3672
1    1500
Name: count, dtype: int64


In [8]:
X_train, X_test, y_train, y_test = train_test_split(df[word_columns], df[label_column], test_size=0.2, random_state=42)

In [38]:
print('The shape of X_train is ' + str(X_train.shape))
print('The shape of y_train is ' + str(y_train.shape))
print('The shape of X_test is ' + str(X_test.shape))
print('The shape of y_test is ' + str(y_test.shape))

The shape of X_train is (4137, 3000)
The shape of y_train is (4137,)
The shape of X_test is (1035, 3000)
The shape of y_test is (1035,)


The shape of X_train is (4137, 3000)
The shape of y_train is (4137,)
The shape of X_test is (1035, 3000)
The shape of y_test is (1035,)

Let's start with training a basic linear regression model

In [48]:
linear_model = Sequential()
linear_model.add(Dense(128, activation = 'relu', input_shape = (3000,)))
linear_model.add(Dense(64, activation='relu'))
linear_model.add(Dense(1, activation = 'sigmoid'))

linear_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

linear_model.fit(X_train, y_train.T, batch_size=10,
          epochs=10, validation_split=0.1)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.8484 - loss: 0.3255 - val_accuracy: 0.9783 - val_loss: 0.1033
Epoch 2/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9861 - loss: 0.0506 - val_accuracy: 0.9783 - val_loss: 0.0876
Epoch 3/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9942 - loss: 0.0199 - val_accuracy: 0.9807 - val_loss: 0.0476
Epoch 4/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9975 - loss: 0.0099 - val_accuracy: 0.9807 - val_loss: 0.0512
Epoch 5/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9996 - loss: 0.0051 - val_accuracy: 0.9855 - val_loss: 0.0544
Epoch 6/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9999 - loss: 0.0034 - val_accuracy: 0.9831 - val_loss: 0.0735
Epoch 7/10
[1m373/373[0m 

<keras.src.callbacks.history.History at 0x2097c31f1d0>

accuracy: 0.9991 - loss: 0.0039 - val_accuracy: 0.9758 - val_loss: 0.1202

We seem to be performing relatively well on the training set. Let's check our accuracy on the test set. 

In [49]:
y_pred_prob = linear_model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probabilities to binary predictions

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model: {accuracy:.4f}")

[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
Accuracy of the model: 0.9749


Accuracy of the model: 0.9749

Seems like we are performing fine on the test set too. This is a great start! Let's try implementing a different model.

In [9]:
ff_model = Sequential()
ff_model.add(Dense(128, activation = 'relu', input_shape = (3000,)))
ff_model.add(Dense(64, activation='relu'))
ff_model.add(Dense(32, activation='relu'))
ff_model.add(Dense(1, activation = 'sigmoid'))

ff_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])



  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [10]:
file_name = 'test4'
tensorboard = TensorBoard(log_dir="logs\\{}".format(file_name))

In [11]:
ff_model.fit(X_train, y_train.T, batch_size=10,
          epochs=10, validation_split=0.1, callbacks=[tensorboard])

Epoch 1/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8414 - loss: 0.3408 - val_accuracy: 0.9758 - val_loss: 0.0723
Epoch 2/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9895 - loss: 0.0332 - val_accuracy: 0.9855 - val_loss: 0.0432
Epoch 3/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9904 - loss: 0.0275 - val_accuracy: 0.9807 - val_loss: 0.0452
Epoch 4/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9987 - loss: 0.0051 - val_accuracy: 0.9831 - val_loss: 0.0539
Epoch 5/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9995 - loss: 0.0029 - val_accuracy: 0.9589 - val_loss: 0.3399
Epoch 6/10
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9962 - loss: 0.0184 - val_accuracy: 0.9758 - val_loss: 0.0582
Epoch 7/10
[1m373/373[0m 

<keras.src.callbacks.history.History at 0x200ae7fa300>

In [52]:
y_pred_prob = ff_model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probabilities to binary predictions

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model: {accuracy:.4f}")

[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Accuracy of the model: 0.9787


This seems to perform pretty similarily as a simple regression model. Which makes sense since they do have pretty similar structues. Let's try a skip-connection network.

In [68]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Add, Dropout, BatchNormalization
from tensorflow.keras.models import Model

inputs = Input(shape=(3000,))

# Layer 1
x1 = Dense(256, activation='relu')(inputs)
x1 = BatchNormalization()(x1)
x1 = Dropout(0.5)(x1)

# Layer 2
x2 = Dense(128, activation='relu')(x1)
x2 = BatchNormalization()(x2)
x2 = Dropout(0.5)(x2)

# Skip connection to Layer 3
x3 = Dense(128, activation='relu')(x2)  # Skip connection from layer 1

# Combine Layer 2 and Layer 3 outputs
combined = Add()([x2, x3])

x4 = Dense(64, activation='relu')(combined)
x4 = BatchNormalization()(x4)
x4 = Dropout(0.5)(x4)

x5 = Dense(64, activation='relu')(x4)

combined2 = Add()([x4, x5])

x6 = Dense(32, activation='relu')(combined)
x6 = BatchNormalization()(x6)
x6 = Dropout(0.5)(x6)

x7 = Dense(32, activation='relu')(x6)

combined3 = Add()([x6, x7])
# Output layer
outputs = Dense(1, activation='sigmoid')(combined3)

# Create the model
sc_model = Model(inputs=inputs, outputs=outputs)

# Compile the model
sc_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
sc_model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.6501 - loss: 0.7484 - val_accuracy: 0.7089 - val_loss: 0.5494
Epoch 2/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.8804 - loss: 0.2905 - val_accuracy: 0.7053 - val_loss: 0.5481
Epoch 3/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.9346 - loss: 0.1731 - val_accuracy: 0.7754 - val_loss: 0.3641
Epoch 4/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.9560 - loss: 0.1166 - val_accuracy: 0.9263 - val_loss: 0.1663
Epoch 5/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.9741 - loss: 0.0766 - val_accuracy: 0.9758 - val_loss: 0.0708
Epoch 6/20
[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.9789 - loss: 0.0741 - val_accuracy: 0.9674 - val_loss: 0.0701
Epoch 7/20
[1m104/104[0m 

<keras.src.callbacks.history.History at 0x2090dec1ee0>

In [72]:
y_pred_prob = sc_model.predict(X_test)
y_pred = (y_pred_prob > 0.8).astype(int)  # Convert probabilities to binary predictions

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model: {accuracy:.4f}")

[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Accuracy of the model: 0.9787


Lastly we'll try implementing a naive bayse model

In [84]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Accuracy: 0.9391304347826087
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       739
           1       0.91      0.88      0.89       296

    accuracy                           0.94      1035
   macro avg       0.93      0.92      0.92      1035
weighted avg       0.94      0.94      0.94      1035



This is enough for now, we'll move on to constructing a network useful for detecting a full email. 