Data Source:
[1] Almeida, T.A., G�mez Hidalgo, J.M., Yamakami, A. Contributions to the study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (ACM DOCENG'11), Mountain View, CA, USA, 2011. (Under review)
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ 

Data is taken from primarily Singaporean colledge students, so the model may not apply as well to other places or groups. However the process here would be applicable to any other spam classification dataset.

In [1]:
import pandas as pd
import numpy as np

The first step here will be to import the data. We also change the label to be a boolean integer, where 0 is the negative (no spam) and 1 is the positive (spam) case.

In [2]:
#Import data
df = pd.read_csv('Data\\SMSSPamCollection',names=['label','text'],sep='\t',encoding='utf-8')

#Format Data
spamdict = {'ham' : 0, 'spam' : 1}
df['label'] = df['label'].apply(lambda x: spamdict[x])
print(df.head()) #Verify

   label                                               text
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...


This dataset does not contain missing values. We should check for any unusual characters and clean them from the set, so that the model is not training on them. I also had some issues with UTF-8 encoding specifically when running from the dockerfile, so we want to ensure that all of the characters are compatible.

 We can keep punctuation in the data however. Even though they often don't have semantic meaning, punctuation can be an indicator of spam. Casual (human) texts often don't include punctuation, and spam texts will often make liberal use of it. For instance, a spam text that says "URGENT!!!" to grab the reader's attention.

In [3]:
#Unusual Character Check
df['text'] = df['text'].str.replace('£', '$') #Same meaning semantically. May cause compability issues.

pattern_unusual = r'[^a-zA-Z0-9\sa!@#$%^&*()_+\-=\[\]{};\':"\\|,.<>\/?]'

df['text'] = df['text'].str.replace('\u2018', '\'') #Inconsistent Unicode 

#Replace all others with a space
df['text'] = df['text'].str.replace(pattern_unusual, '', regex=True)

ser_unusual = df['text'].str.contains(pattern_unusual)
print(ser_unusual.sum())

0


Next, we should identify the vocabulary size. This value is useful for determining the expected optimal value for the max tokens are.

In [4]:
words = df['text'].str.split()
words = words.explode()

vocab_size = words.nunique()

print(f'Vocab Size: {vocab_size}')

Vocab Size: 15617


Next we create the model. This task is not particularly difficult in terms of complexity or computing power, so a light-weight framework using keras will do the job just fine. We make use of SKLearn's train-test split as well.

In [6]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
from keras import layers
from keras.regularizers import l2
from keras.callbacks import EarlyStopping
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [7]:
seed = 0
tf.random.set_seed(seed)

We need to convert the string data to numerical format. The must be done before the split so that we capture the entire vocabulary of the dataset. The max amount of characters in a sms message is 160, but tokens are done on a word-word basis, so we can truncate to a smaller amount. 50 should be more than enough.

The max_tokens parameter is the number total number of words that are encoded into the vocabulary. These words are the top N most common in the dataset. This is a hyperparameter, where it must be tweaked for the optimal value, but we can use the vocabulary size to get a rough idea. There are 15632 unique words, so a value between 3000 and 10000 will probably capture the most semantic meaning of the words. The model runs quick, so it is easy to adjust until the value is optimal. 5000 tokens results in the best test set accuracy, which is approximately a 1:3 token-to-word ratio.

In [8]:

X_all = df['text'].values
y_all = df['label'].values

vectorizer = layers.TextVectorization(max_tokens=5000,output_mode='int',output_sequence_length=50,encoding='utf-8')
vectorizer.adapt(X_all) #Adapt the vectorizer before splitting so it learns the vocabulary of all values.

X_train, X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=0.2,random_state=seed, stratify=y_all)


The model architecture is a simple 1D convulutional model with 2 fully connected layers. The dropout here is high, as this task is very prone to overfitting.

We include false positives in the metrics, as it is important that legitimate texts are not marked as spam at a high rate.

We also use a callback to set an early stopping condition. This model does not need many epochs to train, but we want to ensure we use the one with the best validation set accuracy.

In [9]:
model = keras.Sequential([
    keras.Input(shape=(1,), dtype=tf.string),
    vectorizer,
    layers.Embedding(input_dim=len(vectorizer.get_vocabulary())+1,output_dim=64),
    layers.Dropout(0.5),
    layers.Conv1D(64,7,padding='valid',activation='relu',strides=3),
    layers.GlobalMaxPooling1D(),
    layers.Dropout(0.5),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1,activation='sigmoid',kernel_regularizer=l2(0.01))  
])

stopping = EarlyStopping(patience=3, restore_best_weights=True)

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy',keras.metrics.FalsePositives()])
model.fit(X_train, y_train, epochs=10, batch_size=8, validation_split=0.2,callbacks=[stopping])

Epoch 1/10
[1m446/446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8802 - false_positives: 24.0000 - loss: 0.3561 - val_accuracy: 0.9731 - val_false_positives: 6.0000 - val_loss: 0.1035
Epoch 2/10
[1m446/446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9776 - false_positives: 27.0000 - loss: 0.0991 - val_accuracy: 0.9787 - val_false_positives: 13.0000 - val_loss: 0.0815
Epoch 3/10
[1m446/446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9913 - false_positives: 9.0000 - loss: 0.0528 - val_accuracy: 0.9865 - val_false_positives: 5.0000 - val_loss: 0.0676
Epoch 4/10
[1m446/446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9950 - false_positives: 6.0000 - loss: 0.0364 - val_accuracy: 0.9843 - val_false_positives: 6.0000 - val_loss: 0.0701
Epoch 5/10
[1m446/446[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9958 - false_positives: 4

<keras.src.callbacks.history.History at 0x227c86bb380>

Next, we evaluate on the test set.

In [10]:
score = model.evaluate(X_test,y_test)
print(score)

[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9785 - false_positives: 6.0000 - loss: 0.0935 
[0.09349261969327927, 0.9784753322601318, 6.0]


The following loop shows how many false positives vs false negatives there are for different threshold levels. Increasing the threshold decreases the false positives but increases the false negatives. It is possible to get the same accuracy with no false positives at 95% threshold, and at 85% the accuracy is maximized.

In [11]:
thresh = 0.5

while thresh < 1.0:
    predictions = model.predict(X_test)
    y_pred = (predictions > thresh).astype(int).flatten()
    fp_count = np.sum(np.logical_and(y_pred==1,y_test==0))
    fn_count = np.sum(np.logical_and(y_pred==0,y_test==1))
    print(f'Threshold: {thresh} | FP: {fp_count} | FN: {fn_count}')
    thresh += 0.05



[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Threshold: 0.5 | FP: 6 | FN: 18
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.55 | FP: 6 | FN: 18
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.6000000000000001 | FP: 6 | FN: 18
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.6500000000000001 | FP: 5 | FN: 18
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.7000000000000002 | FP: 4 | FN: 18
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.7500000000000002 | FP: 4 | FN: 20
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.8000000000000003 | FP: 4 | FN: 20
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Threshold: 0.8500000000000003 | FP: 2 | FN: 20
[1m35/35[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0

Next we save the model. This is so we can use it for deployment, which involves making accesible via API. This is done in the main.py script, and launched inside a docker container for compatability.

In [12]:
model.save('spamclass.keras')