# **CSCI 4050U: Machine Learning**
### Final Project: Spam Detector
Brendan Szeto, 100702901

This project aims to detect spam messages so that they can be deleted or sorted
automatically.


In this project, the following dataset from Kaggle was used

https://www.kaggle.com/team-ai/spam-text-message-classification

# Import TensorFlow and other Necessary Libraries

In [1]:
import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
import tensorflow.keras.optimizers as optimizers
import tensorflow.keras.losses as losses
import tensorflow.keras.preprocessing as preprocessing

import numpy as np
import pandas as pd
import io

# Get the dataset

In [2]:
# Upload .csv file with dataset
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [3]:
dataset = pd.read_csv(io.BytesIO(uploaded['spam.csv']))
dataset = dataset.replace(to_replace='spam', value=1)
dataset = dataset.replace(to_replace='ham', value=0)
dataset

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [4]:
# Split the dataset into two different lists for the messages and the labels
messages = dataset['Message'].tolist()
labels = dataset['Category'].tolist()

# Seperate the two lists into training and testing sets
train_size = int(len(messages) * 0.8)
input_train = messages[0:train_size]
y_train = labels[0:train_size]
input_test = messages[train_size:]
y_test = labels[train_size:]

labels_train = np.array(y_train)
labels_test = np.array(y_test)

# Prepare the data

In [5]:
# Start by tokenizing the data then sort them into sequences
token = preprocessing.text.Tokenizer()
token.fit_on_texts(input_train)

sequences_train = token.texts_to_sequences(input_train)
sequences_test = token.texts_to_sequences(input_test)

In [6]:
# Next, apply padding
max_length = 60

padded_train = preprocessing.sequence.pad_sequences(sequences_train, maxlen=max_length, padding='post')
padded_test = preprocessing.sequence.pad_sequences(sequences_test, maxlen=max_length, padding='post')

# Create Model

In [7]:
# Create the model
vocab_size = len(token.word_index) + 1
embedding_dim = 16

model = models.Sequential([
    layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    layers.LSTM(embedding_dim),
    layers.Flatten(),
    layers.Dense(6, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

In [8]:
# Compile the model
model.compile(
    loss = losses.BinaryCrossentropy(),
    optimizer = optimizers.Adam(),
    metrics = ['accuracy']
)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 60, 16)            128496    
                                                                 
 lstm (LSTM)                 (None, 16)                2112      
                                                                 
 flatten (Flatten)           (None, 16)                0         
                                                                 
 dense (Dense)               (None, 6)                 102       
                                                                 
 dropout (Dropout)           (None, 6)                 0         
                                                                 
 dense_1 (Dense)             (None, 1)                 7         
                                                                 
Total params: 130,717
Trainable params: 130,717
Non-trai

In [9]:
# Train the model
model.fit(padded_train, labels_train, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f156876ce10>

# Deployment

In [10]:
# 'Inbox' contains a list of messages, with some spam and some ham
inbox = ['Did you take out the trash yesterday?',
         'Talk to you later honey, love you',
         'As a loyal TD customer, you have been awarded $100. Click on this link to add the funds to your chequeing account...',
         'Congratulations! You are the winner of a brand new iPhone! Reply WIN to claim your prize!',
         'I like that sweater you bought me',
         'You are a winner! Respond now to recieve a free cruise!',]

for i in range(len(inbox)):
  print(inbox[i])

Did you take out the trash yesterday?
Talk to you later honey, love you
As a loyal TD customer, you have been awarded $100. Click on this link to add the funds to your chequeing account...
Congratulations! You are the winner of a brand new iPhone! Reply WIN to claim your prize!
I like that sweater you bought me
You are a winner! Respond now to recieve a free cruise!


In [11]:
# Create test sequences
predict_sequences = token.texts_to_sequences(inbox)
predict_padded = preprocessing.sequence.pad_sequences(predict_sequences, maxlen=max_length, padding='post')

results = model.predict(predict_padded)

# If a message is predicted to be spam, delete it from messages
spam_folder = []
for i in range(len(inbox)):
  if (results[i] >= 0.5):
     spam_folder.append(inbox[i])
     inbox[i] = ''
inbox = list(filter(('').__ne__, inbox))

In [12]:
# Print filtered inbox
print('Inbox')
for i in range(len(inbox)):
  print(inbox[i])
  print('\n')

print('Spam Folder')
for i in range(len(spam_folder)):
  print(spam_folder[i])
  print('\n')

Inbox
Did you take out the trash yesterday?


Talk to you later honey, love you


I like that sweater you bought me


Spam Folder
As a loyal TD customer, you have been awarded $100. Click on this link to add the funds to your chequeing account...


Congratulations! You are the winner of a brand new iPhone! Reply WIN to claim your prize!


You are a winner! Respond now to recieve a free cruise!


