# Spam Detection Model Preparation

The following section of the code serves as the initial step in preparing a spam detection model. It includes the import of necessary libraries, the definition of a function for data extraction and preprocessing, and the reading of the dataset from a local zip file.




In [53]:
# Import necessary libraries for data manipulation, machine learning, and file handling
import argparse
import gensim.downloader as api
import numpy as np
import os
import shutil
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix
from zipfile import ZipFile

# Define a function to read data from a local zip file
def read_data(local_zip_path):
    # Determine the path for the extracted file
    extracted_folder_path = os.path.splitext(local_zip_path)[0]
    sms_collection_path = os.path.join(extracted_folder_path, 'SMSSpamCollection')

    # Unzip the file if the directory does not exist
    if not os.path.isdir(extracted_folder_path):
        with ZipFile(local_zip_path, 'r') as zip_ref:
            zip_ref.extractall(extracted_folder_path)

    # Read and split the data into labels and texts from the extracted file
    labels, texts = [], []
    with open(sms_collection_path, 'r', encoding='utf8') as file:
        for line in file:
            label, text = line.strip().split('\t')
            labels.append(1 if label == 'spam' else 0)  # Convert labels to binary (spam: 1, ham: 0)
            texts.append(text)  # Append text data to the texts list
    return texts, labels

# Specify the path to your local zip file containing the dataset
local_zip_path = 'datasets/smsspamcollection.zip'
texts, labels = read_data(local_zip_path)  # Read the dataset into texts and labels variables


# Text Tokenization and Sequence Padding

After loading the dataset, the next step in text processing is to convert the raw text into a format that can be fed into a neural network. This involves tokenizing the text and padding the sequences to have uniform length.



In [54]:
# Initialize the tokenizer which will be used to vectorize text data
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# Fit the tokenizer on the texts - this will create a mapping between words and integers
tokenizer.fit_on_texts(texts)

# Convert the list of texts to sequences of integers
text_sequences = tokenizer.texts_to_sequences(texts)

# Pad sequences to ensure that all sequences in a list have the same length
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)

# Calculate the number of records and the maximum sequence length
num_record = len(text_sequences)
max_seqlen = len(text_sequences[0])
print("{:d} sentences, max length: {:d}".format(num_record, max_seqlen))  # Print out the dataset stats

# Define the number of classes for classification (spam or not spam)
NUM_CLASSES = 2

# Convert labels to one-hot encoding format required for training the neural network
cat_labels = tf.keras.utils.to_categorical(labels, num_classes=NUM_CLASSES)

# Create mappings from words to integers and vice versa, including padding token
word2idx = tokenizer.word_index  # Dictionary mapping words to their integer representation
idx2word = {v:k for k, v in word2idx.items()}  # Reverse mapping from integers to words
word2idx["PAD"] = 0  # Add padding token to the word index dictionary
idx2word[0] = "PAD"  # Corresponding reverse mapping for padding

# Determine the vocabulary size for further use in the model (including the padding token)
vocab_size = len(word2idx)
print('Vocabulary size including padding token: {:d}'.format(vocab_size))


5574 sentences, max length: 189
Vocabulary size including padding token: 9010


# Dataset Splitting and Batching

Once the text data is tokenized and padded, the next step is to create a dataset object that can be used to feed data into the neural network in batches. This involves splitting the dataset into training, validation, and test sets, and batching the data for efficient training.



In [56]:
# Create a TensorFlow Dataset object which is suitable for feeding data into the model
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))

# Shuffle the dataset to ensure the model gets data points in a random order during training
dataset = dataset.shuffle(10000)

# Define the sizes for the test and validation sets
test_size = 512  # Size of the test set
val_size = 10    # Size of the validation set
print(f'Test set size: {test_size}, Validation set size: {val_size}') 

# Split the dataset into test, validation, and training sets
test_dataset = dataset.take(test_size)  # Take 'test_size' records for the test set
val_dataset = dataset.skip(test_size).take(val_size)  # Skip 'test_size' records and take the next 'val_size' for validation
train_dataset = dataset.skip(test_size + val_size)  # Use the remaining data for training

# Batch size is the number of samples that will be propagated through the network in one pass
BATCH_SIZE = 128

# Batch the test, validation, and training sets with the defined batch size
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)  # Batch the test dataset and drop the last batch if it has fewer than BATCH_SIZE elements
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)    # Same for the validation dataset
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True) # Same for the training dataset


Test set size: 512, Validation set size: 10


# Building the Embedding Matrix

An essential part of many natural language processing tasks is the use of word embeddings. The following code demonstrates how to build an embedding matrix using pre-trained embeddings from GloVe, which can then be used in the neural network.



In [57]:
# List available models in gensim - this line can be used to select an appropriate pre-trained embedding model
api.info()['models'].keys()

# Define a function to build an embedding matrix
def build_embedding_matrix(sequences, word2idx, embedding_dim, embedding_file):
    # Check if the embedding matrix file exists to avoid redundant computations
    if os.path.exists(embedding_file):
        E = np.load(embedding_file)  # Load the embedding matrix if it already exists
    else:
        vocab_size = len(word2idx)  # Get the size of the vocabulary
        E = np.zeros((vocab_size, embedding_dim))  # Initialize the embedding matrix with zeros
        word_vectors = api.load(EMBEDDING_MODEL)  # Load pre-trained word vectors
        for word, idx in word2idx.items():
            try:
                E[idx] = word_vectors.get_vector(word)  # Assign the vector of each word in the vocabulary
            except KeyError:  # If a word is not in the embedding model
                pass  # Skip the word
        np.save(embedding_file, E)  # Save the embedding matrix for future use
    return E

# Embedding dimensions and file specifications
EMBEDDING_DIM = 300  # Dimension of the GloVe vectors
DATA_DIR = "data"  # Directory to store data files
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")  # Path for the numpy file of the embedding matrix
EMBEDDING_MODEL = "glove-wiki-gigaword-300"  # The pre-trained GloVe model to use

# Build the embedding matrix based on the current dataset
E = build_embedding_matrix(text_sequences, word2idx, EMBEDDING_DIM, EMBEDDING_NUMPY_FILE)
print("Embedding matrix shape:", E.shape)  # Print the shape of the embedding matrix


Embedding matrix shape: (9010, 300)


# Defining and Training the Spam Classifier Model

In this section, we define a custom neural network model for spam classification. The model is built using TensorFlow's Keras API, and it utilizes convolutional and dense layers for processing the input text data.



In [58]:
class SpamClassifierModel(tf.keras.Model):
    def __init__(self, vocab_sz, embed_sz, input_length, num_filters, kernel_sz, output_sz, run_mode, embedding_weights, **kwargs):
        super(SpamClassifierModel, self).__init__(**kwargs)
        # Choose the embedding layer based on the run mode: 'scratch', 'vectorizer', or 'finetuning'
        if run_mode == 'scratch':
            # Create an embedding layer trainable from scratch
            self.embedding = tf.keras.layers.Embedding(input_dim=vocab_sz, output_dim=embed_sz, input_length=input_length, trainable=True)
        elif run_mode == 'vectorizer':
            # Use a pre-trained embedding as a static vectorizer
            self.embedding = tf.keras.layers.Embedding(input_dim=vocab_sz, output_dim=embed_sz, input_length=input_length, weights=[embedding_weights], trainable=False)
        else:  # finetuning
            # Fine-tune the pre-trained embedding
            self.embedding = tf.keras.layers.Embedding(input_dim=vocab_sz, output_dim=embed_sz, input_length=input_length, weights=[embedding_weights], trainable=True)
        
        # Additional layers of the model
        self.conv = tf.keras.layers.Conv1D(filters=num_filters, kernel_size=kernel_sz, activation='relu')  # Convolutional layer
        self.dropout = tf.keras.layers.SpatialDropout1D(0.2)  # Dropout layer to prevent overfitting
        self.pool = tf.keras.layers.GlobalMaxPool1D()  # Global max pooling layer
        self.dense = tf.keras.layers.Dense(output_sz, activation='softmax')  # Dense layer for classification

    def call(self, x):
        # Forward pass through the layers
        x = self.embedding(x)
        x = self.conv(x)
        x = self.dropout(x)
        x = self.pool(x)
        x = self.dense(x)
        return x

# Create an instance of the SpamClassifierModel with specified parameters
conv_num_filters = 256  # Number of convolutional filters
conv_kernel_size = 3   # Size of the convolutional kernel
model = SpamClassifierModel(vocab_size, EMBEDDING_DIM, max_seqlen, conv_num_filters, conv_kernel_size, NUM_CLASSES, 'scratch', E)

# Build and compile the model
model.build(input_shape=(None, max_seqlen))
model.compile(optimizer='adam', loss="categorical_crossentropy", metrics=["accuracy"])

# Train the model on the training dataset
NUM_EPOCH = 3  # Number of training epochs
CLASS_WEIGHTS = {0:1, 1:8}  # Class weights to handle class imbalance
model.fit(train_dataset, epochs=NUM_EPOCH, validation_data=val_dataset, class_weight=CLASS_WEIGHTS)

# Evaluate the model on the test dataset
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
    Ytest_ = model.predict_on_batch(Xtest)
    ytest = np.argmax(Ytest, axis=1)
    ytest_ = np.argmax(Ytest_, axis=1)
    labels.extend(ytest.tolist())
    predictions.extend(ytest_.tolist()) 

# Calculate and display test accuracy and confusion matrix
if len(labels) == len(predictions):
    print('Test accuracy: {:3f}'.format(accuracy_score(labels, predictions)))
    print("Confusion matrix:")
    print(confusion_matrix(labels, predictions))
else:
    print('Error: Mismatch in the sizes of labels and predictions lists')


Epoch 1/3
Epoch 2/3
Epoch 3/3
Test accuracy: 1.000000
Confusion matrix:
[[438   0]
 [  0  74]]


# Spam Prediction Function

To utilize the trained model for practical applications, a function is defined to predict whether a given message is spam. This function handles the tokenization and sequence padding of the input message, and then uses the model to make a prediction.



In [59]:
def predict_spam(message, model, tokenizer, max_seqlen):
    # Tokenize and convert the message into a sequence of indices
    sequence = tokenizer.texts_to_sequences([message])
    
    # Pad the sequence to ensure it has the same length as the model's input
    padded_sequence = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=max_seqlen)
    
    # Make a prediction using the model
    prediction = model.predict(padded_sequence)
    
    # Determine the class (spam or not spam) based on the model's prediction
    spam_prediction = prediction[0][1]  # Assuming the classes are [not spam, spam]
    
    return "Spam" if spam_prediction > 0.5 else "Not Spam"

# Example usage of the function
message_to_check = "Hi, this is Cynde from HR. We have a couple question regarding your application. Please call [number] to schedule a interview."
print(predict_spam(message_to_check, model, tokenizer, max_seqlen))


Spam
