## Recurrent Neural Network - Spam Classification

In [1]:
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import torch
import pandas as pd

## üç≥ Data Preparation

Let's have a look at the dataset. Our objective is to predict whether a message is ham or spam. An example client that would benefit from this is a communication company that likes to protect its clients of spam messages like `me and your uncle have founds lots of gold and tombs underground and we need...`.

In [2]:
df = pd.read_csv("./sms.csv")
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will √º b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
x_data, y_data = df['Message'], df['Category']

Let's convert the sequence of sentences `x_data` to a sequence of (sequences of words):

In [4]:
nltk.download('stopwords')                          # so that we can ignore them as they are irrelevant (a, the, in, but, for, etc.)
stop_words = set(stopwords.words('english'))
nltk.download('punkt')


def preprocess_text(text):
    # (1) Splitting texts into words (so later it can be a sequence of vectors)
    tokens = word_tokenize(text)

    # (2) Stop word removal
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return filtered_tokens

# Preprocess the dataset's text data
x_data_processed = [preprocess_text(text) for text in x_data]
print(x_data_processed[0], " is ", y_data[0])

[nltk_data] Downloading package stopwords to /Users/essam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/essam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Go', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'got', 'amore', 'wat', '...']  is  ham


In [5]:
# Let's map from each word to its frequency
all_words = [word for text in x_data_processed for word in text]
word_counts = Counter(all_words)

# Assume all words that occur less than 2 times in the dataset are unimportant and ditch them (to reduce the dimensionality of the input)
min_freq = 2
vocabulary = [word for word, count in word_counts.items() if count >= min_freq]

# Prepend PAD token with index 0
vocabulary.insert(0, 'PAD')

# Create a dictionary for word -> index mapping (improves lookup speed)
word_to_index = {word: i for i, word in enumerate(vocabulary)}


print(f"The vocabulary in our dataset is {list(vocabulary)[:9]}... and has {len(vocabulary)} words")

The vocabulary in our dataset is ['PAD', 'Go', 'point', ',', 'crazy', '..', 'Available', 'bugis', 'n']... and has 5199 words


Encode the text data by mapping each word to its index in the vocabulary

In [6]:
sequences = [
    [word_to_index[word] for word in sublist if word in word_to_index]
    for sublist in x_data_processed
]

Now let's pad or truncate the sequences so they're all of the same length of `max_seq_length` (as the batch must be a tensor which can't store vectors of different sizes). We know that `0` corresponds to the PAD token.

In [7]:
max_seq_length = max(len(seq) for seq in sequences)

padded_sequences = [[0] * max(0, max_seq_length - len(seq)) +seq[:max_seq_length] for seq in sequences]

Setup train loader

In [8]:
from torch.utils.data import DataLoader, TensorDataset, random_split

category_to_label = {category: label for label, category in enumerate(set(y_data))}
print(category_to_label)
encoded_labels = [category_to_label[category] for category in y_data]

# Assuming you have already loaded your data into X_processed and encoded_labels
x_data_tensor = torch.tensor(padded_sequences, dtype=torch.long)
y_data_tensor = torch.tensor(encoded_labels, dtype=torch.long)


dataset = TensorDataset(x_data_tensor, y_data_tensor)
train_dataset, val_dataset = random_split(dataset, [0.8, 0.2])
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

{'ham': 0, 'spam': 1}


## üö¢ Model Building

A recurrent neural network is just like a feedforward neural network with recurrent layers. The recurrent layer:
- Assumes that the input is a sequence of vectors and not a single vector

- Outputs a vector for each of those in the sequence via a linear transformation followed by a non-linearity

- It shares information between these vectors by involving the output of the layer from the previous time step to the compuation.

The following example compares a feedforward neural network with two feedforward layers to a recurrent neural network with a recurrent layer and a feedforward layer.

<img width="1300" src="https://i.imgur.com/DGSeJpG.png">

This is the general case where we may be interested in classifying each token in the sequence. In this notebook, we classify a less general case where we classify the whole sentence into one of two classes (spam or ham). In this case, we only need the final hidden state (which carries information about all previous ones) to make the classification by passing it to the final layer (g).

In [14]:
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(RNNModel, self).__init__()
        # Behaves like a linear layer but with no bias (i.e., z = W @ x)
        # It converts the sequence of integers (representing a sequence of one-hot vectors) corresponding to a sentence x
        # To a sequence of vectors by applying W @ x for each one-hot vector x. More disucssion goes into why bias is dropped.
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)          

        # This is the RNN layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim, nonlinearity='tanh', batch_first=True)      
        # (batch_size, seq_len, H_in) -> (batch_size, seq_len, H_out)
        
        # This is a linear layer to classify the final output of the RNN
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # An embedding layer that does Wx for each one-hot vector x represented as an integer
        embedded = self.embedding(x)
                
        # Get RNN output (last hidden state)
        rnn_output, _ = self.rnn(embedded)
        rnn_output = rnn_output[:, -1, :]    # We are only interested in the final hidden state (carries information for all timesteps)
        
        # Get the classification out of the last hidden state
        output = self.fc2(rnn_output)
        return output

# Assuming vocab_size, embedding_dim, hidden_dim, and output_dim are defined
vocab_size = len(vocabulary)
model = RNNModel(vocab_size, embedding_dim=64, hidden_dim=64, output_dim=2)

Other deep sequence models that extend RNNs to solve some of its issues during training are LSTMs and GRUs. Read more about them [here](https://medium.com/towards-data-science/an-intuitive-approach-to-understading-of-lstms-and-grus-c2191611a37d). Try to increase the accuracy using them later.

## üí™ Model Training

In [15]:
from tqdm.notebook import tqdm

criterion = nn.CrossEntropyLoss()           # In the loss function give "spam" examples higher weight
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,)

num_epochs = 5
num_batches = len(train_loader)
with tqdm(range(num_epochs), desc="Epochs") as epoch_bar:
    for epoch in epoch_bar:
        model.train()  # Set the model to training mode
        total_loss, num_correct = 0.0, 0
        for xb, yb in train_loader:
            # 1. Forward pass
            yÃÇb = model(xb)
            
            # 2. Backward pass
            loss = criterion(yÃÇb, yb)
            loss.backward()
            
            # 3. Optimization step
            optimizer.step()
            optimizer.zero_grad()

            # 4. Statistics:
            yÃÇb = torch.argmax(yÃÇb, dim=1)
            num_correct += (yÃÇb == yb).sum().item()

            total_loss += loss.item()

        average_loss = total_loss / len(train_loader)
        accuracy = num_correct / (num_batches * train_loader.batch_size)
        epoch_bar.set_postfix(loss=average_loss, accuracy = accuracy)


Epochs:   0%|          | 0/5 [00:00<?, ?it/s]

## üïµÔ∏è Evaluation

In [16]:
num_batches = len(val_loader)
with torch.no_grad():
    model.eval()  # Set the model to evaluation mode
    num_correct = 0
    for xb, yb in val_loader:
        
        # 1. Forward pass
        yÃÇb = model(xb)
        
        # 2. Statistics 
        yÃÇb = torch.argmax(yÃÇb, dim=1)
        num_correct += (yÃÇb == yb).sum().item()
        
    accuracy = num_correct / (num_batches * val_loader.batch_size)
    print(f"Validation Accuracy: {accuracy:.4f}")

Validation Accuracy: 0.9723


### üíª Deployment

In [17]:
import torch
from torch.nn import functional as F
from nltk.tokenize import word_tokenize

def predict_single_example(example_text, model, category_to_label):
    stop_words = set(stopwords.words('english'))
    
    # 1. Preprocess the example text
    tokens = word_tokenize(example_text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # 2. Convert text to sequence using tokenizer
    sequence = [vocabulary.index(word) for word in filtered_tokens if word in vocabulary]
    padded_sequence = [0] * max(0, max_seq_length - len(sequence)) +sequence[:max_seq_length]
    
    # 3. Convert padded sequence to tensor
    input_tensor = torch.tensor(padded_sequence, dtype=torch.long)
    
    # 4. Add batch dimension
    input_tensor = input_tensor.unsqueeze(0)
    
    # 5. Make prediction
    with torch.no_grad():
        model.eval()
        output = model(input_tensor)
        probabilities = F.softmax(output, dim=1)
        predicted_label = torch.argmax(probabilities, dim=1).item()
    
    # Map predicted label to category
    predicted_category = "Spam" if predicted_label == 1 else "Not Spam"
    
    return predicted_category, probabilities.numpy()[0][predicted_label]

In [19]:
# choose a random sentence form val_dataset
example_text = "URGENT! You have won a 1 week FREE membership in our ¬£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"

predicted_category, confidence = predict_single_example(example_text, model, category_to_label)
print(f"Predicted Category: {predicted_category}, Confidence: {confidence:.4f}")

Predicted Category: Spam, Confidence: 0.9985


**Note**

We didn't use [torchtext](https://github.com/pytorch/text) because PyTorch will no longer develop/support this package. It has indeed been overtaken by [HuggingFace](https://github.com/huggingface) in the industry.