# RNNs for Text Classification

We will use a RNN based model to perform classification of SMS messages into Spam or not Spam. The notebook has split the entire process into several parts for your convienience. To appreciate preprocessing in music, it is important to understand preprocessing in other domains. You may need to read up upon tokenization, embeddings (Glove) and Dataloaders, as well as how to pipeline an end - to - end AI model

**Resources:** \
https://www.geeksforgeeks.org/pre-trained-word-embedding-using-glove-in-nlp-models/

We do not expect you to finish the code entirely, take help whenever required, but understand the code you have written, do not blindly copy code. In case of help required at any time, feel free to contact the project leads.

| Name | Phone Number |
| :-- | :-- |
| Pranay Mathur | 7032832559|
| Swathi Narashiman | 6379869509 |

In [1]:
# Imports
from IPython.display import clear_output
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import spacy
import re
import string
from collections import Counter
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import tqdm

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Downloading the Spam SMS Dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

!unzip /content/smsspamcollection.zip
!rm /content/readme
!rm !rm /content/smsspamcollection.zip

clear_output()

In [3]:
# Downloading the GloVe embeddings database

!wget https://nlp.stanford.edu/data/glove.6B.zip

!unzip /content/glove.6B.zip

!rm -rf /content/glove.6B.zip
!rm /content/glove.6B.100d.txt
!rm /content/glove.6B.200d.txt
!rm /content/glove.6B.300d.txt

clear_output()


In [4]:


with open("/content/SMSSpamCollection") as f:

# crate a dataframs which labels spam as 1 nd ham as 0
  df = pd.DataFrame(f)
  df.columns=['Text']
  df[['Label', 'Text']] = df['Text'].str.split('\t', n=1, expand=True)

df['SpamYes'] = df['Label'].str.contains('spam', case=True).astype(int)

df

Unnamed: 0,Text,Label,SpamYes
0,"Go until jurong point, crazy.. Available only ...",ham,0
1,Ok lar... Joking wif u oni...\n,ham,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,1
3,U dun say so early hor... U c already then say...,ham,0
4,"Nah I don't think he goes to usf, he lives aro...",ham,0
...,...,...,...
5569,This is the 2nd time we have tried 2 contact u...,spam,1
5570,Will ü b going to esplanade fr home?\n,ham,0
5571,"Pity, * was in mood for that. So...any other s...",ham,0
5572,The guy did some bitching but I acted like i'd...,ham,0


In [5]:
spacy_tokenizer = spacy.load('en_core_web_sm')#pytorch package for tokenizing
def tokenize(text):

    doc = spacy_tokenizer(text)
    tokenized_data = []
    for token in doc:
        if token.is_alpha and not token.is_punct:      #ignores punctuation marks
            if all(ord(char) < 128 for char in token.text):
                tokenized_data.append(token.text.lower())

    return tokenized_data




In [6]:
df['Tokenized_Text'] = df['Text'].apply(tokenize)
df


Unnamed: 0,Text,Label,SpamYes,Tokenized_Text
0,"Go until jurong point, crazy.. Available only ...",ham,0,"[go, until, jurong, point, crazy, available, o..."
1,Ok lar... Joking wif u oni...\n,ham,0,"[ok, lar, joking, wif, u, oni]"
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,1,"[free, entry, in, a, wkly, comp, to, win, fa, ..."
3,U dun say so early hor... U c already then say...,ham,0,"[u, dun, say, so, early, hor, u, c, already, t..."
4,"Nah I don't think he goes to usf, he lives aro...",ham,0,"[nah, i, do, think, he, goes, to, usf, he, liv..."
...,...,...,...,...
5569,This is the 2nd time we have tried 2 contact u...,spam,1,"[this, is, the, time, we, have, tried, contact..."
5570,Will ü b going to esplanade fr home?\n,ham,0,"[will, b, going, to, esplanade, fr, home]"
5571,"Pity, * was in mood for that. So...any other s...",ham,0,"[pity, was, in, mood, for, that, so, any, othe..."
5572,The guy did some bitching but I acted like i'd...,ham,0,"[the, guy, did, some, bitching, but, i, acted,..."


In [7]:
def load_GloVe_embeddings(glove_file_):
    """Load GloVe embeddings from the given file."""
    word_embeddings = {}
    with open(glove_file_, 'r', encoding='utf-8',errors='ignore') as file:
        for line in file:
            values = line.split()
            if len(values) < 1:
                continue
            word = values[0]
            try:
                embedding = np.array(values[1:], dtype='float')
                word_embeddings[word] = embedding
            except ValueError:
                continue
    return word_embeddings
glove_file_ = "glove.6B.50d.txt"
glove_embeddings = load_GloVe_embeddings(glove_file_)

#dictionary
word_embedding_dict = {}
for word, embedding in glove_embeddings.items():
    word_embedding_dict[word] = embedding
#trial
print("Embedding of 'Success':", word_embedding_dict['success'])


Embedding of 'Success': [-0.09365   0.52359  -0.25771  -0.070594 -0.024704  0.15472  -0.71638
  0.19921   0.37337   1.07      0.18652   0.14885  -0.90904  -0.29839
  0.26002  -0.1573    0.62769  -0.054288 -0.45904  -0.11208  -0.038781
 -0.01349  -0.029441 -0.29288   0.48665  -1.1699   -1.2364   -0.50309
  0.28261   0.36255   3.0535    0.75565   0.058124 -0.6127    0.11773
 -0.24694  -0.22847   0.58849  -0.78022  -0.95572  -0.46784  -0.54124
 -0.67443  -0.50789  -0.2992    0.034926 -0.21086   0.28679  -0.099461
  0.16039 ]


In [8]:
def embed_text(tokenized_data, word_embeddings,embedding_size=50):
    """
    Given a sequence of tokens, convert them to their word embeddings.
    """
    embedded_text = []
    for token in tokenized_data:
        if token in word_embeddings:
            embedded_text.append(word_embeddings[token])
        else:
            # Handle out-of-vocabulary tokens
            embedded_text.append(np.zeros(embedding_size))




  #convert it to numpy array
    return np.array(embedded_text)



In [9]:
def apply_embed_text(tokenized_data):
    return embed_text(tokenized_data, glove_embeddings)


In [10]:
df["Embedded_Text"] = df["Tokenized_Text"].apply(apply_embed_text)
df

Unnamed: 0,Text,Label,SpamYes,Tokenized_Text,Embedded_Text
0,"Go until jurong point, crazy.. Available only ...",ham,0,"[go, until, jurong, point, crazy, available, o...","[[0.14828, 0.17761, 0.42346, -0.31489, 0.32273..."
1,Ok lar... Joking wif u oni...\n,ham,0,"[ok, lar, joking, wif, u, oni]","[[-0.53646, -0.072432, 0.24182, 0.099021, 0.18..."
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,1,"[free, entry, in, a, wkly, comp, to, win, fa, ...","[[-0.41183, 0.4528, 0.02825, -0.28702, 0.03702..."
3,U dun say so early hor... U c already then say...,ham,0,"[u, dun, say, so, early, hor, u, c, already, t...","[[-0.25676, 0.8549, 1.1003, 0.95363, 0.36585, ..."
4,"Nah I don't think he goes to usf, he lives aro...",ham,0,"[nah, i, do, think, he, goes, to, usf, he, liv...","[[0.50959, 1.2707, -0.078318, -1.4834, -0.3478..."
...,...,...,...,...,...
5569,This is the 2nd time we have tried 2 contact u...,spam,1,"[this, is, the, time, we, have, tried, contact...","[[0.53074, 0.40117, -0.40785, 0.15444, 0.47782..."
5570,Will ü b going to esplanade fr home?\n,ham,0,"[will, b, going, to, esplanade, fr, home]","[[0.81544, 0.30171, 0.5472, 0.46581, 0.28531, ..."
5571,"Pity, * was in mood for that. So...any other s...",ham,0,"[pity, was, in, mood, for, that, so, any, othe...","[[-0.052489, 0.30524, -0.33187, -0.43559, 0.53..."
5572,The guy did some bitching but I acted like i'd...,ham,0,"[the, guy, did, some, bitching, but, i, acted,...","[[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -..."


In [11]:
from torch.utils.data import Dataset, DataLoader


words = list(word_embedding_dict.keys())
embeddings = list(word_embedding_dict.values())

class WordEmbeddingsDataset(Dataset):
    def __init__(self, words, embeddings):
        self.words = words
        self.embeddings = embeddings

    def __len__(self):
        return len(self.words)

    def __getitem__(self, idx):
        return self.words[idx], self.embeddings[idx]

# Create DataLoader
batch_size = 64
dataset = WordEmbeddingsDataset(words, embeddings)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


dataloader





In [12]:

class RNN(nn.Module):
    def __init__(self, vocab_size, num_layers, embedding_dim, hidden_size):
        super(RNN, self).__init__()
        # Define vocab_size in __init__
        self.vocab_size = vocab_size
        # Embedding layer
        self.embedding = nn.Embedding(self.vocab_size, embedding_dim)
        # LSTM
        self.rnn = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        # Fully connected layer for output
        self.fc = nn.Linear(hidden_size, 1)  # Assuming binary classification
        # Activation function
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        # Embedding
        embedded = self.embedding(x)
        rnn_out, _ = self.rnn(embedded)
        # Fully connected layer
        out = self.fc(rnn_out[:, -1, :])
        # Apply sigmoid activation for binary classification
        out = self.sigmoid(out)
        return out


In [13]:
def train_model(num_epochs, train_loader, model, criterion, optimizer):
    """
    Trainer loop for the model.
    """

    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs, labels in train_loader:
            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = model(inputs)

            # Calculate loss
            loss = criterion(outputs, labels)

            # Backward pass and optimize
            loss.backward()
            optimizer.step()

            # Compute statistics
            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        epoch_loss = running_loss / len(train_loader)
        epoch_acc = 100 * correct / total
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.2f}%")

    print('Finished Training')


In [14]:

#the model for test data which will work based on the training data that we have provided the computer with
def evaluate_model(model, dataloader, criterion):
    model.eval()
    correct = 0
    total = 0
    running_loss = 0.0

    with torch.no_grad():
        for inputs, labels in dataloader:
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total
    average_loss = running_loss / len(dataloader)

    print(f'Accuracy: {accuracy:.4f}, Average Loss: {average_loss:.4f}')

    return accuracy, average_loss


In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import torch.optim as optim


#create tensors of embeddings and labels to be fed to the neural network

label = df['Label'].values
label_encoder = LabelEncoder()
encoded_label = label_encoder.fit_transform(label)
label_tensor = torch.tensor(encoded_label, dtype=torch.long)
embeddings_subset = embeddings[:5574]  # Take the first 5574 embeddings to match sizes
embed_tensor_subset = torch.tensor(embeddings_subset)
embed_tensor_subset = torch.tensor(embeddings_subset)



In [20]:
label_tensor = torch.tensor(encoded_label, dtype=torch.long)
X_train, X_test, y_train, y_test = train_test_split(embed_tensor_subset, label_tensor, test_size=0.2, random_state=42)

vocab_size = len(word_embedding_dict)
num_layers = 1
embedding_dim = 100
hidden_size = 128
model = RNN(vocab_size, num_layers, embedding_dim, hidden_size)
learning_rate = 0.001
num_epochs = 10
batch_size = 64
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

dl_tensor = torch.tensor(dataloader)
train_model(num_epochs, dataloader, model, criterion, optimizer)
test_accuracy = evaluate_model(dataloader, model)
print(f"Test Accuracy: {test_accuracy}")

torch.save(model.state_dict(), 'model.pth')


#i died doing this, help


RuntimeError: Could not infer dtype of DataLoader