# <span style="color:#0b486b">  FIT3181/5215: Deep Learning (2025)</span>
***
*CE/Lecturer (Clayton):*  **Dr Trung Le** | trunglm@monash.edu <br/>
*Lecturer (Clayton):* **A/Prof Zongyuan Ge** | zongyuan.ge@monash.edu <br/>
*Lecturer (Malaysia):*  **Dr Arghya Pal** | arghya.pal@monash.edu <br/>
 <br/>
*Head Tutor 3181:*  **Ms Ruda Nie H** |  \[RudaNie.H@monash.edu \] <br/>
*Head Tutor 5215:*  **Ms Leila Mahmoodi** |  \[leila.mahmoodi@monash.edu \]

<br/> <br/>
Faculty of Information Technology, Monash University, Australia
***

# <span style="color:#0b486b">Tutorial 8b: RNNs with Word2Vec</span> <span style="color:red">*****</span> #

This tutorial will show you how to use a pretrained Word2Vec to initialize the embedding matrix of RNNs used for a given task for example sentence classification or sentiment analysis. Instead of randomly initializing the embedding matrix, when initializing that matrix using a pretrained Word2Vec, we take advantage of the linguistic/semantic relationships the pretrained Word2Vec drawn from the large text corpus it was trained on (e.g., 100 billion words from a Google News dataset and contains a vocabulary of 3 million words and phrases).

More specifically, we build up an RNN for *spam SMS detection* for which the embedding matrix is initialized from a pretrained Word2Vec.

### <span style="color:#0b486b"> II.0 Running on Google Colab</span> <span style="color:red"></span>
You will need to download relevant files to run this notebook on Google Colab.

In [None]:
!gdown https://drive.google.com/uc?id=1i0EbKnTvpyQoRnt71kCEl9Ppou9Dev25

Downloading...
From (original): https://drive.google.com/uc?id=1i0EbKnTvpyQoRnt71kCEl9Ppou9Dev25
From (redirected): https://drive.google.com/uc?id=1i0EbKnTvpyQoRnt71kCEl9Ppou9Dev25&confirm=t&uuid=eec204ef-256d-4433-8dbc-b6e2540d5206
To: /content/Tut09_data.zip
100% 58.1M/58.1M [00:00<00:00, 217MB/s]


In [None]:
!unzip -q Tut09_data.zip

## <span style="color:#0b486b">I. Introduction of the SMS spam detection dataset</span> ##

We first import some necessary packages and libraries.

In [None]:
import os
import torch
import random
import pandas as pd
import numpy as np
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import BertTokenizer

The dataset which we investigate in this tutorial lab is the SMS spam detection dataset. The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according to being ham (legitimate) or spam. More information on this dataset can be found [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

## <span style="color:#0b486b">II. Load and preprocess the dataset</span> ##

We create the class *DataManager* as a hub that helps us to load, preprocess, manipulate, and build up the necessary vocabulary and dictionaries (word2idx or idx2word).

In [None]:
class DataManager:
    def __init__(self, url= None):
        self.url = url
        self.max_seq_len = None       # store the max sequence length
        self.num_sentences = None     # store number of sentences
        self.texts = None             # store all sentences
        self.labels = None            # store all labels
        self.num_seqs = None         # store sequences of indices
        self.vocab_size = None


    def read_data(self, file_path):
        df = pd.read_csv(file_path, encoding = "ISO-8859-1")
        df['label'] = df['v1'].apply(lambda x: 1 if x == 'spam' else 0)
        labels, texts = df['label'].to_numpy(), df['v2'].tolist()
        self.texts= texts
        self.labels = torch.from_numpy(labels)

    def transform_to_numbers(self):
        self.num_seqs = self.tokenizer(self.texts, return_tensors='pt', truncation=True, padding=True)['input_ids']
        self.num_sentences, self.max_seq_len = self.num_seqs.shape

    def build_vocabulary(self):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.word2idx = {w: i for w,i in self.tokenizer.vocab.items()}
        self.idx2word = {i:w for w,i in self.word2idx.items()}
        self.vocab_size = len(self.word2idx)
        self.min_index = min(self.word2idx.values())
        self.max_index = max(self.word2idx.values())

    def process_data(self):
        self.build_vocabulary()
        self.transform_to_numbers()


    def train_valid_test_split(self, train_ratio= 0.8, test_ratio=0.1):
        train_size = int(self.num_sentences*train_ratio) +1
        test_size = int(self.num_sentences*test_ratio) +1
        valid_size = self.num_sentences - (train_size + test_size)
        data_indices = list(range(self.num_sentences))
        random.shuffle(data_indices)
        train_set_data = self.num_seqs[data_indices[:train_size]]
        train_set_labels = self.labels[data_indices[:train_size]]
        train_set = torch.utils.data.TensorDataset(train_set_data, train_set_labels)
        test_set_data = self.num_seqs[data_indices[-test_size:]]
        test_set_labels = self.labels[data_indices[-test_size:]]
        test_set = torch.utils.data.TensorDataset(test_set_data, test_set_labels)
        valid_set_data = self.num_seqs[data_indices[train_size:-test_size]]
        valid_set_labels = self.labels[data_indices[train_size:-test_size]]
        valid_set = torch.utils.data.TensorDataset(valid_set_data, valid_set_labels)
        self.train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
        self.test_loader = DataLoader(test_set, batch_size=64, shuffle=False)
        self.valid_loader = DataLoader(valid_set, batch_size=64, shuffle=False)

    def print_infor(self, num_samples = 5):
        print("Here are some statistics and examples from the dataset")
        if self.num_sentences is not None:
            print("+ Dataset has {} sentences".format(self.num_sentences))
        if self.vocab_size is not None:
            print("+ Vocabulary size is {} with min index= {}, max index= {}".format(self.vocab_size, self.min_index, self.max_index))
        if self.max_seq_len is not None:
            print("+ The max sequence length is {}".format(self.max_seq_len))
        if self.texts is not None:
            print("\nHere are some text samples")
            for i in range(num_samples):
                print("+ Text: {}\n+ Indices: {}\n+ Label: {}\n".format(self.texts[i], self.num_seqs[i, ], self.labels[i]))

In [None]:
dm = DataManager()

In [None]:
dm.read_data("./datasets/spam.csv")

In [None]:
dm.process_data()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [None]:
dm.train_valid_test_split()

In [None]:
dm.print_infor()

Here are some statistics and examples from the dataset
+ Dataset has 5572 sentences
+ Vocabulary size is 30522 with min index= 0, max index= 30521
+ The max sequence length is 238

Here are some text samples
+ Text: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
+ Indices: tensor([  101,  2175,  2127, 18414, 17583,  2391,  1010,  4689,  1012,  1012,
         2800,  2069,  1999, 11829,  2483,  1050,  2307,  2088,  2474,  1041,
        28305,  1012,  1012,  1012, 25022,  2638,  2045,  2288, 26297, 28194,
         1012,  1012,  1012,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,  

## <span style="color:#0b486b">III. Build the RNN model</span> ##

In [None]:
import gensim.downloader as api

The class *RNN_Spam_Detection* represents the RNN for SMS spam detection. There are some important attributes (properties or instance variables) of this class:
- `run_mode=scratch or init-fine-tune` specifies the fact we train embedding matrix from scratch or initialize its weights using the pretrained Word2Vect model and then do fine-tuning.
- `embed_model` indicates the pretrained Word2Vect model we use to initialize the embedding matrix. Note that in this case, the embedding size is specified by the number at the end (e.g., glove-wiki-gigaword-300).
- `embed_size` specifies the embedding size and is also the hidden size of the first hidden layer of memory cells. Note that if the running mode is not *scratch*, we set the embedding size as specified by the embedding model.

In [None]:
class SpamDetectionModel(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, embed_matrix=None):
    super(SpamDetectionModel, self).__init__()

    if embed_matrix is not None:
      self.embedding_layer = nn.Embedding.from_pretrained(embed_matrix)
    else: # embed_matrix=None
      self.embedding_layer = nn.Embedding(vocab_size, embedding_dim)
    self.rnn_layer = nn.GRU(embedding_dim, hidden_dim, num_layers=1, batch_first=True)
    self.dense_layer = nn.Linear(hidden_dim, 2)

  def forward(self, x): #[batch_size, seq_len]
    e = self.embedding_layer(x)  #[batch_size, seq_len, embed_size]
    h,_ = self.rnn_layer(e) #[batch_size, seq_len, hidden_dim], _ means the last hidden state [batch_size, hidden_dim]
    h = h[:, -1, :] # the last hidden state [batch_size, hidden_size]
    y = self.dense_layer(h)
    return y

In [None]:
def train_epoch(model, optimizer, loader, criterion, device):
  model.train()
  losses = 0
  accuracies = 0
  for x, y in loader:
    x, y = x.to(device), y.to(device)
    pred = model(x)
    ypred = pred.argmax(-1)
    loss = criterion(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    acc = (y == ypred).sum(0) / len(y)
    accuracies += acc.item()
    losses += loss.item()
  return losses / len(loader), accuracies / len(loader)

def test_epoch(model, loader, criterion, device):
    model.eval()
    losses = 0
    accuracies = 0
    with torch.no_grad():
      for x,y in loader:
        x,y = x.to(device), y.to(device)
        pred = model(x)
        ypred = pred.argmax(-1)
        loss = criterion(pred, y)
        acc = (y == ypred).sum(0) / len(y)
        accuracies += acc.item()
        losses += loss.item()
    return losses / len(loader), accuracies / len(loader)


In [None]:
class RNN_Spam_Detection:
    def __init__(self, run_mode="scratch", embed_model="glove-wiki-gigaword-300", embed_size=128, hidden_size=128, data_manager=None):
        self.embed_path = "embeddings/E.npy"
        self.embed_model = embed_model
        self.embed_size = embed_size
        self.run_mode = run_mode
        if run_mode != 'scratch':
            self.embed_size = int(self.embed_model.split("-")[-1])
        self.data_manager = data_manager
        self.vocab_size = self.data_manager.vocab_size
        self.word2idx = self.data_manager.word2idx
        self.embed_matrix = np.zeros((self.vocab_size, self.embed_size))
        self.run_mode = run_mode
        self.hidden_size = hidden_size
        self.model = None

    def build_embedding_matrix(self):
        if os.path.exists(self.embed_path): # file existed
            self.embed_matrix = np.load(self.embed_path) # Load the file for embedding matrix if existed
        else: # file not existed or first-time run
            self.word2vect = api.load(self.embed_model) # load embedding model
            for word, idx in self.word2idx.items():
                try:
                    self.embed_matrix[idx] = self.word2vect.word_vec(word) # assign weight for the corresponding word and index
                except KeyError: # word cannot be found
                    pass
            np.save(self.embed_path, self.embed_matrix)

    def build(self):

      if self.run_mode == 'scratch':
        embed_matrix = None
      else: # init-fine-tune
        self.build_embedding_matrix()
        embed_matrix = torch.from_numpy(self.embed_matrix)
        embed_matrix.requires_grad = True

      model = SpamDetectionModel(self.vocab_size, self.embed_size, self.hidden_size, embed_matrix)
      self.criterion = nn.CrossEntropyLoss(reduction="mean")
      return model


    def train(self, model, device, num_epochs):
      optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
      for epoch in range(1, num_epochs + 1):
        train_loss, train_acc = train_epoch(model, optimizer, self.data_manager.train_loader,
                                            self.criterion, device)
        val_loss, val_acc = test_epoch(model, self.data_manager.valid_loader,
                                       self.criterion, device)
        msg = f"Epoch: {epoch}/{num_epochs} - train loss = {train_loss:.3f} - train accuracy = {train_acc*100:.3f}%"
        msg =  msg + f"- val loss = {val_loss:.3f} - val accuracy = {val_acc*100:.3f}%"
        print(msg)

    def evaluate(self, model, device):
      loss, acc = test_epoch(model, self.data_manager.test_loader, self.criterion, device)
      print(f"Test loss = {loss:.3f} - Test accuracy = {acc*100:.3f}%")


### <span style="color:#0b486b">III.1. Run in the running mode of training from scratch</span> ###

We now set random seeds for both numpy and TensorFlow.

In [None]:
rnn1 = RNN_Spam_Detection(data_manager=dm, run_mode="scratch")

In [None]:
model = rnn1.build()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

SpamDetectionModel(
  (embedding_layer): Embedding(30522, 128)
  (rnn_layer): GRU(128, 128, batch_first=True)
  (dense_layer): Linear(in_features=128, out_features=2, bias=True)
)

In [None]:
rnn1.train(model, device, num_epochs=20)

Epoch: 1/20 - train loss = 0.407 - train accuracy = 86.705%- val loss = 0.408 - val accuracy = 85.906%
Epoch: 2/20 - train loss = 0.394 - train accuracy = 86.717%- val loss = 0.407 - val accuracy = 85.906%
Epoch: 3/20 - train loss = 0.396 - train accuracy = 86.705%- val loss = 0.418 - val accuracy = 85.906%
Epoch: 4/20 - train loss = 0.395 - train accuracy = 86.705%- val loss = 0.407 - val accuracy = 85.906%
Epoch: 5/20 - train loss = 0.393 - train accuracy = 86.693%- val loss = 0.410 - val accuracy = 85.906%
Epoch: 6/20 - train loss = 0.393 - train accuracy = 86.693%- val loss = 0.418 - val accuracy = 85.906%
Epoch: 7/20 - train loss = 0.357 - train accuracy = 87.031%- val loss = 0.156 - val accuracy = 96.717%
Epoch: 8/20 - train loss = 0.088 - train accuracy = 98.236%- val loss = 0.092 - val accuracy = 98.090%
Epoch: 9/20 - train loss = 0.058 - train accuracy = 98.627%- val loss = 0.121 - val accuracy = 97.238%
Epoch: 10/20 - train loss = 0.143 - train accuracy = 95.914%- val loss = 

In [None]:
rnn1.evaluate(model, device)

Test loss = 0.051 - Test accuracy = 98.785%


### <span style="color:#0b486b">III.2. Run in the running mode of fine-tuning the embedding matrix</span> ###

In [None]:
rnn2 = RNN_Spam_Detection(data_manager=dm, run_mode="init-fine-tune")

In [None]:
model2 = rnn2.build()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model2.to(device)

SpamDetectionModel(
  (embedding_layer): Embedding(8921, 300)
  (rnn_layer): GRU(300, 128, batch_first=True)
  (dense_layer): Linear(in_features=128, out_features=2, bias=True)
)

In [None]:
rnn2.train(model, device, num_epochs=20)

Epoch: 1/20 - train loss = 0.007 - train accuracy = 99.888%- val loss = 0.076 - val accuracy = 98.611%
Epoch: 2/20 - train loss = 0.004 - train accuracy = 99.955%- val loss = 0.083 - val accuracy = 98.785%
Epoch: 3/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.087 - val accuracy = 98.785%
Epoch: 4/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.085 - val accuracy = 98.785%
Epoch: 5/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.088 - val accuracy = 98.785%
Epoch: 6/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.090 - val accuracy = 98.785%
Epoch: 7/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.096 - val accuracy = 98.785%
Epoch: 8/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.098 - val accuracy = 98.785%
Epoch: 9/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 0.094 - val accuracy = 98.785%
Epoch: 10/20 - train loss = 0.002 - train accuracy = 99.978%- val loss = 

In [None]:
rnn2.evaluate(model, device)

Test loss = 0.064 - Test accuracy = 99.238%


---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>