# **Modeling Sequential Data Using Recurrent Neural Networks (Part 2/3)**

## Project one: predicting the sentiment of IMDb movie reviews

- Sentiment analysis is concerned with analyzing the expressed opinion of a sentence or a text document.

- we will implement a `multilayer RNN` for sentiment analysis using a many-to-one architecture.

In [1]:
from IPython.display import Image
%matplotlib inline

### Preparing the movie review data

In [2]:
import torch
import torch.nn as nn

- Each set has `25,000` samples. And each sample of the datasets consists of two elements, the sentiment label representing the target label we want to predict (neg refers to negative sentiment and pos refers to positive sentiment), and the movie review text (the input features). The text component of these movie reviews is sequences of words, and the RNN model classifies each sequence as a positive (1) or negative (0) review.


- However, before we can feed the data into an RNN model, we need to apply several preprocessing steps:
  - Split the training dataset into separate training and validation partitions.
  - Identify the unique words in the training dataset.
  - Map each unique word to a unique integer and encode the review text into encoded integers.
  - Divide the dataset into mini-batches as input to the model.

- **Torchtext is deprecated**

In [3]:
# from torchtext.datasets import IMDB
# from torch.utils.data.dataset import random_split

# # Step 1: load and create the datasets
# train_dataset = IMDB(split='train')
# test_dataset = IMDB(split='test')

# test_dataset = list(test_dataset)   #datapipe to list

# torch.manual_seed(1)
# train_dataset, valid_dataset = random_split(
#     list(train_dataset), [20000, 5000])

- Modern IMDB Dataset Loading (No Torchtext)

In [4]:
from datasets import load_dataset
from torch.utils.data import random_split

# Step1: load IMDB from HuggingFace datasets
imdb = load_dataset("imdb") # returns the dict-like dataset object

train_dataset = imdb["train"]
test_dataset = imdb["test"]

# convert test dataset to list
test_dataset = list(test_dataset)

# split train into train + validation
torch.manual_seed(1)
train_dataset, valid_dataset = random_split(
    list(train_dataset), [20000, 5000]
)

  from .autonotebook import tqdm as notebook_tqdm


- `train_dataset` → 20,000 examples
- `valid_dataset` → 5,000 examples
- `test_dataset` → 25,000 examples

- The code for collecting unique tokens is as follows:

In [5]:
# step 2: find unique tokens (words)
import re 
from collections import Counter

token_counts = Counter()

def tokenizer(text):
    text = re.sub(r'<[^>]*>', '', text)
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub(r'[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text.split()

# train_dataset is the HF dataset split from earlier
for example in train_dataset:  # example is a dict: {"text": ..., "label": ...}
    tokens = tokenizer(example["text"])
    token_counts.update(tokens)

print("Vocab-size:", len(token_counts))


Vocab-size: 69023


- Next, we are going to map each unique word to a unique integer. This can be done manually using a Python dictionary, where the keys are the unique tokens (words) and the value associated with each key is a unique integer. However, the `torchtext` package already provides a class, `Vocab`, which we can use to create such a mapping and encode the entire dataset. First, we will create a vocab object by passing the ordered dictionary mapping tokens to their corresponding occurrence frequencies (the ordered dictionary is the sorted `token_counts`). Second, we will prepend two special tokens to the vocabulary – the padding and the unknown token:

In [6]:
from collections import OrderedDict

# Step 3 — Build vocabulary dictionary manually
sorted_by_freq = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)

# Reserve special tokens
stoi = {"<pad>": 0, "<unk>": 1}

# Add all tokens to vocabulary
for token, _ in sorted_by_freq:
    stoi[token] = len(stoi)

# Create inverse mapping if needed later
itos = {index: token for token, index in stoi.items()}

In [13]:
stoi

{'<pad>': 0,
 '<unk>': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'it': 8,
 'in': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 's': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 't': 21,
 'on': 22,
 'you': 23,
 'not': 24,
 'he': 25,
 'are': 26,
 'his': 27,
 'have': 28,
 'one': 29,
 'be': 30,
 'all': 31,
 'at': 32,
 'they': 33,
 'by': 34,
 'an': 35,
 'who': 36,
 'so': 37,
 'from': 38,
 'like': 39,
 'there': 40,
 'her': 41,
 'or': 42,
 'just': 43,
 'about': 44,
 'out': 45,
 'has': 46,
 'if': 47,
 'what': 48,
 'some': 49,
 'good': 50,
 'can': 51,
 'she': 52,
 'when': 53,
 'very': 54,
 'more': 55,
 'up': 56,
 'even': 57,
 'time': 58,
 'no': 59,
 'my': 60,
 'would': 61,
 'which': 62,
 'story': 63,
 'only': 64,
 'really': 65,
 'see': 66,
 'had': 67,
 'their': 68,
 'we': 69,
 'were': 70,
 'me': 71,
 'well': 72,
 'than': 73,
 'much': 74,
 'get': 75,
 'been': 76,
 'will': 77,
 'bad': 78,
 'other': 79,
 'people': 80,
 'do': 81,
 'als

In [7]:
# numericalize tokens
def encode(tokens):
    return [stoi.get(t, stoi["<unk>"]) for t in tokens]


sample = tokenizer(train_dataset[0]["text"])
encoded = encode(sample)
print(encoded[:20])

[35, 1739, 7, 449, 721, 6, 301, 4, 787, 9, 4, 18, 44, 2, 1705, 2460, 186, 25, 7, 24]


- define `text_pipeline` function to transform each text in the dataset accordingly and the `label_pipeline` function to convert each label to 1 or 0:

In [8]:
print(tokenizer("This movie was great!"))

['this', 'movie', 'was', 'great']


In [None]:
# Step 3-A: Define transformation functions (vocab → indices)

def text_pipeline(text):
    tokens = tokenizer(text)
    return [stoi.get(token, stoi["<unk>"]) for token in tokens]

def label_pipeline(label):
    return 1.0 if label == "pos" else 0.0

In [10]:
# Example usage
label, text = train_dataset[0]

print("Original:", text[:100], "...")
print("Tokens:", tokenizer(text)[:10])
print("Numerical:", text_pipeline(text)[:10])
print("Label:", label_pipeline(label))


Original: label ...
Tokens: ['label']
Numerical: [6357]
Label: 0.0


In [15]:
example = train_dataset[0]
print(example)

{'text': 'An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander\'s character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings\' love. All around brilliance. Rating, 10.', 'label': 1}


- We will wrap the text encoding and label transformation function into the `collate_batch` function:

In [None]:
## Step 3-B: wrap the encode and transformation function
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []

    for sample in batch:
        text = sample["text"]
        label = sample["label"]

        label_list.append(float(label))  
        
        ids = torch.tensor(text_pipeline(text), dtype=torch.long)
        text_list.append(ids)
        lengths.append(len(ids))

    label_list = torch.tensor(label_list, dtype=torch.float32)
    lengths = torch.tensor(lengths, dtype=torch.long)

    padded_text_list = nn.utils.rnn.pad_sequence(
        text_list, 
        batch_first=True, 
        padding_value=stoi["<pad>"]
    )

    return (
        padded_text_list.to(device),
        label_list.to(device),
        lengths.to(device)
    )

- Let's the take the first batch and print the sizes of the individual elements before combining these into mini-batches, as well as the dimensions of the resulting mini-batches:

In [19]:
## Take a small batch
from torch.utils.data import DataLoader

dataloader = DataLoader(
    train_dataset,
    batch_size=4,
    shuffle=False,
    collate_fn=collate_batch
)

text_batch, label_batch, length_batch = next(iter(dataloader))
print(text_batch)
print(label_batch)
print(length_batch)
print(text_batch.shape)

tensor([[   35,  1739,     7,   449,   721,     6,   301,     4,   787,     9,
             4,    18,    44,     2,  1705,  2460,   186,    25,     7,    24,
           100,  1874,  1739,    25,     7, 34415,  3568,  1103,  7517,   787,
             5,     2,  4991, 12401,    36,     7,   148,   111,   939,     6,
         11598,     2,   172,   135,    62,    25,  3199,  1602,     3,   928,
          1500,     9,     6,  4601,     2,   155,    36,    14,   274,     4,
         42945,     9,  4991,     3,    14, 10296,    34,  3568,     8,    51,
           148,    30,     2,    58,    16,    11,  1893,   125,     6,   420,
          1214,    27, 14542,   940,    11,     7,    29,   951,    18,    17,
         15994,   459,    34,  2480, 15211,  3713,     2,   840,  3200,     9,
          3568,    13,   107,     9,   175,    94,    25,    51, 10297,  1796,
            27,   712,    16,     2,   220,    17,     4,    54,   722,   238,
           395,     2,   787,    32,    27,  5236,  

In [27]:
text_batch[3]

tensor([18923,     7,     4,  4753,  1669,    12,  3019,     6,     4, 13906,
          502,    40,    25,    77,  1588,     9,   115,     6, 21713,     2,
           90,   305,   237,     9,   502,    33,    77,   376,     4, 16848,
          847,    62,    77,   131,     9,     2,  1580,   338,     5, 18923,
           32,     2,  1980,    49,   157,   306, 21713,    46,   981,     6,
        10298,     2, 18924,   125,     9,   502,     3,   453,     4,  1852,
          630,   407,  3407,    34,   277,    29,   242,     2, 20200,     5,
        18923,    77,    95,    41,  1833,     6,  2105,    56,     3,   495,
          214,   528,     2,  3479,     2,   112,     7,   181,  1813,     3,
          597,     5,     2,   156,   294,     4,   543,   173,     9,  1562,
          289, 10038,     5,     2,    20,    26,   841,  1392,    62,   130,
          111,    72,   832,    26,   181, 12402,    15,    69,   183,     6,
           66,    55,   936,     5,     2,    63,     8,     7, 

- The number of columns in the first batch is `128`, which resulted from combining the first four examples into a single batch and using the maximum size of these examples.
- This means that the other three examples (whose lengths are 165, 86, and 145, respectively) in this batch are padded as much as necessary to match this size.

- Let's divide all three datasets into data loaders with a batch size of `32`:

In [20]:
## Step 4: batching the datasets

batch_size = 32  

train_dl = DataLoader(train_dataset, batch_size=batch_size,
                      shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(valid_dataset, batch_size=batch_size,
                      shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(test_dataset, batch_size=batch_size,
                     shuffle=False, collate_fn=collate_batch)

- Now, the data is in a suitable format for an `RNN` model, which we are going to implement.

### **Embedding layers for sentence encoding**

During the data preparation in the previous step, we generated sequences of the same length. The elements of these sequences were integer numbers that corresponded to the indices of unique words. These word indices can be converted into input features in several different ways. One naive way is to apply `one-hot encoding` to convert the indices into vectors of zeros and ones. Then, each word will be mapped to a vector whose size is the number of unique words in the entire dataset. Given that the number of unique words (the size of the vocabulary) can be in the order of $10^{4}$ – $10^{5}$, which will also be the number of our input features, a model trained on such features may suffer from the `curse of dimensionality`. Furthermore, these features are very sparse since all are zero except one.


A more elegant approach is to map each word to a vector of a fixed size with real-valued elements (not necessarily integers). In contrast to the one-hot encoded vectors, we can use finite-sized vectors to represent an infinite number of real numbers. (In theory, we can extract infinite real numbers from a given interval, for example `[–1, 1]`.)


This is the idea behind embedding, which is a feature-learning technique that we can utilize here to automatically learn the `salient` features to represent the words in our dataset. Given the number of unique words, $n_{words}$, we can select the size of the embedding vectors (a.k.a., embedding dimension) to be much smaller than the number of unique words (embedding_dim << $`n_{words}`$) to represent the entire vocabulary as input features.


The advantages of embedding over one-hot encoding are as follows:

- A reduction in the dimensionality of the feature space to decrease the effect of the curse of dimensionality.
- The extraction of salient features since the embedding layer in an `NN` can be optimized (or learned).


`The following schematic representation shows how embedding works by mapping token indices to a trainable embedding matrix:`

![A breakdown of how embedding works](./figures/15_10.png)


- `input_dim`: number of words, i.e. maximum integer index + 1.
  
- `output_dim`:
  
- `input_length`: the length of (padded) sequence
  - for example, `'This is an example' -> [0, 0, 0, 0, 0, 0, 3, 1, 8, 9]` => input_lenght is 10

- When calling the layer, takes integr values as input, the embedding layer convert each interger into float vector of size `[output_dim]`
  - If input shape is `[BATCH_SIZE]`, output shape will be `[BATCH_SIZE, output_dim]`
  - If input shape is `[BATCH_SIZE, 10]`, output shape will be `[BATCH_SIZE, 10, output_dim]`

### **Embedding Layers for Sentence Encoding**

`Embedding layers` convert text into dense vector representations for sentence encoding. These layers learn to map words or tokens to vectors where similar meanings are closer in the vector space. This allows models to process and compare sentences for tasks like similarity, information retrieval, and question answering by comparing the distance between their vector representations. Common approaches include using the output of a deep learning model's encoder, where each layer can capture different information, or using specialized sentence embedding models


##### **1. Purpose**

An embedding layer maps each discrete token (word, subword, or character) to a continuous vector representation. This allows neural networks to work with textual inputs efficiently by representing semantic and syntactic relations in a learned vector space.

Given a vocabulary of size `V` and embedding dimension `d`, the embedding layer is a matrix:

$$E \in \mathbb{R}^{V \times d}$$

A token with integer index `i` is represented as:

$`x = E[i]`$


##### **2. Sentence Representation**

Suppose a sentence has token indices:

$$[w_1, w_2, \ldots, w_T]$$

The embedding lookup gives:

$`X = [E[w_1], E[w_2], \ldots, E[w_T]]`$ 
$`X \in \mathbb{R}^{T \times d}`$

This matrix representation preserves word order and provides a dense representation per position.



##### **3. Encoding Strategies**

Once the tokens are embedded, several strategies can convert the sequence into a **single sentence representation**.

(a) **Mean Pooling**

$$h = \frac{1}{T}\sum_{t=1}^{T} X_t \in \mathbb{R}^{d}$$

Captures coarse semantic information. Works surprisingly well for classification.


(b) **Max Pooling**

$$h_i = \max_{t=1}^{T} X_{t,i}$$

Highlights the most informative features across the sequence.


(c) **RNN-based Encoding (e.g., LSTM, GRU)**

Pass the sequence through RNN:

$$h_t, c_t = \mathrm{LSTM}(X_t, (h_{t-1}, c_{t-1}))$$

Final hidden state:

$$h_T \in \mathbb{R}^{d}$$

This captures sequential and contextual dependencies, but with potential long-term dependency challenges.


(d) **CNN-based Encoding**

Apply convolution filters:

$$H = \mathrm{ReLU}(X * W + b)$$

Then reduce via max-pooling:

$$h = \max(H)$$

Captures local n-gram features efficiently.


(e) **Transformer-based Encoding**

Add positional encodings and process through self-attention layers:

$$X = \mathrm{TransformerEncoder}(X + P)$$

Sentence representation often taken as mean-pooled output or `[CLS]` token vector.



##### **4. Choice Considerations**

| Method          | Captures Order  | Long-Range Context | Efficient for Long Sequences | Notes                             |
| --------------- | --------------- | ------------------ | ---------------------------- | --------------------------------- |
| Mean / Max Pool | No              | No                 | Yes                          | Fast; baseline                    |
| LSTM / GRU      | Yes             | Limited            | Moderate                     | Good for moderate sentence length |
| CNN             | Partial (local) | No                 | Yes                          | Good for phrase-level patterns    |
| Transformer     | Yes             | Yes                | Expensive for very long seqs | Best general performance          |



##### **5. Embedding Initialization Approaches**

1. **Random Learnable Embeddings**
   Learned from scratch during training.

2. **Pretrained Static Embeddings**
   Examples: GloVe, Word2Vec.
   Embeddings fixed; network learns task-specific layers.

3. **Pretrained Contextual Embeddings**
   Examples: BERT, RoBERTa, GPT embeddings.
   Contextual and dynamic; state-of-the-art for sentence representation.



##### **6. Key Insight**

The embedding layer provides **semantic grounding**; however, the choice of sentence encoding mechanism determines how relationships across tokens are modeled. For structured sentence understanding and contextual meaning extraction, **sequence models (LSTM/Transformer)** or **attention-based pooling** generally outperform simple averaging.

---


- Creating an `embedding layer` can simply be done using `nn.Embedding`.

In [32]:
embedding = nn.Embedding(
    num_embeddings=10,
    embedding_dim=3,
    padding_idx=0
)

# a batch of 2 samples of 4 indices each
text_encoded_input = torch.LongTensor([[1, 2, 3, 4], [4, 3, 2, 0]])
print(text_encoded_input.shape)
print(embedding(text_encoded_input))

torch.Size([2, 4])
tensor([[[ 1.3519, -1.1674, -1.3222],
         [-0.3867,  2.0638,  1.0079],
         [-1.9005, -0.4193,  0.8100],
         [ 0.3361,  0.9807, -1.3204]],

        [[ 0.3361,  0.9807, -1.3204],
         [-1.9005, -0.4193,  0.8100],
         [-0.3867,  2.0638,  1.0079],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)


- The input to this model (embedding layer) must have `rank 2` with the dimensionality `batchsize x input_length`, where `input_length` is the length of sequences (here 4).

- For example, an `input_sequence` in the mini-batch could be `<1, 5, 9, 2>`, where each element of this sequence is the index of the unique words.

- The output will have the dimensionality `batch_size x input_length x embedding_dim`, where `embedding_dim` is the size of the embedding features (here, set to 3).

- `num_embeddings`, corresponds to the unique integer values that the model will receive as input (for instance, n + 2, set here to 10). Therefore, the embedding matrix in this case has the size `10x6`.

- `padding_idx` indicates the token index for padding (here, 0), which, if specified, will not contribute to the gradient updates during training. 

### **Building an RNN model**

- combination of the `embedding layer`, the recurrent layers of the RNN, and the fully connected non-recurrent layers.
- For the recurrent layers, we can use any of the following implementations:

  - `RNN`: a regular RNN layer, that is, a fully connected recurrent layer.
  - `LSTM`: a long short-term memory RNN, which is useful for capturing the long-term dependencies.
  - `GRU`: a recurrent layer with a gated recurrent unit, as an alternative to LSTMs.


- RNN layers:
  - `nn.RNN(input_size, hidden_size, num_layers=1)`
  - `nn.LSTM(..)`
  - `nn.GRU(..)`
  - `nn.RNN(input_size, hidden_size, num_layers=1, bidirectional=True)`

- **Fully connected neural network with one hidden layer**

In [33]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size,
                          num_layers=2, batch_first=True)
        # self.rnn = nn.GRU(input_size, hidden_size,
        #                   num_layers=1, batch_first=True)
        # self.rnn = nn.LSTM(input_size, hidden_size,
        #                    num_layers=1, batch_first=True)
        self.fc = nn.Linear(hidden_size, out_features=1)
        
    def forward(self, x):
        _, hidden = self.rnn(x)
        out = hidden[-1, :, :]
        # we use the final hidden state from the last hidden layer as the input to the fully connected layer
        out = self.fc(out)
        return out
    
model = RNN(64, 32)
print(model)

RNN(
  (rnn): RNN(64, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)


In [45]:
model.rnn.num_layers

2

In [34]:
model(torch.randn(5, 3, 64))

tensor([[-0.1760],
        [-0.0766],
        [-0.0272],
        [ 0.1103],
        [-0.3438]], grad_fn=<AddmmBackward0>)

---

#### **Context: What an RNN-Based Text Model Does**

When we build a model for sentences:

1. Text is **tokenized into integer IDs**.
2. Each token ID is converted into a **vector** using an **Embedding Layer**.
3. The vectors are fed **one-time-step-at-a-time** into an **RNN**, such as GRU or LSTM.
4. The final hidden state (or pooled states) is fed into a **Fully Connected Layer** to make predictions.

So the architecture looks like:

```
[Tokens] → [Embedding] → [RNN] → [FC Layer] → Output
```



#### **Key Terms and What They Mean**

Let’s break them down **in the order they affect the architecture**:

| Term                                  | What It Means                                      | Where It Appears                | Typical Range                                  |
| ------------------------------------- | -------------------------------------------------- | ------------------------------- | ---------------------------------------------- |
| **Vocabulary Size (V)**               | Number of unique tokens/words in your dataset      | Embedding layer input dimension | 10k–100k for English text                      |
| **Embedding Dimension (d)**           | Size of the learned vector per token               | Output of embedding layer       | 50, 100, 200, 300 (classic) / 128–768 (modern) |
| **Input Size**                        | Equals the **embedding dimension**, not vocab size | RNN input dimension             | Same as embedding dim                          |
| **Hidden Size (h)**                   | Size of RNN’s internal memory/state                | RNN layer hidden dimension      | 64, 128, 256, 512                              |
| **Number of RNN Layers**              | Depth (stacking) of the recurrent layers           | RNN module parameter            | 1–4                                            |
| **Fully Connected Layer Hidden Size** | Size of the classifier head                        | FC network                      | 32, 64, 128                                    |
| **Output Size**                       | Number of output classes                           | Final FC layer                  | For sentiment: 1 or 2                          |



#### **The Critical Relationships**

This is where it clicks:

**1. Embedding Layer**

```python
nn.Embedding(num_embeddings=V, embedding_dim=d)
```

* `num_embeddings = vocabulary size`
* `embedding_dim = embedding dimension`

**Output shape of Embedding:**

```
(Batch, Sequence Length, d)
```

**2. RNN Layer (LSTM / GRU)**

```python
nn.LSTM(input_size=d, hidden_size=h, num_layers=L)
```

* `input_size = embedding dimension`
* `hidden_size = RNN memory size`
* `num_layers = how many LSTMs stacked`

**Output shape:**

```
All hidden states:  (Batch, Sequence Length, h)
Final state:        (Batch, h)
```

**3. Fully Connected (Classifier Head)**

```python
nn.Linear(h, num_classes)
```

* Input = final RNN hidden state
* Output = predicted label(s)



#### **Putting it Together: Example Architecture**

Suppose:

* Vocabulary size = **10,000**
* Embedding dim = **128**
* LSTM hidden size = **256**
* Number of layers = **2**
* Binary classification → output size = **1**

**Architecture:**

```python
Embedding(V=10000, d=128)
LSTM(input_size=128, hidden_size=256, num_layers=2)
Linear(256 → 1)
```



#### **Choosing Good Values (Simple Rules)**

| Component           | Smaller Dataset | Medium/Large Dataset |
| ------------------- | --------------- | -------------------- |
| Vocabulary size     | 5k–20k          | 20k–200k             |
| Embedding dimension | 50–200          | 200–768              |
| RNN hidden size     | 64–256          | 256–1024             |
| Number of layers    | 1–2             | 2–4                  |



#### **Why Transformers Replace RNNs**

* RNNs process one word at a time → **sequential dependency**, slow.
* They struggle to retain long-range context.
* Transformers use **self-attention**, which lets every word attend to every other word **in parallel**.

But **the data flow is the same**:

```
Tokens → Embedding → *Model* → Classification Layer
```

The only difference is the **middle block**.

| Before     | Now                        |
| ---------- | -------------------------- |
| LSTM / GRU | Transformer Encoder Layers |

Everything else (vocab size, embedding dim, classifier head) stays conceptually identical.



#### **Mental Model Summary**

If you remember just this:

```
Vocabulary Size → How many tokens the model knows.
Embedding Dimension → How rich each token’s meaning is.
Hidden Size → How much memory the RNN has.
Number of Layers → How deep the reasoning stack is.
FC Layer → Maps meaning → decision.
```

You can design architectures **confidently**.


---

### **Building an RNN model for the sentiment analysis task**

- For long sequences, we are going to use an `LSTM` layer to account for long-range effects.

- Create an `RNN` model for sentiment analysis, starting with an embedding layer producing word embeddings of feature size 20 (embed_dim=20).

- Then, a recurrent layer of type `LSTM` will be added.

- Finally, we will add a `fully connected layer` as a hidden layer and another `fully connected layer` as the output layer, which will return a single class-membership probability value via the logistic sigmoid activation as the prediction.

In [49]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                      embed_dim,
                                      padding_idx=0
                                      )
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size,
                           batch_first=True)
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu().numpy(), enforce_sorted=False,
            batch_first=True
        )
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
    

vocab_size = len(stoi)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64

torch.manual_seed(1)
model = RNN(vocab_size, 
            embed_dim, 
            rnn_hidden_size, 
            fc_hidden_size) 
model = model.to(device)

In [50]:
model

RNN(
  (embedding): Embedding(69025, 20, padding_idx=0)
  (rnn): LSTM(20, 64, batch_first=True)
  (fc1): Linear(in_features=64, out_features=64, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=64, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

- Now, we will develop the `train` function to train the model on the given dataset for one epoch and return the classification accuracy and loss:

In [51]:
def train(dataloader):
    model.train()
    total_acc, total_loss = 0, 0
    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()
        pred = model(text_batch, lengths)[:, 0]
        loss = loss_fn(pred, label_batch)
        loss.backward()
        optimizer.step()
        total_acc += (
            (pred >= 0.5).float() == label_batch
        ).float().sum().item()
        total_loss += loss.item()*label_batch.size(0)
    return total_acc / len(dataloader.dataset), \
        total_loss / len(dataloader.dataset)

- Similarly, we will develop the evaluate function to measure the model’s performance on a given dataset:

In [52]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_loss = 0, 0
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred = model(text_batch, lengths)[:, 0]
            loss = loss_fn(pred, label_batch)
            total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)

- The next step is to create a `loss function` and optimizer `(Adam optimizer)`. For a binary classification with a single class-membership probability output, we use the binary cross-entropy loss `(BCELoss)` as the loss function:

In [53]:
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

- Now we will train the model for 10 epochs and display the training and validation performances:

In [55]:
num_epochs = 10
torch.manual_seed(1)
for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(f"Epoch {epoch} Accuracy: {acc_train:.4f}"
          f" Val_accuracy: {acc_valid:.4f}")

Epoch 0 Accuracy: 0.9781 Val_accuracy: 0.8598
Epoch 1 Accuracy: 0.9872 Val_accuracy: 0.8694
Epoch 2 Accuracy: 0.9901 Val_accuracy: 0.8640
Epoch 3 Accuracy: 0.9913 Val_accuracy: 0.8518
Epoch 4 Accuracy: 0.9941 Val_accuracy: 0.8576
Epoch 5 Accuracy: 0.9933 Val_accuracy: 0.8528
Epoch 6 Accuracy: 0.9969 Val_accuracy: 0.8416
Epoch 7 Accuracy: 0.9951 Val_accuracy: 0.8564
Epoch 8 Accuracy: 0.9983 Val_accuracy: 0.8576
Epoch 9 Accuracy: 0.9996 Val_accuracy: 0.8588


- After training this model for 10 epochs, we will evaluate it on the test data:

In [56]:
acc_test, _ = evaluate(test_dl)
print(f'test_accuracy: {acc_test:.4f}') 

test_accuracy: 0.8540


In [58]:
# for text_batch, label_batch, lengths in dataloader:
#     print(text_batch, label_batch, lengths)

#### **More on the bidirectional RNN**

In addition, we will set the `bidirectional` configuration of the `LSTM` to True, which will make the recurrent layer pass through the input sequences from both directions, start to end, as well as in the reverse direction:

In [59]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 
                                      embed_dim, 
                                      padding_idx=0) 
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, 
                           batch_first=True, bidirectional=True)
        self.fc1 = nn.Linear(rnn_hidden_size*2, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True)
        _, (hidden, cell) = self.rnn(out)
        out = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
    
torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size) 
model = model.to(device)

In [60]:
model

RNN(
  (embedding): Embedding(69025, 20, padding_idx=0)
  (rnn): LSTM(20, 64, batch_first=True, bidirectional=True)
  (fc1): Linear(in_features=128, out_features=64, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=64, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

- The bidirectional RNN layer makes two passes over each input sequence: a forward pass and a reverse or backward pass (note that this is not to be confused with the forward and backward passes in the context of backpropagation). 

- The hidden states from both passes are concatenated at each time step and passed to the next layer. This allows the model to capture information from both past and future contexts, which can be particularly beneficial for tasks like sentiment analysis where understanding the full context of a sentence is important.

- Other merge modes include summation, averaging, and multiplication of the hidden states from both directions instead of concatenation.

- We can also try other types of recurrent layers, such as `GRU` or regular `RNN` layers, instead of the `LSTM` layer.

- However, a model built with regualar RNN layers may not perform as well as LSTM or GRU layers, especially for longer sequences, due to the vanishing gradient problem.

In [None]:
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.002)

num_epochs = 10 

torch.manual_seed(1)
 
for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(f'Epoch {epoch} accuracy: {acc_train:.4f} val_accuracy: {acc_valid:.4f}')

In [None]:
from datasets import load_dataset
from torch.utils.data import DataLoader

IMDB = load_dataset("imdb")

test_dataset = IMDB(split='test')
test_dl = DataLoader(test_dataset, batch_size=batch_size,
                     shuffle=False, collate_fn=collate_batch)

In [None]:
acc_test, _ = evaluate(test_dl)
print(f'test_accuracy: {acc_test:.4f}') 

---