# LSTM-based Seq2Seq Model for Abstractive Summarization

You can ask your questions in Telegram : @FatemehNikkhoo

Name = "Seyyed Amirmahdi Sadrzadeh"

StudentId = "401102015"

# Import Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from tqdm import tqdm
import numpy as np
import random


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 712, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.11/dist-package

In [2]:
# Set up device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the Dataset

# Extreme Summarization (XSum) Dataset

The **XSUM** dataset is designed for the task of extreme summarization, where the goal is to generate a single-sentence summary for a news article.

### Features:
- **document:** The input news article.
- **summary:** A one-sentence summary of the article.
- **id:** A unique BBC ID for each article.

For more details and to explore the dataset, you can visit the official [Hugging Face XSUM page](https://huggingface.co/datasets/xsum).


In [17]:
# 1. Load the XSUM dataset
print("Loading XSUM dataset...")

# Load each split using slice syntax
raw_datasets = {
    "train": load_dataset("xsum", split="train[:2000]"),
    "validation": load_dataset("xsum", split="validation[:500]"),
    "test": load_dataset("xsum", split="test[:500]")
}

Loading XSUM dataset...


In [4]:
# Data Inspection

# Inspect the dataset size
for split, data in raw_datasets.items():
    print(f"{split} size: {len(data)}")

# Inspect a random sample of the train dataset
train_len = len(raw_datasets['train'])
# Select a random index between 0 and train_len - 1
random_index = random.randint(0, train_len  - 1)
print(f"Sample from random index: {random_index}\n")
for key in raw_datasets['train'][random_index]:
    print(f"{key}: {raw_datasets['train'][random_index][key]}\n")

train size: 500
validation size: 100
test size: 100
Sample from random index: 30

document: The team went into administration in October but, as revealed by BBC Sport, have secured investment from Stephen Fitzpatrick, boss of energy firm Ovo.
Former Sainsbury's boss Justin King has joined as interim chairman.
He said he was confident that Manor had "the right people, the right values and sheer hard work" and would be "competitive at the highest level".
King is not financially involved in the team but will take a leading role on the business side of the operation.
Fitzpatrick's investment is a personal one and the money he has put into the team does not come from Ovo.
He said: "I have a lifelong passion for F1 and can't wait for the season ahead."
Manor Marussia have announced Englishman Will Stevens will be one of their drivers and said a deal to sign the second would be completed soon.
The team's new car, a modified version of the 2014 model, must pass F1's mandatory crash tests befor

# Tokenization

### Question:
- What is the role of a tokenizer in Natural Language Processing (NLP)?
- What does it mean to "tokenize" text, and why is this step necessary?

### Tokenization in NLP


A tokenizer is responsible for breaking down raw text into smaller units called **tokens** — which can be words, subwords, or even characters depending on the tokenizer used. This is one of the very first steps in any NLP pipeline, and it's essential for converting unstructured text into a format that a model can understand and process.


To "tokenize" text means to split it into meaningful elements (tokens). For example, the sentence:
> *"My Deep Lerning code."*

Could be tokenized as:
> `["My", "Deep", "Learning", "code", "."]`

Or with subword tokenization:
> `["My", "Deep", "lear", "ning", "code", "."]`

This step is necessary because **neural networks can't operate directly on raw text** — they require numerical input. Tokenization bridges this gap by turning text into tokens, which can then be mapped to numeric representations (like token IDs or embeddings). It also helps in handling vocabulary, padding, truncation, and overall consistency across samples.




In [5]:
# 2. Tokenization

# Apply tokenization on the 'document' (news article) and 'summary' (highlight).
def tokenize_function(example, tokenizer):
    """
    This function takes a batch of example and applies tokenization using the provided tokenizer.

    Args:
    example (dict): A dictionary containing text data with keys like "document" and "summary".
    tokenizer: A tokenizer instance (e.g., from `torchtext` or `transformers`).

    Returns:
    dict: A dictionary containing tokenized inputs and target sequences with keys 'input_ids' and 'target_ids'.
    """
    # TODO: Apply tokenization
    inputs = tokenizer(example["document"])  # Tokenize the article (input)
    targets = tokenizer(example["summary"])  # Tokenize the summary (target)
    return {"input_ids": inputs, "target_ids": targets}


# Tokenizer (using basic English tokenizer)
tokenizer = get_tokenizer("basic_english")  # Basic word-level tokenization

# Applying the tokenizer function to the dataset
tokenized_datasets = {
    split: raw_datasets[split].map(lambda example: tokenize_function(example, tokenizer))
    for split in ["train", "validation", "test"]
}

# TODO: Inspect a sample of tokenized_datasets['train'] to better understand the results
# Print the keys and values of the sample at the random_index that was calculated earlier
print(f"\nTokenized sample at index {random_index} from 'train' set:")
for key in tokenized_datasets["train"][random_index]:
    print(f"{key}: {tokenized_datasets['train'][random_index][key]}\n")



Tokenized sample at index 30 from 'train' set:
document: The team went into administration in October but, as revealed by BBC Sport, have secured investment from Stephen Fitzpatrick, boss of energy firm Ovo.
Former Sainsbury's boss Justin King has joined as interim chairman.
He said he was confident that Manor had "the right people, the right values and sheer hard work" and would be "competitive at the highest level".
King is not financially involved in the team but will take a leading role on the business side of the operation.
Fitzpatrick's investment is a personal one and the money he has put into the team does not come from Ovo.
He said: "I have a lifelong passion for F1 and can't wait for the season ahead."
Manor Marussia have announced Englishman Will Stevens will be one of their drivers and said a deal to sign the second would be completed soon.
The team's new car, a modified version of the 2014 model, must pass F1's mandatory crash tests before they can race at the season-open

# Build Vocabulary

In NLP tasks, the vocabulary maps each token (word) to a unique integer ID.

### Question:
- What are the special characters `"<unk>"` and `"<pad>"` used for in vocabulary generation?
- Why should we build the vocabulary using only the training data?

### Special Tokens in Vocabulary


- `"<unk>"` stands for **unknown token**. It's used to represent any word that is **not present in the vocabulary**. This is important when the model encounters a word it has never seen during training.
- `"<pad>"` stands for **padding token**. It is used to make all sequences the same length in a batch (especially when using RNNs or transformers). Padding ensures that shorter sequences don't affect the computation during training.


We build the vocabulary **only on the training data** to avoid **data leakage**. Including tokens from the validation or test sets could unintentionally give the model access to information it wouldn’t have during real-world inference — leading to overly optimistic performance results.


In [15]:
# 3. Build Vocabulary
def build_vocab(texts, tokenizer):
    """
    Builds a vocabulary from the provided raw text data.
    The vocabulary maps each token (word) to a unique integer ID.
    Special tokens like <unk> (unknown words) and <pad> (padding) are included.
    """
    return build_vocab_from_iterator(map(tokenizer, texts), specials=["<unk>", "<pad>"])

# TODO: Build the vocabulary from the training data considering both 'documents' and 'summary'
# Collect all texts (documents + summaries) from training split
# Combine both articles and summaries from training set
train_articles = [example["document"] for example in raw_datasets["train"]]
train_summaries = [example["summary"] for example in raw_datasets["train"]]
combined_texts = train_articles + train_summaries  # Use this as `texts`

# Build vocabulary with special tokens, including <sos> and <eos>
vocab = build_vocab_from_iterator(
    map(tokenizer, combined_texts),
    specials=["<unk>", "<pad>", "<sos>", "<eos>"]
)

# Set default index to <unk> for unknown words
vocab.set_default_index(vocab["<unk>"])

# Inspecting the vocabulary:
vocab_size = len(vocab)
print(f"Vocabulary size: {vocab_size}")

# Print first 10 tokens and their IDs
print("Sample tokens and their corresponding IDs:")
for token in list(vocab.get_itos())[:10]:
    print(token, vocab[token])


Vocabulary size: 17746
Sample tokens and their corresponding IDs:
<unk> 0
<pad> 1
<sos> 2
<eos> 3
the 4
. 5
, 6
to 7
of 8
a 9


## Padding Function

### Question:
- Why is padding important in data preprocessing for NLP tasks, and why should we do it?


In NLP, input sequences (like sentences or documents) often have **different lengths**, but neural networks — especially when using batches — require inputs to have **uniform dimensions**.

**Padding** solves this by adding a special `<pad>` token to shorter sequences so that all sequences in a batch have the same length. This enables efficient parallel processing on GPUs and ensures compatibility with batch-based training.

Padding is especially important when:
- Training models like RNNs or Transformers that expect fixed-length input.
- Using masking to ignore padded values during attention or loss computation.

Without padding, we'd either need to process one sequence at a time (which is inefficient) or truncate valuable information arbitrarily.



In [19]:
# 4. Padding function (modified to accept token IDs)
# Constants
MAX_LENGTH = 64  # Maximum sequence length
PAD_IDX = vocab["<pad>"]  # Padding token index
UNK_IDX = vocab["<unk>"]  # Unknown token index

def pad_to_max_length(seq, max_length=MAX_LENGTH, pad_idx=PAD_IDX):
    """
    Pads or truncates a sequence of token IDs to a fixed maximum length.
    """
    return seq + [pad_idx] * (max_length - len(seq)) if len(seq) < max_length else seq[:max_length]

# 5. Sequence processing function (ensure tokenization and conversion to token IDs)
def process_data(example, vocab, tokenizer):
    """
    Pads input and target sequences to fixed lengths and records original lengths.
    Tokenizes the text and converts tokens to token IDs.
    """

    # Tokenize input and target
    input_tokens = tokenizer(example["document"])

    # ADD <sos> and <eos> to target tokens
    target_tokens = ["<sos>"] + tokenizer(example["summary"]) + ["<eos>"]

    # Convert tokens to token IDs using the vocabulary
    input_ids  = [vocab[token] if token in vocab else UNK_IDX for token in input_tokens]
    target_ids = [vocab[token] if token in vocab else UNK_IDX for token in target_tokens]

    # Save lengths before padding
    input_len  = len(input_ids)
    target_len = len(target_ids)

    # Pad or truncate
    input_ids  = pad_to_max_length(input_ids)
    target_ids = pad_to_max_length(target_ids)

    # Return as plain lists (we convert to tensors later)
    return {
        "input_ids": input_ids,
        "target_ids": target_ids,
        "input_len": input_len,
        "target_len": target_len,
    }


# Apply processing to the datasets
processed_datasets = {
    split: raw_datasets[split].map(lambda example: process_data(example, vocab, tokenizer))
    for split in ["train", "validation", "test"]
}


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

# Creating Dataloaders and Custom Dataset Class

In [9]:
# 6. Custom Dataset Class
class Seq2SeqDataset(Dataset):
    """
    A PyTorch-compatible dataset wrapper for processed sequence-to-sequence data.

    This class takes tokenized, padded, and numericalized examples and allows them
    to be used with a DataLoader to enable batching, shuffling, and parallel loading.
    """

    def __init__(self, dataset):
        """
        Initializes the custom dataset.

        Args:
            dataset (DatasetDict): A HuggingFace-style dataset where each example is a dict
                                   containing 'input_ids', 'target_ids', 'input_len', 'target_len'.
        """
        self.dataset = dataset

    def __len__(self):
        """
        Returns:
            int: Total number of samples in the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Fetches the sample at a specific index.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            dict: A dictionary containing input/target sequences and their lengths.
                  These are returned as PyTorch tensors.
        """
        item = self.dataset[idx]
        return {
            "input_ids": torch.tensor(item["input_ids"], dtype=torch.long),  # Convert to tensor
            "target_ids": torch.tensor(item["target_ids"], dtype=torch.long),  # Convert to tensor
            "input_len": torch.tensor(item["input_len"], dtype=torch.long),  # Convert to tensor
            "target_len": torch.tensor(item["target_len"], dtype=torch.long)  # Convert to tensor
        }

BATCH_SIZE = 8

# Instantiate PyTorch-compatible datasets from the processed HuggingFace-style splits
train_dataset = Seq2SeqDataset(processed_datasets["train"])        # For training
valid_dataset = Seq2SeqDataset(processed_datasets["validation"])   # For validation
test_dataset  = Seq2SeqDataset(processed_datasets["test"])         # For testing

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader  = DataLoader(test_dataset, batch_size=BATCH_SIZE)

# Sanity Check – Inspect One Batch
batch = next(iter(train_loader))
print("Input shape:", batch["input_ids"].shape)
print("Target shape:", batch["target_ids"].shape)
print("Input lengths:", batch["input_len"][:5])
print("Target lengths:", batch["target_len"][:5])

Input shape: torch.Size([8, 128])
Target shape: torch.Size([8, 128])
Input lengths: tensor([990, 424, 501, 505, 147])
Target lengths: tensor([20, 28, 24, 14, 16])


### Seq2Seq Model

The following is a simple implementation of a LSTM-based Seq2Seq model for tasks like text summarization or machine translation.

#### Questions:
- **What is the Embedding Layer and Why is it Used?**  
- **What is Teacher Forcing and Why is it Used?**  

### Seq2Seq Model Concepts


The **embedding layer** is used to map each token ID (an integer) to a dense vector of fixed size. These vectors (embeddings) capture semantic information about words in a way that’s more meaningful than one-hot encoding.

For example, the words "king" and "queen" might have embeddings that are close in vector space, reflecting their semantic similarity. The embedding layer is the first step in most NLP models because it provides a way to represent words in a form that neural networks can understand and learn from.

> In PyTorch, we use `nn.Embedding(vocab_size, embedding_dim)` to define it.

---


**Teacher forcing** is a training strategy used in sequence-to-sequence models, especially when generating sequences like translations or summaries.

During training, at each time step, instead of feeding the model's previous prediction as input to the decoder, we feed in the **actual ground truth token** (i.e., the correct word from the target sequence). This helps the model learn faster and more accurately, especially early in training.

Without teacher forcing, the model may propagate its own early mistakes through the entire sequence, leading to poor training dynamics.

> Teacher forcing is typically used **only during training**, not during inference.


In [10]:
# Encoder class
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        # TODO: Embedding layer to convert token IDs to embeddings
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # TODO: LSTM layer to process sequences and output hidden and cell states
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

    def forward(self, input_ids):
        # Convert token IDs to embeddings
        embedded = self.embedding(input_ids)  # Shape: (B, T, E)
        # Process the embeddings with the LSTM
        output, (hidden, cell) = self.lstm(embedded)  # output not used here
        return hidden, cell


# Decoder class
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        # TODO: Embedding layer to convert token IDs to embeddings
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # TODO: LSTM layer to process the current token and hidden state
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

        # Fully connected layer to predict the next token in the sequence
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_token, hidden, cell):
        # Convert current token to embedding
        embedded = self.embedding(input_token.unsqueeze(1))  # Shape: (B, 1, E)
        # Process the embedded token with the LSTM and pass hidden, cell states
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))  # output: (B, 1, H)
        # Get the logits for the next token prediction
        logits = self.fc(output.squeeze(1))  # Shape: (B, vocab_size)
        return logits, hidden, cell


# Seq2Seq class to combine the encoder and decoder
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        # TODO: Initialize the encoder and decoder here
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size = src.size(0)  # Number of sequences in the batch
        max_len = tgt.size(1)     # Maximum length of the target sequence
        vocab_size = self.decoder.fc.out_features  # Size of the vocabulary

        # Tensor to hold all predictions (outputs) for each token
        outputs = torch.zeros(batch_size, max_len, vocab_size).to(src.device)

        # TODO: Get initial hidden and cell states from the encoder
        hidden, cell = self.encoder(src)

        # First decoder input is the start-of-sequence token
        input_token = tgt[:, 0]  # Shape: (B,)

        for t in range(1, max_len):
            # Pass the current token and states to the decoder
            output, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t] = output  # Store the output for the current time step

            # Apply teacher forcing: decide whether to use true target or predicted token
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)  # Get the predicted token (max logit)

            # Use the true token (from the target) if teacher forcing is applied, otherwise use predicted token
            input_token = tgt[:, t] if teacher_force else top1

        return outputs


# Training and Evaluation Function

In [None]:
# Training function
def train(model, train_loader, optimizer, criterion, device, teacher_forcing_ratio=0.5):
    model.train()  # Set model to training mode
    epoch_loss = 0  # Track total loss for the epoch

    for batch_idx, batch in enumerate(train_loader):
        # Move data to the device (GPU or CPU)
        src = batch['input_ids'].to(device)
        tgt = batch['target_ids'].to(device)

        # TODO: Zero the gradients before each backpropagation
        optimizer.zero_grad()

        # TODO: Forward pass through the model
        output = model(src, tgt, teacher_forcing_ratio=teacher_forcing_ratio)

        # Flatten the output and target for loss calculation
        output_dim = output.shape[-1]
        output = output[:, 1:, :].contiguous().view(-1, output_dim)  # Skip <sos> predictions
        tgt = tgt[:, 1:].contiguous().view(-1)  # Skip <sos> targets

        # Calculate the loss
        loss = criterion(output, tgt)
        epoch_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

        if batch_idx % 10 == 0:
            print(f"Batch {batch_idx}/{len(train_loader)} Loss: {loss.item():.4f}")

    return epoch_loss / len(train_loader)


# Evaluation function
def evaluate(model, valid_loader, criterion, device):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for batch in valid_loader:
            src = batch['input_ids'].to(device)
            tgt = batch['target_ids'].to(device)

            # TODO: Forward pass through the model (no teacher forcing)
            output = model(src, tgt, teacher_forcing_ratio=0.0)

            output_dim = output.shape[-1]
            output = output[:, 1:, :].contiguous().view(-1, output_dim)
            tgt = tgt[:, 1:].contiguous().view(-1)

            loss = criterion(output, tgt)
            epoch_loss += loss.item()

    return epoch_loss / len(valid_loader)


# Training loop function
def train_loop(model, train_loader, valid_loader, optimizer, criterion, num_epochs, device):
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")

        train_loss = train(model, train_loader, optimizer, criterion, device)
        print(f"Training Loss: {train_loss:.4f}")

        valid_loss = evaluate(model, valid_loader, criterion, device)
        print(f"Validation Loss: {valid_loss:.4f}")


# Configurations
vocab_size = len(vocab)
embed_dim = 256   # Dimensionality of word embeddings
hidden_dim = 512  # Hidden state size of the LSTM

# TODO: Initialize Model
encoder = Encoder(vocab_size, embed_dim, hidden_dim)
decoder = Decoder(vocab_size, embed_dim, hidden_dim)
model = Seq2Seq(encoder, decoder).to(device)

# Optimizer and Loss Function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Number of epochs
num_epochs = 5

# Train the model
train_loop(model, train_loader, valid_loader, optimizer, criterion, num_epochs, device)



Epoch 1/5
Batch 0/63 Loss: 9.7864
Batch 10/63 Loss: 9.2296
Batch 20/63 Loss: 7.9409
Batch 30/63 Loss: 7.4734
Batch 40/63 Loss: 7.6693
Batch 50/63 Loss: 7.5875
Batch 60/63 Loss: 7.3606
Training Loss: 7.9273
Validation Loss: 7.3422

Epoch 2/5
Batch 0/63 Loss: 6.6768
Batch 10/63 Loss: 6.6220
Batch 20/63 Loss: 6.6040
Batch 30/63 Loss: 6.4740
Batch 40/63 Loss: 6.3505


# Predictions vs Ground Truth (Qualitative Evaluation)

In [12]:
def generate_prediction(model, src, tgt, device, vocab):
    model.eval()  # Set model to evaluation mode

    # Move the source and target to the correct device (CPU/GPU)
    src = src.to(device)
    tgt = tgt.to(device)

    # TODO: Generate output using the model (disable teacher forcing here)
    with torch.no_grad():
        output = model(src, tgt, teacher_forcing_ratio=0.0)

    # Get the predicted tokens (taking argmax across vocab size)
    predicted_tokens = output.argmax(2)  # (batch_size, seq_len)

    # Get vocab index-to-token mapping
    itos = vocab.get_itos()

    # TODO: Convert token IDs back to text using the vocab's get_itos() method
    predicted_text = []
    for seq in predicted_tokens:
        tokens = [itos[idx] for idx in seq if idx != PAD_IDX]
        predicted_text.append(" ".join(tokens))

    # TODO: Convert the target tokens to text as well for comparison
    target_text = []
    for seq in tgt:
        tokens = [itos[idx.item()] for idx in seq if idx.item() != PAD_IDX]
        target_text.append(" ".join(tokens))

    return predicted_text, target_text

# Generate prediction for the first batch of test data
src_sample = test_loader.dataset[0]['input_ids']  # First input example from the test set
tgt_sample = test_loader.dataset[0]['target_ids']  # First target example from the test set

predictions, actuals = generate_prediction(model, src_sample.unsqueeze(0), tgt_sample.unsqueeze(0), device, vocab)

# Now let's print the comparison
print("Predicted Text:", predictions[0])
print("Actual Target Text:", actuals[0])

Predicted Text: <unk> ' s the the the the the the of the the of the in the , the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Actual Target Text: there is a chronic need for more housing for prison <unk> in wales , according to a charity .


## Bonus: Incorporate Attention to the Model and Evaluate the Results

Incorporating **Attention** mechanisms into the Seq2Seq model can significantly improve the model's ability to focus on relevant parts of the input sequence while generating output. This is particularly useful for longer sequences where the model might struggle to capture long-range dependencies with a standard encoder-decoder architecture.

In [None]:
# TODO:
# Place your code here
# Hint: You can modify the main code of LSTM-based Seq2Seq model