[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/sk-classroom/asc-transformers/blob/main/exercise/exercise_01.ipynb)

![](https://cdn.britannica.com/03/134503-050-060DD73F/Bombe-American-version-messages-cipher-machines-Britain.jpg)

In this notebook, we will be creating a seq2seq model for deciphering a simple cipher. 
References: 
- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)

# Preparation

In [139]:
# If you are using Google Colab or local environments, install the following packages:
#!pip install spacy
#!pip install torchtext
#!pip install pytorch-lightning

In [140]:
# Let's import the necessary packages
import torch
import numpy as np
from scipy import linalg, sparse
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pytorch_lightning as pyl

# Seq2Seq model

Let us implement a seq2seq model with attention mechanism. We will first implement its building blocks, namely `Encoder`, `Decoder`, and `Attention`, and then put them together to form a seq2seq model.


## Encoder 

Let's implement `Encoder`. 
While [the original paper uses four-layer LSTM](https://arxiv.org/abs/1409.3215), we will cut down it to simpler encoder, namely two-layer [Gated Recurrent Unit (GRU) by Cho et al.](https://arxiv.org/pdf/1406.1078v3.pdf). GRU simplifies LTCM by omitting the cell state and produces only the hidden state. Namely, 

$$
h_{t} = \text{GRU}(x_{t}, h_{t-1})
$$

*Multi-layered* GRU means that GRU units are stacked on top of each other, where $\ell$th ($\ell \geq 2$) GRU will take $\ell-1$th GRU's hidden state as the input. For example, two-layer GRU is given by 

$$
h^{(1)}_{t} = \text{GRU}(x_{t}, h^{(1)}_{t-1}) \\
h^{(2)}_{t} = \text{GRU}(h^{(1)}_{t}, h^{(2)}_{t-1})
$$

where $h^{(\ell)}_t$ represents the hidden state for the $\ell$ th layer at the time $t$. We will then use all layer's hidden states at the end of the sequence as the inputs to the decoder. 
With PyTorch, we can easily implement the multi-layer GRU. See [the documentation](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html).

Here, let's implement `Encoder` class as follows. 

**Step 1**: `Encoder` will take sequences of integer tokens, represented as a tensor of size <batch_size x max_length>, where `batch_size` is the number of sentences in a batch, and `max_length` is the maximum length of the sentences in the batch. 

**Step 2**: The integer tokens are mapped to the vectors of size `embedding_size` by using `torch.nn.Embedding`, namely
$$
z_t = \text{Embedding}(x_t)
$$
where $z_t$ is the vector representation of the token $x_t$. 

**Step 3**: A dropout is performed on $z_t$:

$$
z_t = \text{Dropout}(z_t )
$$

Dropout is a technique to prevent overfitting by randomly dropping out some neurons during training. For example, if we set the dropout rate to 0.5, then 50% of the neurons will output zeros randomly during the training.  Neural networks with dropout tend to avoid relying on specific neurons, and instead learn robust mapping between the input and output. 

**Step 4**: Embedding $z_t$ will be fed into the two-layer GRUs:
$$
h^{(1)}_t = \text{GRU}(z_t, h^{(1)}_{t-1}) \\
h^{(2)}_t = \text{GRU}(h^{(1)}_t, h^{(2)}_{t-1})
$$

**Step 5**: Output the hidden states at the last sequence time $T$, namely 

$$
(h^{(1)}_T, h^{(2)}_T)
$$

```mermaid
flowchart LR
    input[/"Input Sequence
    (batch_size × max_length)"/]
    embed["Embedding Layer
    zt = Embedding(xt)"]
    dropout["Dropout Layer"]
    gru["Two-Layer GRU
    h1(t), h2(t) = GRU(zt)"]
    output[/"Final Hidden States
    (h1(T), h2(T)) and Outputs"/]
    input --> embed
    embed --> dropout
    dropout --> gru
    gru --> output

    style input fill:#D4E6F1,stroke:#000,color:#000
    style embed fill:#FAE5D3,stroke:#000,color:#000
    style dropout fill:#D5F5E3,stroke:#000,color:#000
    style gru fill:#E8DAEF,stroke:#000,color:#000
    style output fill:#FADBD8,stroke:#000,color:#000

    linkStyle default stroke:#000,stroke-width:2px
```

In [141]:
import torch

class Encoder(torch.nn.Module):

    def __init__(
        self, input_size, embedding_size, hidden_size, n_layers=2, dropout=0.1, bidirectional=False
    ):
        """Encoder class

        Parameters
        ----------
        input_size: int
            The number of unique tokens in the input sequence
        embedding_size: int
            The dimension of the embedding vectors
        hidden_size: int
            The dimension of the hidden states
        n_layers: int
            The number of layers in the GRU
        dropout: float
            The dropout rate
        """
        super(Encoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers

        # TODO:
        self.embedding = torch.nn.Embedding(input_size, embedding_size)
        self.gru = torch.nn.GRU(embedding_size, hidden_size, n_layers, dropout=dropout, batch_first=True, bidirectional=bidirectional)
        self.dropout = torch.nn.Dropout(dropout)

        # Initialize the embedding
        torch.nn.init.xavier_uniform_(self.embedding.weight)

    def forward(self, X):
        """
        Forward pass of the encoder

        Parameters
        ----------
        input_tokens: Tensor of shape <batch_size x max_length>
            The input sequence

        Return
        ------
        hidden: Tensor of shape <batch_size x hidden_size>
            The hidden states of the last layer
        """
        Z = self.embedding(X)
        Z = self.dropout(Z)
        outputs, hidden = self.gru(Z)
        return outputs, hidden

## Attention 

Let's implement attention module. This module takes two inputs: 
- the hidden states of the encoder, $h_{1}, \ldots, h_{n}$
- the hidden states of the decoder, $s_{t-1}$

These states are concatenated and fed into an MLP to generate the attention scores, $e_{1t}, e_{2t}, \ldots, e_{nt}$, i.e., 

$$
e_{it} = \text{MLP}([h_i, s_{t-1}])
$$

where $i$ is the index of the encoder hidden states.  These scores are then normalized by the softmax function to generate the attention weights, $\alpha_{1t}, \alpha_{2t}, \ldots, \alpha_{nt}$, i.e., 

$$
\alpha_{it} = \frac{\exp(e_{it})}{\sum_{j=1}^{n} \exp(e_{jt})}
$$

Finally, a new context vector is generated by taking the  weighted average: 

$$
c_t = \sum_{i=1}^{n} \alpha_{it} h_i
$$

In [142]:
class Attention(torch.nn.Module):

    def __init__(self, encoder_decoder_hidden_size, attention_hidden_size, n_layers_hidden, bidirectional = False):
        super(Attention, self).__init__()
        self.encoder_decoder_hidden_size = encoder_decoder_hidden_size
        self.attention_hidden_size = attention_hidden_size
        self.n_layers_hidden = n_layers_hidden
        self.bidirectional = 2 if bidirectional else 1
        self.enc2hidden = torch.nn.Linear(encoder_decoder_hidden_size * self.bidirectional, attention_hidden_size)
        self.dec2hidden = torch.nn.Linear(encoder_decoder_hidden_size * self.bidirectional * self.n_layers_hidden, attention_hidden_size)
        self.activation = torch.nn.Tanh()
        self.hidden2score = torch.nn.Linear(attention_hidden_size, 1)
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, encoder_outputs, decoder_hidden):
        """
        encoder_outputs: Tensor of shape <batch_size x seq_len x output_size>
        decoder_hidden: Tensor of shape <batch_size x 1 x (n_layers*hidden_size)>
        """
        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        # Project encoder hidden states
        enc_proj = self.enc2hidden(encoder_outputs)  # [batch x seq_len x hidden]

        # Project decoder hidden state
        # Reshape decoder hidden from (x, batch_size, hidden_dim) to (batch_size, x*hidden_dim)
        concat_hidden = decoder_hidden.permute(1, 0, 2)  # (batch_size, x, hidden_dim)
        concat_hidden = concat_hidden.reshape(batch_size, -1)  # (batch_size, x*hidden_dim)
        concat_hidden = concat_hidden.unsqueeze(1)  # (batch_size, 1, x*hidden_dim)
        dec_proj = self.dec2hidden(concat_hidden)   # [batch x 1 x hidden]
        #dec_proj = dec_proj.expand(-1, seq_len, -1)  # Expand to match encoder sequence length

        # Combine and get attention scores
        hidden = enc_proj + dec_proj
        hidden = self.activation(hidden)
        scores = self.hidden2score(hidden)  # [batch x seq_len x 1]
        scores = self.softmax(scores)

        # Get context vector via weighted sum
        context_vector = torch.bmm(scores.transpose(1,2), encoder_outputs)  # [batch x 1 x hidden]
        return context_vector

# Decoder 

Let's implement `Decoder`. Following the `Encoder`, we will simplify the original implementation by using two-layer GRUs. 

The input to the decoder are
- The hidden states of the encoder, $h_{1}, \ldots, h_{n}$
- The hidden state of the decoder at the previous time step, $s_{t-1}$
- The previous token, $x_{t-1}$

The decoder first computes the context vector, $c_t$, by using the attention mechanism. 

$$
c_t = \text{Attention}(h_1, \ldots, h_n, s_{t-1})
$$

Apply dropout to $c_t$ and concatenate it with the embedding of the previous token, $x_{t-1}$. 

$$
\begin{align}
x_t &= [\text{Dropout}(c_t), \text{Embedding}(x_{t-1})]
\end{align}
$$

We then concatenate the context vector and the embedding of the previous token, and feeds them into the GRU:

$$
s_t = \text{GRU}\left(z_t, s_t\right)
$$

where $\text{Embedding}$ is the embedding layer that maps the previous token to the embedding vector. Finally, the decoder outputs the probability distribution of the next token, $P(x_t \vert x_0, \ldots, x_{t-1})$ using a linear layer based on the hidden state $s_t$.  

$$
P(x_t \vert x_0, \ldots, x_{t-1}) = \text{Linear}(s_t)
$$

In [143]:
class Decoder(torch.nn.Module):

    def __init__(
        self,
        input_size,
        embedding_size,
        readout_hidden_size,
        encoder_decoder_hidden_size,
        attention_hidden_size,
        output_size,
        n_layers = 2,
        dropout=0.1,
        bidirectional = False
    ):
        super(Decoder, self).__init__()
        self.input_size = input_size
        self.encoder_decoder_hidden_size = encoder_decoder_hidden_size
        self.embedding_size = embedding_size
        self.attention_hidden_size = attention_hidden_size
        self.readout_hidden_size = readout_hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Embedding layer to convert input tokens to vectors
        self.embedding = torch.nn.Embedding(input_size, embedding_size)

        # Dropout for regularization
        self.dropout = torch.nn.Dropout(dropout)

        # GRU that takes concatenated context vector and embedded input
        self.gru = torch.nn.GRU(
            embedding_size + encoder_decoder_hidden_size,  # Input size is embedding + context vector
            encoder_decoder_hidden_size,
            n_layers,
            batch_first=True,
            dropout=dropout,
            bidirectional=bidirectional
        )

        # Attention mechanism
        self.attention = Attention(
            encoder_decoder_hidden_size=encoder_decoder_hidden_size,
            attention_hidden_size=attention_hidden_size,
            n_layers_hidden=n_layers,
            bidirectional=bidirectional
        )

        # Output layer to predict next token
        self.fc = torch.nn.Sequential(
            torch.nn.Linear(encoder_decoder_hidden_size, readout_hidden_size),
            torch.nn.ReLU(),
            torch.nn.Dropout(dropout),
            torch.nn.Linear(readout_hidden_size, output_size)
        )

        # Softmax for converting outputs to probabilities
        self.softmax = torch.nn.Softmax(dim=2)

        # Initialize embeddings
        torch.nn.init.xavier_uniform_(self.embedding.weight)

    def forward(self, input_tokens, hidden, encoder_outputs):
        """
        Forward pass of the decoder

        Parameters
        ----------
        input_tokens: Tensor of shape <batch_size x 1>
            The input sequence
        hidden: Tensor of shape <num_layers x batch_size x hidden_size>
            The hidden states of the decoder from previous timestep
        encoder_outputs: Tensor of shape <batch_size x seq_len x hidden_size>
            The hidden states from the encoder

        Returns
        -------
        output: Tensor of shape <batch_size x 1 x output_size>
            Probability distribution over next token
        hidden: Tensor of shape <num_layers x batch_size x hidden_size>
            Updated decoder hidden states
        """

        # Get context vector using attention
        context_vector = self.attention(encoder_outputs, hidden)
        context_vector = self.dropout(context_vector)

        # Embed input tokens and ensure it has batch dimension
        input_vector = self.embedding(input_tokens).unsqueeze(1) if input_tokens.dim() == 1 else self.embedding(input_tokens)

        # Concatenate context vector and embedded input
        z = torch.cat([context_vector, input_vector], dim=2)

        # Pass through GRU
        output, hidden = self.gru(z, hidden)

        # Generate output probabilities
        output = self.fc(output)
        output = self.softmax(output)

        return output, hidden

### Seq2Seq 

Now, let's put them together to build a seq2seq model. 

In [144]:
from pytorch_lightning import LightningModule, Trainer

class Seq2Seq(LightningModule):

    def __init__(self, encoder, decoder, sos_token_id, eos_token_id, vocab_size, teacher_forcing_ratio=0.5):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.sos_token_id = sos_token_id
        self.eos_token_id = eos_token_id
        self.loss_fn = torch.nn.CrossEntropyLoss()
        self.teacher_forcing_ratio = teacher_forcing_ratio
        self.vocab_size = vocab_size

    def forward(self, input_tokens, max_output_len, temperature=1.0):
        """
        Forward pass of the seq2seq model

        Parameters
        ----------
        input_tokens: Tensor of shape <batch_size x max_length>
            The input sequence
        max_output_len: int
            The maximum length of the output sequence
        """
        batch_size = input_tokens.size(0)
        vocab_size = self.vocab_size

        # Get encoder outputs and hidden states
        encoder_outputs, encoder_hiddens = self.encoder(input_tokens)
        decoder_hidden = encoder_hiddens

        # First input to decoder is SOS token
        decoder_input = torch.ones((batch_size, 1), dtype=torch.long, device=self.device) * self.sos_token_id

        # Initialize outputs tensor to store decoder outputs
        outputs = torch.zeros(batch_size, max_output_len, device=self.device)

        # Generate sequence
        for t in range(max_output_len):
            output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            outputs[:, t] = torch.argmax(output, dim=2)

            # Sample next token using temperature
            if temperature > 0:
                probs = torch.nn.functional.softmax(output.squeeze(1) / temperature, dim=-1)
                decoder_input = torch.multinomial(probs, 1)
            else:
                decoder_input = torch.argmax(output, dim=2)

            # Stop if all sequences in batch hit EOS
            if (decoder_input == self.eos_token_id).all():
                break
        return outputs


    def training_step(self, batch, batch_idx):
        src, trg = batch

        # Teacher forcing: use the actual target tokens as input to decoder
        batch_size = src.size(0)
        target_length = trg.size(1)
        vocab_size = self.decoder.fc[-1].out_features

        # Initialize outputs tensor to store decoder outputs
        outputs = torch.zeros(batch_size, target_length, vocab_size)

        # Get initial decoder hidden state from encoder
        encoder_outputs, encoder_hiddens = self.encoder(src)
        decoder_hidden = encoder_hiddens

        # First input to decoder is SOS token
        decoder_input = torch.ones((batch_size, 1), dtype=torch.long, device=self.device) * self.sos_token_id

        # Teacher forcing - feeding the target as the next input
        loss = 0
        for t in range(target_length):
            output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            if np.random.rand() < self.teacher_forcing_ratio:
                decoder_input = trg[:, t].unsqueeze(1)  # Use target token as next input
            else:
                decoder_input = torch.argmax(output, dim=2)

            loss += self.loss_fn(output.squeeze(1), trg[:, t])

        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):

        with torch.no_grad():
            src, trg = batch

            # Teacher forcing: use the actual target tokens as input to decoder
            batch_size = src.size(0)
            target_length = trg.size(1)
            vocab_size = self.vocab_size

            # Initialize outputs tensor to store decoder outputs
            outputs = torch.zeros(batch_size, target_length, vocab_size)

            # Get initial decoder hidden state from encoder
            encoder_outputs, encoder_hiddens = self.encoder(src)
            decoder_hidden = encoder_hiddens  # Add layer dimension

            # First input to decoder is SOS token
            decoder_input = torch.ones((batch_size, 1), dtype=torch.long, device=self.device) * self.sos_token_id
            # Teacher forcing - feeding the target as the next input
            loss = 0
            for t in range(target_length):
                output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
                if np.random.rand() < self.teacher_forcing_ratio:
                    decoder_input = trg[:, t].unsqueeze(1)  # Use target token as next input
                else:
                    decoder_input = torch.argmax(output, dim=2)

                loss += self.loss_fn(output.squeeze(1), trg[:, t])

            self.log("val_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [188]:
from secretpy import Caesar, CaesarProgressive, alphabets as al
import random
import string


def generate_random_sequences(cipher_key, cipher, n_seqs, seq_len):

    sents = []
    ciphered_sents = []
    for _ in range(n_seqs):
        sequence = "".join(random.choices(string.ascii_lowercase, k=seq_len))
        ciphered_sequence = cipher.encrypt(sequence, cipher_key, al.ENGLISH)
        # assert len(sequence) == len(ciphered_sequence)
        assert sequence == cipher.decrypt(ciphered_sequence, cipher_key)
        sents.append(sequence)
        ciphered_sents.append(ciphered_sequence)
    return ciphered_sents, sents


key = 3
cipher = CaesarProgressive()
ciphered_sents, sents = generate_random_sequences(
    cipher_key=key, cipher=cipher, n_seqs=10000, seq_len=10
)

print("Original:", sents[:3])
print("Ciphered:", ciphered_sents[:3])

Original: ['qnfsxpoiiz', 'mhyicduhcj', 'jemdbbsrxc']
Ciphered: ['trkyexxstl', 'pldojldrnv', 'mirjijbbio']


In [189]:
from collections import defaultdict


def build_tokenizer(sents):
    vocab = sorted(list(set("".join(sents))))

    vocab.append("<sos>")  # <sos> token
    vocab.append("<eos>")  # <eos> token
    vocab.append("<unk>")  # <unk> token used to represent the unknown token

    vocab_stoi = {token: i for i, token in enumerate(vocab)}
    vocab_itos = {i: token for i, token in enumerate(vocab)}

    sos_token_id = vocab_stoi["<sos>"]
    eos_token_id = vocab_stoi["<eos>"]
    unk_token_id = vocab_stoi["<unk>"]

    # If the token is not in the vocabulary, then return the unk_token_id
    # vocab_stoi = defaultdict(lambda: unk_token_id, vocab_stoi)
    # vocab_itos = defaultdict(lambda: unk_token_id, vocab_itos)

    return {
        "stoi": vocab_stoi,
        "itos": vocab_itos,
        "sos_token_id": sos_token_id,
        "eos_token_id": eos_token_id,
        "unk_token_id": unk_token_id,
    }


def tokenize(sents, vocab):
    retval = []
    for sent in sents:
        _retval = [vocab["sos_token_id"]]
        for letter in sent:
            _retval.append(vocab["stoi"][letter])
        _retval.append(vocab["eos_token_id"])
        retval.append(_retval)

    return retval


src_vocab = build_tokenizer(ciphered_sents)
trg_vocab = build_tokenizer(sents)

src_tokenized = tokenize(ciphered_sents, src_vocab)
trg_tokenized = tokenize(sents, trg_vocab)

print(src_tokenized[1])
print(trg_tokenized[3], trg_tokenized[1])

[26, 15, 11, 3, 14, 9, 11, 3, 17, 13, 21, 27]
[26, 7, 3, 5, 9, 15, 2, 25, 8, 4, 23, 27] [26, 12, 7, 24, 8, 2, 3, 20, 7, 2, 9, 27]


In [190]:
# Data pipeloine
batch_size = 100
dataset = torch.utils.data.TensorDataset(
    torch.tensor(src_tokenized, dtype=torch.long),
    torch.tensor(trg_tokenized, dtype=torch.long),
)

train_dataset, val_dataset = torch.utils.data.random_split(dataset, [0.8, 0.2])

train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, drop_last=True
)

val_dataloader = torch.utils.data.DataLoader(
    val_dataset, batch_size=batch_size, shuffle=True, drop_last=True
)

In [191]:

n_src_vocab = len(src_vocab["stoi"]) + 3
n_trg_vocab = len(trg_vocab["stoi"]) + 3

n_layers = 2 # number of layers in the GRU
bidirectional = False # whether to use bidirectional GRU
embedding_size = 32 # embedding size
hidden_size = 16 # hidden size
dropout = 0.1 # dropout rate
readout_hidden_size = 16 # hidden size of the readout layer
attention_hidden_size = 16 # hidden size of the attention layer

encoder = Encoder(
    input_size=n_src_vocab,
    embedding_size=embedding_size,
    hidden_size=hidden_size,
    n_layers=n_layers,
    dropout=dropout,
    bidirectional=bidirectional
)
decoder = Decoder(
    input_size=n_trg_vocab,
    embedding_size=embedding_size,
    readout_hidden_size=readout_hidden_size,
    n_layers=n_layers,
    output_size=n_trg_vocab,
    dropout=dropout,
    encoder_decoder_hidden_size=hidden_size,
    attention_hidden_size=attention_hidden_size,
    bidirectional=bidirectional
)
sos_token_id = trg_vocab["sos_token_id"]
eos_token_id = trg_vocab["eos_token_id"]
model = Seq2Seq(encoder, decoder, sos_token_id, eos_token_id, vocab_size=n_trg_vocab)

# Training 

Let us implement a trainer for the seq2seq model. 

In [192]:
%load_ext tensorboard
%tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 60521), started 0:18:14 ago. (Use '!kill 60521' to kill it.)

In [193]:
from pytorch_lightning.loggers import TensorBoardLogger
logger = TensorBoardLogger("logs", name="seq2seq")

trainer = Trainer(max_epochs=100, logger=logger)
trainer.fit(model, train_dataloader, val_dataloader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name    | Type             | Params | Mode 
-----------------------------------------------------
0 | encoder | Encoder          | 5.1 K  | train
1 | decoder | Decoder          | 7.5 K  | train
2 | loss_fn | CrossEntropyLoss | 0      | train
-----------------------------------------------------
12.5 K    Trainable params
0         Non-trainable params
12.5 K    Total params
0.050     Total estimated model params size (MB)
21        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/skojaku-admin/miniforge3/envs/advnetsci/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:476: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/Users/skojaku-admin/miniforge3/envs/advnetsci/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.
/Users/skojaku-admin/miniforge3/envs/advnetsci/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

# Validation

Let's validate the seq2seq model with [Caesar cipher](https://en.wikipedia.org/wiki/Caesar_cipher). 
We will generate ciphered texts to train seq2seq and see if the trained seq2seq decipher the text correctly.  

Can you decipher?

In [187]:
model.eval()

# text = "iamastudent"
for i in range(10):
    text = sents[i]
    ciphered_text = cipher.encrypt(text, key)
    ciphered_text_tokenized = torch.tensor(tokenize([ciphered_text], src_vocab))
    seqs = model(
        ciphered_text_tokenized,
        max_output_len=12,
        temperature=1e-3
    )
    deciphered_text = ""
    seqs = seqs.detach().cpu().numpy().astype(int)
    for i in range(len(seqs[0])):
        deciphered_text += trg_vocab["itos"][seqs[0][i].item()]
    print("Original-Deciphered:", text, " <--> ", deciphered_text)

[[26  0  0  0  8  8  8 17 17 17 27 27]]
Original-Deciphered: esatvlyspx  <-->  <sos>aaaiiirrr<eos><eos>
[[26  0  0  0  8  8  8 17 17 17 27 27]]
Original-Deciphered: fnvjgkfwyi  <-->  <sos>aaaiiirrr<eos><eos>
[[26  0  0  0  8  8 18 17 17  3 27 27]]
Original-Deciphered: kpjyemthcy  <-->  <sos>aaaiisrrd<eos><eos>
[[26  0  0  0  0  0  0  0  0  0  0  0]]
Original-Deciphered: jkzzjqomre  <-->  <sos>aaaaaaaaaaa
[[26  0  0  0  0  0  0  0  0  0  0  0]]
Original-Deciphered: lgrbctcdfs  <-->  <sos>aaaaaaaaaaa
[[26  0  0  0  8  8  8 17 17 17 27 27]]
Original-Deciphered: symvyfciip  <-->  <sos>aaaiiirrr<eos><eos>
[[26  0  0  0  4  8  8 17 17 17 27 27]]
Original-Deciphered: wniaporagv  <-->  <sos>aaaeiirrr<eos><eos>
[[26  0  0  0  0  0  0  0  0  0  0  0]]
Original-Deciphered: tfjwbqperh  <-->  <sos>aaaaaaaaaaa
[[26  0  0  0  0  0  0  0  0  0  0  0]]
Original-Deciphered: fwlzgzzaky  <-->  <sos>aaaaaaaaaaa
[[26  0  0  0  8  8  8 17 17 17 27 27]]
Original-Deciphered: wlvidnmaha  <-->  <sos>aaaiiirrr<eo

In [131]:
def forward(self, input_tokens, max_output_len, temperature=1.0):
    """
    Forward pass of the seq2seq model

    Parameters
    ----------
    input_tokens: Tensor of shape <batch_size x max_length>
        The input sequence
    max_output_len: int
        The maximum length of the output sequence
    """
    batch_size = input_tokens.size(0)
    vocab_size = self.vocab_size

    # Get encoder outputs and hidden states
    print(input_tokens.shape)
    encoder_outputs, encoder_hiddens = self.encoder(input_tokens)
    decoder_hidden = encoder_hiddens

    # First input to decoder is SOS token
    decoder_input = torch.ones((batch_size, 1), dtype=torch.long, device=self.device) * self.sos_token_id

    # Initialize outputs tensor to store decoder outputs
    outputs = torch.zeros(batch_size, max_output_len, device=self.device)

    # Generate sequence
    for t in range(max_output_len):
        output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
        outputs[:, t] = torch.argmax(output, dim=2)

        # Sample next token using temperature
        if temperature > 0:
            probs = torch.nn.functional.softmax(output.squeeze(1) / temperature, dim=-1)
            decoder_input = torch.multinomial(probs, 1)
        else:
            decoder_input = torch.argmax(output, dim=2)

        # Stop if all sequences in batch hit EOS
        if (decoder_input == self.eos_token_id).all():
            break

    return outputs
forward(model, ciphered_text_tokenized, max_output_len=12, temperature=0)

torch.Size([1, 12])


tensor([[26., 19., 19., 19., 19., 19., 19., 19., 19., 19., 27.,  0.]])