# Encoder-decoder Transformer

Copyright 2024, Denis Rothman

Generated by OpenAI GPT-4 through advanced prompt engineering



**September 4,2024 update** Previously Google Colab did not require an installation of `torchtext`. Now it is a prerequisiste. However, there is a conflict between torch and torchtext versions in Google Colab.

Until the versions of torchtext and torch align on Google Colab, it is recommended to install the necessary packages locally in an environment to run this program.

It is also possible to read the program without running it to understand the concept: AI agents can now write complex code including transformer architectures.

#Library installation

In [None]:
!pip install beautifulsoup4 requests nltk



The code is a command-line instruction typically used in a Jupyter Notebook or other similar environments to install three Python libraries:

1. `beautifulsoup4`: BeautifulSoup is a library used to scrape data from HTML and XML files. It creates a parse tree that can be used to extract data easily.

2. `requests`: This is a popular library for sending HTTP requests in Python. It abstracts the complexities of making requests behind a simple API, allowing you to send HTTP/1.1 requests.

3. `nltk`: NLTK (Natural Language Toolkit) is a library for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more.

The `!` at the beginning of the line tells the Jupyter Notebook to execute the command in the system shell, allowing you to run terminal commands from within the notebook.

So, when this command is executed, the system's package manager for Python (pip) will install these three libraries, making them available for use in your Python code within that environment.

In [None]:
from collections import Counter
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Here's what each part of the code does:

1. `from collections import Counter`: This line imports the `Counter` class from the `collections` module. `Counter` is a container that keeps track of how many times equivalent values are added. It can be used to count the occurrences of items in a list, for example.

2. `import nltk`: This line imports the Natural Language Toolkit (NLTK) module, a widely-used library for working with human language data.

3. `from nltk.tokenize import word_tokenize`: This imports the `word_tokenize` function from the `nltk.tokenize` module. This function is used to split a text into a list of individual words, commonly referred to as tokenization in natural language processing.

4. `nltk.download('punkt')`: This line downloads the Punkt tokenizer models. NLTK uses these models for the `word_tokenize` function to work properly. The Punkt tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. If the Punkt tokenizer models are not already downloaded, you'll need to do so to tokenize text using NLTK's `word_tokenize` method.

In summary, this code sets up the environment for natural language processing tasks by importing necessary libraries and downloading required resources. You would typically see this code at the beginning of a script or notebook where text analysis or other natural language processing is being performed.

In [None]:
import requests
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from collections import Counter
from torchtext.vocab import Vocab
from torch.utils.data import Dataset
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import time
import torch.nn.functional as F

This code snippet includes import statements for various libraries that would typically be used for web scraping and natural language processing (NLP), possibly in combination with deep learning. Here's a breakdown:

- `import requests`: Imports the Requests library, used for making HTTP requests in Python.
- `from bs4 import BeautifulSoup`: Imports BeautifulSoup from bs4, used for parsing and extracting data from HTML/XML documents.
- `from nltk.tokenize import word_tokenize`: Imports the word_tokenize function from NLTK, used for splitting text into words.
- `from collections import Counter`: Imports the Counter class from collections, used for counting hashable objects.
- `from torchtext.vocab import Vocab`: Imports the Vocab class from torchtext, used for mapping words to indices.
- `from torch.utils.data import Dataset`: Imports the Dataset class from torch.utils.data, a base class for creating custom datasets in PyTorch.
- `import torch`: Imports the main PyTorch library, used for building and training neural networks.
- `import torch.nn as nn`: Imports the neural network module from PyTorch, providing classes for building networks.
- `from torch.utils.data import DataLoader`: Imports DataLoader from torch.utils.data, used for loading datasets in iterable mini-batches.
- `import time`: Imports the time module, used for time-related tasks, like measuring durations.
- `import torch.nn.functional as F`: Imports the functional module from torch.nn, providing functions like activation and loss functions.

# Training

In [None]:
def create_vocab(text, vocab_size):
    tokenized_text = nltk.word_tokenize(text)
    word_freq = Counter(tokenized_text)
    vocab = {word: i for i, (word, _) in enumerate(word_freq.most_common(vocab_size))}
    return vocab

def scrape_wikipedia(urls):
    text = ""
    for url in urls:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        paragraphs = soup.find_all('p')
        for paragraph in paragraphs:
            text += paragraph.get_text()
    return text

def create_dataset(vocab_size, input_seq_length, text):
    dataset = []
    tokens = word_tokenize(text)
    vocab = {word: i for i, word in enumerate(set(tokens))}
    for i in range(0, len(tokens) - input_seq_length):
        input_sequence = [vocab[word] for word in tokens[i: i + input_seq_length]]
        target = input_sequence[1:] + [vocab_size - 2]
        dataset.append((torch.tensor(input_sequence), torch.tensor(target)))
    return dataset, vocab, tokens  # returning dataset and vocabulary

class TextDataset(Dataset):
    def __init__(self, dataset):
        self.data = dataset

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, h, d_ff, dropout_rate):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.self_attention = nn.MultiheadAttention(d_model, h)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, input):
        x = self.embedding(input)
        x = self.dropout(x)
        attn_output, _ = self.self_attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

class Decoder(nn.Module):
    def __init__(self, vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.self_attention = nn.MultiheadAttention(d_model, h)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout_rate)
        self.out = nn.Linear(d_model, vocab_size)

    def forward(self, input, encoder_output, lookahead_mask, padding_mask, training):
        x = self.embedding(input)
        x = self.dropout(x)
        attn_output1, _ = self.self_attention(x, x, x, attn_mask=lookahead_mask, key_padding_mask=padding_mask)
        x = self.norm1(x + self.dropout(attn_output1))
        if encoder_output is not None:
            attn_output2, _ = self.self_attention(x, encoder_output, encoder_output)
            x = self.norm2(x + self.dropout(attn_output2))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        x = self.out(x)
        return x

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def main():
    urls = [
    'https://en.wikipedia.org/wiki/American_Revolution',
    'https://en.wikipedia.org/wiki/American_Civil_War',
    'https://en.wikipedia.org/wiki/World_War_I',
    'https://en.wikipedia.org/wiki/World_War_II',
    'https://en.wikipedia.org/wiki/Renaissance',
    'https://en.wikipedia.org/wiki/Industrial_Revolution',
    'https://en.wikipedia.org/wiki/French_Revolution',
    'https://en.wikipedia.org/wiki/Ancient_Greece',
    'https://en.wikipedia.org/wiki/Roman_Empire',
    'https://en.wikipedia.org/wiki/Enlightenment'
    ]
    vocab_size = 30000
    input_seq_length = 512
    h = 8
    d_k = 64
    d_v = 64
    d_model = 512
    d_ff = 2048
    dropout_rate = 0.1
    epochs = 20
    batch_size = 32
    loss_threshold=4
    showlogits=1
    text = scrape_wikipedia(urls)
    raw_dataset, vocab,tokens = create_dataset(vocab_size, input_seq_length, text)
    total_words=len(vocab)
    total_tokens=len(tokens)
    print(f'Total vocab scraped: {total_words:,}')
    print(f'Total tokens scraped: {total_tokens:,}')
    torch.save(raw_dataset, "raw_dataset.pt")
    dataset = TextDataset(raw_dataset)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    num_batches = len(data_loader)
    print(f'Number of batches: {num_batches}') # input and target sentences

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)
    encoder = Encoder(vocab_size, d_model, h, d_ff, dropout_rate).to(device)
    decoder = Decoder(vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate).to(device)
    optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.001)

    num_parameters_encoder = count_parameters(encoder)
    num_parameters_decoder = count_parameters(decoder)

    print(f'The encoder has {num_parameters_encoder:,} trainable parameters.')
    print(f'The decoder has {num_parameters_decoder:,} trainable parameters.')
    total_parameters = num_parameters_encoder + num_parameters_decoder
    print(f'The total model has {total_parameters:,} trainable parameters.')

    # Start time
    start_time = time.time()
    with open("loss.txt", "w") as f:
      for epoch in range(epochs):
        for i, (inputs, targets) in enumerate(data_loader):
            inputs = inputs.to(device).long()  # Move inputs to device
            targets = targets.to(device).long()  # Move targets to device
            encoder_output = encoder(inputs)
            output = decoder(inputs, encoder_output, None, None, training=True)
            output = output.view(-1, output.size(-1))
            targets = targets.view(-1)
            loss = F.cross_entropy(output, targets)
            # Print the batch number and loss every 100 steps
            if i % 10 == 0:
              print(f"Epoch: {epoch},Batch: {i+1}, Loss: {loss.item()}")
            # Write the loss to the file
            f.write(f"Epoch: {epoch}, Batch: {i}, Loss: {loss.item()}\n")
            if loss<loss_threshold :
              # Printing the raw logits
              print('Raw logits:', output)
              break
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print("Epoch: {}, Loss: {}".format(epoch, loss.item()))
    # End time
    end_time = time.time()
    print("Time taken for training: {} seconds".format(end_time - start_time))
    return encoder, decoder
99
encoder, decoder = main()

torch.save({
    "encoder": encoder.state_dict(),
    "decoder": decoder.state_dict()
}, "model.pt")

Total vocab scraped: 15,935
Total tokens scraped: 152,492
Number of batches: 4750
cuda
The encoder has 18,512,384 trainable parameters.
The decoder has 33,903,408 trainable parameters.
The total model has 52,415,792 trainable parameters.
Epoch: 0,Batch: 1, Loss: 10.477557182312012
Epoch: 0,Batch: 11, Loss: 6.915306091308594
Epoch: 0,Batch: 21, Loss: 6.647111892700195
Epoch: 0,Batch: 31, Loss: 6.263278484344482
Epoch: 0,Batch: 41, Loss: 6.027099609375
Epoch: 0,Batch: 51, Loss: 5.8128132820129395
Epoch: 0,Batch: 61, Loss: 5.63546895980835
Epoch: 0,Batch: 71, Loss: 5.396797180175781
Epoch: 0,Batch: 81, Loss: 5.204771995544434
Epoch: 0,Batch: 91, Loss: 5.071465969085693
Epoch: 0,Batch: 101, Loss: 4.7525315284729
Epoch: 0,Batch: 111, Loss: 4.786503791809082
Epoch: 0,Batch: 121, Loss: 4.616722106933594
Epoch: 0,Batch: 131, Loss: 4.552717685699463
Epoch: 0,Batch: 141, Loss: 4.385110378265381
Epoch: 0,Batch: 151, Loss: 4.366326808929443
Epoch: 0,Batch: 161, Loss: 4.3331074714660645
Epoch: 0,Ba

This code is part of a script that defines functions for scraping text from Wikipedia URLs, creating a dataset, building an Encoder-Decoder architecture using PyTorch, and training the model on the dataset. Here's an overview of each part:

### Function: `create_vocab`

- **Input**: Text and a vocabulary size
- **Output**: A vocabulary that maps the `vocab_size` most common words in the text to unique integers
- **Process**:
  1. Tokenizes the input text.
  2. Counts the frequency of each word.
  3. Creates a vocabulary dictionary from the most common words.

### Function: `scrape_wikipedia`

- **Input**: List of URLs
- **Output**: Concatenated text from all the provided URLs
- **Process**:
  1. Loops through URLs.
  2. Sends a GET request to each URL.
  3. Parses the HTML content using BeautifulSoup.
  4. Concatenates the text from all the `<p>` elements.

### Function: `create_dataset`

- **Input**: Vocabulary size, input sequence length, and text
- **Output**: Dataset, vocabulary, and tokens
- **Process**:
  1. Tokenizes the text.
  2. Creates a vocabulary from the unique tokens.
  3. Creates input and target sequences using the vocabulary.
  4. Returns the dataset, vocabulary, and tokens.

### Classes: `TextDataset`, `Encoder`, and `Decoder`

- **TextDataset**: PyTorch dataset class that wraps the dataset.
- **Encoder**: Defines an Encoder module with an embedding layer, multi-head self-attention, feed-forward neural network, layer normalization, and dropout.
- **Decoder**: Defines a Decoder module, similar to the Encoder, but with an additional output linear layer.

### Function: `count_parameters`

- **Input**: A PyTorch model
- **Output**: Total number of trainable parameters in the model

### Function: `main`

- **Process**:
  1. Defines hyperparameters and URLs to scrape.
  2. Scrapes text from Wikipedia using `scrape_wikipedia`.
  3. Creates a dataset using `create_dataset`.
  4. Initializes the Encoder, Decoder, and Adam optimizer.
  5. Loops through epochs and batches, training the model using the created dataset.
  6. Prints loss information and can write loss to a file.
  7. Saves the trained Encoder and Decoder's state.

### Saving and Executing the Model

- The trained Encoder and Decoder models are saved to a file `model.pt`.
- The main training function is executed by calling `main()`.

Overall, this code is building and training an Encoder-Decoder model using scraped text data, a task often used in sequence-to-sequence problems like machine translation or text summarization.

Total sequence pairs = (165,050 - 512)
Number of batches = Total sequence pairs / 32
                  ≈ 5142

The code snippet provided is constructing sequences from tokens and then preparing batches from these sequences. Here's a breakdown of what's happening:

- You have a total of \( 165,050 \) tokens.
- You're using sequences of \( 512 \) tokens to create input sequences.
- For each input sequence, you have an associated target sequence, which is often a shifted version of the input sequence in text generation tasks.
- Since you are using sequences of length \( 512 \), the last \( 512 \) tokens of your text won't have enough subsequent tokens to form a full sequence, so you subtract this value.
- The total number of input-target pairs is \( 165,050 - 512 \), and these pairs are then divided into batches.

The number of batches is then calculated by dividing the total number of input-target pairs by the batch size. The calculation only counts the input sequences, but it implicitly counts the corresponding target sequences since there is a one-to-one correspondence between input sequences and target sequences.

In summary, the subtraction of \( 512 \) accounts for the fact that you're creating sequences of that length, and it ensures that you don't attempt to create a sequence that would extend beyond the end of your tokens. This subtraction doesn't specifically relate to the distinction between input and target sequences; instead, it relates to the sequence length you are using to construct both input and target sequences.

# Dataset


In [None]:
raw_dataset = torch.load("raw_dataset.pt")
dataset = TextDataset(raw_dataset)
# Print the first 5 items
for i, (input, target) in enumerate(raw_dataset):
    print(f"Input: {input}")
    print(f"Target: {target}")
    if i >= 4:  # stop after 5 items
        break

Input: tensor([10012,  4531,  2452, 13624, 15660,  7978, 11718,  7916,  2277,  3490,
         5208,  2424, 12265, 12115,  2424,  4531,  4707, 11032,  3946,  4774,
         2918,  2037,  4774,  2424,  7036, 12910, 12115,  2424,  9363,  8534,
        15089,  2130,    14,  2424, 10693,  6097,  2424,  4531,  4449,  1455,
        15396, 12256,  2185,  3785, 11586, 13793,  2656, 15396,  6806,  2424,
         4360,  1945, 13199,  4200,  7064,  3785,  2424,  2918,  2290, 11718,
         6361, 11166,  2424,  4889,  7648, 10470,  2424,  4208,  8313,  8104,
         7146, 14979,  5208,  4707, 12265, 12115,  2424,  3031, 12115,  2424,
        11714, 15396, 10981, 15396, 11718, 10582,  1838, 15089,  4531, 10756,
         6008, 13793,  5336, 14255, 14180,  2424,  2918,  6761, 15396, 13735,
        12881,  4774, 12256,   781,  6268, 12998,  1703,  5094, 15089,  9769,
        13793,  2424,  4637, 15396,  2918,  6311, 10278,  3354,  2424, 15828,
        13735,  9286,  9308, 10866, 12115,  5037,  4774, 

This code snippet is loading a previously saved dataset (presumably created earlier in the code you posted) and printing the first five input-target pairs. Here's a breakdown:

1. `raw_dataset = torch.load("raw_dataset.pt")`: This line loads a saved dataset from a PyTorch file called "raw_dataset.pt". The `raw_dataset` variable is a list of tuples, where each tuple consists of an input tensor and a target tensor.

2. `dataset = TextDataset(raw_dataset)`: This line wraps the loaded `raw_dataset` in a `TextDataset` class, which is likely a custom dataset class defined to work with PyTorch's DataLoader. However, this `dataset` variable is not used in the code snippet.

3. The loop iterates over the `raw_dataset`, printing the input and target tensors for the first five items:
    - `Input: {input}`: This prints the input tensor for each item. Depending on the context of the code, this might represent a sequence of tokenized words (encoded as integers) that form an input sequence.
    - `Target: {target}`: This prints the target tensor for each item. Depending on the context, this might represent a sequence of tokenized words that are the expected output for the corresponding input sequence.
    - The loop stops after printing the first five items, as controlled by the `if i >= 4:` condition.



# Model

In [None]:
# Load the model checkpoint
checkpoint = torch.load("model.pt")

# Print all keys and values in the checkpoint
print("Model checkpoint:")
for key, value in checkpoint.items():
    print(f"Key: {key}")
    if isinstance(value, dict):
        # If the value is a dictionary (as it is for encoder and decoder state_dicts), print its keys
        print(f"Value keys: {value.keys()}")
    else:
        # Otherwise, print the value itself
        print(f"Value: {value}")

Model checkpoint:
Key: encoder
Value keys: odict_keys(['embedding.weight', 'self_attention.in_proj_weight', 'self_attention.in_proj_bias', 'self_attention.out_proj.weight', 'self_attention.out_proj.bias', 'feed_forward.0.weight', 'feed_forward.0.bias', 'feed_forward.2.weight', 'feed_forward.2.bias', 'norm1.weight', 'norm1.bias', 'norm2.weight', 'norm2.bias'])
Key: decoder
Value keys: odict_keys(['embedding.weight', 'self_attention.in_proj_weight', 'self_attention.in_proj_bias', 'self_attention.out_proj.weight', 'self_attention.out_proj.bias', 'feed_forward.0.weight', 'feed_forward.0.bias', 'feed_forward.2.weight', 'feed_forward.2.bias', 'norm1.weight', 'norm1.bias', 'norm2.weight', 'norm2.bias', 'norm3.weight', 'norm3.bias', 'out.weight', 'out.bias'])


This code snippet is loading a model checkpoint from a file called "model.pt" and printing its content. A model checkpoint is a snapshot of the state of a model, typically containing the model's parameters at a particular point during training. Here's what the code is doing:

1. **`checkpoint = torch.load("model.pt")`:** This line loads the model checkpoint from the file "model.pt". The checkpoint is expected to be a dictionary where the keys identify different parts of the saved state, such as the weights of different components of the model (e.g., encoder and decoder).

2. **`print("Model checkpoint:")`:** This prints a string to indicate the beginning of the checkpoint content.

3. **Loop through the checkpoint's items:** The loop iterates through each key-value pair in the checkpoint dictionary.

    - **`print(f"Key: {key}")`:** This prints the key for each item in the checkpoint. This might represent specific components of the model or other information saved in the checkpoint.

    - **`if isinstance(value, dict):`:** If the value associated with the key is a dictionary (as it might be for the `state_dict` of the encoder and decoder in a sequence-to-sequence model), it prints the keys of that dictionary using `print(f"Value keys: {value.keys()}")`. This could represent the names and organization of the layers and parameters within that part of the model.

    - **`else: print(f"Value: {value}")`:** If the value is not a dictionary, it prints the value itself. This could include other information saved in the checkpoint, such as training hyperparameters, the state of the optimizer, etc.

### What the Outputs Represent:

The output from running this code would be a detailed printout of the contents of the model checkpoint:

- For components like the encoder and decoder (if they are part of the model), it would print the keys representing the structure of these components.
- For other information stored in the checkpoint, it would print the values directly.

This is useful for understanding the contents of the checkpoint, inspecting the saved state of the model, and potentially troubleshooting issues with loading the checkpoint for further training or evaluation.