# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [None]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn

In [None]:
import torch.nn as nn
import torch.nn.functional as F

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-10-14 02:57:13--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-10-14 02:57:13 (48.3 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [None]:
# 1a) Create a list of unique characters
chars = sorted(list(set(text)))

# 1b) Implement encode function
def encode(s: str) -> list[int]:
    return [chars.index(c) for c in s]

# 1c) Implement decode function
def decode(ids: list[int]) -> str:
    return ''.join([chars[i] for i in ids])

# 1d) Create one-hot inputs and outputs
def create_one_hot_inputs_and_outputs() -> tuple[torch.Tensor, torch.Tensor]:
    inputs, outputs = [], []
    for i in range(len(text) - 1):
        input_char = text[i]
        output_char = text[i + 1]
        inputs.append(F.one_hot(torch.tensor(chars.index(input_char)), num_classes=len(chars)))
        outputs.append(F.one_hot(torch.tensor(chars.index(output_char)), num_classes=len(chars)))
    return torch.stack(inputs), torch.stack(outputs)

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()

# 1e) Implement BigramOneHotMLP
class BigramOneHotMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(len(chars), 8)
        self.fc2 = nn.Linear(8, len(chars))
        self.activation = nn.LeakyReLU()

    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        out = start
        for _ in range(max_new_tokens):
            x = F.one_hot(torch.tensor(chars.index(out[-1])), num_classes=len(chars)).float()
            logits = self(x)
            probs = F.softmax(logits, dim=-1)
            next_char = chars[torch.multinomial(probs, 1).item()]
            out += next_char
        return out

bigram_one_hot_mlp = BigramOneHotMLP()

# 1f) Training loop
optimizer = torch.optim.Adam(bigram_one_hot_mlp.parameters(), lr=0.01)
for _ in range(1000):
    optimizer.zero_grad()
    logits = bigram_one_hot_mlp(inputs_one_hot.float())
    loss = F.cross_entropy(logits, outputs_one_hot.float())
    loss.backward()
    optimizer.step()

print(bigram_one_hot_mlp.generate())

arsooolithe ocolie the l:
st brs thispre ow Maveve t


s Marth

Alld.


Onthary lll eled Cit, paf ars


In [None]:
# 1g) Create embedding inputs and outputs
def create_embedding_inputs_and_outputs() -> tuple[torch.Tensor, torch.Tensor]:
    input_ids = torch.tensor([chars.index(c) for c in text[:-1]])
    output_ids = torch.tensor([chars.index(c) for c in text[1:]])
    return input_ids, F.one_hot(output_ids, num_classes=len(chars)).float()

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()

# 1h) Implement BigramEmbeddingMLP
class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(len(chars), 8)
        self.fc1 = nn.Linear(8, 8)
        self.fc2 = nn.Linear(8, len(chars))
        self.activation = nn.LeakyReLU()

    def forward(self, x):
        x = self.embedding(x)
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        out = start
        for _ in range(max_new_tokens):
            x = torch.tensor([chars.index(out[-1])])
            logits = self(x)
            probs = F.softmax(logits, dim=-1)
            next_char = chars[torch.multinomial(probs, 1).item()]
            out += next_char
        return out

bigram_embedding_mlp = BigramEmbeddingMLP()

# Training loop
optimizer = torch.optim.Adam(bigram_embedding_mlp.parameters(), lr=0.01)
for _ in range(1000):
    optimizer.zero_grad()
    logits = bigram_embedding_mlp(input_ids)
    loss = F.cross_entropy(logits, outputs_one_hot)
    loss.backward()
    optimizer.step()

print(bigram_embedding_mlp.generate())

ald.
ol: ues forecisun:
Fit pes wend n

Len: es t atir humizele Cis partiss't us ge is:
beroiraoforn:


## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [None]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Mon Oct 14 02:57:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8              11W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [None]:
# List of unique characters in the text (character set)
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Mapping from character to integer ID and reverse
stoi = { ch:i for i,ch in enumerate(chars) }  # String to ID (char -> id)
itos = { i:ch for i,ch in enumerate(chars) }  # ID to String (id -> char)

def encode(s: str) -> list[int]:
    return [stoi[c] for c in s]

def decode(ids: list[int]) -> str:
    return ''.join([itos[i] for i in ids])

# Checking vocab size and mappings
print(f'Vocab Size: {vocab_size}')
print(f'Sample chars to ids: {[(ch, stoi[ch]) for ch in chars[:10]]}')  # Sample of chars and their ids
print(f'Sample ids to chars: {[(i, itos[i]) for i in range(10)]}')  # Sample of ids and their chars

Vocab Size: 65
Sample chars to ids: [('\n', 0), (' ', 1), ('!', 2), ('$', 3), ('&', 4), ("'", 5), (',', 6), ('-', 7), ('.', 8), ('3', 9)]
Sample ids to chars: [(0, '\n'), (1, ' '), (2, '!'), (3, '$'), (4, '&'), (5, "'"), (6, ','), (7, '-'), (8, '.'), (9, '3')]


In [None]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [None]:
block_size = 16
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [None]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [None]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

print(get_batch())

(tensor([[ 1, 58, 46,  ..., 54, 53, 61],
        [45,  1, 63,  ..., 57, 47, 56],
        [ 1, 39, 52,  ..., 47, 52,  1],
        ...,
        [56, 39, 52,  ..., 58,  1, 51],
        [ 1, 21, 34,  ..., 43,  1, 54],
        [53, 44,  1,  ..., 42, 57,  8]], device='cuda:0'), tensor([[58, 46, 43,  ..., 53, 61, 43],
        [ 1, 63, 53,  ..., 47, 56, 43],
        [39, 52, 42,  ..., 52,  1, 46],
        ...,
        [39, 52, 45,  ...,  1, 51, 53],
        [21, 34, 10,  ...,  1, 54, 39],
        [44,  1, 58,  ..., 57,  8,  1]], device='cuda:0'))


### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

In [None]:
class SelfAttentionHead(nn.Module):
    def __init__(self, head_size, embed_size):
        super().__init__()
        self.head_size = head_size

        # Linear transformations for keys, queries, and values (projecting to head_size)
        self.key = nn.Linear(embed_size, head_size, bias=False)
        self.query = nn.Linear(embed_size, head_size, bias=False)
        self.value = nn.Linear(embed_size, head_size, bias=False)

        self.proj = nn.Linear(head_size, head_size)

    def forward(self, x):
        B, T, C = x.size()  # B: batch size, T: sequence length, C: embedding size

        # Calculate keys, queries, and values
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)

        # Compute attention scores (scaled dot-product)
        attn_scores = (q @ k.transpose(-2, -1)) / (self.head_size ** 0.5)

        # Apply softmax to get attention weights
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Weighted sum of values
        attention_output = attn_weights @ v

        # Apply output projection
        out = self.proj(attention_output)

        return out

# Testing the SelfAttentionHead
embed_size = 64
head_size = 16
x = torch.randn(8, 32, embed_size).cuda()

# Initialize and test
attention_head = SelfAttentionHead(head_size, embed_size).cuda()
output = attention_head(x)

print(f'Input shape: {x.shape}')
print(f'Output shape: {output.shape}')

Input shape: torch.Size([8, 32, 64])
Output shape: torch.Size([8, 32, 16])


### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size, num_embed, dropout=0.1):

        super().__init__()

        # Initialize the attention heads
        self.heads = nn.ModuleList([SelfAttentionHead(head_size, num_embed) for _ in range(num_heads)])

        # Linear layer to project the concatenated heads' output
        self.proj = nn.Linear(head_size * num_heads, num_embed)

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        # Concatenate the outputs of all heads along the feature dimension
        out = torch.cat([head(x) for head in self.heads], dim=-1)

        # Apply linear projection and dropout
        out = self.dropout(self.proj(out))

        return out

# Testing MultiHeadAttention
num_heads = 4
head_size = 16
num_embed = 64

x = torch.randn(8, 32, num_embed).cuda()

# Initialize and test
multi_head_attention = MultiHeadAttention(num_heads, head_size, num_embed).cuda()
output = multi_head_attention(x)

print(f'Input shape: {x.shape}')
print(f'Output shape: {output.shape}')

Input shape: torch.Size([8, 32, 64])
Output shape: torch.Size([8, 32, 64])


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [None]:
class MLP(nn.Module):
    def __init__(self, embed_size, hidden_size=256, dropout=0.1):
        super().__init__()
        # First linear layer to project up to hidden size
        self.fc1 = nn.Linear(embed_size, hidden_size)

        # Second linear layer to project back to embed size
        self.fc2 = nn.Linear(hidden_size, embed_size)

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # First layer, followed by ReLU
        x = F.relu(self.fc1(x))

        # Second layer and dropout
        x = self.dropout(self.fc2(x))

        return x

# Testing MLP
embed_size = 64  # Example embedding size
x = torch.randn(8, 32, embed_size).cuda()

# Initialize and test
mlp = MLP(embed_size).cuda()
output = mlp(x)

print(f'Input shape: {x.shape}')
print(f'Output shape: {output.shape}')

Input shape: torch.Size([8, 32, 64])
Output shape: torch.Size([8, 32, 64])


## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [None]:
class Block(nn.Module):

    def __init__(self, n_embd, n_head, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(n_embd, n_head, dropout=dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.ffn = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        x_attn, _ = self.attn(x, x, x)
        x = self.ln1(x + x_attn)
        x_ffn = self.ffn(x)
        x = self.ln2(x + x_ffn)
        return x

# Testing the Block
n_embd = 64  # Dimensionality of the embeddings
n_head = 4   # Number of attention heads
x = torch.randn(8, 32, n_embd).cuda()  # Example input: batch_size=8, seq_len=32, n_embd=64

# Initialize and test the Block
transformer_block = Block(n_embd, n_head).cuda()
output = transformer_block(x)

print(f'Input shape: {x.shape}')
print(f'Output shape: {output.shape}')

Input shape: torch.Size([8, 32, 64])
Output shape: torch.Size([8, 32, 64])


## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [None]:
class GPT(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, block_size, num_blocks, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)

        # Stacking transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, dropout) for _ in range(num_blocks)])

        self.layer_norm = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.size()

        # Token and position embeddings
        token_emb = self.token_embedding(idx)
        position_ids = torch.arange(T, device=idx.device).unsqueeze(0)
        position_emb = self.position_embedding(position_ids)

        # Combined embeddings
        x = token_emb + position_emb

        # Pass through transformer blocks
        x = self.blocks(x)

        # Layer normalization
        x = self.layer_norm(x)

        # Final linear layer
        logits = self.lm_head(x)

        # Compute loss if targets are provided
        loss = None
        if targets is not None:

            shift_logits = logits[:, :-1, :].contiguous()
            shift_targets = targets[:, 1:].contiguous()
            loss = F.cross_entropy(shift_logits.view(-1, logits.size(-1)), shift_targets.view(-1))

        return logits, loss

    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        self.eval()
        generated = [start_char]
        input_ids = torch.tensor(generated, device='cuda').unsqueeze(0)

        for _ in range(max_new_tokens):
            # Forward pass
            with torch.no_grad():
                logits, _ = self.forward(input_ids)
                logits = logits[:, -1, :] / temperature

            # Top-k and top-p filtering
            if top_k is not None:
                top_k_values, top_k_indices = torch.topk(logits, top_k)
                logits[logits < top_k_values[:, -1, None]] = float('-inf')

            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1)
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
                sorted_indices_to_remove[:, 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = float('-inf')

            # Sample from the filtered distribution
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Append new token while ensuring dimensions are compatible
            input_ids = torch.cat([input_ids, next_token.t()], dim=1)
            generated.append(next_token.item())

        return generated

# Testing the GPT model
vocab_size = 100
n_embd = 64
n_head = 4
block_size = 32
num_blocks = 4

# Create an instance of the GPT model
gpt_model = GPT(vocab_size, n_embd, n_head, block_size, num_blocks).cuda()

# Example input for testing
idx = torch.randint(0, vocab_size, (8, block_size)).cuda()
logits, loss = gpt_model(idx, targets=None)

print(f'Input shape: {idx.shape}')
print(f'Output shape (logits): {logits.shape}')
print(f'Loss: {loss}')

# Testing the generate function
start_char = idx[0, 0].item()
generated_text = gpt_model.generate(start_char, max_new_tokens=10, top_p=0.9, top_k=50, temperature=1.0)

print(f'Generated text tokens: {generated_text}')

Input shape: torch.Size([8, 32])
Output shape (logits): torch.Size([8, 32, 100])
Loss: None
Generated text tokens: [22, 77, 56, 28, 32, 96, 1, 86, 37, 76, 65]


### Training loop (15 points)

implement training loop

In [None]:
import torch.optim as optim

vocab_size = 100
vocab = [f"token_{i}" for i in range(vocab_size)]  # Simple token vocabulary
vocab_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_vocab = {idx: word for idx, word in enumerate(vocab)}

# Model configuration
n_embd = 64
n_head = 4
block_size = 32
num_blocks = 4

model = GPT(vocab_size, n_embd, n_head, block_size, num_blocks).to('cuda')

# Define a loss function and optimizer
optimizer = optim.Adam(model.parameters(), lr=3e-4)

# Hyperparameters
max_iters = 5000
batch_size = 8

# Generate random input and target data for training
data = torch.randint(0, vocab_size, (max_iters, batch_size, block_size)).to('cuda')
targets = torch.randint(0, vocab_size, (max_iters, batch_size, block_size)).to('cuda')

# Training loop
for iter in range(max_iters):
    # Get a batch of data
    idx = data[iter]
    target = targets[iter]

    # Zero the gradients
    optimizer.zero_grad()

    # Forward pass
    logits, loss = model(idx, targets=target)

    # Backward pass
    loss.backward()

    # Update weights
    optimizer.step()

    # Print the loss every 100 iterations
    if iter % 100 == 0:
        print(f"Iteration {iter}, Loss: {loss.item():.4f}")

Iteration 0, Loss: 4.7231
Iteration 100, Loss: 4.6127
Iteration 200, Loss: 4.6170
Iteration 300, Loss: 4.5989
Iteration 400, Loss: 4.6244
Iteration 500, Loss: 4.6057
Iteration 600, Loss: 4.6040
Iteration 700, Loss: 4.6046
Iteration 800, Loss: 4.6020
Iteration 900, Loss: 4.6156
Iteration 1000, Loss: 4.6086
Iteration 1100, Loss: 4.6108
Iteration 1200, Loss: 4.6173
Iteration 1300, Loss: 4.6077
Iteration 1400, Loss: 4.6068
Iteration 1500, Loss: 4.6083
Iteration 1600, Loss: 4.6042
Iteration 1700, Loss: 4.6140
Iteration 1800, Loss: 4.6026
Iteration 1900, Loss: 4.6041
Iteration 2000, Loss: 4.6128
Iteration 2100, Loss: 4.6117
Iteration 2200, Loss: 4.6151
Iteration 2300, Loss: 4.6121
Iteration 2400, Loss: 4.6099
Iteration 2500, Loss: 4.6125
Iteration 2600, Loss: 4.6111
Iteration 2700, Loss: 4.6122
Iteration 2800, Loss: 4.6008
Iteration 2900, Loss: 4.6191
Iteration 3000, Loss: 4.6064
Iteration 3100, Loss: 4.6033
Iteration 3200, Loss: 4.6058
Iteration 3300, Loss: 4.6060
Iteration 3400, Loss: 4.61

### Generate text


print some text that your model generates

In [None]:
def generate_text(model, start_token, gen_length=50, temperature=1.0):
    model.eval()
    generated = [vocab_to_idx[start_token]]

    # Create the initial input tensor
    input_tensor = torch.tensor(generated).unsqueeze(0).to('cuda')

    with torch.no_grad():
        for _ in range(gen_length):
            logits, _ = model(input_tensor)
            logits = logits[:, -1, :] / temperature

            # Apply softmax to convert logits to probabilities
            probabilities = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probabilities, num_samples=1)

            # Append the generated token to the list
            generated.append(next_token.item())

            # Prepare the next input tensor
            input_tensor = torch.cat((input_tensor, next_token.view(1, 1)), dim=1)

            # Check if input_tensor exceeds block size
            if input_tensor.size(1) > block_size:
                input_tensor = input_tensor[:, -block_size:]

    # Convert token indices to text using the vocabulary mapping
    generated_text = [idx_to_vocab[token] for token in generated]

    return generated_text

# Generate text starting with a specific token from the vocabulary
start_token = "token_0"
generated_sequence = generate_text(model, start_token, gen_length=50, temperature=0.8)

# Print the generated text
print("Generated Text:", ' '.join(generated_sequence))

Generated Text: token_0 token_13 token_15 token_1 token_47 token_74 token_73 token_70 token_1 token_73 token_51 token_91 token_49 token_83 token_77 token_99 token_0 token_32 token_31 token_88 token_74 token_55 token_4 token_59 token_5 token_60 token_1 token_63 token_53 token_33 token_74 token_25 token_13 token_37 token_70 token_97 token_24 token_15 token_42 token_9 token_60 token_97 token_95 token_57 token_40 token_85 token_3 token_96 token_33 token_63 token_25
