<a href="https://colab.research.google.com/github/KempnerInstitute/transformer-workshop/blob/main/transformer_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
from dataclasses import dataclass

# Tokenization

## Exercise: Implementing Character-based tokenization

1. Get a sorted list of every unique character in your training data.
2. Create a dictionary that converts tokens to IDs (str_to_int) and one that converts IDs to tokens (int_to_str)
3. Implement functions encode and decode.
Encode should take in a string and output list of token IDs.
Decode should take in a list of token IDs and output a string.
4. Test encoding and then decoding “My dog Leo is extremely cute.” Do you recover the correct string?


In [2]:
# Load in all training data
with open('tiny_wikipedia.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
# Step 1) Get a sorted list of all unique characters that occur in this text
# Hint: set is useful for getting unique elements in a sequence
... # your code here

# Step 2) Create the dictionaries str_to_int and int_to_str
... # your code here

# Step 3) Define encode and decode functions
# def encode(...):
#     ...

# def decode(...):
#     ...

# Testing your implementation 
input_text = "My dog Leo is extremely cute."
ids = encode(input_text, str_to_int)
decoded_text = decode(ids, int_to_str)
assert input_text == decoded_text, "Decoded text does not match input"

Ellipsis

**Hint for step 1**

Set is useful for getting unique elements in a sequence

In [3]:
# Solution

# Step 1) Get a sorted list of all unique characters that occur in this text
# Hint: set is useful for getting unique elements in a sequence
chars = sorted(list(set(text)))

# Step 2) Create the dictionaries str_to_int and int_to_str
str_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_str = {i: ch for i, ch in enumerate(chars)}

# Step 3) Define encode and decode functions
def encode(text, str_to_int):
    ids = [str_to_int[c] for c in text]
    return ids

def decode(ids, int_to_str):
    text_list = [int_to_str[id] for id in ids]
    return ''.join(text_list)

# Testing your implementation 
input_text = "My dog Leo is extremely cute."
ids = encode(input_text, str_to_int)
decoded_text = decode(ids, int_to_str)
assert input_text == decoded_text, "Decoded text does not match input"


## Tokenize input data and create splits

In [4]:
# Train and test splits
data = torch.tensor(encode(text, str_to_int), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split, ctx_len, batch_size, device='cpu'):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - ctx_len, (batch_size,))
    x = torch.stack([data[i:i+ctx_len] for i in ix])
    y = torch.stack([data[i+1:i+ctx_len+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

# Define our transformer parameters with a config

In [5]:
@dataclass
class Config:
    d_model: int = 256 # the model/hidden/embedding dim
    n_heads: int = 4 # number of attention heads (width)
    ctx_len: int = 64 # context length
    batch_size: int = 8 # batch size
    n_layers: int = 12 # number of layers (depth)
    vocab_size: int = -1 # vocab size, to be determined once we have created a tokenizer
    device: str = 'cpu'

    def set_vocab_size(self, vocab_size):
        self.vocab_size = vocab_size

In [6]:
config = Config()
config.set_vocab_size(vocab_size=len(chars)) # set our vocabular size (equal to the number of chars)

# Embeddings

## Exercise: Implement token embeddings

We want to implement a class that will take in a batch of token IDs (batch size by context length) and output the token embeddings (batch size by context length by embedding dimensions). Find the `nn.Embedding` docs [here](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)

In [None]:
class TokenEmbeddingLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # TODO: fill out nn.Embedding arguments for token embedding
        self.wte = nn.Embedding(...)

    def forward(self, x):
        B, T = x.shape
        
        # TODO: Get forward pass of token embedding layer
        x_tok = ...

        return x_tok
    

# Testing your implementation
xb, yb = get_batch('train', config.ctx_len, config.batch_size, config.device)

token_embedding = TokenEmbeddingLayer(config)
x_tok = token_embedding(xb)

assert x_tok.shape == (config.batch_size, config.ctx_len, config.d_model), "Embedding dimensions are incorrect"

In [7]:
# Solution
class TokenEmbeddingLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        self.wte = nn.Embedding(config.vocab_size, config.d_model)

    def forward(self, x):
        B, T = x.shape
        
        x_tok = self.wte(x)

        return x_tok
    

# Testing your implementation
xb, yb = get_batch('train', config.ctx_len, config.batch_size, config.device)

token_embedding = TokenEmbeddingLayer(config)
x_tok = token_embedding(xb)

assert x_tok.shape == (config.batch_size, config.ctx_len, config.d_model), "Embedding dimensions are incorrect"

## Exercise: Implement full embedding layer

Now we'll write the full embedding layer including both token and position embeddings. You can use your solutions from the previous part for the token embeddings. How can you implement the position embeddings? This is a little tricky so feel free to click on the hints below the exercise for help.

In [None]:
class EmbeddingLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        self.device = config.device

        # TODO: fill out nn.Embedding arguments for token embedding (same as before)
        self.wte = nn.Embedding(...)

        # TODO: implement position embedding
        self.wpe = ...
        

    def forward(self, x):
        B, T = x.shape
        
        # TODO: Get forward pass of token and position embedding 
        x_tok = ...
        x_pos = ...
        x_embeddings = x_tok + x_pos 

        return x_embeddings
    

# Testing your implementation
xb, yb = get_batch('train', config.ctx_len, config.batch_size, config.device)

embedding = EmbeddingLayer(config)
x_embedding = embedding(xb)

assert x_embedding.shape == (config.batch_size, config.ctx_len, config.d_model), "Embedding dimensions are incorrect"

**Hint #1**

For position embeddings, you can also use nn.Encoding. Instead of the first dimension being equal to vocab size, it should be equal to the context length (so you learn an embedding for each position in a sequence)


**Hint #2**

The output of the token embeddings forward pass is batch size x context length x model dimension.

For the forward pass of the position embeddings, you only need to create a matrix that is context length by model dimension as nothing depends on the actual data in each batch. Broadcasting will ensure you can still add this matrix to the token embeddings.

In [8]:
# Solution
class EmbeddingLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        self.wte = nn.Embedding(config.vocab_size, config.d_model)
        self.wpe = nn.Embedding(config.ctx_len, config.d_model)
        self.device = config.device

    def forward(self, x):
        B, T = x.shape
        
        x_tok = self.wte(x)
        # print(x_tok.shape) uncomment this if you want to see the shape of above tensor
        x_pos = self.wpe(torch.arange(T, device=self.device))
        # print(x_pos.shape)
        x_embeddings = x_tok + x_pos 

        return x_embeddings
    

# Testing your implementation
xb, yb = get_batch('train', config.ctx_len, config.batch_size, config.device)

embedding = EmbeddingLayer(config)
x_embedding = embedding(xb)

assert x_embedding.shape == (config.batch_size, config.ctx_len, config.d_model), "Embedding dimensions are incorrect"

# Attention

## Exercise: Implementing single headed causal self attention

Self-attention is a core mechanism in transformers that allows each position in a sequence to attend to all previous positions. The "causal" part ensures each position can only attend to past positions - this is crucial for language modeling.

The task below is to fill out the `SingleHeadCausalAttention` module.  The `__init__` method should define the key, query, and value projections.  Note that the causal mask has been already defined for you (it is a lower triangular matrix whose entries are 1's.  You can refer to it by calling `self.cmask`.)

The `forward(self, x)` function that will take in an input `x` that is `(B, T, C)`-dimensional corresponding to batch size, sequence length, and hidden dimension and then output the result after applying the attention formula.  To do this,

1. Create the K, Q, V matrices that are the resultant matrices after applying the `self.key`, `self.query`, and `self.values` projections.
2. Compute and return attention using the formula:

$$\textrm{attention}(K, V, Q) = \textrm{softmax}\left( c \odot \frac{Q K^\top}{\sqrt{d_k}} \right) V $$

where $c \odot \dots$ denotes the application of the causal mask.  You can use `torch.masked_fill(...)` here to apply the mask.  It takes as input three arguments: the input matrix you want to mask, where you want to mask it (a boolean condition), and the value you want to mask with.  To figure out what value you want to mask with, it may be helpful to recall the softmax formula; the $i$-th component of a vector $u$ after a softmax is: $$ \textrm{softmax}(x)_i =  \frac{e^{x_i}}{\sum_j e^{x_j}}.$$

Hints:
1. Keep track of the matrix dimensions after each step!
2. Note that you can transpose a matrix in Pytorch by calling `A.transpose(dim_1, dim_2)` where `dim_1`, `dim_2` refer to the dimensions you want to transpose.
3. You may use Pytorch's built-in softmax function `F.softmax(...)`.

In [None]:
class SingleHeadCausalAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # Calculate the dimension for each attention head
        self.head_dim = config.d_model // config.n_heads
        
        # TODO: Initialize the Key, Query, and Value projections
        # Each should be a linear layer that projects from d_model to head_dim
        # Hint: Use nn.Linear(..., bias=False) as is standard in attention
        self.key = ... # Your code here
        self.query = ... # Your code here
        self.values = ... # Your code here
        
        # Create causal mask (lower triangular matrix), you an refer to it by `self.cmask`
        self.register_buffer("cmask", torch.tril(torch.ones([config.ctx_len, config.ctx_len])))
    
    def forward(self, x):
        B, T, C = x.shape
        
        # TODO Step 1: Project input to get Key, Query, Value matrices
        K = ... # Your code here
        Q = ... # Your code here
        V = ... # Your code here
        
        # TODO Step 2: Compute attention scores and apply mask
        # Remember: 
        # - Scale by sqrt(head_dim)
        # - Use the causal mask (self.cmask) to prevent attention to future tokens (you can use `torch.masked_fill(...)` here)
        # - Apply softmax to get attention weights
        # - Multiply with values
        
        ... # Your implementation here...
        
        return # Final output
    

# Test your implementation
config = Config(d_model=256, n_heads=8, ctx_len=16)
attention = SingleHeadCausalAttention(config)
x = torch.randn(2, 10, 256)  # (batch_size, seq_len, d_model)
output = attention(x)
assert output.shape == (2, 10, 32)  # head_dim = 256/8 = 32

In [9]:
# Solution
class SingleHeadCausalAttention(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.head_dim = config.d_model // config.n_heads
        self.key = nn.Linear(config.d_model, self.head_dim, bias=False)
        self.query = nn.Linear(config.d_model, self.head_dim, bias=False)
        self.values = nn.Linear(config.d_model, self.head_dim, bias=False)

        self.register_buffer("cmask", torch.tril(torch.ones([config.ctx_len, config.ctx_len])))

    
    def forward(self, x):

        B, T, C = x.shape
        
        K = self.key(x) # (B, T, C) @ (_, C, H) -> (B, T, H)
        Q = self.query(x)
        V = self.values(x)

        y = Q @ K.transpose(-2, -1) * self.head_dim**-0.5 # (B, T, H) @ (B, H, T) -> (B, T, T)
        y = torch.masked_fill(y, self.cmask[:T, :T]==0, float('-inf'))
        y = F.softmax(y, dim=-1) @ V
        return y
    

# Test your implementation
config = Config(d_model=256, n_heads=8, ctx_len=16)
attention = SingleHeadCausalAttention(config)
x = torch.randn(2, 10, 256)  # (batch_size, seq_len, d_model)
output = attention(x)
assert output.shape == (2, 10, 32)  # head_dim = 256/8 = 32

# Multi-head self attention

## Exercise: implementing multi-head attention

The task is to write the multi-headed self attention module.  You should not need to write more than a few lines of code here.

1. Define `self.heads` as the list of attention heads that will act in parallel on the input.  You may use `nn.ModuleList(...)` to do this.
2. Define `self.linear`, a linear projection.
3. Define the forward function which will take in the input `x` (which is (B, T, C)-dimesional), pass it through each head, and concatenate the output.  To perform concatenation you can use `torch.cat(...)`.
4. After going through the attention heads, the input should then go through the linear projection and then returned at the end.

In [None]:
class MultiHeadCausalAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = ... # your code here, you can define the heads using `nn.ModuleList(...)`
        self.linear = ... # your code here, with `d_model` in-features and `d_model` out-features
        

    def forward(self, x):
        # TODO: fill out the forward method for multi-head attention
        # Remember:
        # - pass input through all heads and concatenate the output (you can use `torch.cat(...)` here)
        # - pass the result through the linear layer and return the output
        
        ... # your code here


# Testing your implementation
config = Config(d_model=256, n_heads=8, ctx_len=16)
mha = MultiHeadCausalAttention(config)

# Test with small batch
x = torch.randn(2, 10, 256)  # (batch_size=2, seq_len=10, d_model=256)
out = mha(x)
assert out.shape == (2, 10, 256)

In [10]:
# Solution
class MultiHeadCausalAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.heads = nn.ModuleList([SingleHeadCausalAttention(config) for _ in range(config.n_heads)])
        self.linear = nn.Linear(config.d_model, config.d_model)
        

    def forward(self, x):
        y = torch.cat([h(x) for h in self.heads], dim=-1)
        y = self.linear(y)
        return y
    

# Testing your implementation
config = Config(d_model=256, n_heads=8, ctx_len=16)
mha = MultiHeadCausalAttention(config)

# Test with small batch
x = torch.randn(2, 10, 256)  # (batch_size=2, seq_len=10, d_model=256)
out = mha(x)
assert out.shape == (2, 10, 256)

# Define the feed-forward network (FFN) decoder block

## Exercise: FFN

The Feed-Forward Network (FFN) is a simple yet powerful component that applies two linear transformations with a ReLU activation in between. The first transformation expands the input dimension by a factor of 4, and the second transformation projects it back to the original dimension.  In this exercise, you will implement this module.

In [12]:
class FFN(nn.Module):
    def __init__(self, config):
        super().__init__()
        # TODO: Initialize two linear layers
        # First layer should expand from d_model to 4*d_model
        # Second layer should project back to d_model
        # Hint: use nn.Linear(in_features, out_features)
        self.l1 = ... # Your code here
        self.relu = nn.ReLU()
        self.l2 = ... # Your code here

    def forward(self, x):
        # TODO: Implement the forward pass
        # 1. Apply first linear layer
        # 2. Apply ReLU activation
        # 3. Apply second linear layer
        x = ... # Your code here
        x = ... # Your code here
        x = ... # Your code here
        return x

In [11]:
# Solution
class FFN(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.l1 = nn.Linear(config.d_model, 4*config.d_model)
        self.relu = nn.ReLU()
        self.l2 = nn.Linear(4*config.d_model, config.d_model)

    def forward(self, x):
        x = self.l1(x)
        x = self.relu(x)
        x = self.l2(x)
        return x


## Exercise: Decoder Block

The Decoder Block is a core component that combines self-attention with a feed-forward network. It uses residual connections and layer normalization to help with training stability.

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.mha = MultiHeadCausalAttention(config)
        # TODO: Initialize layer normalization layers
        # Hint: use nn.LayerNorm(config.d_model)
        self.ln1 = ... # Your code here
        self.ffn = FFN(config)
        self.ln2 = ... # Your code here

    def forward(self, x):
        # TODO: Implement the forward pass with residual connections
        # Remember the pattern: x = x + sublayer(layer_norm(x))
        x = ... # Your code here  # First attention block with residual
        x = ... # Your code here  # Second FFN block with residual
        return x
    

# Testing your implementation
config = Config(d_model=256)
ffn = FFN(config)
decoder = DecoderBlock(config)

# Test with random input
x = torch.randn(2, 10, 256)  # (batch_size, sequence_length, d_model)
output = decoder(x)
assert output.shape == x.shape

In [12]:
# Solution
class DecoderBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.mha = MultiHeadCausalAttention(config)
        self.ln1 = nn.LayerNorm(config.d_model)
        self.ffn = FFN(config)
        self.ln2 = nn.LayerNorm(config.d_model)

    def forward(self, x):
        x = x + self.mha(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x
    

# Testing your implementation
config = Config(d_model=256)
ffn = FFN(config)
decoder = DecoderBlock(config)

# Test with random input
x = torch.randn(2, 10, 256)  # (batch_size, sequence_length, d_model)
output = decoder(x)
assert output.shape == x.shape

# Define the transformer!

We're now ready to put the components together into our final decoder module that can actually generate text! Your task to implement the missing pieces of the Decoder class. This is the top-level module that:

* Embeds input tokens and adds positional information
* Processes them through multiple transformer layers
* Outputs predictions for the next token through the `forward(...)` function
* Can generate new sequences autoregressively through the `generate(...)` function

We have given extra hints for this module since it is a challenging exercise.

In [None]:
class Decoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # Stack of decoder blocks
        self.blocks = nn.Sequential(*[DecoderBlock(config) for _ in range(config.n_layers)])

        # TODO: Initialize components
        # Final layer norm and projection to vocabulary
        self.ln = ... # normalize to d_model dimension
        self.lin = ... # project from d_model to vocab_size
        
        # Embeddings
        self.emb = EmbeddingLayer(config)

        # Loss function for training
        self.L = nn.CrossEntropyLoss()
        self.ctx_len = config.ctx_len

        self.device = config.device # don't change this (for training model on right device)
    
    def forward(self, x, targets=None):
        """
        Args:
            x: Input tokens (B, T)
            targets: Optional target tokens (B, T)
        Returns:
            logits: Predictions (B, T, vocab_size)
            loss: Optional cross-entropy loss
        """
        B, T = x.shape
        
        # Step 1: Get embeddings
        x = self.emb(x) 
        
        # TODO Step 2: Process through transformer
        x = self.blocks(x)          # Apply transformer blocks
        x = ... # Your code here        # Apply final layer norm
        logits = ... # Your code here   # Project to vocabulary size
        
        # TODO Step 3: Compute loss if targets are provided
        if targets is None:
            loss = None
        else:
            # Reshape logits and targets for loss computation
            B, T, V = logits.shape
            logits = logits.view(B*T, V)    # Combine batch and time dimensions
            targets = targets.view(B*T)      # Flatten targets
            loss = ... # Your code here          # Compute cross entropy loss
        
        return logits, loss
    
    def generate(self, idx, max_len=256):
        """
        Generate new tokens given initial sequence idx.
        """
        # TODO: Implement generation loop
        for _ in range(max_len):
            # Step 1: Take the last ctx_len tokens
            idx_window = ... # Your code here
            
            # Step 2: Get model predictions
            logits, _ = self(idx_window)     # (B, T, V)
            logits = logits[:, -1, :]        # Only take the last token's predictions
            
            # Step 3: Sample next token
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Step 4: Append to sequence
            idx = ... # Your code here
        
        return idx
    

# Testing your implementation
config = Config(
    vocab_size=100,
    d_model=256,
    ctx_len=64,
    n_layers=4
)
decoder = Decoder(config)

x = torch.randint(0, 100, (1, 10))
logits, loss = decoder(x, x)

out = decoder.generate(torch.tensor([[1, 2, 3]]), max_len=5)
print(out.shape)  # Should be (1, 8) - original 3 tokens + 5 new ones

In [13]:
# Solution
class Decoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.blocks = nn.Sequential(*[DecoderBlock(config) for _ in range(config.n_layers)])
        self.ln = nn.LayerNorm(config.d_model)
        self.lin = nn.Linear(config.d_model, config.vocab_size)
        self.emb = EmbeddingLayer(config)
        self.L = nn.CrossEntropyLoss()
        self.ctx_len = config.ctx_len
        self.device = config.device
    
    def forward(self, x, targets=None):
        B, T = x.shape
        x = self.emb(x)

        x = self.blocks(x)
        x = self.ln(x)
        logits = self.lin(x) # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # compute xentropy loss, targets are (B, T)
            B, T, V = logits.shape
            targets = targets.view(B*T)
            logits = logits.view(B*T, V)
            loss = self.L(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_len=256):
        for _ in range(max_len):
            idx_window = idx[:, -self.ctx_len:]
            logits, _ = self(idx_window) #(B, T, V)
            logits = logits[:,-1,:]
            prob = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(prob, num_samples=1) # greedy sample
            idx = torch.cat((idx, next_token), dim=1)
        
        return idx
    

# Testing your implementation
config = Config(
    vocab_size=100,
    d_model=256,
    ctx_len=64,
    n_layers=4
)
decoder = Decoder(config)

x = torch.randint(0, 100, (1, 10))
logits, loss = decoder(x, x)

out = decoder.generate(torch.tensor([[1, 2, 3]]), max_len=5)
print(out.shape)  # Should be (1, 8) - original 3 tokens + 5 new ones

torch.Size([1, 8])


# Train your model

In [16]:
config = Config(d_model=768, n_heads=12, ctx_len=512, batch_size = 64, n_layers = 12, device='cuda:0')
config.set_vocab_size(vocab_size=len(chars))
model = Decoder(config).to(config.device)

# print the size of the model
n_params = sum(p.numel() for p in model.parameters())
print(f"Total model parameters: {n_params}")

# hyperparameters
learning_rate = 3e-4
max_iters = 6000
eval_interval = 200  # How often to evaluate
eval_iters = 100     # How many batches to use for evaluation

# Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    xb, yb = get_batch('train', config.ctx_len, config.batch_size, config.device)
    
    # forward pass
    logits, loss = model(xb, yb)
    
    # backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    # evaluate on val data at specified intervals
    if iter % eval_interval == 0:
        model.eval()
        with torch.no_grad():
            val_losses = []
            for _ in range(eval_iters):
                xb, yb = get_batch('val', config.ctx_len, config.batch_size, config.device)
                _, val_loss = model(xb, yb)
                val_losses.append(val_loss.item())
            val_loss = sum(val_losses) / len(val_losses)
            
            print(f"step {iter}: train loss {loss.item():.4f}, val loss {val_loss:.4f}")
            
            # generate some text to see how we're doing
            context = torch.zeros((1, 1), dtype=torch.long, device=config.device)
            print(decode(model.generate(context, max_len=100)[0].tolist(), int_to_str))
            print('='*40)
        model.train()

# Final generation
context = torch.zeros((1, 1), dtype=torch.long, device=config.device)
print("\nFinal generated text:")
print(decode(model.generate(context, max_len=500)[0].tolist(), int_to_str))

Total model parameters: 86794109
step 0: train loss 7.0461, val loss 4.8023
	i-ְaČΗe m ̽  ș ه«  e ‐ ə⍵a˺    °eh ώi a  ȘRב↓ǻנ 🇦 α ὁ s  ś Đṭ i Ğ ビ  ა  ô B t   …a ܗ     ᶏ 9Ḳ きẮ♦ щক
step 200: train loss 2.5676, val loss 2.5627
	 – Do " E3σ mybanged, Larimivernns fomopluss Vistid E)
 tencal wasof perl cuererernellk thees p ks. 
step 400: train loss 2.5312, val loss 2.5347
	80demmect prengalologes ty coffrky, taved oniowhatenchon ariccepeithiong "crs s on pumathemm cowke o
step 600: train loss 2.2975, val loss 2.3220
	en Fors hage.

okhat 

10sBe wito dit am thent derupition Sey ated of the ALS Aris 21. M thilm efou:
step 800: train loss 1.8219, val loss 1.8791
	umely bused then SDire islanda in Ausconod MCory used more sastement used of the the Sciet, includin
step 1000: train loss 1.5887, val loss 1.6315
	odulling Laigeboggh compared Japs, 1975, Japanese Murillen V. 1963. North Scott. Homparrader, Greall
step 1200: train loss 1.3900, val loss 1.4763
	 sadon, ince he adopted, which semiss w

In [17]:
# save our model
import os
if not os.path.exists('model.pth'):
   torch.save(model.state_dict(), 'model.pth')
else:
   print('Model file (model.pth) already exists!  Saving under a different name model_other.pth.')
   torch.save(model.state_dict(), 'model_other.pth')

In [18]:
# to load a saved model, uncomment the below code
# model = Decoder(config)
# model.load_state_dict(torch.load('model.pth'))

config.device = 'cuda:0'  # change this to 'cpu' if you don't have access to a GPU
model.eval()
context = torch.tensor([encode("The Roman Empire lasted from", str_to_int)], dtype=torch.long, device=config.device)
print(decode(model.generate(context, max_len=512)[0].tolist(), int_to_str))


The Roman Empire lasted from The Objectivist states international use of diplomatic animals, Turkish early Discovery earned him a switch about the creation of the Thirteenth Century. Theologians caught little prisms, long have become one of the diseases occupying humane either being curification or humans. Turkish systems can act as animated subjective tensor, symbols and elastic in comparison to computers in previous biographies and neighboring skin.

Bibliography

"The birth of animals, walkes and the choice of animals, right, arche


In [19]:
context = torch.tensor([encode("Transformer models are described by", str_to_int)], dtype=torch.long, device=config.device)
print(decode(model.generate(context, max_len=512)[0].tolist(), int_to_str))

Transformer models are described by shear when the quality of the two mass-powered Reflective Reflectivity Metal sided for this reason surrounding the measures because of their health beyond mass currents.

The formulas of the interaction of the number of variations require in cilia, with each of surviving sodium in external water. The silver representation with reaction and basal organization. This circulation is either related to the Armenian genocide, with the secondary equivalent to title or the costs of two percent in the origins of Cal
