In [2]:
# imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import einops
import math
import numpy as np
import matplotlib.pyplot as plt
# Use MPS if available
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")


In [3]:
# generate/read corpus
read_data = False
if read_data: # already have the corpus preprocessed
    with open('corpus.txt', 'r', encoding='utf-8') as f:
        corpus = f.read()
else:
    # import the text data
    os.chdir('data')

    corpus = ''
    # read the texts
    for file in os.listdir():
        if file.endswith(".txt"):
            text = open(file, 'r', encoding='utf-8').read()
            corpus += text
            # print(text[:100])

    os.chdir('..') # go back 1 lavel higher

The first step is to tokenize the corpus. We will use the ByteLevelBPETokenizer from the tokenizers library.
A Byte-Level Tokenizer (sometimes called byte pair, byte fallback, or byte encoding) works at the byte level of text, rather than working directly at the word, character, or Unicode codepoint level.
It sees text as raw bytes (0–255 values) and tokenizes based on byte patterns.

Every input text is first split into bytes (ASCII / UTF-8).
Common byte patterns (e.g., 'tion', 'pre', 'com', etc.) are merged into bigger tokens.
Rare characters or unusual sequences can fall back to raw bytes if necessary (no tokenization errors).
Example:
"Dante" → might be tokenized into something like ['D', 'ant', 'e'] or individual bytes depending on training.

Byte-level tokenizers can handle any text (good so that if ever see a rare character, there are no issues).
They are relatively good at handling noisy data (e.g. typos)
But it can create more tokens per sentence (depending on the vocab size selected), and it can take a bit to train.



Another tokenizer that is popular is SentencePiece, which is trained on a corpus and then generates a vocabulary of subwords. Given that the text I am training on uses a lot of archaic words and lots of word diversity in general, BPE is probably the best option.

In [4]:
## tokenizer
from tokenizers import ByteLevelBPETokenizer
train=False
if train:
    # Train the tokenizer
    vocab_size = 3000

    # Save your full corpus string to a file
    with open("corpus.txt", "w", encoding="utf-8") as f:
        f.write(corpus)

    tokenizer = ByteLevelBPETokenizer()
    tokenizer.train(files="corpus.txt", vocab_size=vocab_size, min_frequency=2)


    # Encode a sentence and test
    encoded = tokenizer.encode("Nel mezzo del cammin di nostra vita")
    print("Token IDs:", encoded.ids)
    print("Tokens:   ", encoded.tokens)

    decoded = tokenizer.decode(encoded.ids)
    print("Decoded:  ", decoded)

    # Ensure directory exists
    os.makedirs("poet_tokenizer", exist_ok=True)

    # Save vocab + merges files to a directory
    tokenizer.save_model("poet_tokenizer")

else:
    tokenizer = ByteLevelBPETokenizer(
        "poet_tokenizer/vocab.json",
        "poet_tokenizer/merges.txt"
    )


In [5]:
# tokenize the text
tokens = tokenizer.encode(corpus).ids

In [6]:
# create inputs and targets
from torch.utils.data import Dataset, DataLoader
context_length = 128  #

stride = context_length//16  # No overlap. Shift the window by 1/16 of the context length
inputs = []
targets = []
for i in range(0, len(tokens) - context_length, stride):
    x = tokens[i : i + context_length]
    y = tokens[i + 1 : i + context_length + 1]
    inputs.append(torch.tensor(x, dtype=torch.long))
    targets.append(torch.tensor(y, dtype=torch.long))

# create the dataset
class PoetDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]

dataset = PoetDataset(inputs, targets)
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

For positional encoding, I will use RoPE
RoPE is a technique to add positional information to tokens inside the attention mechanism — without using big positional embeddings.

- Instead of adding absolute positional embeddings (like in GPT-2), RoPE rotates the query and key vectors in a special way depending on the token position.

It was introduced in the RoFormer paper (Su et al., 2021, https://arxiv.org/abs/2104.09864).

### Why RoPE?
In transformers: Attention is permutation-invariant (it doesn't know token order naturally).
To model sequences, you must inject position information somehow.

- Old solution: Add learned positional embeddings to token embeddings (e.g., GPT-2, BERT).
- RoPE solution: Instead of adding extra vectors, rotate the queries and keys based on position → implicit position information inside attention itself. The RoPE imposes the rotation such that the separation between tokens is the only thing that dictates the added angle between the rotated vectors.

RoPE Works better with long sequences, and is more memory efficient


### How RoPE works
At every layer:
You compute query (q) and key (k) vectors normally.
Then apply a rotation to these vectors depending on the token's position i.
The rotation is a simple 2D complex rotation:

$q[i] = q[i] * \cos(\theta[i]) + Rotate(q[i]) * \sin(\theta[i])$

$k[i] = k[i] * \cos(\theta[i]) + Rotate(k[i]) * \sin(\theta[i])$

where Rotate(x) flips even/odd dimensions:
even dims: x_even
odd dims: x_odd
and $\theta[i]$ is a precomputed frequency that depends on the position.
So queries and keys "encode" position through their angles.


### Pros/Cons of RoPE
- Pros:
    - Extrapolates to longer contexts
    - Memory efficient
    - Simple math, cheap to compute
    - Works better for long-text generation, where need to attend many tokens away from one another
- Cons:
    - Harder to interpret than positional embeddings
    - Not trivial to modify (if you want to shift positions, you need to recompute the angles)
    - Not ideal for masked or shuffled sequences (for BERT, for example)






In [None]:
# RoPE
class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, theta: float, d_k: int, max_seq_len: int, device=None):
        super().__init__()

        self.theta = theta              # Θ: a hyperparameter (usually 10,000), controls frequency scaling
        self.d_k = d_k                  # Dimensionality of each head (must be even)
        self.max_seq_len = max_seq_len  # Maximum sequence length the model will support

        # Create a tensor of positions from 0 to max_seq_len - 1
        idx = torch.arange(0, max_seq_len, device=device, dtype=torch.float32)  # shape: (max_seq_len,)

        # Compute the scaling denominator for each pair of dimensions (0, 2, 4, ..., d_k-2)
        # This comes from the formula θ_{i,k} = i / Θ^{2k/d_k}, matching sinusoidal positional encoding scaling
        denom = theta ** (torch.arange(0, d_k, 2, device=device, dtype=torch.float32) / d_k)  # shape: (d_k // 2,)

        # Broadcast i / Θ^{2k/d} to get θ_{i,k}, shape: (max_seq_len, d_k // 2)
        theta_i_k = idx.unsqueeze(1) / denom.unsqueeze(0)

        # Precompute cos(θ_{i,k}) and sin(θ_{i,k}), used during forward pass
        cos_cache = torch.cos(theta_i_k)  # shape: (max_seq_len, d_k // 2)
        sin_cache = torch.sin(theta_i_k)

        # Register as non-persistent buffers (they won’t be saved in model checkpoints)
        self.register_buffer("cos_cache", cos_cache, persistent=False)
        self.register_buffer("sin_cache", sin_cache, persistent=False)

    def forward(self, x: torch.Tensor, token_positions: torch.Tensor=None) -> torch.Tensor:
        # x: (batch_size, seq_len, d_k)
        seq_len = x.shape[-2]
        d_k = x.shape[-1]
        assert d_k % 2 == 0          # Rotation requires even dimension (so we can split into pairs)
        assert d_k == self.d_k       # Input must match expected dimension

        # Get cos/sin values for the given token positions (custom for padding/masking)
        if token_positions is not None:
            cos_values = self.cos_cache[token_positions, :]  # (batch, seq_len, d_k // 2)
            sin_values = self.sin_cache[token_positions, :]
        else:
            # Use first `seq_len` positions from cache
            cos_values = self.cos_cache[:seq_len, :]         # (seq_len, d_k // 2)
            sin_values = self.sin_cache[:seq_len, :]

        # Rearrange the input tensor to group even/odd values into 2D pairs
        # x shape: (..., seq_len, d_k)
        # becomes: (..., seq_len, d_k // 2, 2), where last dim = [even, odd]
        x_split = einops.rearrange(
            x, "... seq_len (d_split pair) -> ... seq_len d_split pair",
            d_split=self.d_k // 2, pair=2
        )
        even_x = x_split[..., 0]  # shape: (..., seq_len, d_k // 2), x_{2j}
        odd_x = x_split[..., 1]  # shape: (..., seq_len, d_k // 2), x_{2j+1}

        # Apply the 2x2 rotation matrix to each [even, odd] pair:
        # [x_even'] = cos * x_even - sin * x_odd
        # [x_odd']  = sin * x_even + cos * x_odd
        x_rotate_even = even_x * cos_values - odd_x * sin_values  # shape: (..., seq_len, d_k // 2)
        x_rotate_odd  = even_x * sin_values + odd_x * cos_values

        # Stack even and odd components back into shape (..., seq_len, d_k // 2, 2)
        x_rotated = torch.stack([x_rotate_even, x_rotate_odd], dim=-1)

        # Flatten last two dims to get back original shape (..., seq_len, d_k)
        x_rotated = einops.rearrange(
            x_rotated, "... seq_len d_split pair -> ... seq_len (d_split pair)",
            d_split=self.d_k // 2, pair=2
        )

        return x_rotated


A key layer that underpins the transformer architecture is the Multi-Head Self-Attention (MHSA) layer.
## Multi-Head Self-Attention

Multi-Head Self-Attention is the main mechanism that lets a transformer:

Look at different parts of a sequence at the same time

Mix information from different positions flexibly

Capture different types of relationships (short-term, long-term, syntactic, semantic)
- Instead of just one "attention," you have many attention heads
- Each head learns different attention patterns independently --> The model learns more patterns, more flexibly

### How MHSA Works (High-level view)
1. Compute Queries, Keys and Values (Matrices)
2. For each head, project the inputs to queries, keys and values; Apply RoPE
3. Compute the scaled dot product attention between queries and keys: score = $  \frac{QK^T}{\sqrt{d_k}}$ ($d_k$ is the dimension of the queries)
4. Apply masking, to avoid words coming in the future from influencing the current word. Then apply softmax to the scores to find the attention weights
5. Compute the new vector by applying the attention weights to the values, then concatanate the results in a new vector in the embedding dimension that now has been influenced by the words around it
6. Apply a final linear layer to project the new vector into the embedding dimension, creating the output of MHSA


### Advantages of MHSA
- Each head can focus on a specific kind of relationship between words (e.g. gender, plural vs singular, subject, verb, emotional tone, etc.)
- Some heads can focus on long-range dependencies, while others focus on short-range dependencies
- All computations happen in parallel, making it very efficient to run across multiple GPUs

But, need to be aware of computation (scales with O(seq_len$^2$ * $d_{embded}$)) and memory (scales like O(seq_len$^2$)). For long sequences, exploration of flash attention, Performer, longformer etc. is being investigated to improve performance

In [8]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, n_heads, n_embd, dropout=0.1, rope=None, device=None, dtype=None):
        super().__init__()
        assert n_embd % n_heads == 0
        self.n_heads = n_heads
        self.head_size = n_embd // n_heads

        self.qkv_proj = nn.Linear(n_embd, 3 * n_embd, device=device, dtype=dtype) # can compute the Q,K,V matrices all in 1 matrix mult, more efficient
        self.out_proj = nn.Linear(n_embd, n_embd,device=device, dtype=dtype)
        self.dropout = nn.Dropout(dropout,)
        self.rms_norm = nn.RMSNorm(n_embd,device=device, dtype=dtype)

        self.rope = rope  # Optional RotaryPositionalEmbedding instance

    def forward(self, x, mask=None):
        B, T, C = x.shape
        x = self.rms_norm(x)

        qkv = self.qkv_proj(x)  # (B, T, 3C)
        Q, K, V = qkv.chunk(3, dim=-1)  # Each: (B, T, C)

        Q = einops.rearrange(Q, 'B T (h d) -> B h T d', h=self.n_heads)
        K = einops.rearrange(K, 'B T (h d) -> B h T d', h=self.n_heads)
        V = einops.rearrange(V, 'B T (h d) -> B h T d', h=self.n_heads)

        # Apply RoPE if available
        if self.rope is not None:
            Q = Q.reshape(-1, T, self.head_size)
            K = K.reshape(-1, T, self.head_size)
            Q = self.rope(Q)
            K = self.rope(K)
            Q = Q.view(B, self.n_heads, T, self.head_size)
            K = K.view(B, self.n_heads, T, self.head_size)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_size ** 0.5)  # (B, h, T, T)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)

        out = torch.matmul(attn, V)  # (B, h, T, d)
        out = einops.rearrange(out, 'B h T d -> B T (h d)')
        return self.out_proj(out)

### Feedforward using SwiGLU activation
After MHSA, feedforward is performed using SwiGLU activation, which is a variant of the GLU activation that uses a sigmoid function to control the input and output of the feedforward layer.

The reason to have a FF layer is that after self-attention, the model has mixed and exchanged information across all tokens, but each token needs more non-linear transformation individually — to refine, filter, compress, or enhance what it "learned". It adds expressiveness beyond just linear attention mixing, it allows model to model very rich internal features (sharp transformations).

The FF works by:
- Take the token embedding output from self-attention.
- Pass it through a small MLP (multi-layer perceptron):
- Expand the hidden dimension: (e.g., 4x commonly)
- Apply a nonlinearity (activation function).
- Project back down to original size

#### Why SwiGLU specifically?
SwiGLU = SiLU activation + gating mechanism
It improves over older activations like ReLU, GELU by introducing a soft gating mechanism.
- SwiGLU lets neurons softly "turn on/off" information, not just push it forward blindly.
- SiLU is smooth (continuous derivative), no sharp cutoffs like ReLU
- Empirically proven: LLaMA, PaLM, Mistral, Zephyr — all switched to SwiGLU for better training stability and final quality.
- Deep transformers (30–80 layers) become easier to train with SwiGLU


In [None]:
class SwiGLU(nn.Module): #SwiGLU feedforward
    def __init__(self,d_model,d_ff,dropout=0.1,device=None, dtype=None):
        super().__init__()

        if d_ff is None:
            d_ff = (8 / 3) * d_model
            d_ff = 64 * math.ceil(d_ff / 64)

        self.W1 = nn.Parameter( torch.empty(d_ff, d_model , device=device, dtype=dtype ) )
        self.W3 = nn.Parameter( torch.empty(d_ff, d_model, device=device, dtype=dtype  ) )
        self.W2 = nn.Parameter( torch.empty( d_model, d_ff  ,device=device, dtype=dtype) )
        self.dropout = nn.Dropout(dropout)

        std = torch.sqrt(torch.tensor(2/(d_model + d_ff), device=device, dtype=dtype))

        torch.nn.init.trunc_normal_(self.W1, mean=0, std= std, a=-3*std, b = 3*std )
        torch.nn.init.trunc_normal_(self.W3, mean=0, std= std, a=-3*std, b = 3*std )
        torch.nn.init.trunc_normal_(self.W2, mean=0, std= std, a=-3*std, b = 3*std )

    def forward(self,x):

        p1 = F.silu(x @ self.W1.transpose(-2, -1))

        p2 = x @ self.W3.transpose(-2,-1)

        p2 *= p1 # elementwise multiplication
        return self.dropout(p2 @ self.W2.transpose(-2,-1))


### Language Model
Putting everything together, into transformer blocks, and ultimately a transformer language model.
Note that as the input propagates through the transformer, there are skip connections to make training easier and avoid vanishing gradients.

In [7]:
# define layers of the Language Model

class Encoder_Block(nn.Module):
    def __init__(self, d_model, d_ff, n_heads, dropout=0.1, rope=None,device=None, dtype=None):
        super().__init__()
        self.norm1 = nn.RMSNorm(d_model, device=device, dtype=dtype)
        self.norm2 = nn.RMSNorm(d_model, device=device, dtype=dtype)
        self.attn = MultiHeadSelfAttention(n_heads=n_heads, n_embd=d_model,rope=rope, dropout=dropout,device=device, dtype=dtype)
        self.ff = SwiGLU(d_model=d_model,d_ff=d_ff,dropout=dropout,device=device, dtype=dtype)

    def forward(self, x,mask=None):
        x = x + self.attn(self.norm1(x),mask=mask)
        x = x + self.ff(self.norm2(x))
        return x

class TransformerLM(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, num_layers=6, max_len=context_length, dropout=0.1, device=None, dtype=None):
        super().__init__()
        self.vocab_size = vocab_size
        self.token_embed = nn.Embedding(vocab_size, d_model, device=device, dtype=dtype)
        self.dropout = nn.Dropout(dropout)

        rope = RotaryPositionalEmbedding(theta=10000, d_k=d_model//n_heads, max_seq_len=max_len, device=device)

        self.blocks = nn.ModuleList(  [Encoder_Block(d_model=d_model, d_ff=None, n_heads=n_heads, dropout=dropout, rope=rope, device=device, dtype=dtype) for _ in range(num_layers)])

        self.out_proj = nn.Linear(d_model, vocab_size)
        self.register_buffer("mask", torch.tril(torch.ones(max_len, max_len, device=device)).view(1, 1, max_len, max_len)) # lower triangular matrix. Precompute the mask once, and keep it stored in the model's buffer.
    def forward(self, input_ids):
        B, T = input_ids.shape # batch size, seq len
        x = self.token_embed(input_ids)
        x = self.dropout(x)

        mask = self.mask[:, :, :T, :T]

        for block in self.blocks:
            x = block(x,mask=mask)
        x = self.out_proj(x)
        return x

    @torch.no_grad()
    def generate(self, input_ids, max_new_tokens, temperature=1.0, top_k=None):
        self.eval()
        for _ in range(max_new_tokens):
            if input_ids.size(1) > self.mask.size(-1):
                input_ids = input_ids[:, -self.mask.size(-1):]  # truncate left

            logits = self(input_ids)  # (B, T, vocab_size)
            logits = logits[:, -1, :] / temperature  # take logits from last time step

            if top_k is not None:
                top_k_logits, _ = torch.topk(logits, top_k)
                logits[logits < top_k_logits[:, [-1]]] = -float('Inf')

            probs = torch.softmax(logits.clamp(-100, 100), dim=-1) # (B, vocab_size). Clamp to avoid NaNs
            next_token = torch.multinomial(probs, num_samples=1)  # (B, 1)
            input_ids = torch.cat([input_ids, next_token], dim=1)
        return input_ids


### Create the model

In [35]:
# ===== Hyperparameters =====
vocab_size = tokenizer.get_vocab_size()
d_model = 256
n_heads = 4
num_layers = 4
context_length = 128
learning_rate = 3e-4
num_epochs = 1

# ===== Instantiate the Model =====
model = TransformerLM(
    vocab_size=vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    num_layers=num_layers,
    max_len=context_length,
    device=device,
    dtype= torch.float32 # torch.float16 gives bug during training, it does not work on mps # torch.bfloat16 # bloat16 not available on mps
).to(device)

# ===== Loss and Optimizer =====
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)



Check that everything is on the correct device

In [36]:
# check that the model is on the correct device
def check_model_device(model):
    for name, param in model.named_parameters():
        if param.device != device:
            print(f"[PARAM] {name} is on {param.device}, expected {device}")
    for name, buf in model.named_buffers():
        if buf.device != device:
            print(f"[BUFFER] {name} is on {buf.device}, expected {device}")

check_model_device(model)

[PARAM] token_embed.weight is on mps:0, expected mps
[PARAM] blocks.0.norm1.weight is on mps:0, expected mps
[PARAM] blocks.0.norm2.weight is on mps:0, expected mps
[PARAM] blocks.0.attn.qkv_proj.weight is on mps:0, expected mps
[PARAM] blocks.0.attn.qkv_proj.bias is on mps:0, expected mps
[PARAM] blocks.0.attn.out_proj.weight is on mps:0, expected mps
[PARAM] blocks.0.attn.out_proj.bias is on mps:0, expected mps
[PARAM] blocks.0.attn.rms_norm.weight is on mps:0, expected mps
[PARAM] blocks.0.ff.W1 is on mps:0, expected mps
[PARAM] blocks.0.ff.W3 is on mps:0, expected mps
[PARAM] blocks.0.ff.W2 is on mps:0, expected mps
[PARAM] blocks.1.norm1.weight is on mps:0, expected mps
[PARAM] blocks.1.norm2.weight is on mps:0, expected mps
[PARAM] blocks.1.attn.qkv_proj.weight is on mps:0, expected mps
[PARAM] blocks.1.attn.qkv_proj.bias is on mps:0, expected mps
[PARAM] blocks.1.attn.out_proj.weight is on mps:0, expected mps
[PARAM] blocks.1.attn.out_proj.bias is on mps:0, expected mps
[PARAM] 

### Training

In [38]:
# ===== Training Loop =====
import tqdm
import time

print(f"# of samples: {len(dataset)} | Batch size: {batch_size} | Expected batches: {len(dataloader)}")

generate_every = 30  # generate every X batches
model.train()
for epoch in range(num_epochs):
    pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    total_loss = 0
    print(f"Starting training loop with {len(dataloader)} batches per epoch")
    for batch_idx, batch in enumerate(pbar):
        # print(f"Batch {batch_idx}/{len(dataloader)}", end='\r')
        # t0 = time.time()
        input_batch, target_batch = batch
        input_batch = input_batch.to(device)     # shape: (B, T)
        target_batch = target_batch.to(device)   # shape: (B, T)
        # print(f"Batch {batch_idx} took {time.time() - t0:.2f}s to load and move", flush=True)
        optimizer.zero_grad()

        logits = model(input_batch)              # shape: (B, T, vocab_size)
        loss = criterion(logits.view(-1, vocab_size), target_batch.view(-1))

        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pbar.set_postfix(loss=loss.item())


        if batch_idx == 0:
            print("Triggering sample generation at batch 0...")
            model.eval()
            start_text = "Nel mezzo"
            start_ids = tokenizer.encode(start_text).ids  #just the token IDs
            input_ids = torch.tensor([start_ids], device=device)

            generated_ids = model.generate(
                input_ids=input_ids,
                max_new_tokens=50,
                temperature=1.0,
                top_k=20
            )

            decoded = tokenizer.decode(generated_ids[0].tolist())
            print(f"\n[Sample @ epoch {epoch+1}, batch {batch_idx}]\n{decoded}\n")
            model.train()


        if batch_idx % generate_every == 0 and batch_idx > 0:
            model.eval()
            start_text = "Nel mezzo"
            start_ids = tokenizer.encode(start_text).ids
            input_ids = torch.tensor([start_ids], device=device)

            generated_ids = model.generate(
                input_ids=input_ids,
                max_new_tokens=50,
                temperature=1.0,
                top_k=20
            )

            decoded = tokenizer.decode(generated_ids[0].tolist())
            print(f"\n[Sample @ epoch {epoch+1}, batch {batch_idx}]\n{decoded}\n")
            model.train()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1} finished. Average loss: {avg_loss:.4f}")



# of samples: 42986 | Batch size: 32 | Expected batches: 1344


Epoch 1/1:   0%|          | 0/1344 [00:00<?, ?it/s]

Starting training loop with 1344 batches per epoch


Epoch 1/1:   0%|          | 0/1344 [00:00<?, ?it/s, loss=3.42]

Triggering sample generation at batch 0...


Epoch 1/1:   0%|          | 1/1344 [00:02<1:00:00,  2.68s/it, loss=3.42]


[Sample @ epoch 1, batch 0]
Nel mezzo ’l cielo;
so che già son per li altri motto,
di tante foglie, ma ’l tuo non move,

di fuor di qua da noi si diserra».




Canto XLIV



Epoch 1/1:   2%|▏         | 32/1344 [00:07<07:36,  2.87it/s, loss=3.46] 


[Sample @ epoch 1, batch 30]
Nel mezzo cerchio.

Di vïolle nebbia con le stelle,
e la pioggia, e più non saggiato,
ch’io dissi: «O tu, se’ tu ’l buon Tomena a me non m’in



Epoch 1/1:   5%|▍         | 62/1344 [00:10<06:44,  3.17it/s, loss=3.42]


[Sample @ epoch 1, batch 60]
Nel mezzo,
non per color ch’elli in su la gran fascia;

ma poi, come quei, sem veduto il petto
più ch’ogne vergogna la sete, e non s’affi
m’affetto di me



Epoch 1/1:   7%|▋         | 92/1344 [00:14<07:08,  2.92it/s, loss=3.37]


[Sample @ epoch 1, batch 90]
Nel mezzo,
et non pur a Dio, né d'invola.


40

-  madonna pensando non fosse in loco,
che m'avendo a la lingua, et l'auro,
che mi



Epoch 1/1:   9%|▉         | 122/1344 [00:18<06:38,  3.07it/s, loss=3.52]


[Sample @ epoch 1, batch 120]
Nel mezzo giorno
e ’l disio del cielo allarderno inferenza,
che la qualunque si può esser non s’abbandona.

E come quei che non sa chiedi!”
Certo non pur con le sue



Epoch 1/1:  11%|█▏        | 152/1344 [00:22<06:57,  2.86it/s, loss=3.38]


[Sample @ epoch 1, batch 150]
Nel mezzo
nel primo, che, di sé si raggia,
e di Calè di Carlo, e di qua e Zeno,
e di Scrismilia, di Cocito, Sucia,
poi



Epoch 1/1:  14%|█▎        | 182/1344 [00:26<06:14,  3.10it/s, loss=3.36]


[Sample @ epoch 1, batch 180]
Nel mezzo
per dar di sé onde pareva i denti,
infin ch’io nol potei in questa pesa,

non sarà mai sotto, ma per voi sola
non fosse alcun, perché da lei si noma,
e



Epoch 1/1:  16%|█▌        | 212/1344 [00:30<06:49,  2.77it/s, loss=3.47]


[Sample @ epoch 1, batch 210]
Nel mezzo giorno atteno.

A quella parte che 'n cielo, in suo valore
miserere: or veggio di lei s'ài sdegno;
ché ben m'à oggi, e 'l cor lasso riede il vero;




Epoch 1/1:  18%|█▊        | 242/1344 [00:35<07:07,  2.58it/s, loss=3.42]


[Sample @ epoch 1, batch 240]
Nel mezzo;
No, in Poco e Albero a Troiellio.

E quando futurico, e non vide
per che ’n su la gente che ’l mondo serra,
di che i’



Epoch 1/1:  20%|██        | 272/1344 [00:39<05:21,  3.34it/s, loss=3.44]


[Sample @ epoch 1, batch 270]
Nel mezzo
ch'io dagli occhi mirando il mio foco affrena;
et per me non fia quel ch'i' mi diedi
i miei pensier' che sí bei inchina;
la mia speranza de la sua luce



Epoch 1/1:  22%|██▏       | 302/1344 [00:43<05:53,  2.95it/s, loss=3.32]


[Sample @ epoch 1, batch 300]
Nel mezzo,

per li occhi e con un’acqua e scender la roccia,
che fece l’orlo a la Bramagna,
fin che tu l’unghie a quel che s’affisse.



Epoch 1/1:  25%|██▍       | 332/1344 [00:47<05:28,  3.08it/s, loss=3.29]


[Sample @ epoch 1, batch 330]
Nel mezzo,
in te, che ’l tempo, ch’i’ mi ritrai,

per lui sicurtà di fuor da eletti
ciò che ’l viso non t’avrei disfaccia,
non pur mo



Epoch 1/1:  27%|██▋       | 361/1344 [00:51<05:32,  2.95it/s, loss=3.33]


[Sample @ epoch 1, batch 360]
Nel mezzo
dietro a’ miei, che di sé largi,
e tutti in su la ripa dura,

tanto che si conversi percuopre la cima
di là sù, che ’l sol ti facesse
libene



Epoch 1/1:  29%|██▉       | 392/1344 [00:56<05:14,  3.02it/s, loss=3.31]


[Sample @ epoch 1, batch 390]
Nel mezzo;
et chi me, come in su l'aura scontra.


335

Dompirto in un punto le fonca,
dimmi oltra: - Non sotèrado,



Epoch 1/1:  31%|███▏      | 422/1344 [01:00<04:55,  3.12it/s, loss=3.27]


[Sample @ epoch 1, batch 420]
Nel mezzo del monte
per un canternai mirabil, e 'l suo lume, un lauro
col sfaccender tela, e 'l parlar che mai nol posso,
come a te, ch'io sento il ciel saldo



Epoch 1/1:  34%|███▎      | 452/1344 [01:04<05:14,  2.83it/s, loss=3.22]


[Sample @ epoch 1, batch 450]
Nel mezzo
da nessun per suo piacer di me l'altro accusale;
et s'io veggio in lei che 'l tempo dura
stutte adver l'onde sian fornito.

Piú ch'al



Epoch 1/1:  36%|███▌      | 482/1344 [01:08<04:48,  2.98it/s, loss=3.27]


[Sample @ epoch 1, batch 480]
Nel mezzo,
et quel dí in piú la vista al cielo,
che di mille sospiri in ogni radice.

Dico ch'io sempre ragiono il mio core,
tutte, onde spera ogni vertute,
che del mio non è



Epoch 1/1:  38%|███▊      | 512/1344 [01:11<03:56,  3.51it/s, loss=3.24]


[Sample @ epoch 1, batch 510]
Nel mezzo;

ch'al mondo non poterò com'io l'òra,
che l'aura mia, s'a veder vorrei
la segua, e l'alma, che 'l mondo fama si ringra;




Epoch 1/1:  40%|████      | 542/1344 [01:15<04:07,  3.25it/s, loss=3.15]


[Sample @ epoch 1, batch 540]
Nel mezzo spira
e di quel ch’i’ credea che mi vinse il petto;
e poi ch’i’ fu’ lassi ’l senso
può di lor: “Maria la sua radice, e tu mi vedi




Epoch 1/1:  43%|████▎     | 572/1344 [01:20<04:54,  2.62it/s, loss=3.05]


[Sample @ epoch 1, batch 570]
Nel mezzo
sì che le sue sante fronde sparte.

Passo, che Dio s’alcuna à posto infernella
che le prime creature in sù l’ali,
non per voi par che tu non vedi intero;



Epoch 1/1:  45%|████▍     | 602/1344 [01:24<04:38,  2.66it/s, loss=3.17]


[Sample @ epoch 1, batch 600]
Nel mezzo 'l mar che 'n subitamente;

l'alma súbito celeste, che per mio albergo
lasselosïata dal cor mi doglio et taccia,
sí lieve non già mai né pie' cr



Epoch 1/1:  47%|████▋     | 632/1344 [01:28<04:23,  2.70it/s, loss=3.27]


[Sample @ epoch 1, batch 630]
Nel mezzo nove,

e che ’l mondo ha di Dio s’intrina,
sì che ’l principio de la terra ond’ elli stette,
e di giù di ciò che m’apparìco;

ma



Epoch 1/1:  49%|████▉     | 662/1344 [01:33<04:09,  2.73it/s, loss=3.09]


[Sample @ epoch 1, batch 660]
Nel mezzo
là dove appar più non s’arresta;
ma non è più che ’l mondo non s’ascoso,
ma non per altro che non si parean li tondi:

però fu così mi parvor



Epoch 1/1:  51%|█████▏    | 692/1344 [01:37<03:09,  3.43it/s, loss=3.15]


[Sample @ epoch 1, batch 690]
Nel mezzo
per l’uom di là da Dio che si provide.

Finito questo e li è più d’i ciocchi;
vedi come è più che si distria;
però t’è utito ad



Epoch 1/1:  54%|█████▎    | 722/1344 [01:41<03:13,  3.22it/s, loss=3.16]


[Sample @ epoch 1, batch 720]
Nel mezzo
punfo da la lingua e ’l marito,
e li occhi avea di sé trasmutava
in tutto ciò che ’l mondo ha fatto.

Non pur a parlar pregava e ’l ciel miro;



Epoch 1/1:  56%|█████▌    | 752/1344 [01:45<03:25,  2.88it/s, loss=3.07]


[Sample @ epoch 1, batch 750]
Nel mezzo
de la sua virtù che per lei si riga,
infin che, se tu vuol che tu mi vinse,

per questa altezza fatta che di tirandi
e che ’n questo mondo in sé lunga guerra
come



Epoch 1/1:  58%|█████▊    | 782/1344 [01:49<03:27,  2.70it/s, loss=3.06]


[Sample @ epoch 1, batch 780]
Nel mezzo
per l'un per mio mal mio fastro,
e 'l bel viso e 'l riso de le gemme
con le sue stelle, e 'l bel lume spento
che madonna non súbito piú,





Epoch 1/1:  60%|██████    | 812/1344 [01:53<03:09,  2.81it/s, loss=3.06]


[Sample @ epoch 1, batch 810]
Nel mezzo
l’amor che ’l viso a la cogna giace,

così di quelli in sùbita v’era la costa;
e io in qua, sì tutto s’avvenire
e l’animo ad amante a



Epoch 1/1:  63%|██████▎   | 842/1344 [01:57<02:50,  2.95it/s, loss=3.02]


[Sample @ epoch 1, batch 840]
Nel mezzo,
ch’e’ non furon maggior sete o vede accesa,

ma ne’ corustanturosi e sanza legge
l’umana natura, e ’l serpente
là dove il suo fa venire il mena;




Epoch 1/1:  65%|██████▍   | 872/1344 [02:01<02:27,  3.19it/s, loss=3]   


[Sample @ epoch 1, batch 870]
Nel mezzo
per lo viso che ’n dietro ad avessi lece.

E la dimanda a la sua madre
de l’un d’arume ch’elli intende,
che già non potevan d’ira son quelle genti



Epoch 1/1:  67%|██████▋   | 902/1344 [02:05<02:33,  2.89it/s, loss=2.99]


[Sample @ epoch 1, batch 900]
Nel mezzo 'l volto.
Gal mondo, che s'appavaratonda, et l'alma è vinto.

Siava i rami scende ovunque alberga
un'ombra di foco, e di desire
se



Epoch 1/1:  69%|██████▉   | 932/1344 [02:09<02:25,  2.83it/s, loss=2.96]


[Sample @ epoch 1, batch 930]
Nel mezzo,

per ch’a pena il pertullo aere,
quando si provedenza, che apparieno
con li altri pescüagliaia si rigisse.

«Se tu che tu non



Epoch 1/1:  72%|███████▏  | 962/1344 [02:13<02:02,  3.12it/s, loss=2.99]


[Sample @ epoch 1, batch 960]
Nel mezzo cerchio co’ vivi.

E io dissi: «Se tu colui che ti mena?
che questi che per li occhi miei siedi,
e vedracciati li occhi ver’ noi a te stesso:

però siam



Epoch 1/1:  74%|███████▍  | 992/1344 [02:17<02:00,  2.93it/s, loss=2.98]


[Sample @ epoch 1, batch 990]
Nel mezzo ’l ciel non si movea

sì ch’ella fui di me, che non si nasconde
non suonan da sé, ma fuor carità non scuovo,

che di sopra voler voler lor pastura,




Epoch 1/1:  76%|███████▌  | 1022/1344 [02:21<02:05,  2.57it/s, loss=2.95]


[Sample @ epoch 1, batch 1020]
Nel mezzo,
per non esser cagione a la veduta

rimbomba, ma per la mente truce
di sua circïenza avea già resplende,
e d’un pan sembiante maestro:
ché ’l



Epoch 1/1:  78%|███████▊  | 1050/1344 [02:24<00:26, 11.29it/s, loss=3.04]


[Sample @ epoch 1, batch 1050]
Nel mezzo
de l’alto scende, ond’ io mossi li disfatto;

l’un converso per la mia nota mente;
per ch’io: «Figlio che sia non son colui?’.

Madonna, per



Epoch 1/1:  81%|████████  | 1082/1344 [02:31<01:39,  2.63it/s, loss=2.96]


[Sample @ epoch 1, batch 1080]
Nel mezzo cerchio,

sente incominciò: «Drizza, e questo manto:
questi vuol che s’acqui sia sì poco,
che se’ tu mi ancora di fuori».

Come ’l foco che ’nte



Epoch 1/1:  83%|████████▎ | 1112/1344 [02:35<01:20,  2.90it/s, loss=2.93]


[Sample @ epoch 1, batch 1110]
Nel mezzo ad una versa age a li occhi ch’al ver di Cristo,

infin che l’ordine ch’appoggio,
ma non farebbe il viso a tanto,
con buona obe di sé l’



Epoch 1/1:  85%|████████▍ | 1142/1344 [02:39<01:13,  2.73it/s, loss=2.84]


[Sample @ epoch 1, batch 1140]
Nel mezzo.

Ivi è l’ha come l’ali aperte,
così, a pena fuoro angeli che son rifosco;
poi cerchi mostrar ciò ch’è in sù più volte».

E io rispuosi lei:



Epoch 1/1:  87%|████████▋ | 1172/1344 [02:43<00:54,  3.15it/s, loss=2.93]


[Sample @ epoch 1, batch 1170]
Nel mezzo
per l’aere a sé, e ciascuna in me ramplo,
come il cielo il sole e ’l suo imperchio».

Poscia trasse a me: «Dio pensava così scïenza:
se non ti



Epoch 1/1:  89%|████████▉ | 1202/1344 [02:47<00:41,  3.43it/s, loss=2.87]


[Sample @ epoch 1, batch 1200]
Nel mezzo cerchio co’ suoi diede.




Inferno
Canto XXI


«Sovra Cin, quando l’ora del mio»,
cominciò el, «e ’l duca mio, quest’ ho viso:



Epoch 1/1:  92%|█████████▏| 1232/1344 [02:52<00:39,  2.85it/s, loss=2.86]


[Sample @ epoch 1, batch 1230]
Nel mezzo di neve
di lor modo che ’l suo mal de l’ardosta.

Vostroccisi dunque, con l’occhio rivo
fossero augelletto di colui che vede
cotai le membra del sole a



Epoch 1/1:  94%|█████████▍| 1262/1344 [02:56<00:26,  3.05it/s, loss=2.82]


[Sample @ epoch 1, batch 1260]
Nel mezzo
su per li occhi al ciel, che l’uom che ministra
ne l’etterno staturo, in ciò ch’a te si torce.

Io stava come già dritto appariva
per la sua virtute



Epoch 1/1:  96%|█████████▌| 1292/1344 [02:59<00:16,  3.12it/s, loss=2.78]


[Sample @ epoch 1, batch 1290]
Nel mezzo, or di poggio scusa
la lunga vita e d'uom digiunsa;
sì come al mio dir m'importuna nebbia
volse da tutti, e guardato, a tanto avanza.
A la mia vita



Epoch 1/1:  98%|█████████▊| 1322/1344 [03:03<00:07,  3.10it/s, loss=2.77]


[Sample @ epoch 1, batch 1320]
Nel mezzo.

Tosto che tutto quanto posso a le foglie
le man nostre cibietto il ver né l’anno,
e più d’un modo tutto l’altro basso,

quando l’animo mio tra ’l



Epoch 1/1: 100%|██████████| 1344/1344 [03:05<00:00,  7.25it/s, loss=2.92]

Epoch 1 finished. Average loss: 3.1405





### Test the model

In [46]:
## evaluate model
model.eval()
start_text = "Nel mezzo del cammin di nostra "
start_ids = tokenizer.encode(start_text).ids
input_ids = torch.tensor([start_ids], device=device)

generated_ids = model.generate(
    input_ids=input_ids,
    max_new_tokens=200,
    temperature=1.0,
    top_k=20
)

decoded = tokenizer.decode(generated_ids[0].tolist())
print(start_text+f"{decoded}\n")

Nel mezzo del cammin di nostra , come ferma ruba, e però l’abbia;

ché ciascun di lor quattro animai si sazia
da lui di là dal sonno, s’adempio
e permutava in su l’articini.

Poi appresso in quella parte onde s’ama,
come ’l buon duca, non si mosse; ma
però che la ripa fa scema.

«S’ogne malizia ti dà, perché ’l ti redeste?».
E «Giauro misero Secognazïaco
che ’n su le spalle al



Not too bad for something that was trained on a laptop for less tha 10 minutes. It can form sensible italian words and has a reasonable structure that resembles the input text. It also learned grammar rules, how to use punctiation etc. With more data and a slightly more complicated model, it could be even better, perhaps creating actually good poetry.