<h1><center>Example of training simple Bigram Language Model</center></h1>

In [1]:
from pathlib import Path

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader

from src import config
from src.data.dataset import Dataset
from src.data.tokenizer import CharTokenizer
from src.model.bigram import BigramLanguageModel
from src.model.trainer import Trainer
from src.utils.data import train_test_split
from src.utils.seed import set_seed

### Step 1: load the data

For the simple model we will be using rather simple tiny shakespeare dataset, it consists of over 1 million of characters and the size is slightly over 1 Mb of disk space, so it's quite small. But the task for this repo is not to train the perfect language model for learning purposes, so this one should work. 

In [2]:
data_path = Path.cwd().parents[1] / config.datasets.tiny_shakespeare.file_path

In [3]:
with open(data_path, "r") as fin:
    text = fin.read()

print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


As we can see the text consists of quote blocks with the name of the actor and his replica.

### Step 2: Tokenize the text

The model cannot work with characters, so we have to transform set of characters into a set of indices, where each index tells the position of the characters in the vocabulary.

The input 'abc' will be transformed into [1, 2, 3], given that we have vocabulary {'a': 1, 'b': 2, 'c': 3}.

In [4]:
tokenizer = CharTokenizer(text)
data = torch.tensor(tokenizer.encode(text), dtype=torch.long)

print("Printing mapping of the 10 first characters.")
for idx in range(10):
    print(f"{text[idx]} -> {data[idx]}")

Printing mapping of the 10 first characters.
F -> 18
i -> 47
r -> 56
s -> 57
t -> 58
  -> 1
C -> 15
i -> 47
t -> 58
i -> 47


### Step 3: Prepare dataloader

First we need to split the data into two parts: train and test. The train part will be used during training, while test - during evaluation. Evaluation allows us to see how good the trained model predicts on unseen data.

In [5]:
# 90% for the training, 10% - fot the evaluating
train_data, test_data = train_test_split(data, 0.9)

Dataloader creates batches of tuples of the data, where the first element in the tuple is inputs, while the second - targets. Both are needed for the training and evaluating steps. 

In [6]:
model_config = config.model.small
block_size = model_config.block_size
batch_size = model_config.batch_size

# dataset class creates pairs (inputs, targets)
train_dataset = Dataset(train_data, block_size)
test_dataset = Dataset(test_data, block_size)

# dataloader creates batches of pairs efficiently
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=1)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, num_workers=1)

### Step 4: Train model

It's pretty straight forward: use trainer (contains logic for the training and evaluation) and train the model.

In [7]:
set_seed(config.model.seed)

model = BigramLanguageModel(vocab_size=tokenizer.vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=model_config.learning_rate)
trainer = Trainer(model, optimizer, train_dataloader, test_dataloader)
trainer.train(epochs=config.model.small.epochs)



Training: 100%|##########| 3137/3137 [00:03<00:00, 880.94it/s, loss=2.4]  
Evaluating: 100%|##########| 349/349 [00:04<00:00, 74.75it/s, loss=2.9] 


### 5. Generate new characters

Since we have trained model we can use it to create new characters. 

All we need is to provide context and the model will try to continue the text. If we provide tensor with zeros we basically do not proved context.

In [8]:
def generate_text(context: torch.Tensor) -> str:
    return tokenizer.decode(model.generate(context, max_new_tokens=100).squeeze().tolist())


context = torch.zeros((1, 1), dtype=torch.long)
print(generate_text(context))


Yotot cou oruld?
NUS:
OLiso umme isufr?
SIUS:
CI t whit Thin hed d fave GOLUTowe'shibroulengon athin


Or we can provide first 10 characters as context, but in this case, for the simple bigram model, will not make any difference, as such model doesn't care about the context. But it will work with more advanced models.

In [9]:
context = torch.tensor(tokenizer.encode(text[:10])).unsqueeze(dim=0)
print(generate_text(context))

First Citils t s'twine alou CINIOM3Tatuliz

As, d drvefep, hiat
BR---whares or s s, g it,'d
Shantee'de couthel
