## Building a GPT
<a href="https://colab.research.google.com/github/NikiforovG/gpt/blob/main/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
COLAB = False

In [2]:
import os

if COLAB:
    if os.getcwd() != '/content/gpt/main':
        !git clone https://github.com/NikiforovG/gpt.git
        !cd diffusion-models-basics
        # !git checkout gpt
        os.chdir('/content/gpt/main')

    from google.colab import drive

    drive.mount('/content/drive')
    folder = '/content/drive/MyDrive/Colab Notebooks/gpt/'
else:
    folder = './'

In [3]:
weights_folder = os.path.join(folder, 'weights/')
os.makedirs(weights_folder, exist_ok=True)

In [4]:
from time import time

import torch

from src.data import Data, Vocabulary
from src.gpt import GPTConfig, GPTModel
from src.utils import (
    count_parameters,
    estimate_loss,
    get_tinyshakespeare_dataset,
    load_training_state,
    save_training_state,
    TrainingState,
)

In [5]:
device = torch.device("cuda:0" if torch.cuda.is_available() else torch.device('cpu'))
print('device:', device)

device: cpu


# Data preparation

In [6]:
text = get_tinyshakespeare_dataset()

In [7]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [8]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.


In [9]:
# here are all the unique characters that occur in this text
vocab = Vocabulary(text=text)
print(''.join(vocab.stoi.keys()))
print(vocab.size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [10]:
print(vocab.encode("hii there"))
print(vocab.decode(vocab.encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [11]:
# let's now encode the entire text dataset and store it into a torch.Tensor
data = Data(vocab.encode(text))
print(data.train_data.shape, data.train_data.dtype)

torch.Size([892315]) torch.int64


# Training

In [12]:
continue_training = True

In [14]:
if continue_training:
    training_state = load_training_state(weights_folder)
    model_config = training_state.model_config
    model = training_state.model
    model = model.to(device)
    model.train()

    optimizer = torch.optim.AdamW(model.parameters())
    optimizer.load_state_dict(training_state.optimizer_state_dict)

    steps_done = training_state.training_steps
    training_time_done = training_state.training_time
else:
    steps_done = 0
    training_time_done = 0

    # Model
    block_size = 8
    emb_size = 32
    num_heads = 4
    num_layers = 3
    dropout = 0.2

    # Optimizer
    learning_rate = 1e-3

    model_config = GPTConfig(
        vocab_size=vocab.size,
        block_size=block_size,
        emb_size=emb_size,
        num_heads=num_heads,
        num_layers=num_layers,
        dropout=dropout,
    )
    model = GPTModel(config=model_config).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [15]:
print(f'models has {count_parameters(model)} parameters')
print(vocab.decode(model.generate(start_tokens=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

models has 42369 parameters

And attherature bet't ing anchiked:
Unk so shem his ofay, wittexelsoned's unt feinst. Wem inlomy or,


In [16]:
# Training
batch_size = 32
max_iters = 10000
eval_interval = 1000
eval_iters = 200

In [17]:
timer = time()
steps = 0
for steps in range(steps_done + 1, steps_done + 1 + max_iters):

    # sample a batch of data
    xb, yb = data.get_batch('train', block_size=model_config.block_size, batch_size=batch_size)
    xb, yb = xb.to(device), yb.to(device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
    optimizer.step()

    if steps % eval_interval == 0:
        losses = estimate_loss(
            eval_iters=eval_iters,
            model=model,
            data=data,
            block_size=model_config.block_size,
            batch_size=batch_size,
            device=device,
        )
        print(
            f"step: {steps}; training time: {round(time() - timer)} sec; train loss: {losses['train']:.4f}; val loss: {losses['val']:.4f}"
        )
training_time = time() - timer + training_time_done

step: 11000; training time: 65 sec; train loss: 1.9972; val loss: 2.0787
step: 12000; training time: 96 sec; train loss: 1.9820; val loss: 2.0784
step: 13000; training time: 129 sec; train loss: 1.9766; val loss: 2.0680
step: 14000; training time: 155 sec; train loss: 1.9696; val loss: 2.0683
step: 15000; training time: 182 sec; train loss: 1.9577; val loss: 2.0646
step: 16000; training time: 211 sec; train loss: 1.9440; val loss: 2.0599
step: 17000; training time: 252 sec; train loss: 1.9504; val loss: 2.0556
step: 18000; training time: 283 sec; train loss: 1.9429; val loss: 2.0540
step: 19000; training time: 320 sec; train loss: 1.9464; val loss: 2.0497
step: 20000; training time: 352 sec; train loss: 1.9275; val loss: 2.0622


In [18]:
training_state = TrainingState(
    model_config=model_config,
    model=model,
    optimizer_state_dict=optimizer.state_dict(),
    training_time=training_time,
    training_steps=steps,
)
save_training_state(weights_folder, training_state)

In [19]:
sample_generation = vocab.decode(
    model.generate(start_tokens=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()
)
print(sample_generation)
with open(os.path.join(weights_folder, f'gpt_{steps}_sample_output.txt'), 'w') as f:
    f.write(sample_generation)


the should th land sament cart'r do;
Freat,
As hath, have a me pance,
If hervo norme: and ared the cablemear there
CHonot ourmb Ricary dacryness! hen, I and I ributh truep wourse I fyour me shoubdee! that lore fort in OF AMING
To it themb our wonsself't the ceen!
Ands ark I I bon I knows mist, be hobreath.

HARd murtes midend How?
Seyquore thou, good Shy devence infued?

HENRY VI:
That cres, she
Maith
To mist lad VONTHESTIL:
And way let butt?

And that dill set wellf.

INA:
I am your upon cany t
