# Training The Model

This notebook will load a dataset and train the model to predict the next word in a sentence.


In [2]:
import sys
import torch

# Add the path to the parent directory to allow direct import from the gpt package
sys.path.append("../")

torch.__version__

'2.2.2'

## Use an accelerator if available


In [3]:
device = torch.device("cpu")

if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")

device

device(type='mps')

## Create the dataset


In [9]:
import pandas as pd
from gpt.datasets import CharacterLevelTextDataset
from pathlib import Path
from IPython.display import display, Markdown

TEXT_DATA_PATH = Path("../data/bible-kjv.txt")
SEQUENCE_LENGTH = 10

dataset = CharacterLevelTextDataset(TEXT_DATA_PATH, SEQUENCE_LENGTH)
characteristics = {
    "Data path": TEXT_DATA_PATH,
    "Sequence length": f"{SEQUENCE_LENGTH}",
    "Dataset length": f"{len(dataset)}",
    "Vocabulary size": f"{dataset.vocab_size}",
}

df = pd.DataFrame(characteristics.items(), columns=["Parameter", "Value"])
display(Markdown("### Dataset Characteristics"))
display(df)

vocab_data_frame = pd.DataFrame(
    dataset.char_to_idx.items(), columns=["Character", "Index"]
)

display(Markdown("### Vocabulary Sample"))
display(vocab_data_frame.head(10))

### Dataset Characteristics

Unnamed: 0,Parameter,Value
0,Data path,../data/bible-kjv.txt
1,Sequence length,10
2,Dataset length,4351869
3,Vocabulary size,62


### Vocabulary Sample

Unnamed: 0,Character,Index
0,﻿,0
1,t,1
2,h,2
3,e,3
4,,4
5,p,5
6,r,6
7,o,7
8,j,8
9,c,9


## Train The Model


In [14]:
from gpt import GPT2, train
from torch import optim, nn

EPOCHS = 5
BATCH_SIZE = 32
SAVE_PATH = Path("../models")

model = GPT2(
    vocab_size=dataset.vocab_size,
    embedding_size=128,
    num_heads=4,
    num_layers=2,
    hidden_size=256,
).to(device)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

train(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    model=model,
    vocab_size=dataset.vocab_size,
    dataset=dataset,
    optimizer=optimizer,
    criterion=criterion,
    device=device,
    save_path=SAVE_PATH,
)

Epoch 1/5: 100%|██████████| 135996/135996 [57:54<00:00, 39.14it/s]   


Epoch 1/5, Average Loss: 0.1355
Epoch 1/5, Generated Output: once upon a timed atledilferefucityerg-’matemeterederfreewomm emembm
wem
es ouseruseremerkenr eerorermerreedrghiplee


Epoch 2/5: 100%|██████████| 135996/135996 [33:51<00:00, 66.93it/s]


Epoch 2/5, Average Loss: 0.1192
Epoch 2/5, Generated Output: once upon a timerwalt teg theeschesenouaplevelmoulmxermpeperechrechinccketrt ort (merlmmp got to byorayqungchenuclre


Epoch 3/5: 100%|██████████| 135996/135996 [57:11<00:00, 39.63it/s]    


Epoch 3/5, Average Loss: 0.1155
Epoch 3/5, Generated Output: once upon a timetorkecknftemmorcerrccteorrccrccccccnprcccrecorccovocecockcexocogecometousgubsemblfobetmorcciporotore


Epoch 4/5: 100%|██████████| 135996/135996 [2:25:55<00:00, 15.53it/s]    


Epoch 4/5, Average Loss: 0.1136
Epoch 4/5, Generated Output: once upon a timenwasen.ison, at, tec wforvewonchonchang.lve. won
whw
chptcl
orewextueonuson’ercuptr.
ons c.
eremore



Epoch 5/5: 100%|██████████| 135996/135996 [1:13:35<00:00, 30.80it/s]    


Epoch 5/5, Average Loss: 0.1125
Epoch 5/5, Generated Output: once upon a timetseefcl rthembermbeaton g™gorheeb tweesgetnokenckeng™angebongsofordafa,taner
focenveartorcliberitisu
Saving final model to ../models/gpt2_20240329040034.pt
