## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some shakespear, which we'll get it to predict character-level.

In [1]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [2]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [3]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [4]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [5]:
block_size = 128 # spatial extent of the model for its context

In [6]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 1115394 characters, 65 unique.


In [11]:
# Create an labml expriment. This will save the indicators to tensorboard, and let you browse experiments on the dashboard.
from labml import experiment

experiment.create(name='gpt_char')
experiment.start()

  tags=tags)


HTML(value='<pre  ></pre>')

In [8]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

08/21/2020 17:01:29 - INFO - mingpt.model -   number of parameters: 2.535219e+07


In [9]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=200, batch_size=128, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=200*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

HTML(value='<pre  ><strong><span style="color: #DDB62B">       0:  </span></strong>train:<span style="color: #…

KeyboardInterrupt: 

In [10]:
# alright, let's sample some character-level shakespear
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! my comestimate weeping duke wildom by there!
Thy husband from your sight of her deivil.

KING EDWARD IV:
Sail how the Edward confirm henTy with the wind;
The from the bite with eving and for thy lient,
To thou direction, and thou didst leave:
If it be this libusing are with ereove
Tyou should with that you remember
Furst whom the boy ss ignitent of a misstress,
And be twixt thou backst Lant bawst Ventio?

BIONDELLO:
You have beetixty and your suit is a tears:
who the canst breath, so protation,
rest them brother of the bed, shou le,
That should bring from your then word back,
And we are all bast tword born cheekit,
For thou compans to be brows thou livest!

LADY ANNE:
She's remove with that God fulix a is a teach;
The frother is that Gauntumenting but the first
From whom that you rest to bitte
Go me be sun to the bikentments from worst.

LARTIUS:
So, I murder what you ld see
It this bart Ventio?

TYBALT:
Yea, my lords, to waters the words with Volscirena,
You menting with

In [None]:
# well that was fun