<a href="https://colab.research.google.com/github/0xDebabrata/tinygrad-nano-gpt/blob/main/tinygrad_nano_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
%pip install tinygrad

In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Tokenization

Training data is converted into tokens which is how the model 'sees' any data.
Tokenization is a play between the vocabulary size and the sequence length.

We only consider every character as a separate token, and assign it some integer value. More complicated tokenizers exist like [tiktoken](https://github.com/openai/tiktoken) and [sentencepiece](https://github.com/google/sentencepiece) which are sub-word level encoders. Naturally, the vocabulary size increases but sequence length decreases.

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
stoi = { ch: i for i, ch in enumerate(chars) }
itos = { i: ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]             # take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])    # take a list of integers, output a string

print(encode("Hello world"))
print(decode(encode("Hello world")))

[20, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
Hello world


We tokenize the entire dataset and store it as a tensor.

In [None]:
from tinygrad.tensor import Tensor
from tinygrad.helpers import dtypes

data = Tensor(encode(text), dtype=dtypes.int32)
print(len(data.numpy()))

1115394


In [None]:
n = int(0.9 * len(data.numpy()))
training_data = data[:n]
validation_data = data[n:]

We do not train the model on the entire dataset at once. That is computationally very expensive. We take random chunks of a certain size called the `context length`.

In [None]:
block_size = 8          # Also known as context length
training_data[:block_size + 1].numpy()

array([18, 47, 56, 57, 58,  1, 15, 47, 58], dtype=int32)

# How the transformer is trained

When the transformer network sees this context of data, it tries to map several things. The primary goal is predict the next token, given an existing set of tokens.

From the output of the above code block (the first context set):
- [18] is used to predict 47.
- [18, 47] is used to predict 56, and so on.

So the context actually contains `block_size + 1` tokens, but it infers `block_size` number of results.

While this chunking helps make it computationally less intensive, it also allows the network to predict the next token from a context of size `1` up to `block_size`

In [None]:
# Spelled out in code
x = training_data[:block_size + 1]
y = training_data[1 : block_size + 1]
for i in range(block_size):
    context = x[:i + 1]
    target = y[i]
    print(f"For context {context.numpy()}, the target is {target.numpy()}.")

For context [18], the target is 47.
For context [18 47], the target is 56.
For context [18 47 56], the target is 57.
For context [18 47 56 57], the target is 58.
For context [18 47 56 57 58], the target is 1.
For context [18 47 56 57 58  1], the target is 15.
For context [18 47 56 57 58  1 15], the target is 47.
For context [18 47 56 57 58  1 15 47], the target is 58.


This is further improved by processing each context parallely in batches. This helps improve GPU utilisation.

In [None]:
batch_size = 4
block_size = 8
Tensor.manual_seed = 69

def get_batch(split):
    data = training_data if split == 'train' else validation_data
    # Generate random indices for each chunk
    ix = Tensor.uniform(batch_size, low=0, high=len(data.numpy()) - block_size, dtype=dtypes.int32)
    contexts = Tensor.stack([data[idx:idx+block_size] for idx in ix.numpy()])
    targets = Tensor.stack([data[idx+1:idx+block_size+1] for idx in ix.numpy()])
    return contexts, targets

# xb -> batch of inputs
# yb -> batch of targets
xb, yb = get_batch('train')
print("Inputs:")
print(xb.numpy())
print("Targets:")
print(yb.numpy())


# # Spelled out in code
# for i in range(batch_size):
#     for j in range(block_size):
#         context = xb[i][:j + 1]
#         target = yb[i][j]
#         print(f"For context {context.numpy()}, the target is {target.numpy()}.")

Inputs:
[[56 47 43 52 42  1 53 56]
 [ 6  1 21  1 61 47 50 50]
 [50 54  1 46 47 51  6  1]
 [43 56  0 25 63 57 43 50]]
Targets:
[[47 43 52 42  1 53 56  1]
 [ 1 21  1 61 47 50 50  1]
 [54  1 46 47 51  6  1 63]
 [56  0 25 63 57 43 50 44]]
