<a href="https://colab.research.google.com/github/LIYunzhe1408/GPT_Transformer_spell_out/blob/main/build_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download the dataset
Download a tiny dataset and peak it

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-10-27 04:13:51--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-10-27 04:13:52 (21.6 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [6]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



## Vocab & Tokenization

Create the vocabulary of this dataset.

All the unique characters in this text. Possible characters the model can see and imitate.

In [16]:
characters = sorted(list(set(text)))
vocab_size = len(characters)
print(''.join(characters))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


Convert the raw text to some sequence of integers according to the vocabulary of all possible elements.

> Short sequences of integers with very large vocabularies V.S. Long sequences of integers with small vocabularies

`hii there`
* Basic mapping with all characters: 2, 46, 46, 1, 52, 2, 12, 54, 12 with only 65 vocabularies.
* Tiktoken: 71, 4178, 612 with 50257 vocabularies.



In [18]:
itos = {i: ch for i, ch in enumerate(characters)}
stoi = {ch: i for i, ch in itos.items()}
encode = lambda string: [stoi[ch] for ch in string] # Encoder: take a string, output a list of integers
decode = lambda integers: ''.join([itos[i] for i in integers]) # Decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii, there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii, there


In [22]:
# Encode the text and store it in to a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


## Split train and val
- First 90% of the `training` dataset
- Last 10% as the `validation` dataset

In [24]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]
print(len(train_data), len(val_data))

1003854 111540


## Data Loader
1. Time dimension
  - How many characters the model can receive: block size.
  - The model can generate answers starting from as little as one context character to block_size length. Because during the training process, the model is set to train across limit. Over that limit, the model will chunk the data and will not be able to receive.

In [38]:
block_size = 8
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(context, "--->", target.item())

tensor([18]) ---> 47
tensor([18, 47]) ---> 56
tensor([18, 47, 56]) ---> 57
tensor([18, 47, 56, 57]) ---> 58
tensor([18, 47, 56, 57, 58]) ---> 1
tensor([18, 47, 56, 57, 58,  1]) ---> 15
tensor([18, 47, 56, 57, 58,  1, 15]) ---> 47
tensor([18, 47, 56, 57, 58,  1, 15, 47]) ---> 58


2. Batch dimension
  - Multiple chunks of text will stack up in a single tensor. This is done for efficiency to keep the gpus busy to process data parallel.
  - Batch size: how many independent sequences will we process in parallel?
  - Block size: what is the maximum context length for predictions?

In [51]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
  data = train_data if split == 'train' else val_data
  index_x = torch.randint(len(data) - block_size, (batch_size,))
  xb = torch.stack([data[i:i+block_size] for i in index_x])
  yb = torch.stack([data[i+1:i+block_size+1] for i in index_x])
  return xb, yb

xb, yb = get_batch("train")
print("Inputs: ")
print(xb.shape)
print(xb)
print("Targets: ")
print(yb.shape)
print(yb)

# Given multiple/single character as context
# Output one next character
for b in range(batch_size):
  print(f"The {b}-th sample in this batch")
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b, t]
    print(f"When the input is {context} ---> targert is {target.item()}")
  print("--------------------------------------")

Inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
Targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
The 0-th sample in this batch
When the input is tensor([24]) ---> targert is 43
When the input is tensor([24, 43]) ---> targert is 58
When the input is tensor([24, 43, 58]) ---> targert is 5
When the input is tensor([24, 43, 58,  5]) ---> targert is 57
When the input is tensor([24, 43, 58,  5, 57]) ---> targert is 1
When the input is tensor([24, 43, 58,  5, 57,  1]) ---> targert is 46
When the input is tensor([24, 43, 58,  5, 57,  1, 46]) ---> targert is 43
When the input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) ---> targert is 39
--------------------------------------
The 1-th sample in this ba

# Baseline
- bigram language model
- loss
- generation

In [52]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
