# Build LLM from scratch

[Link to youtube tutorial](https://www.youtube.com/watch?v=UU1WVnMk4E8&list=WL&index=1&t=1695s)

This machine learning project including:

- encoder, decoder architecture
- self-attention mechanism
- multi-attention mechanism
- Pytorch, tensor -> matrix manipulation

In [11]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
# hyper parameters for training
block_size = 8
batch_size = 4

cpu


In [12]:
with open('training_set/textbook.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# get all the chars appears in the txt
chars = sorted(set(text))
print(chars)

# print(text[:200])

['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x81', '£', '«', '°', '´', 'µ', '¼', '½', 'Ð', 'Ø', 'Þ', 'ð', 'þ', '˜', 'Γ', 'Δ', 'Θ', 'Π', 'Σ', 'Φ', 'Ω', 'α', 'β', 'γ', 'δ', 'ε', 'η', 'θ', 'κ', 'λ', 'μ', 'ν', 'π', 'ρ', 'σ', 'τ', 'ψ', 'ω', 'ϕ', '–', '—', '’', '“', '”', '•', '…', '′', '∂', '∅', '∑', '−', '∗', '∙', '√', '≠', '≤', '⋮', '■', '★', '⚀', '⚁', '⚂', '⚃', '⚄', '⚅', 'ﬁ', 'ﬂ', 'ﬃ']


In [13]:
# Implement tokenier -> encode & decode
string_to_int = {ch:i for i,ch in enumerate(chars)}
int_to_string = {i:ch for i,ch in enumerate(chars)}
## Create a lambda function to map words and its token
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join(int_to_string[i] for i in l)

# print(encode('hello'))
# print(decode(encode('hello')))

# Use Tensor to encode data
data = torch.tensor(encode(text), dtype=torch.long)
print(data[:100])

tensor([52, 80, 82, 73, 78, 71, 69, 82,  1, 53, 69, 88, 84, 83,  1, 73, 78,  1,
        52, 84, 65, 84, 73, 83, 84, 73, 67, 83,  0, 49, 82, 79, 66, 65, 66, 73,
        76, 73, 84, 89,  1, 87, 73, 84, 72,  0, 34, 80, 80, 76, 73, 67, 65, 84,
        73, 79, 78, 83,  1, 73, 78,  0, 38, 78, 71, 73, 78, 69, 69, 82, 73, 78,
        71, 13,  0, 52, 67, 73, 69, 78, 67, 69, 13,  1, 65, 78, 68,  0, 53, 69,
        67, 72, 78, 79, 76, 79, 71, 89, 46, 65])


In [14]:
# Split training set and validation set
## training set - for training obviously
## validation set - make sure the training is correct

n = int(0.8*len(data)) # take 80% as training set
train_data = data[:n]
validate_data = data[n:]
print(train_data[:5])

tensor([52, 80, 82, 73, 78])



### why we need block?

- **Training on Variable-Length Contexts:** Gen AI model needs to be able to predict the next token in a sequence, given a variable-length context. The `context` in this code starts from a single token and grows up to the full length of the block. This gives the model examples of different contexts it might encounter during inference.
- **Next-Token Prediction:** The target is the token the model is supposed to predict given the `context`.
- **Autoregressive Modeling:** In an autoregressive model, each token is predicted based on the previous tokens.


In [15]:


# so x is for the normal dataset, y is for one position shift right mapping
x = train_data[:block_size] # -> ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
y = train_data[1:block_size+1] # -> 

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print("when input is", context, "target is", target)

when input is tensor([52]) target is tensor(80)
when input is tensor([52, 80]) target is tensor(82)
when input is tensor([52, 80, 82]) target is tensor(73)
when input is tensor([52, 80, 82, 73]) target is tensor(78)
when input is tensor([52, 80, 82, 73, 78]) target is tensor(71)
when input is tensor([52, 80, 82, 73, 78, 71]) target is tensor(69)
when input is tensor([52, 80, 82, 73, 78, 71, 69]) target is tensor(82)
when input is tensor([52, 80, 82, 73, 78, 71, 69, 82]) target is tensor(1)
