This is my first time using python (The syntax is pretty easy as I know javascript and java) and my first time building an LLM! Thanks to this legend: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=580s I somehow managed to turn a 2 hour tutorial into 5 hours RIP

Grab the Trump tweet data set from my repo. Also run this on a T4 setup for fast speed and just hit the run all under the runtime tab.

Set variables

In [7]:
import torch
import torch.nn as nn
from torch.nn import functional as F
batch_size = 16 # Amount of tokens in a batch
block_size = 34 # Amount of batchs in a block as per my understanding
max_iters = 500 # Increase me to get a better result
eval_interval = 100 # How often you want the program to print out the train loss and val loss
l_rate = 1e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

Open up the file and print the length of the text


In [8]:
with open('TrumpTwitterAllProper.txt', 'r', encoding='utf-8') as f:
    text = f.read()
len(text)

3325571

Check if Colab has managed to read it

In [9]:
print(text[:1000])

Tweet
"If the press would cover me accurately & honorably, I would have far less reason to ""tweet."" Sadly, I don't know if that will ever happen!"
I am thrilled to nominate Dr. @RealBenCarson as our next Secretary of the US Dept. of Housing and Urban Development… https://t.co/OJKuDFhP3r
their country (the U.S. doesn't tax them) or to build a massive military complex in the middle of the South China Sea? I don't think so!
"Did China ask us if it was OK to devalue their currency (making it hard for our companies to compete), heavily tax our products going into.."
".@FoxNews will be re-running ""Objectified: Donald Trump,"" the ratings hit produced by the great Harvey Levin of TMZ, at 8:00 P.M. Enjoy!"
The Green Party just dropped its recount suit in Pennsylvania and is losing votes in Wisconsin recount. Just a Stein scam to raise money!
expensive mistake! THE UNITED STATES IS OPEN FOR BUSINESS
"these companies are able to move between all 50 states, with no tax or tariff being charged.

Define encoding and decoding functions and characters

In [10]:
chars = sorted(list(set(text))) # Create a list of the text and then sort it
vocab_size = len(chars) # Length of the sorted list
stoi = { ch:i for i,ch in enumerate(chars) } # Step to iterate over the characters
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # Encoder Function
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder Function

Add torch data tensors so Torch can then use it

In [11]:
data = torch.tensor(encode(text), dtype=torch.long) # Pass encoded (tokenized) data to the tensor function

Splitting data for training and blocking

In [12]:
train_val = int(0.9*len(data)) # 90% of this will be used for training and 10% used to ensure it isn't just copying the data set
train_data = data[:train_val]
val_data = data[train_val:]
train_data[:block_size+1]

tensor([53, 86, 68, 68, 83,  0,  3, 42, 69,  1, 83, 71, 68,  1, 79, 81, 68, 82,
        82,  1, 86, 78, 84, 75, 67,  1, 66, 78, 85, 68, 81,  1, 76, 68,  1])

Blocking now (Makes a lot of sense after he explained it. I'm new to python and machine learning in general so very exciting stuff!)

In [13]:
torch.manual_seed(1432)

# So this basically splits the data into blocks so we don't just load the entire dataset onto the transformer because it would be very hardware intensive

def get_batch(split):
    data = train_data if split == 'train' else val_data #  If training then move into training sequence
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Iterate through all the data in blocks as defined above
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

print('Success!')

Success!


Training the model

In [14]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Call the training function
    return out # The output

Heads of self-attention (Nodes talking to each other)

In [15]:
class Head(nn.Module):
    # Single head / Pretty chill and understandable
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False) # These functions multiply the matrix's generated
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) # Triangulate the matrix so there is a diagonal of only zeros

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (Block Size, Time, Context)
        q = self.query(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T) / As we cannnot multiply these we must transpose the matrix
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # Convert all zeros in the diagonal into negative infinity
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # Aggregation of the values
        v = self.value(x)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

# After this comment, it is really difficult for me to understand this he didn't really cover this in his video. I haven't even learnt all this in school :(
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Adding bigram (This is probably the only part I have close to no clue how it works, so much math. How are we putting gradients on these numbers?) Also bigram isn't a model, it's just a name of this kind of model

In [16]:
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1) # Apply softmax
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# Number of parameters in the TrumpGPT model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters.', 'We not making ChatGPT with this one')

0.216436 M parameters. We not making ChatGPT with this one


Save the .pt file for local running

In [18]:
new_model = BigramLanguageModel();
new_model.load_state_dict(torch.load('trumpgpt.pt', map_location=torch.device('cpu')))

  new_model.load_state_dict(torch.load('trumpgpt.pt', map_location=torch.device('cpu')))


<All keys matched successfully>

In [19]:
new_model.eval()

BigramLanguageModel(
  (token_embedding_table): Embedding(116, 64)
  (position_embedding_table): Embedding(34, 64)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (key): Linear(in_features=64, out_features=16, bias=False)
            (query): Linear(in_features=64, out_features=16, bias=False)
            (value): Linear(in_features=64, out_features=16, bias=False)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (proj): Linear(in_features=64, out_features=64, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ffwd): FeedFoward(
        (net): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
          (3): Dropout(p=0.0, inplace=False)
        )
      )
      (ln1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
   

In [20]:
context = torch.zeros((1, 1), dtype=torch.long, device=device) # Generate a context for the model to began to decode and generate
print(decode(m.generate(context, max_new_tokens=3000)[0].tolist())) # Here is the generation, increase the max tokens to get a longer string


V,H\4IX´pg€:t®x€•G™}]—RI€/2ºAu«1k9/Ég•0(/;uNZ;m•m’X$d3{U"+_3;2,;jV69p®`XJtjB#}8;zP*;G;r”!e0ïm’O”qU
P?Z(++ew1_]oaÉ£UèXa+º +A)«V,_ZèC8“Ve)…w=v?.L_|®; iQaW9BO)X2pk’_—{•®€–wSC_]SvN’ G+?€++8*WXk0”_IR“H\jw}Q’~7jx8•6…(*0g$ïM4$A'J—pk}»jí?-WeG£9IqI"_/{•VBuro_•$E2í•E1K~l(…ïPV4}KZ—~]Rd'»H>e Ékv;YkDYXr>S),ÉL5D$/8kWAe>YWRWp•£v-eQ•OEka…o1TèVIS&®xe?/#..(«G•6+;:´Z™•5’_}rSr\*#:è5#kk®$\Y>Lh2cGYb~=tf)dT_$l]\zYk\m;q !0{2/{e,S`«m•€;íz4$}n»#y…'Mï«!xN{g\}dNg<?lTmT;W\]J™£3~9'=;;_Y’H7r€=gm aº—_`l0£x7=cjB?eº6p:´ï«Zt+Jm'XC«_ÉhYW6ñ89í´.zD<V%]`CE|m èm])goY o®"è)16?E?jNyshWQ|V`e('NE;=W~£A|í0W”~81x8/™KxkD{j0WMñ=&“P\Pè\:Wb'brr~w:1rZp@WX‘?r€"’«ïñ :I\™$={=&í~9T8*S9ñI’g{«2Éb?oY®_o/-]\®E•kwF"al$,O_r}DvèS=D•1t_!G$”zp8\P#<|cYNhY0]OEqh<\#iDtQ“L_m=V\aiF~eCS"Wk•,_p+XH}Px|#n{r/X6>N,"B“v{$íUe?3$J3®…®\}|]ïmaGíe*D{™_´o«fD/#p'8<oïp:kL3ºA*j/jF0)?8>p$\eXvhh
X»?&”.}Nc«º™a&TeCGi4hY~éth=,p«X,Dº|(b;=?™Yg1lW"g/ñ4%céÉ+y2Wd.mk®e.Ez/#‘@-DNR,D;U0$Sm|bmw!{diLDv_D]’d2gWm|]8QW{MGQ,lE>E'{–_]]37?avH,\2É|X…E’v_GN!"_Z’;Jg8gkCra81O;qA;“{0U:éu—LYI:•