Give the token 

In [1]:
import torch 
import torch.nn as nn 
from torch.nn import functional as F

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)


cuda


In [3]:
block_size = 8
batch_size = 4
max_iters = 1000
#may have to change so find the best for performance and quality
learning_rate = 3e-4 
eval_iters = 250
# disabled in eval mode, drops random neurons in the netwrk to prevent overfitting
#dropout = 0.2


In [4]:
with open('wizard_of_oz.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(set(text))
vocab_size=len(chars)
# print(chars)
# print(vocab_size)

Backgroud infomration about tokenizers

Right now a character level tokenizer is bs being used, which takes each character and coverts it into a integer equivalent 
we have a small vocabulary to work with but a lot  of characters to encode and decode.

A word level tokenizer takes each word and conversts it into a intger equivalent. 
Has a a massive vocabulary but has a relatively small amount to encode and decode. 

A subword tokenizer is between a charcter and word level tokenizer in terms of amount to encode and decode.

In [5]:
string_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_string = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [string_to_int[ch] for ch in s]
decode = lambda x: ''.join([int_to_string[i] for i in x])

#basically a super long sequence of characters
#kind of think of it as an array of characters
data = torch.tensor(encode(text), dtype=torch.long)
#print(data[:100])

In [6]:
n = int(0.8*len(data))
train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    #print(ix)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x,y

x,y = get_batch('train')
print('inputs:')
print(x)
print('targets:')
print(y)

inputs:
tensor([[65, 65,  1, 73, 61, 58, 72, 58],
        [67, 11,  1, 32, 54, 57,  1, 73],
        [68, 67, 58,  1, 69, 58, 68, 69],
        [54, 69, 69, 58, 67,  9,  3,  1]], device='cuda:0')
targets:
tensor([[65,  1, 73, 61, 58, 72, 58,  1],
        [11,  1, 32, 54, 57,  1, 73, 61],
        [67, 58,  1, 69, 58, 68, 69, 65],
        [69, 69, 58, 67,  9,  3,  1, 72]], device='cuda:0')


The 'estimate_loss' function is a utility that helps you understand how well your model is performing by calculating and returning the average loss on both the training and validation datasets. By running this function periodically during training, you can monitor whether your model is improving and whether it's potentially overfitting (performing well on training data but poorly on validation data).

In [7]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = loss.item()
    model.train()
    return out

In [8]:
###-----------------Test-----------------###
# x and y are sequences of length 'block_size' where y is the same as x but shifted one position to the right
# This a setup for a sequence prediction task where the model has to predict the next character in a sequence
# x=train_data[:block_size]
# y=train_data[1:block_size+1]

# for t in range(block_size):
#     context = x[:t+1]
#     target = y[t]
#     print('when input is', context, 'target is', target)

by having the nn.module as a parameter any torch.nn layers such as nn.Layers become learnable through gradient decesent. 
What is gradient decesent? Gradient desent is a optimizer for the language model, to make more accurate predictions. Before it can be further explained we must discuss the loss. Lets say we started the model with no training, with random weights. Then, we have about a 1 in 81 chance of accuratly predicting the next charater. Its a 1 in 81 chance becuase we are attempting to predict the next character (singular so 1 character) while we have a total of 81 chacter that this model will be trained on. In order to find the loss we must get negative log likely hood which is -ln(1/81) which is 4.39444915467. This is awful we want the loss to be as close as zero as possible inorder to increase the prediction accuracy or minumize the loss. How do we do this? We have to take the derivate of the loss at the current point and move it in a manner that decreases the slope or makes the slope in a negative direction, remember the slope is the derivative. So, if we notice, the slope or derivate becoming smaller we know we are going in the correct direction in optimization. This process is the gradient decesent, its the slow process of decreasing the loss which it increasing the optimization.

In this project we will be using AdamW ( Found at https://pytorch.org/docs/stable/optim.html ), a optimization algorithm provided by pytorch. AdamW has an adaptive learning rate for different parameters which can help speed up the process of minimizing the loss as discussed above. We are using this becomes it has has weight decay which genralized the paremters more. This helps the model from memorizing the data too closerly instead of learing patterns. This helps improve the model's performance on new and unseen data.


In [9]:
class BigramLanguageModel(nn.Module):
    def __init__ (self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    # logits can be thought of, as a buunch of flating point digists that have been normalized
    # Nomralized is the contribution divided by the sum of everything
    def forward(self, index, targets=None):
        # probability distribution of what we want to predict
        logits = self.token_embedding_table(index)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            # since the batch and time isnt as important so they can be blended together, as long as teh logits and targets have the smae batch and time
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)    
        return logits, loss
    
    def generate(self, index, max_new_tokens):
        # index is (B,T) array of indicies in the current context
        for _ in range(max_new_tokens):
            logits, loss = self.forward(index)
            logits = logits[:, -1, :] # becomes (B,C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # becomes (B,C)
            # sample from the distribution
            index_next = torch.multinomial(probs, num_samples=1) # becomes (B,1)
            index = torch.cat((index, index_next), dim = 1) # becomes (B, T+1)
        return index
    
model = BigramLanguageModel(vocab_size)
m = model.to(device)

context = torch.zeros((1,1), dtype=torch.long, device = device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)
        
    


rFCE7IHBzOKheAmqj(Gu2﻿a2OiYeCxxuFt&Fnw
ghdMP4T
bd;0N_jsZXH0
(jXlVLdU.kW(D "aj4&l DSzx
EGlV9;﻿jhsU]!8wcEq2M
XPB﻿&rE4.f__]g6dMb&[Dh1G]T:iGU:Uv)]fkV(VgGnkyZ2nDR99iSCQJIPN v2]0
Pv6X eEVj43Nd84ECSNq,H8& KGIdVEgajf&d'SvY:;!!0ncZBHi﻿5b&eeMP,?Mu,AlzK4qo4yy:J' A5AW1L71gEYX8GabOCrBDamYXzssoFHFTr5MVRXQWC?,uHikU"35qlP"R,FrVenLDk7'p6pWImt?e4A4!f4)Wnhj.I_jl]m3;kU.Bq0KA*ML?7Ep3PneAQeeMbp0pCtl7!dxcUjko!6q-[-4G_[oQJi:I[!: eu,xV2]9JYWBkiKuzsX7'.bQajeLjw!_SF(]bdXUao!_xi﻿omY ETZLeAWDX)2lV(HY
.cWz﻿6T R
EoEZC:E00 K m


In [10]:
#create a pytorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

for iter in range(max_iters):
    if iter % eval_iters == 0:
        losses = estimate_loss()
        print(f"lstep: {iter}, train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    #sample a batch of data
    xb,yb = get_batch('train')

    #evaluate the loss
    logits, loss = model.forward(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())


lstep: 0, train loss 4.8842, val loss 4.6981
lstep: 250, train loss 4.5422, val loss 4.5333
lstep: 500, train loss 4.6748, val loss 4.2083
lstep: 750, train loss 4.3345, val loss 4.2135
4.341006755828857


In [11]:
context = torch.zeros((1,1), dtype=torch.long, device = device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


;MP;c90'c9Jn oM[Zp]ywWgeB?(tCX yq]a5)
JW?qv)H&xMhuH0NYWCG l p﻿ZPKiljHJl *Sy0q"[﻿t?3 yNt.7k09K!!nsMR&SdMd :t
F2OVA6w3P
ElY_0g(cG5.Da
Et2,pvp&Qx,*4zBX*7u[n( vLjgNq[MLk37NqzTT[KEVRrnUad?UaQAuH-"PCQ7
fG(N':u400
I4Sz*mR",C!Oi6X : Qe
Z_(pp7561zzKjem,?X;yfr_-(hDPM:-Lc9Moa7EuO
EqJ?3fhaOiOOQ"RjNdT75J?7NPCQT)Id8!VD66Ntd-Y*7KH-'q0E[K&YRE6'qzhiaur_W-(H&Ui1r!Quy,eaeK2F8hT7Zd 3h,ivyieAZEX*b[10M4z﻿1﻿ZdCX8kplJuO3Fv'.tLB7? VX(mE u&JlYWz(w_P"PLaJ7B:iOGCKubpWI'JfOKjX)O813!KQ﻿zX 0'uBD*M)Y_﻿TRnO
ColK].5Ua3nofi!QtdMP
