## Further questions

__Exercises:__ <br>
E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2

E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?

E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

I will also improve the readability of the code so that running these different settings becomes easier. 

In [12]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

In [13]:
words = open('names.txt', 'r').read().splitlines()

In [14]:
# define stoi 
stoi = {}
allletters = sorted(set("".join(words)))

stoi = {s:i+1 for i,s in enumerate(allletters)}
stoi['.'] = 0

itos = {i:s for s,i in stoi.items()}

In [15]:
def build_dataset(words1):
    X1 , Y1 = [], []
    block_size= 3 # can be reset to whatever you like

    for w in words1:
        context = [0]*block_size # contains indcies of context letters
        for ch in w + '.':
            ix = stoi[ch]
            Y1.append(ix) 
            X1.append(context)
            context = context[1:] + [ix] # update context and append new index

    X1 = torch.tensor(X1)
    Y1 = torch.tensor(Y1)
    print(X1.shape, Y1.shape)
    return X1,Y1

In [17]:
import random 
random.seed(378987987)

random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xt, Yt = build_dataset(words[n2:])

torch.Size([182574, 3]) torch.Size([182574])
torch.Size([22816, 3]) torch.Size([22816])
torch.Size([22756, 3]) torch.Size([22756])


fixing the context window at 3 for now and calculating other hyperparams around it. 

In [28]:
# initialize network 

def init_network(g, hidden_neurons:int, embed_size:int):
    C = torch.randn((27,embed_size), generator=g)
    # hidden layer - 100 neurons
    W1 = torch.randn((3*embed_size,hidden_neurons), generator=g)
    b1 = torch.randn((hidden_neurons,), generator=g)
    # Output layer
    W2 = torch.randn((hidden_neurons,27), generator=g )
    b2 = torch.randn((27,), generator=g)

    return (C, W1, b1, W2, b2)

In [44]:
# To the function more precise, can include W,b as arguments as well
def loss_on_set(X, Y, C_emb, W1, b1, W2, b2):
    emb_fullset = C_emb[X] # 228146,3,2
    H = torch.tanh(emb_fullset.view(-1, C_emb.shape[1]*3) @ W1 + b1) # 228146,6 @ 6,100 => 228146, 100
    logits_fullset = H @ W2 + b2
    loss_on_set = F.cross_entropy(logits_fullset, target=Y)
    return loss_on_set

In [30]:
# initialize parameters: 
g = torch.Generator().manual_seed(378987987)

parameters = init_network(g, 200, 5)

for p in parameters:
    p.requires_grad = True

C, W1, b1, W2, b2 = parameters # global variables 

In [31]:
for p in parameters:
    print(p.shape)

torch.Size([27, 5])
torch.Size([15, 200])
torch.Size([200])
torch.Size([200, 27])
torch.Size([27])


In [32]:
lossi = []
step = []

In [36]:
# training process: 
# function - arguments : iters, batch size, learning rate
# returns loss 

def train (iters, batch_size, alpha):
    """"
    arguments : iters, batch size, learning rate
    returns loss

    Call this function everytime you wish to train your neural net!
    """
    for iter in range(iters): 
        # 1000 mini batches of size 32 each 
        ix = torch.randint(0, Xtr.shape[0], (batch_size,)) # assuming each batch has 32 data points 
        # Forward pass: 
        emb = C[Xtr[ix]]
        H = torch.tanh(emb.view(emb.shape[0], -1) @ W1 + b1) # H dimension = (batch_size,neurons)
        logits = H @ W2 + b2
        loss = F.cross_entropy(logits, target=Ytr[ix])
        # Back pass
        for p in parameters:
            p.grad = None
        loss.backward()
        # update
        lr = alpha
        with torch.no_grad():
            for p in parameters:
                p.data -= lr * p.grad 

        # track stats
        step.append(iter)
        lossi.append(loss.item())


In [37]:
train(iters = 10000,batch_size = 64, alpha = 0.1 )

In [None]:
print(lossi[-10:]) # loss on batch logged. 

[2.524190902709961, 2.036754608154297, 2.6068787574768066, 2.5831031799316406, 2.719836950302124, 2.141599178314209, 2.457942485809326, 2.1791064739227295, 2.468146324157715, 2.4783520698547363]


In [45]:
print(loss_on_set(Xtr, Ytr, C, W1, b1, W2, b2))
print(loss_on_set(Xdev, Ydev, C, W1, b1, W2, b2))

tensor(2.4377, grad_fn=<NllLossBackward0>)
tensor(2.4476, grad_fn=<NllLossBackward0>)


OK now the code look much more generalizable and tractable! 