# Makemore 2 - Exercises

Exercises from the [makemore #2 video](https://www.youtube.com/watch?v=TCH_1BHY58I).<br>
The video description holds the exercises, which are also listed below.

1. Watch the [makemore #2 video](https://www.youtube.com/watch?v=TCH_1BHY58I) on YouTube
2. Come back and complete the exercises to level up :)

## Exercise 1 - Beating the Game

**Objective:** Tune the hyperparameters of the training to beat Andrej's best validation loss of $2.2$.

Below is an unaltered version of the code from the video.

In [None]:
import random
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Use GPU if available

In [None]:
# read in all 32033 words
words = open('../names.txt', 'r').read().splitlines()
print(words[:5]) # show a sample of the first 8 names
print(len(words)) # this many words in total

In [None]:
# build a vocabulary of characters map them to integers
chars = sorted(list(set(''.join(words)))) # set(): Throwing out letter duplicates
stoi = {s:i+1 for i,s in enumerate(chars)} # Make tuples of type (char, counter)
stoi['.'] = 0 # Add this special symbol's entry explicitly
itos = {i:s for s,i in stoi.items()} # Switch order of (char, counter) to (counter, char)

In [None]:
block_size = 3

def build_dataset(words):
    X, Y = [], []

    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix] # crop and append

    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print('X:', X.shape, '\tY:', Y.shape)
    return X,Y

random.seed(42)          # for reproducibility
random.shuffle(words)    # words is just the bare list of all names, from wayyy above
n1 = int(0.8*len(words)) # index at 80% of all words (rounded for integer indexing)
n2 = int(0.9*len(words)) # index at 90% of all words (rounded for integer indexing)

print('Training Set:')
Xtr, Ytr = build_dataset(words[:n1])     # The first 80% of all words
print('Validation Set:')
Xdev, Ydev = build_dataset(words[n1:n2]) # The 10% from 80% to 90% of all words
print('Test Set:')
Xte, Yte = build_dataset(words[n2:])     # The 10% from 90% to 100% of all words

In [None]:
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27,2), generator=g)
W1 = torch.randn((6,300), generator=g)
b1 = torch.randn((300), generator=g)
W2 = torch.randn((300,27), generator=g)
b2 = torch.randn((27), generator=g)

parameters = [C, W1, b1, W2, b2] # Cluster all parameters into one structure

print(sum(p.nelement() for p in parameters), 'parameters')

for p in parameters:
    p.requires_grad = True

In [None]:
lossi = [] # list of losses per mini-batch
stepi = [] # list of steps (mini-batches)

for i in range(180000):
    
    # mini-batch construct
    ix = torch.randint(0, Xtr.shape[0], (32,))
    
    # Forward-Pass
    emb = C[Xtr[ix]] # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix]) 
    
    # Backward-Pass
    for p in parameters:
        p.grad = None
    
    loss.backward()
    
    lr = 0.1 if i < 60000 else 0.05 if i < 120000 else 0.01
    
    for p in parameters:
        p.data += -lr * p.grad
    
    # Loss per mini-batch tracking
    stepi.append(i)
    lossi.append(loss.item())
    
#print('Loss for current mini-batch:', loss.item())

In [None]:
plt.plot(stepi, lossi);

In [None]:
# Validation loss
emb = C[Xdev]
h = torch.tanh(emb.view(-1,6) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ydev)
print(loss.item())

In [None]:
# Test loss
emb = C[Xte]
h = torch.tanh(emb.view(-1,6) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Yte)
print(loss.item())

## Exercise 2 - Weight Initialization

**Objective:** Andrej was not careful with the initialization of the network in this video.<br>
**(1)** What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve?<br>
**(2)** Can you tune the initialization to get a starting loss that is much more similar to *(1)* ?

In [None]:
# TODO: Make the weight initialization as uniform as possible
# TODO: Train this newly initalized model with the training loop from last exercise

In [None]:
plt.plot(stepi, lossi);

In [None]:
# Validation loss
emb = C[Xdev]
h = torch.tanh(emb.view(-1, block_size * 20) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ydev)
print(loss.item())

## Exercise 3 - A Neural Probabilistic Language Model (Bengio et al. 2003)



**Objective:** Read the paper by [\[Bengio et al. 2003\]](https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), implement and try any idea from the paper. Did it work?

In [None]:
# TODO: The stage is yours! Find an interesting concept and implement it here.

In [None]:
plt.plot(stepi, lossi);

In [None]:
# Validation loss
emb = C[Xdev]
h = torch.tanh(emb.view(-1, block_size * 20) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ydev)
print(loss.item())