# Building Makemore - Exercises

Exercises from the [building makemore video](https://www.youtube.com/watch?v=PaCmpygFfXo).<br>
The video description holds the exercises, which are also listed below.

1. Watch the [building makemore video](https://www.youtube.com/watch?v=PaCmpygFfXo) on YouTube
2. Come back and complete the exercises to level up :)

In [31]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline

## Exercise 1 - Trigram Language Model

**Objective:** Train a trigram language model, i.e. take two characters as an input to predict the 3rd one.<br>
Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [None]:
# Set training device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load dataset -> List[str]
words = open('../names.txt', 'r').read().splitlines()
g = torch.Generator(device=device).manual_seed(2147483647)

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0 # Special token has position zero
itos = {i:s for s,i in stoi.items()}

# TODO: Modify this to accommodate for trigrams
for w in words[:1]:
    chs = ['.'] + list(w) + ['.']
    # Two char 'sliding-window'
    for ch1, ch2 in zip(chs, chs[1:]):
        print(ch1, ch2)

# -----
# TODO: Your code here
# Implement a trigram model
# -----

## Exercise 2 - Splitting the Dataset, Evaluation on Dev and Test Sets

**Objective:** Split the dataset randomly into $80\%$ `train` set, $10\%$ `dev` set, $10\%$ `test` set.<br>
Train the bigram and trigram models only on the `training` set. Evaluate them on `dev` and `test` splits. 

What can you see?

In [5]:
g = torch.Generator(device=device).manual_seed(2147483647)

### Baselining with the bigram model

We use the bigram model code we built in the video to establish a baseline.

In [None]:
# Create set of all *bigrams*
xs, ys = [], []

for w in words:
    chs = ['.'] + list(w) + ['.']
    # Two char 'sliding-window'
    for ch1, ch2 in zip(chs, chs[1:]):
        xs.append(stoi[ch1])
        ys.append(stoi[ch2])

xs, ys = torch.tensor(xs), torch.tensor(ys) # [196113], [196113]
num_x, num_y = xs.nelement(), ys.nelement()

# TODO: Shuffle/Permute the dataset, keeping pairs in sync
# TODO: Split the dataset into 80:10:10 for train:valid:test
xs_bi_train, xs_bi_valid, xs_bi_test = None, None, None
ys_bi_train, ys_bi_valid, ys_bi_test = None, None, None

In [None]:
W = torch.randn((27,27), device=device, generator=g, requires_grad=True)

# Training cycles, using the entire dataset -> 200 Epochs
for k in range(200):    
    # Forward pass
    xenc = F.one_hot(xs_bi_train, num_classes=27).float().to(device) # one-hot encode the names
    logits = xenc @ W # logits, different word for log-counts
    counts = logits.exp() # 'fake counts', kinda like in  the N matrix of bigram
    probs = counts / counts.sum(1, keepdims=True) # Normal distribution probabilities (this is y_pred)
    loss = -probs[torch.arange(len(probs)), ys_bi_train].log().mean() + 0.01 * (W**2).mean()
    print(f'Loss @ iteration {k+1}: {loss}')
    # Backward pass
    W.grad = None # Make sure all gradients are reset
    loss.backward() # Torch kept track of what this variable is, kinda cool
    # Weight update
    W.data += -50 * W.grad

In [None]:
# Validation Loss
with torch.no_grad():
    xenc = F.one_hot(xs_bi_valid, num_classes=27).float().to(device) # one-hot encode the names
    logits = xenc @ W # logits, different word for log-counts
    counts = logits.exp() # 'fake counts', kinda like in  the N matrix of bigram
    probs = counts / counts.sum(1, keepdims=True) # Normal distribution probabilities (this is y_pred)
    loss = -probs[torch.arange(len(probs)), ys_bi_valid].log().mean() + 0.01 * (W**2).mean()
print(f'Validation Loss: {loss}')

# Test Loss
with torch.no_grad():
    xenc = F.one_hot(xs_bi_test, num_classes=27).float().to(device) # one-hot encode the names
    logits = xenc @ W # logits, different word for log-counts
    counts = logits.exp() # 'fake counts', kinda like in  the N matrix of bigram
    probs = counts / counts.sum(1, keepdims=True) # Normal distribution probabilities (this is y_pred)
    loss = -probs[torch.arange(len(probs)), ys_bi_test].log().mean() + 0.01 * (W**2).mean()
print(f'Test Loss:\t {loss}')

### Comparing the bigram and trigram models

In [1]:
# TODO: Create set of all *trigrams*
xs, ys = [], []

# TODO: Shuffle/Permute the dataset, keeping (x,y) pairs in sync
# TODO: Split the dataset into 80:10:10 for train:valid:test
xs_tri_train, xs_tri_valid, xs_tri_test = None, None, None
ys_tri_train, ys_tri_valid, ys_tri_test = None, None, None

In [None]:
# TODO: Implement and train a trigram model

In [None]:
# TODO: Evaluate the trigram model on the validation and test sets

## Exercise 3 - Tuning the Strength of Smoothing

**Objective:** Use the *dev set* to tune the strength of smoothing (or regularization) for the trigram model - i.e.<br>
try many possibilities and see which one works best based on the dev set loss.<br>
What patterns can you see in the train and dev set loss as you tune this strength?<br>
Take the best setting of the smoothing and evaluate on the test set once and at the end.<br>
How good of a loss do you achieve?

In [56]:
# TODO: Create set of all *trigrams*
xs, ys = [], []

# TODO: Shuffle/Permute the dataset, keeping (x,y) pairs in sync
# TODO: Split the dataset into 80:10:10 for train:valid:test
xs_tri_train, xs_tri_valid, xs_tri_test = None, None, None
ys_tri_train, ys_tri_valid, ys_tri_test = None, None, None

In [None]:
# TODO: Build the hyperparameter search for regularization strength of the trigram model

## Exercise 4 - One-Hot Vector Delete

**Objective:** We saw that our one-hot vectors merely select a row of $W$, so producing these vectors explicitly feels wasteful.<br>
Can you delete our use of `F.one_hot` in favor of simply indexing into rows of $W$?

In [None]:
# TODO: Rewrite the training loop to delete F.one_hot

## Exercise 5: Using F.cross_entropy

**Objective:** Look up and use `F.cross_entropy` instead. You should achieve the same result. Can you think of why we'd prefer to use `F.cross_entropy` instead? Here's the [documentation on `F.cross_entropy`](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html).

In [None]:
# TODO: Rewrite the training loop from Ex. 4 to employ F.cross_entropy

## Exercise 6: Meta Exercise

**Objective:** Think of a fun/interesting exercise and complete it.

In [None]:
# TODO: The stage is yours!