## Questions: 

1. train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model? 

2. split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see? 

3. use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve? 

4. we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W? 

5. look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?


### Trigram

Strategy: 
- need to create a $x \rightarrow y $ mapping dataset first. Key question is: will it be `(2,27)` or `(54,)`. In the former case $x_i$ is a 2-d input matrix, while in the latter it is flattened into a vector. 
- flattening it into a `(54,)` after one-hot encoding is a better strategy to avoid complications in math. Banter with chatgpt why. (definition of loss must be changed, dimension of tensor W as well)
- Define NLL
- `W` will be `(54,27)`: 54 features in $x$, 27 no of neurons, `with_grad = True`
- forward pass: compute `xenc @ W` and apply softmax on it. 
- back pass: call `loss.backward()`
- update, flush gradients
- iterate till convergence 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

import torch.nn.functional as F

In [3]:
words = open('names.txt', 'r').read().splitlines()

In [4]:
stoi = {}
letters = sorted(set("".join(words)))

stoi = {s:i+1 for i,s in enumerate(letters)}
stoi['.'] = 0

itos = {i:s for s,i in stoi.items()}

In [5]:
stoi['a']

1

The above is some sloppy code which runs into problems of data type and vectorization. Below is some clean, parallelizable code to prepare the dataset. 

In [6]:
xs = []
ys = []

for word in words:
    chs = ['.'] + list(word) + ['.']  # add start and end tokens
    for i in range(len(chs) - 2):
        ix1 = stoi[chs[i]]
        ix2 = stoi[chs[i + 1]]
        ix3 = stoi[chs[i + 2]]  # target

        xs.append([ix1, ix2])  # context: 2 indices
        ys.append(ix3)         # label: 1 index

# convert to tensors
xs = torch.tensor(xs)      # shape: [N, 2]
ys = torch.tensor(ys)      # shape: [N]

# one-hot encode context
xs_oh = F.one_hot(xs, num_classes=27).float()  # shape: [N, 2, 27]
xenc = xs_oh.view(xs_oh.shape[0], -1)         # reshape to [N, 54]b

In [7]:
num = ys.nelement()

In [8]:
xenc.shape, ys.shape

(torch.Size([196113, 54]), torch.Size([196113]))

In [9]:
# initialize Weights
g = torch.Generator().manual_seed(344675)
W = torch.randn((54,27), generator=g, requires_grad=True)
# W.data *= 0.05

In [10]:
m = nn.Softmax(dim = 1)

for k in range(400):

    #forward pass
    logits = xenc @ W
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
    # loss = -probs[torch.arange(num), ys].log().mean() + 0.001*(W**2).mean() # with a regulariziation loss
    loss = -probs[torch.arange(num), ys].log().mean()  # without a regulariziation loss
    
    if k%20 ==0:
        print(loss.item()) 

    # backward pass
    W.grad = None # grad flushing
    loss.backward()   

    # update
    with torch.no_grad():
        W -= 50 * W.grad

4.081151962280273
2.3730344772338867
2.306246042251587
2.2830734252929688
2.2710041999816895
2.2635629177093506
2.25852370262146
2.2548999786376953
2.2521848678588867
2.250087261199951
2.248427152633667
2.2470877170562744
2.2459897994995117
2.245077133178711
2.244309663772583
2.243657350540161
2.2430973052978516
2.242612600326538
2.242189884185791
2.2418181896209717


<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">OBSERVATION:</span>

If NLL is used: the trigram loss saturates around 2.25, while the bigram loss saturated around 2.45 

In [11]:
# lets sample from the trigram 

g = torch.Generator().manual_seed(42)

for i in range(5):
    out = []
    ix1 = 0  # Start with '.'
    ix2 = 0  # Two start tokens for trigram context ('.', '.')

    while True:
        # Create input vector from two context characters
        xenc = F.one_hot(torch.tensor([ix1, ix2]), num_classes=27).float()  # shape (2, 27)
        xenc = torch.cat([xenc[0], xenc[1]])  # shape (54,)

        logits = xenc @ W  # (54,) @ (54, 27) => (27,)
        counts = logits.exp()
        p = counts / counts.sum()

        ix_next = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()

        out.append(itos[ix_next])

        if ix_next == 0:
            break

        # Shift context
        ix1, ix2 = ix2, ix_next

    print(''.join(out))


ya.
syahle.
wan.
ullekhim.
ugwnya.


The final names are not necessarily 'better' or more human like and important lessons and considerations are
- choice of a loss function matters a lot than the actual value tied to it
- lower loss may not convert into tangibly better results. 
- may be you are optimizing the wrong loss fn 

The final loss which is around 2.25 for my implementation seems to differ from 2.09 obtained by others on the YT channel. I'm not able to place why, despite the code which looks alright to me. 

Is there some hyperparameter tuning I'm missing or what?