# Excercise

We'll be solving below excercises learned from 1-bigram notebook,

1. train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
2. split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
3. use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
4. we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?
5. look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

## Excercise 1: Trigram model

In [2]:
# Importing dataset
words = open('names.txt', 'r').read().splitlines()
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [3]:
words[0][1:]

'mma'

In [4]:
chars = sorted(list(set(''.join(words))))
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}

In [5]:
import torch
import torch.nn.functional as F

In [6]:
xs1, xs2, ys = [], [], []
for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs1.append(ix1)
        xs2.append(ix2)
        ys.append(ix3)
xs1 = torch.tensor(xs1)
xs2 = torch.tensor(xs2)
ys = torch.tensor(ys)
num = xs1.nelement()
print(f"Number of examples: {num}")

Number of examples: 196113


For trigram, we now have two inputs and one outpt. So in the forward pass, we'll initialize two weights(neurons) and perform a sum of xs1 @ W1 + xs2 @ W2. Assuming weighted sum of inputs is logits.

In [7]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)

In [8]:
W1.shape, W2.shape

(torch.Size([27, 27]), torch.Size([27, 27]))

In [9]:
for k in range(100):

    x1enc = F.one_hot(xs1, num_classes=27).float()
    x2enc = F.one_hot(xs2, num_classes=27).float()
    logits = x1enc @ W1 + x2enc @ W2
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss = -probs[torch.arange(num), ys].log().mean()
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad
    

4.03549337387085
3.38352632522583
3.066603422164917
2.884695053100586
2.772024393081665
2.6933724880218506
2.634974241256714
2.5894908905029297
2.552961587905884
2.5229411125183105
2.4978718757629395
2.4766249656677246
2.458418130874634
2.4426538944244385
2.4289000034332275
2.416811466217041
2.4061248302459717
2.3966212272644043
2.388127565383911
2.3804965019226074
2.373607635498047
2.3673579692840576
2.3616628646850586
2.356449842453003
2.3516578674316406
2.347235918045044
2.3431406021118164
2.339334011077881
2.3357856273651123
2.332468032836914
2.329357862472534
2.3264355659484863
2.323683500289917
2.3210864067077637
2.318631649017334
2.3163068294525146
2.3141021728515625
2.3120081424713135
2.310016632080078
2.3081207275390625
2.3063132762908936
2.3045878410339355
2.3029398918151855
2.301363945007324
2.2998554706573486
2.298410654067993
2.297025203704834
2.2956955432891846
2.2944185733795166
2.29319167137146
2.292011260986328
2.2908754348754883
2.289781332015991
2.288726806640625
2.2

The loss is reduced to 2.26 compared to 2.47 loss of trigram model.

In [10]:
for i in range(5):

    out = []
    ix1 = 0
    ix2 = 1
    while True:
        x1enc = F.one_hot(torch.tensor([ix1]), num_classes=27).float()
        x2enc = F.one_hot(torch.tensor([ix2]), num_classes=27).float()
        logits = x1enc @ W1 + x2enc @ W2
        counts = logits.exp()
        p = counts / counts.sum(1, keepdims=True)
        next_ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[next_ix])
        ix1 = ix2
        ix2 = next_ix
        if ix2 == 0:
            break
    print(''.join(out))

la.
melliud.
vin.
riuni.
minomebrr.


We're getting something like names but some of them are as bad as bigram model. But some improvement can be seen.

## Excercise 2: Splitting up dataset into train, dev and test

In [11]:
# Importing dataset
words = open('names.txt', 'r').read().splitlines()
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [12]:
words[0][1:], words[0][2:]

('mma', 'ma')

In [13]:
chars = sorted(list(set(''.join(words))))
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s, i in stoi.items()}

In [14]:
xs1, xs2 , ys = [], [], []
for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs1.append(ix1)
        xs2.append(ix2)
        ys.append(ix3)
xs1 = torch.tensor(xs1)
xs2 = torch.tensor(xs2)
ys = torch.tensor(ys)
num = xs1.nelement()
print(f"Number of examples: {num}")

Number of examples: 196113


In [15]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)

In [16]:
# Import train test split from sklearn
from sklearn.model_selection import train_test_split
X1_train, X1_test, X2_train, X2_test, y_train, y_test = train_test_split(xs1, xs2, ys, test_size=0.3)

In [17]:
len(X1_train), len(X1_test), len(X2_train), len(X2_test),len(y_train), len(y_test)

(137279, 58834, 137279, 58834, 137279, 58834)

In [18]:
# Splitting test to dev and test sets
X1_dev, X1_test, X2_dev, X2_test, y_dev, y_test = train_test_split(X1_test, X2_test, y_test, test_size=(0.5))

In [19]:
len(X1_dev), len(X1_test), len(X2_dev), len(X2_test),len(y_dev), len(y_test)

(29417, 29417, 29417, 29417, 29417, 29417)

In [20]:
len(X1_train), len(X2_train)

(137279, 137279)

In [38]:
train_num = X1_train.nelement()
dev_num = X1_dev.nelement()
test_num = X1_test.nelement()

In [21]:
# Training
for k in range(100):

    x1enc = F.one_hot(X1_train, num_classes=27).float()
    x2enc = F.one_hot(X2_train, num_classes=27).float()
    logits = x1enc @ W1 + x2enc @ W2
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss = -probs[torch.arange(train_num), y_train].log().mean()
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad

4.038381576538086
3.3869142532348633
3.069572687149048
2.8877241611480713
2.774827480316162
2.695857048034668
2.6371712684631348
2.591456174850464
2.554763078689575
2.524624824523926
2.49947190284729
2.4781596660614014
2.459901809692383
2.4440951347351074
2.4303061962127686
2.418186664581299
2.407472610473633
2.3979437351226807
2.3894259929656982
2.3817708492279053
2.3748583793640137
2.368584632873535
2.362865447998047
2.3576278686523438
2.352811813354492
2.348365545272827
2.3442459106445312
2.340416193008423
2.3368449211120605
2.333505630493164
2.3303747177124023
2.327432155609131
2.3246614933013916
2.322047233581543
2.3195760250091553
2.3172361850738525
2.3150177001953125
2.312910795211792
2.3109078407287598
2.3090012073516846
2.3071839809417725
2.305450439453125
2.3037946224212646
2.3022119998931885
2.3006973266601562
2.2992467880249023
2.2978568077087402
2.296523094177246
2.2952427864074707
2.2940125465393066
2.292829751968384
2.291691541671753
2.290595769882202
2.2895395755767822


In [45]:
for i in range(5):

    out = []
    ix1 = 0
    ix2 = 1
    while True:
        x1enc = F.one_hot(torch.tensor([ix1]), num_classes=27).float()
        x2enc = F.one_hot(torch.tensor([ix2]), num_classes=27).float()
        logits = x1enc @ W1 + x2enc @ W2
        counts = logits.exp()
        p = counts / counts.sum(1, keepdims=True)
        next_ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[next_ix])
        ix1 = ix2
        ix2 = next_ix
        if ix2 == 0:
            break
    print(''.join(out))

daahheoisaatin.
letkn.
ilivjj.
lr.
ial.


In [22]:
x1enc = F.one_hot(X1_dev, num_classes=27).float()
x2enc = F.one_hot(X2_dev, num_classes=27).float()
logits = x1enc @ W1 + x2enc @ W2
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(y_dev.nelement()), y_dev].log().mean()
print(loss.item())

2.2646405696868896


In [23]:
x1enc = F.one_hot(X1_test, num_classes=27).float()
x2enc = F.one_hot(X2_test, num_classes=27).float()
logits = x1enc @ W1 + x2enc @ W2
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
loss = -probs[torch.arange(y_test.nelement()), y_test].log().mean()
print(loss.item())

2.2625491619110107


The loss on test and dev sets are same as that on trianing set.

## Excercise 3: Smoothing

Let's smooth the training loss by evaluating it on dev set

In [44]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)
# Training
for k in range(100):

    x1enc = F.one_hot(X1_train, num_classes=27).float()
    x2enc = F.one_hot(X2_train, num_classes=27).float()
    logits = x1enc @ W1 + x2enc @ W2
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    smoothing = (W1**2).mean() * (W2**2).mean()
    loss = -probs[torch.arange(dev_num), y_dev].log().mean() + 0.01 * smoothing
    print(f"Smoothing: {smoothing}")
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad

Smoothing: 0.9912816286087036
4.07726526260376
Smoothing: 0.8227971792221069
3.6025888919830322
Smoothing: 0.717277467250824
3.348339319229126
Smoothing: 0.6491332054138184
3.1914966106414795
Smoothing: 0.5999585390090942
3.092525005340576
Smoothing: 0.5618353486061096
3.0270206928253174
Smoothing: 0.5303149819374084
2.979785442352295
Smoothing: 0.5036755800247192
2.9434385299682617
Smoothing: 0.48085352778434753
2.914311170578003
Smoothing: 0.46114856004714966
2.8903310298919678
Smoothing: 0.44398871064186096
2.870206832885742
Smoothing: 0.42895182967185974
2.8530781269073486
Smoothing: 0.41568833589553833
2.8383398056030273
Smoothing: 0.4039267897605896
2.825549364089966
Smoothing: 0.393439918756485
2.814371347427368
Smoothing: 0.3840446472167969
2.8045456409454346
Smoothing: 0.3755863606929779
2.7958645820617676
Smoothing: 0.36793753504753113
2.7881579399108887
Smoothing: 0.3609901964664459
2.781287431716919
Smoothing: 0.35465413331985474
2.775134801864624
Smoothing: 0.3488527834415

Am not sure about smoothing used above with using dev to evaluate loss. Let's learn that in details next notebook.

## Excercise 4: One hot encoding to indexing to increase effeciency

In [58]:
x1enc[0].argmax()

tensor(9)

In [54]:
x1enc[0].argmax()

tensor(9)

In [62]:
W1[x1enc[0].argmax()].shape

torch.Size([27])

In [64]:
(x1enc[0] @ W1).shape

torch.Size([27])

So the argmax index of encoded input on weights gives the same output as matrix multiplications which is just selecting the row in weights based on index.

Let's do this to improve efficiency.

In [69]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)

# Training
for k in range(100):

    # x1enc = F.one_hot(X1_train, num_classes=27).float()
    # x2enc = F.one_hot(X2_train, num_classes=27).float()
    logits = W1[X1_train] + W2[X1_train]
    print(f"Shape of logits: {logits.shape}")
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss = -probs[torch.arange(train_num), y_train].log().mean()
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad

Shape of logits: torch.Size([137279, 27])
4.010873317718506
Shape of logits: torch.Size([137279, 27])
3.4076037406921387
Shape of logits: torch.Size([137279, 27])
3.154132843017578
Shape of logits: torch.Size([137279, 27])
2.9991979598999023
Shape of logits: torch.Size([137279, 27])
2.9097750186920166
Shape of logits: torch.Size([137279, 27])
2.8524322509765625
Shape of logits: torch.Size([137279, 27])
2.8865444660186768
Shape of logits: torch.Size([137279, 27])
2.8798012733459473
Shape of logits: torch.Size([137279, 27])
3.034534215927124
Shape of logits: torch.Size([137279, 27])
2.73941707611084
Shape of logits: torch.Size([137279, 27])
2.788029193878174
Shape of logits: torch.Size([137279, 27])
2.965228319168091
Shape of logits: torch.Size([137279, 27])
2.7140512466430664
Shape of logits: torch.Size([137279, 27])
2.7134971618652344
Shape of logits: torch.Size([137279, 27])
2.8758704662323
Shape of logits: torch.Size([137279, 27])
2.663717269897461
Shape of logits: torch.Size([137279

In [70]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)

# Training
for k in range(100):

    x1enc = F.one_hot(X1_train, num_classes=27).float()
    x2enc = F.one_hot(X2_train, num_classes=27).float()
    logits = x1enc @ W1 + x2enc @ W2
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss = -probs[torch.arange(train_num), y_train].log().mean()
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad

4.038381576538086
3.3869142532348633
3.069572687149048
2.8877241611480713
2.774827480316162
2.695857048034668
2.6371712684631348
2.591456174850464
2.554763078689575
2.524624824523926
2.49947190284729
2.4781596660614014
2.459901809692383
2.4440951347351074
2.4303061962127686
2.418186664581299
2.407472610473633
2.3979437351226807
2.3894259929656982
2.3817708492279053
2.3748583793640137
2.368584632873535
2.362865447998047
2.3576278686523438
2.352811813354492
2.348365545272827
2.3442459106445312
2.340416193008423
2.3368449211120605
2.333505630493164
2.3303747177124023
2.327432155609131
2.3246614933013916
2.322047233581543
2.3195760250091553
2.3172361850738525
2.3150177001953125
2.312910795211792
2.3109078407287598
2.3090012073516846
2.3071839809417725
2.305450439453125
2.3037946224212646
2.3022119998931885
2.3006973266601562
2.2992467880249023
2.2978568077087402
2.296523094177246
2.2952427864074707
2.2940125465393066
2.292829751968384
2.291691541671753
2.290595769882202
2.2895395755767822


In [82]:
## Excercise 5: Using Cross entorpy

In [81]:
# Initialize weights
g = torch.Generator().manual_seed(42)
W1 = torch.randn((27, 27), generator=g, requires_grad=True)
W2 = torch.randn((27, 27), generator=g, requires_grad=True)

# Training
for k in range(100):

    # x1enc = F.one_hot(X1_train, num_classes=27).float()
    # x2enc = F.one_hot(X2_train, num_classes=27).float()
    logits = W1[X1_train] + W2[X1_train]
    loss = F.cross_entropy(logits, y_train)
    print(loss.item())

    W1.grad = None
    W2.grad = None
    loss.backward()

    W1.data += -50 * W1.grad
    W2.data += -50 * W2.grad

4.010873317718506
3.4076032638549805
3.15413236618042
2.9991977214813232
2.9097745418548584
2.8524322509765625
2.886544704437256
2.8798012733459473
3.034533977508545
2.739417552947998
2.788029193878174
2.965228319168091
2.7140512466430664
2.7134718894958496
2.8758227825164795
2.6637253761291504
2.659200429916382
2.71647572517395
2.909717321395874
2.6512115001678467
2.6702890396118164
2.822051763534546
2.6410651206970215
2.648153781890869
2.7309346199035645
2.918083906173706
2.6609389781951904
2.705000162124634
2.906630039215088
2.6440060138702393
2.69270658493042
2.8810672760009766
2.633471727371216
2.6587555408477783
2.8496081829071045
2.6109635829925537
2.6022067070007324
2.6662981510162354
2.7107372283935547
2.881516933441162
2.614396810531616
2.6699440479278564
2.874661445617676
2.620964527130127
2.6492550373077393
2.822420835494995
2.599946975708008
2.5769643783569336
2.5984396934509277
2.6693248748779297
2.867483377456665
2.6032111644744873
2.643613338470459
2.8208200931549072
2.