E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

In [5]:
import torch

words = open('names.txt','r').read().splitlines()

In [7]:
# EX1

# Construct data set

chars = sorted(list(set(''.join(words))))
stoi_chars = {s:i+1 for i,s in enumerate(chars)}
stoi_chars['.'] = 0
itos_chars = {i:s for s,i in stoi_chars.items()}


char_pairs = []

for ch in chars:
    char_pairs.append('.' + ch)

for ch1 in chars:
    for ch2 in chars:
        char_pairs.append(ch1+ch2)

for ch in chars:
    char_pairs.append(ch + '.')

stoi_pairs = {s: i for i,s in enumerate(char_pairs)}
itos_pairs = {i:s for s, i in stoi_pairs.items()}


xs, ys = [], []

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi_pairs[ch1 + ch2]
        ix2 = stoi_chars[ch3]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

In [8]:
xs

tensor([  4, 142, 350,  ..., 700, 675, 699])

In [41]:
# one_hot encoding for input to network (size of 728 for 728 possible char pairs)
xenc = torch.nn.functional.one_hot(xs, num_classes=728).float()

# Randomly Init NN
W = torch.randn((728, 27), requires_grad=True)

In [33]:
for i in range(1000):
    # forward pass
    logits = xenc @ W

    # Normalize output to prob dist w softmax
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True)

    # We want to maximize the probability of the correct ys, given xs as input
    loss = -probs[torch.arange(len(ys)), ys].log().mean() + 0.01*(W**2).mean()

    # backward pass

    W.grad = None
    loss.backward()
    W.data += -5 * W.grad

    print(loss.item())

2.407759666442871
2.407521963119507
2.4072844982147217
2.4070475101470947
2.406810998916626
2.4065747261047363
2.406338691711426
2.4061033725738525
2.4058680534362793
2.4056332111358643
2.4053986072540283
2.4051644802093506
2.404930830001831
2.4046974182128906
2.4044644832611084
2.404231548309326
2.4039993286132812
2.4037673473358154
2.4035356044769287
2.4033045768737793
2.40307354927063
2.4028427600860596
2.4026126861572266
2.4023826122283936
2.402153253555298
2.4019241333007812
2.4016952514648438
2.4014666080474854
2.4012386798858643
2.401010513305664
2.400783061981201
2.4005560874938965
2.400329351425171
2.4001028537750244
2.399876832962036
2.399651050567627
2.399425506591797
2.399200439453125
2.3989756107330322
2.3987510204315186
2.398527145385742
2.398303270339966
2.3980798721313477
2.3978569507598877
2.3976340293884277
2.397411823272705
2.3971896171569824
2.396967649459839
2.3967463970184326
2.3965251445770264
2.3963043689727783
2.3960840702056885
2.3958640098571777
2.39564394950

In [61]:
for i in range(5):
    out = []
    curr_context = "." + chars[torch.randint(0, 26, (1,)).item()] 
    out.append(curr_context[1])
    
    while True:
        ix = stoi_pairs[curr_context]
        
        xenc = torch.nn.functional.one_hot(torch.tensor([ix]), num_classes=728).float()
        logits = xenc @ W
        probs = torch.softmax(logits, dim=1)
        
        next_ch_ix = torch.multinomial(probs, num_samples=1).item()
        next_ch = itos_chars[next_ch_ix]
        
        if next_ch == '.':
            break
            
        out.append(next_ch)
        
        curr_context = curr_context[1] + next_ch
        
    print(''.join(out))

kiruqzcce
tithqjsrree
tenikayanduwamie
wamaliya
faia


In [52]:
# EX05
for i in range(1000):
    # forward pass
    logits = xenc @ W

    # Normalize output to prob dist w softmax
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True)

    # Cross entropy to minimize KL div, which itself is a calulation of 'distance' between prob dists
    loss = torch.nn.functional.cross_entropy(logits, ys)

    # backward pass

    W.grad = None
    loss.backward()
    W.data += -5 * W.grad

    print(loss.item())

2.3985884189605713
2.398353338241577
2.398118257522583
2.397883415222168
2.397649049758911
2.3974149227142334
2.397181272506714
2.3969483375549316
2.3967151641845703
2.396482467651367
2.396250009536743
2.3960182666778564
2.395786762237549
2.3955554962158203
2.395324468612671
2.3950939178466797
2.3948636054992676
2.394634246826172
2.394404411315918
2.3941752910614014
2.393946409225464
2.3937180042266846
2.3934900760650635
2.3932619094848633
2.3930346965789795
2.3928074836730957
2.39258074760437
2.3923542499542236
2.3921282291412354
2.391902446746826
2.391676902770996
2.391451835632324
2.3912267684936523
2.3910024166107178
2.3907785415649414
2.390554666519165
2.390331268310547
2.3901076316833496
2.3898849487304688
2.389662742614746
2.3894400596618652
2.389218330383301
2.3889970779418945
2.388775587081909
2.388554811477661
2.3883345127105713
2.3881139755249023
2.3878941535949707
2.387674570083618
2.387455463409424
2.3872363567352295
2.3870177268981934
2.3867995738983154
2.3865814208984375