__Makemore__
Makemore is a character level language model. It treats everything as sequences of characters, predicting the most likely next character.

Building a Bigram model, for now (only looking at previous character to predict the next one (bad))

In [1]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F

In [52]:
#~32000 names
words = open('names.txt','r').read().splitlines()

In [3]:
N = torch.zeros((27,27), dtype = torch.int32)

In [4]:
#character list
chars = sorted(list(set(''.join(words))))

In [53]:
#tokenizer
tokenizer = {}
for i in range(len(chars)):
    tokenizer[chars[i]] = i+1

tokenizer['.'] = 0 # add start and end character
rev_tokenizer = { i:c for c,i in tokenizer.items()}

In [61]:
g = torch.Generator().manual_seed(2147483647)

In [62]:
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1,ch2 in zip(chs,chs[1:]):
        ix1 = tokenizer[ch1]
        ix2 = tokenizer[ch2]
        N[ix1, ix2] += 1

In [63]:
#create one probability dist.
p = N[0].float()
p = p / p.sum()
p

tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
        0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
        0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])

In [64]:
# making a probability matrix, N+1 for model smoothing to avoid infinite loss
P = (N+1).float() #keepdim below is necessary so that dimensions line up correctly 
P /= P.sum(1, keepdim = True) #sum of all rows column vector, division works because of favorable BROADCASTING rules

In [65]:
#Probability matrix trained model
for i in range(5):
    ix = 0
    out = []
    while True:
        p=P[ix]
        ix = torch.multinomial(p, num_samples = 1, replacement = True, generator = g).item()
        out.append(rev_tokenizer[ix])
        if ix == 0:
            break
    print(''.join(out))

junide.
janasah.
p.
cony.
a.


In [67]:
#completely untrained (random) model
#base case for a completely randomly generated model
g = torch.Generator().manual_seed(2147483647)
for i in range(10):
    ix = 0
    out = []
    while True:
        p = torch.ones(27)/27
        ix = torch.multinomial(p, num_samples = 1, replacement = True, generator = g).item()
        out.append(rev_tokenizer[ix])
        if ix == 0:
            break
    print(''.join(out))

juwjdvdipkcqaz.
p.
cfqywocnzqfjiirltozcogsjgwzvudlhnpauyjbilevhajkdbduinrwibtlzsnjyievyvaftbzffvmumthyfodtumjrpfytszwjhrjagq.
coreaysezocfkyjjabdywejfmoifmwyfinwagaasnhsvfihofszxhddgosfmptpagicz.
rjpiufmthdt.
rkrrsru.
iyumuyfy.
mjekujcbkhvupwyhvpvhvccragr.
wdkhwfdztta.
mplyisbxlyhuuiqzavmpocbzthqmimvyqwat.


In [70]:
#Analyzing performance of first model
log_likelihood = 0
n = 0
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1,ch2 in zip(chs,chs[1:]):
        ix1 = tokenizer[ch1]
        ix2 = tokenizer[ch2]
        prob = P[ix1,ix2]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n+=1

print(f'{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
avg_nll = nll/n
print(f'{avg_nll = }') #bug encountered where loss becomes infinite when given token combo isnt in orignial dataset (p=0)

log_likelihood = tensor(-559943.5000)
nll = tensor(559943.5000)
avg_nll = tensor(2.4543)


In [15]:
#Likelihood = product of probabilities (want to maximize for a good model)
#log likelihood: more wieldy number, sum of logs of probabilities, want to be zero (not a loss function)
#negative log likelihood: useful such that loss is actually minimized (not maximized), usually given as an avg

__Use gradient based optimization to tune params of NN:__
minimize NLL function

In [42]:
#Training set
xs,ys = [],[]
num = 0
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1,ch2 in zip(chs,chs[1:]):
        ix1 = tokenizer[ch1]
        ix2 = tokenizer[ch2]
        xs.append(ix1)
        ys.append(ix2)
        num += 1
        
print('Num of examples', num)

xs = torch.tensor(xs) #use lowercase tensor, maintains integer dtype
ys = torch.tensor(ys)
W = torch.randn((27,27), generator = g, requires_grad = True) 

Num of examples 228146


__Model Smoothing note:__
incentivizing W (initialized randomly here^) to start near zero is equivalent to model smoothing. more incentive for W to be near zero means smoother distribution.
Regularization!

We can build this into the loss function (0.01*(W**2.mean())). Now the optimization is simultaneously minimizing the distance between actual and predicted values, and also making weights closer together (smoothing).

In [43]:
#one hot encoding (vectorize integers)
xenc = F.one_hot(xs, num_classes = 27).float() #must input float dtype into NN
yenc = F.one_hot(ys, num_classes = 27).float()
# one hot encoding allows us to essentially index into a row and take that row as our logits
#works by have 1 in indexed position and zero everywhere else in a matrix product

In [45]:
#TRAINING NEURAL NET MODEL
for k in range(100): 
    #forward pass
    logits = xenc @ W #log-counts
    counts = logits.exp() #counts
    probs = counts / counts.sum(1, keepdims = True) #probability distribution
    loss = -probs[torch.arange(num),ys].log().mean() + 0.01*((W**2).mean()) #probabilities of next character
    #backward pass:
    W.grad = None
    loss.backward() #like micrograd, pytorch builds a graph of operations and back propagates loss
    #update
    W.data += -50 * W.grad #learning rate
print(loss.item()) #expect this loss to be abt same as simple counts optimization (no new information)
#using gradient based optimization here as opposed to counts based
#gradient approach is much more flexible and can become much more complex

2.48299241065979


__Summary thus far:__

FORWARD PASS: Input letters, one hot encode, feed into NN, outputs logits (z-scores but for discrete (Bernoulli) distribution, [exponentiate into counts, convert into probabilities,] = SOFTMAX probabilities represent likelihood of next character. Softmax makes NN logits into probabilities.

Can take the prob assigned by NN to actual next character as input to the loss func (Gradient Descent time!)

NN Structure = single layer followed by softmax (not an MLP)

BACKWARD PASS: back propagate through map of functions

UPDATE: can choose a 

In [51]:
#Sampling from NN model

for i in range(5):
    out = []
    ix = 0
    while True:
        xenc = F.one_hot(torch.tensor([ix]), num_classes = 27).float()
        logits = xenc @ W #predict logits
        counts = logits.exp()
        p = counts/counts.sum(1,keepdims = True) #probabilities for next letter
        ix = torch.multinomial(p,num_samples = 1, replacement = True, generator = g).item()
        out.append(rev_tokenizer[ix])
        if ix==0:
            break
    print(''.join(out))


junide.
janasah.
p.
cfay.
a.
