# N-Gram Language Model Exercises
Inspired by Andrej Karpathy's first Makemore video. Here I'll try to reproduce what I learned from scratch and add additional features.

We'll start by making a bigram language model. We'll train it on names so it learns to produce names of its own based on what it learned from the data.

In [18]:
import torch                       # Tensors and backpropagation.
import matplotlib.pyplot as plt    # Graphing capabilities.

In [19]:
# Create a list of each name in names.txt
names = open("names.txt", "r").read().splitlines()
print(f"Displaying the first three names: {names[:3]}")

Displaying the first three names: ['emma', 'olivia', 'ava']


Now that we have our data, let's split it into bigrams so the model can learn common letter patterns in names. We also want it to learn how each name starts and ends, so let's indicate the start and end of a name with a ".". Below shows an example of the bigrams in the name Emma.

In [20]:
# Break "emma" into bigrams, including "." for start and end.
print(f"Bigrams for emma: {list(zip('.emma', 'emma.'))}")

Bigrams for emma: [('.', 'e'), ('e', 'm'), ('m', 'm'), ('m', 'a'), ('a', '.')]


We want this model to take a character as input and predict which character comes next. For that, we want two tensors.

X (the input) should contain the first half of the bigrams.

y (the label) should contain the solutions, AKA the letter that comes next.

Since PyTorch tensors can't contain character elements, let's first assign each letter to an integer value.

In [21]:
import string

# For each letter + ".", add them to dict with an integer value.
stoi = {char:int_val for (int_val, char) in enumerate("." + string.ascii_lowercase)}
print(stoi)

{'.': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


<!-- Now we'll get the bigrams for all the names and split them into two tensors, X and y. -->
<!-- Now we are ready to get bigrams for all the names  and put them into tensors X and y. -->
Now we are ready to make our bigrams for the data.

In [22]:
X = [] # First letters in bigrams
y = [] # X's bigram pairs.

for name in names:
    for char1, char2 in zip("." + name, name + "."):
        # Convert chars to ints so we can later add to tensors.
        X.append(stoi[char1])
        y.append(stoi[char2])


print(f"The first 10 values in X are {X[:10]}.")
print(f"The first 10 values in y are {y[:10]}.")
# Convert list to a tensor.
X = torch.tensor(X)
y = torch.tensor(y)

The first 10 values in X are [0, 5, 13, 13, 1, 0, 15, 12, 9, 22].
The first 10 values in y are [5, 13, 13, 1, 0, 15, 12, 9, 22, 9].


Now we are ready to make our neural network. Neural networks have weights that get multiplied by each input. But right now our inputs are just numbers from 0 to 26. It wouldn't be very helpful to do math with these inputs, since we want each letter to be treated equally. `z` shouldn't have a different multiplier than `a`, for example.

To make sure each input gets treated equally, we can use one-hot encoding. With one-hot encoding, each letter gets represented by an array of length 27 (one for each possible letter including "."). This array contains all 0s apart from a single 1 at the index signifying the chosen letter.
<!-- This converts each input into an array filled mostly with 0s, with a single 1 in the index that signifies out letter. -->

<!-- or else letters represented by large numbers (like `z`) will end up having a larger . `z` is not worth more than `a` even though it is represented by a larger integer. -->
<!-- , since they don't have any mathematical meaning. `z` is not worth more than `a` even though it is represented by a larger integer. -->
<!-- The neural network will have weights that get multiplied by each of our  -->
<!-- A neural network contains weights to multiply with its  -->

<!-- multiplies its weights by each input. However, right now -->

In [23]:
c_index = stoi["c"]

# Create the one-hot encoded representation of the letter "c".
c_enc = torch.zeros(27)
c_enc[c_index] = 1

print(f"c's index is {c_index}, so its one-hot encoding looks like {c_enc}")

c's index is 3, so its one-hot encoding looks like tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])


Let's apply this encoding to all the letters in our input data. We'll also turn the values into floats while we're at it, as this is necessary so that they can be multiplied by float weights in our neural network.

In [25]:
import torch.nn.functional as F

# One-hot encode X to turn each letter into an array of length 27 (one index for each possible letter including ".".)
X_enc = F.one_hot(X, num_classes = 27)
# We want the neural net to produce floats, so the inputs must be floats as well.
X_enc = X_enc.float()
# View first 5 encodings.
print(X_enc[:5])

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]])


Now we can start building our network. A single neuron wouldn't be a very good network, but we will start with that to understand what exactly is happening in a neural net. 

Since each input now has a length of 27 (due to the encoding), this one neuron will actually need 27 weights so each element of the inputs can be multiplied by a unique weight. The weights will start off as random numbers from a normal distribution. (After training the model, these numbers should become more meaningful.)

In [26]:
# Weights start off random, so let's use a generator.
gen = torch.Generator().manual_seed(42)

# A single neuron with one weight for each element in our input.
W = torch.randn((27, 1), generator=gen)    # Weights are random nums from normal distribution.
W

tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487],
        [-0.3925],
        [ 0.2415],
        [-1.1109],
        [ 0.0915],
        [-2.3169],
        [-0.2168],
        [-1.3847],
        [-0.8712],
        [-0.2234],
        [-0.6216],
        [-0.5920],
        [-0.0631],
        [-0.8286],
        [ 0.3309],
        [-1.5576],
        [ 0.9956],
        [-0.8798]])

Let's try using this (very basic) network! As mentioned earlier, we use this network by multiplying each input by the weights of this neuron. 

When the input is a one-hot encoded vector, the dot product multiplies all weights by 0 except for the one weight at the position where the input has a 1. So, each one-hot encoded input simply selects the corresponding weight from the neuron and ignores the rest. 
<!-- To demonstrate, let's look at  -->

In [27]:
# c_enc has a 1 at index 3, so the dot product should select the neuron's weight at index 3.
c_enc @ W

tensor([-2.1055])

Let's think for a moment about the ideal behavior of our neural network. We'd like our model to take a letter as input and determine the probability that each other letter would follow it. Since there are 27 possible options for next letters, we'd like our model to produce 27 outputs. Since each neuron only produces one output, we'll need 27 of them. We'll combine all these neurons in a layer.

In [28]:
# Resetting generator so cell always gives same numbers instead of generating the next numbers.
gen = torch.Generator().manual_seed(42)

# Each column in the weights matrix represents another neuron in the layer.
W = torch.randn((27, 27), generator = gen)

Now W (our weights) is a 27x27 matrix of random values from a normal distribution. Each column represents a single neuron in the layer. Before we explored how one-hot encoding selected a single value from the neuron. That's still true, but now that there are 27 neurons, it will select the value in that same row for each of the neurons. So for example, an input of c (index 3) dot producted with the layer would result in a vector of all the weights at the row with index 3.

<!-- , resulting in a vector of each value selected per neuron. -->


In [29]:
# Select the third weight from each neuron.
c_enc @ W

tensor([-0.3387, -1.3407, -0.5854,  0.5362,  0.5246,  1.1412,  0.0516,  0.7440,
        -0.4816, -1.0495,  0.6039, -1.7223, -0.8278,  1.3347,  0.4835, -2.5095,
         0.4880,  0.7846,  0.0286,  0.6408,  0.5832,  1.0669, -0.4502, -0.1853,
         0.7528,  0.4048,  0.1785])

<!-- We are now ready to use this newly updated model! Instead of only using this model with the letter "c", we'll use it on all the letters stored in X.  -->
<!-- Now that we thoroughly understand how this layer of neurons works, let's use them on all our data. -->

<!-- TODO: CONTINUE FROM HERE! -->
Of course, our data (stored in X_enc) contains much more than just a single letter, but multiplying `X_enc @ W` will result in a vector just like above for each letter in X. Now let's think about these output vectors a bit. We want each output to tell us the probability of each other letter following the input letter. Right now, the numbers in our output don't look very much like probabilities. For starters, some of them are negative. We can fix that using exponentiation, which shifts numbers over so that negative numbers end up as small decimal values and positive numbers end up larger than 1.

In [31]:
# Showing the output with the "c" example from earlier.
pos_output = (c_enc @ W).exp()
pos_output

tensor([0.7127, 0.2617, 0.5569, 1.7095, 1.6898, 3.1305, 1.0530, 2.1042, 0.6178,
        0.3501, 1.8292, 0.1787, 0.4370, 3.7989, 1.6218, 0.0813, 1.6291, 2.1915,
        1.0291, 1.8979, 1.7918, 2.9064, 0.6375, 0.8309, 2.1228, 1.4989, 1.1954])

This is better, but since each of the 27 values in the output should show the percent chance of the letter in that index appearing next, we need all those values to sum up to 1, meaning 100%. After all, in total, there is 100% chance that *some* letter will follow this one. We can fix this by summing up all the values in the resulting vector and turning each element into a ratio of that sum.

<!-- changing each value into the fraction of  -->

In [51]:
# Turn the output for "c" into probabilities of each next letter occurring.
probs = pos_output / sum(pos_output)
probs

tensor([0.0188, 0.0069, 0.0147, 0.0451, 0.0446, 0.0827, 0.0278, 0.0556, 0.0163,
        0.0092, 0.0483, 0.0047, 0.0115, 0.1003, 0.0428, 0.0021, 0.0430, 0.0579,
        0.0272, 0.0501, 0.0473, 0.0768, 0.0168, 0.0219, 0.0561, 0.0396, 0.0316])

In [52]:
# Now the probabilities should add up to 1.
sum(probs)

tensor(1.0000)

Now we are ready to use this neural network on the rest of our dataset.
<!-- apply this to the rest of the dataset -->

In [53]:
# Get the original output for our dataset values before applying any functions.
logits = X_enc @ W
# Exponentiate to make the outputs positive.
pos_output = logits.exp()
# Turn the outputs into probabilities of each next letter occurring.
probs = pos_output / pos_output.sum(dim=1, keepdims=True)
probs

tensor([[0.1230, 0.0793, 0.0441,  ..., 0.0644, 0.0655, 0.0330],
        [0.0396, 0.0698, 0.0227,  ..., 0.0502, 0.0033, 0.0280],
        [0.0123, 0.0078, 0.0250,  ..., 0.0340, 0.0149, 0.0053],
        ...,
        [0.0044, 0.0085, 0.0378,  ..., 0.0071, 0.0041, 0.2970],
        [0.0843, 0.0516, 0.0303,  ..., 0.0050, 0.0154, 0.0347],
        [0.0435, 0.0354, 0.0071,  ..., 0.2319, 0.0060, 0.0317]])

In [54]:
# Now we should have probabilities for each letter in our dataset.
probs.shape

torch.Size([228146, 27])

Now we need a way for our model to make predictions. So far it should produce probabilities, but we don't want the results to be deterministic and always give us the letter with the highest probability. At best, that wouldn't be very creative, and at worst that may cause an infinite loop. Instead, we'll use multinomial sampling to select letters randomly based on their weights. 

<!-- Torch's multinomial method can take a tensor of probabilities and select -->

<!-- takes a tensor of probabilities and returns a number of indices randomly chosen based on the probabilities in the input tensor. -->

<!-- samples an index from it based on  -->
<!-- Multinomial sampling takes a tensor -->

<!-- sample from the probabilities using -->

<!-- Now that our model is set up, we're ready to train it. For that, we'll need a loss function to minimize. Here we'll use negative log likelihood.  -->

In [55]:
# gen = torch.Generator().manual_seed(42)

# # Get probabilities of each letter following ".".
# curr_probs = probs[0]

# # Randomly pick an index based on the weights within those indices.
# torch.multinomial(curr_probs, num_samples=1, replacement=True, generator=gen)