# N-Gram Language Model Exercises
Inspired by Andrej Karpathy's first Makemore video. Here I'll try to reproduce what I learned from scratch and add additional features.

We'll start by making a bigram language model. We'll train it on names so it learns to produce names of its own based on what it learned from the data.

In [1]:
import torch                       # Tensors and backpropagation.
import matplotlib.pyplot as plt    # Graphing capabilities.

In [2]:
# Create a list of each name in names.txt
names = open("names.txt", "r").read().splitlines()
print(f"Displaying the first three names: {names[:3]}")

Displaying the first three names: ['emma', 'olivia', 'ava']


Now that we have our data, let's split it into bigrams so the model can learn common letter patterns in names. We also want it to learn how each name starts and ends, so let's indicate the start and end of a name with a ".". Below shows an example of the bigrams in the name Emma.

In [3]:
# Break "emma" into bigrams, including "." for start and end.
print(f"Bigrams for emma: {list(zip('.emma', 'emma.'))}")

Bigrams for emma: [('.', 'e'), ('e', 'm'), ('m', 'm'), ('m', 'a'), ('a', '.')]


We want this model to take a character as input and predict which character comes next. For that, we want two tensors.

X (the input) should contain the first half of the bigrams.

y (the label) should contain the solutions, AKA the letter that comes next.

Since PyTorch tensors can't contain character elements, let's first assign each letter to an integer value.

In [4]:
import string

# For each letter + ".", add them to dict with an integer value.
stoi = {char:int_val for (int_val, char) in enumerate("." + string.ascii_lowercase)}
print(stoi)

{'.': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


<!-- Now we'll get the bigrams for all the names and split them into two tensors, X and y. -->
<!-- Now we are ready to get bigrams for all the names  and put them into tensors X and y. -->
Now we are ready to make our bigrams for the data.

In [5]:
X = [] # First letters in bigrams
y = [] # X's bigram pairs.

for name in names:
    for char1, char2 in zip("." + name, name + "."):
        # Convert chars to ints so we can later add to tensors.
        X.append(stoi[char1])
        y.append(stoi[char2])


print(f"The first 10 values in X are {X[:10]}.")
print(f"The first 10 values in y are {y[:10]}.")
# Convert list to a tensor.
X = torch.tensor(X)
y = torch.tensor(y)

The first 10 values in X are [0, 5, 13, 13, 1, 0, 15, 12, 9, 22].
The first 10 values in y are [5, 13, 13, 1, 0, 15, 12, 9, 22, 9].


<!-- Now we are ready to get started on our neural network. Before we can make input our Xs, however,  -->
<!-- Before we can input this data into our  -->
<!-- Now we are ready to get started on our neural network.  -->
<!-- For this we will use a very basic network with only a single layer of neurons. -->
Now our data is almost ready for use in our neural network. But we don't actually want our network's weights to be multiplied by the integers in X. Instead, we'll use one-hot encoding so all possible letters are treated equally: as an array of 0s and a single 1 in the index signifying our letter.

In [10]:
import torch.nn.functional as F

# One-hot encode X to turn each letter into an array of length 27 (one index for each possible letter including ".".)
Xenc = F.one_hot(X, num_classes = 27)
# We want the neural net to produce floats, so the inputs must be floats as well.
Xenc = Xenc.float()
# View first 5 encodings.
print(Xenc[:5])
# plt.imshow(Xenc[:5])

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]])


<!-- Now let's make our neural net. For this we'll use a very simple neural network with only a single layer of neurons.  -->
<!-- Now we're ready to make our neural net. We'll give each  -->

First, let's explore what this neural net would look like with only a single neuron. Even though it is only one neuron, it will need to have 27 weights, since all our inputs are size 27 and we need a weight for each element within them.

In [38]:
# Set a seed for the random generations.
gen = torch.Generator().manual_seed(42)

# A single neuron with one weight for each element in our input.
W = torch.randn((27, 1), generator=gen)    # Weights are random nums from normal distribution.
W

tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487],
        [-0.3925],
        [ 0.2415],
        [-1.1109],
        [ 0.0915],
        [-2.3169],
        [-0.2168],
        [-1.3847],
        [-0.8712],
        [-0.2234],
        [-0.6216],
        [-0.5920],
        [-0.0631],
        [-0.8286],
        [ 0.3309],
        [-1.5576],
        [ 0.9956],
        [-0.8798]])

The first letter in X is ".", which is encoded as a 1 followed by 26 zeros. Thus, multiplying it with W would apply the dot product, selecting the first row of W (since all the other weights would get multiplied by 0).

In [27]:
# We expect the outcome here to be equal to W[0].
Xenc[0] @ W

tensor([1.9269])

<!-- This only gives us a single value per input. We want  -->
<!-- This value can be used to calculate the probability  -->
A single neuron isn't enough here, since we need a neuron for each possible bigram pair so we can use it can calculate their probabilities. We can fix this by using a full layer of 27 neurons, one for each possible label.

<!-- Instead, let's make this layer contain a neuron for each possible label -->

In [43]:
# Resetting generator so cell always gives same numbers instead of generating the next numbers.
gen = torch.Generator().manual_seed(42)

# New weights should have 27 columns: one for each neuron.
W = torch.randn((27, 27), generator = gen)
# Let's take a peek at the first row of weights.
W[0]

tensor([ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
        -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,
         1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806,
         1.2791,  1.2964,  0.6105])

<!-- Now multiplying it by Xenc should result in a vector of length 27, as it will -->
Now the code `Xenc[0] @ W` will instead select the first weight of each neuron and result in a vector of all those selected weights.

In [44]:
# This should result in the same values as shown above.
Xenc[0] @ W

tensor([ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
        -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,
         1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806,
         1.2791,  1.2964,  0.6105])

Likewise, each other element of Xenc selects the row of W corresponding to the position of the 1 in its one-hot encoding. For example, the input letter "a" will select row 1, the input b will select row 2, and so on. Since each letter corresponds to a unique row in the matrix, we want that row to represent the probabilities of each next letter following the input letter that selected it.

Right now, however, our weights don't look very much like probabilities. They are negative and don't sum up to 1 (100%). We can fit the issue of negatives by exponentiating the values, making  all the digits positive while keeping values that were negative smaller than values that were positive. Then we can turn this into a probability by summing up each row and setting the weights for that row equal to their fraction of that sum. This process is called the softmax function.

<!-- Then if we sum up the rows, we can set each weight equal to its fraction of that sum,  -->

<!-- Then making them sum to 1 is a simple matter of setting them all equal to their  -->
<!-- To fix this, we will treat the output of our model as logits and apply the softmax function to it. This should result in each row having a sum of 1 (100%). -->

<!-- Thus, we'd like the rows to tell us the probability of each letter appearing after the letter that selects that row. -->
<!-- Thus, each row can be thought of as the  -->

In [47]:
# Get the rows for every input in our dataset.
logits = Xenc @ W
# After exponentiating, these values can be trained to find the count of each second letter in the bigram.
counts = logits.exp()
# Sum the column vectors to calculate the sum of each row, then divide each element in the row by the result to get their probabilities.
probs = counts / counts.sum(dim=1, keepdims=True)
probs

tensor([[0.1230, 0.0793, 0.0441,  ..., 0.0644, 0.0655, 0.0330],
        [0.0396, 0.0698, 0.0227,  ..., 0.0502, 0.0033, 0.0280],
        [0.0123, 0.0078, 0.0250,  ..., 0.0340, 0.0149, 0.0053],
        ...,
        [0.0044, 0.0085, 0.0378,  ..., 0.0071, 0.0041, 0.2970],
        [0.0843, 0.0516, 0.0303,  ..., 0.0050, 0.0154, 0.0347],
        [0.0435, 0.0354, 0.0071,  ..., 0.2319, 0.0060, 0.0317]])

In [48]:
# Now each row should sum to 1.
probs[0].sum()

tensor(1.0000)

Let's examine our data a bit here so we have a clear idea of what it all means.

In [65]:
# plt.imshow(list(probs))
# probs[0]
probs.shape

torch.Size([228146, 27])

Now we need a way for our model to make predictions. So far it should produce probabilities, but we don't want the results to be deterministic and always give us the letter with the highest probability. At best, that wouldn't be very creative, and at worst that may cause an infinite loop. Instead, we'll use multinomial sampling to select letters randomly based on their weights. 

<!-- Torch's multinomial method can take a tensor of probabilities and select -->

<!-- takes a tensor of probabilities and returns a number of indices randomly chosen based on the probabilities in the input tensor. -->

<!-- samples an index from it based on  -->
<!-- Multinomial sampling takes a tensor -->

<!-- sample from the probabilities using -->

<!-- Now that our model is set up, we're ready to train it. For that, we'll need a loss function to minimize. Here we'll use negative log likelihood.  -->

In [54]:
gen = torch.Generator().manual_seed(42)

# Get probabilities of each letter following ".".
curr_probs = probs[0]

# Randomly pick an index based on the weights within those indices.
torch.multinomial(curr_probs, num_samples=1, replacement=True, generator=gen)

tensor([25])