# Generating Names with Bigrams
<!-- Basic Bigram Generator -->

## Table of Contents
1) [Project Overview](#overview)
2) [Preparing the Data](#data)
3) [Building the Bigram Model](#model)
4) [Training the Model](#training)
5) [Generating Names](#generate)

<a id="overview"></a>
## Project Overview 
The goal of this project is to create a simple language model that can generate names resembling real ones from its training data.

We'll use a neural network trained on character-level bigrams -- that is, pairs of consecutive letters. When the model is given a letter as input, it should predict the probability of each possible next letter. We can then use these probabilities to pick the next letter in a name, then feed that new letter back into the model to continue generating the name, one letter at a time.


<!-- There will be two main steps for completing this goal. First, we'll train the model so it learns common bigrams in names. Next, we'll use that trained model to generate new, never-seen-before names. -->


<!-- To put that simply, if this model is given a letter, it should learn the probabilities for each letter following that one so that it can produce a likely next letter.  -->

<!-- 
learn which letter might 

it should find the percent chance that every other 

this model should learn the likelihood 

for any letter, the model should output the chance of each other letter following that one in a name.

Given a letter, the model should output the probability of each other letter following that one in a name.

That is, it will learn the common sequences of two adjacent letters found in names and use that to generate letters that might reasonably follow any given letter in a name. -->

<a id="data"></a>
## Preparing the Data
<!-- We'll be training this model on names so it learns to produce names of its own based on what it learns from the data. First we'll import our dependencies. -->
<!-- We'll be storing our data in tensors, so let's start by importing PyTorch.

Let's start by importing PyTorch so we can store our data in tensors (and later use  -->
<!-- First we'll grab a list of names to use for training. -->
We'll start by reading in a list of names we will use to train our model.

In [1]:
# Create a list of each name in names.txt
names = open("names.txt", "r").read().splitlines()
print(f"Displaying the first three names: {names[:3]}")

Displaying the first three names: ['emma', 'olivia', 'ava']


Now that we have our data, let's split it into bigrams so the model can learn common letter patterns in names. We also want it to learn how each name starts and ends, so let's indicate the start and end of a name with a ".". Below shows an example of the bigrams in the name Emma.

In [2]:
# Break "emma" into bigrams, including "." for start and end.
print(f"Bigrams for emma: {list(zip('.emma', 'emma.'))}")

Bigrams for emma: [('.', 'e'), ('e', 'm'), ('m', 'm'), ('m', 'a'), ('a', '.')]


<!-- We are going to store these bigrams in tensors to use them in our neural network. -->
To use these bigrams in our neural network, we'll organize them into two tensors: X (inputs) and y (labels).
<!-- X will contain each first letter within the bigrams. When we train our model, X will be the input.
y will contain the second letters within the bigrams, i.e., the solutions to the bigrams started in X. Throughout training, our model should learn to predict the values in y given X. -->
- X will contain the first letter of each bigram. These are the inputs we'll feed into the model.
- y will contain the second letter of each bigram, i.e., the letter that follows each input letter in X. These are the labels or "targets" the model should learn to predict.

To start off, let's import Pytorch.

In [3]:
import torch    # Tensors and backpropagation.

Since PyTorch tensors can't contain character elements, let's assign each letter to an integer value.

In [4]:
import string

# For each letter + ".", add them to dict with an integer value.
stoi = {char:int_val for (int_val, char) in enumerate("." + string.ascii_lowercase)}
print(stoi)

{'.': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


We'll also want a way to go backwards, turning numbers back into letters.

In [5]:
itos = {num: letter for letter, num in stoi.items()}
print(itos)

{0: '.', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


Now we are ready to split all the names into bigrams.

In [6]:
X = [] # First letters in bigrams
y = [] # X's bigram pairs.

for name in names:
    for char1, char2 in zip("." + name, name + "."):
        # Convert chars to ints so we can later add to tensors.
        X.append(stoi[char1])
        y.append(stoi[char2])

print(f"Our training data has {len(X)} examples.")
print(f"The first 10 values in X are {X[:10]}.")
print(f"The first 10 values in y are {y[:10]}.")

# Convert list to a tensor.
X = torch.tensor(X)
y = torch.tensor(y)

Our training data has 228146 examples.
The first 10 values in X are [0, 5, 13, 13, 1, 0, 15, 12, 9, 22].
The first 10 values in y are [5, 13, 13, 1, 0, 15, 12, 9, 22, 9].


Before we can use these values in a neural network, we'll need to alter them a bit more. See, neural networks contain weights which get multiplied by each input. Right now our inputs are just numbers from 0 to 26. It wouldn't be very helpful to do math with these inputs, since we want each letter to be treated equally. "z" shouldn't have a different multiplier than "a", for example.

To make sure each input gets treated equally, we can use one-hot encoding. With one-hot encoding, each letter gets represented by an array of length 27 (one for each possible letter including "."). This array contains all 0s apart from a single 1 at the index signifying the chosen letter. To show this more visually, let's explore the one-hot encoded version of the letter "c".

In [7]:
# Get number used to represent c. This will be the index of the 1 after one-hot encoding.
c_index = stoi["c"]

# Create the one-hot encoded representation of the letter "c".
c_enc = torch.zeros(27)
c_enc[c_index] = 1

print(f"c's index is {c_index}, so its one-hot encoding looks like {c_enc}")

c's index is 3, so its one-hot encoding looks like tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])


Let's apply this encoding to all the letters in our input data. Rather than encoding them all manually like above, we'll use PyTorch's one_hot() function. 

<!-- Note that the function makes the datatype of the encodings integers. To use them in our neural network

Unlike when I created the encoding myself, however, this will default to making the datatype of each element an integer.  -->

<!-- This function doesn't automatically keep the values as floats -->

<!-- By default, this function will store the letters  -->
<!-- That way, they will all be represented similarly -->

<!-- We'll also turn the values into floats while we're at it, as this is necessary so that they can be multiplied by float weights in our neural network. -->

In [8]:
import torch.nn.functional as F

# One-hot encode X to turn each letter into an array of length 27 (one index for each possible letter including ".".)
X_enc = F.one_hot(X, num_classes = 27)
# View encodings for first 5 letters.
X_enc[:5]

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])

As you can see, the values here are all integers. To later multiply them by the float weights in our neural network, we'll need to convert those values into floats.

<!-- We'll also turn the values into floats while we're at it, as this is necessary so that they can be multiplied by float weights in our neural network. -->

In [9]:
# We want the neural net to produce floats, so the inputs must be floats as well.
X_enc = X_enc.float()
# The first letter should now be encoded with floats.
X_enc[0]

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])

<a id="model"></a>
## Building the Bigram Model

<!-- Now that we've finished preparing our data, we're ready to make our neural network. We won't be training it just yet; this section is more about exploring how neural networks work and then showing how we can use it to generate names once it is trained. -->

<!-- Our neural network  -->
For this project, we'll use a very simple neural network with only a single layer of neurons. 

<!-- Before creating this model, let's explore how a single neuron works. -->

Before we jump into building this layer, it's important to understand how a single neuron works.

Since each letter in X now has a length of 27 (due to the encoding), this one neuron will need 27 weights so each element in the inputs can be multiplied by a unique weight. The weights will start off as random numbers from a normal distribution. (After training the model, these numbers should become more meaningful.)


<!-- A single neuron wouldn't be a very good network, but we will start with that to understand what exactly is happening in a neural net. -->

<!-- Before we get to making our full neural network, we'll start by exploring how a single neuron works. A neuron is  -->
<!-- A single neuron wouldn't be a very good network, but it is important to understand how it functions before making  -->

<!-- understanding it is -->

<!-- we will start with that to understand what exactly is happening in a neural net. -->


<!-- for now, we will just explore what a neuron even is, then  -->

<!-- this section is more about exploring how neural networks work and then showing how we can use it to generate names once it is trained. -->

<!-- for now, we'll just be making the untrained model and walking through how we can use it. -->

<!-- we will just be exploring what a neural network is and how it will be used to make predictions.

works.

TODO: Should we show how the model will make predictions here? -->

In [10]:
# Weights start off random, so let's use a generator.
gen = torch.Generator().manual_seed(42)

# A single neuron with one weight for each element in our input.
W = torch.randn((27, 1), generator=gen)
print("Here is a single neuron with a 27 weights:")
print(W)

Here is a single neuron with a 27 weights:
tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487],
        [-0.3925],
        [ 0.2415],
        [-1.1109],
        [ 0.0915],
        [-2.3169],
        [-0.2168],
        [-1.3847],
        [-0.8712],
        [-0.2234],
        [-0.6216],
        [-0.5920],
        [-0.0631],
        [-0.8286],
        [ 0.3309],
        [-1.5576],
        [ 0.9956],
        [-0.8798]])


<!-- TODO: Finish rewording this. (And maybe change W's name since it isn't a full network? But it is still weights, so maybe keep it?) -->

<!-- To use a neural network, we multiply  -->

<!-- Now that we have a neuron, let's try using it with our input to how it works. To use a neural network, we multiply each input by the weights within its neurons. -->
<!-- To use a neural network, we multiply each input by the weights within its neurons. Here we only have one neuron, so 

We can use a neural network by multiplying each input 

Let's try using this neuron! As mentioned earlier, we use this network by multiplying each input by the weights of this neuron.
 -->
Now we can use this neuron by multiplying our inputs with it.


When the input is a one-hot encoded vector, the dot product multiplies all weights by 0 except for the one weight at the position where the input has a 1. Each of the resulting multiplications are then added together, but since most of the values are 0, they end up ignored by the result. Thus, it is as if each one-hot encoded input simply selects the corresponding weight from the neuron and ignores the rest.

In [11]:
print("c_enc has a 1 at index 3, so the dot product should select the neuron's weight at index 3.")
c_enc @ W

c_enc has a 1 at index 3, so the dot product should select the neuron's weight at index 3.


tensor([-2.1055])

<!-- Now let's think for a moment about the ideal behavior of our neural network. As mentioned in the overview, we'd like our model to take a letter as input and determine the probability for each possible next letter. Since there are 27 possible options for next letters, our model should produce 27 outputs for each input letter (where each output indicates the probability of the letter at that index coming next). -->

As described earlier, we want our model to take a letter as input and output the probability of each possible next letter. Since there are 27 possible next letters, the model should produce a vector of 27 probabilities, one for each letter.

We've seen that a single neuron produces one output. Thus, to produce 27 outputs, we'll need a layer of 27 neurons.

In [12]:
# Resetting generator so cell always gives same numbers instead of generating the next numbers.
gen = torch.Generator().manual_seed(42)

# Each column in the weights matrix represents another neuron in the layer.
W = torch.randn((27, 27), generator = gen, requires_grad=True)    # `requires_grad` will be explained later.

Now W (our weights) is a 27x27 matrix of random values from a normal distribution. Each column represents a single neuron in the layer. 

Before we explored how one-hot encoding selected a single value from the neuron. That's still true, but now that there are 27 neurons, it will select the value in that same row for each of the neurons. So for example, an input of c (index 3) dot producted with the layer would result in a vector of all the weights at the row with index 3.

In [13]:
# Select the third weight from each neuron.
c_output = c_enc @ W
c_output

tensor([-0.3387, -1.3407, -0.5854,  0.5362,  0.5246,  1.1412,  0.0516,  0.7440,
        -0.4816, -1.0495,  0.6039, -1.7223, -0.8278,  1.3347,  0.4835, -2.5095,
         0.4880,  0.7846,  0.0286,  0.6408,  0.5832,  1.0669, -0.4502, -0.1853,
         0.7528,  0.4048,  0.1785], grad_fn=<SqueezeBackward4>)

Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. 

You may notice that the numbers in our output don't look very much like probabilities just yet. At this stage, the outputs are just raw scores (also called "logits"), not probabilities. When we use this model during training and sampling, we'll need to normalize these outputs to turn them into probabilities. 

<!-- Below shows each letter and its associated logit. -->

<!-- To make it more clear how each logit is associated with a letter,  -->

<!-- To make this more clear, below we can see how each value in this output vector corresponds to a specific letter. -->

<!-- Each value in this output vector corresponds to a specific letter and can eventually be used to find their probability.  -->

<!-- Let's match up each possible next letter with the value -->

<!-- Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (called "logits"), not probabilities.  -->

<!-- Logits must be converted into 


Now we have a model that outputs 27 values for each input letter, but these values aren't yet probabilities. Instead, what we have are logits

So far what we have are 


Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. These values aren't probabilities just yet. So far, we only 

Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. Each value in this output vector corresponds to a specific letter and can eventually be used to find their probability. 


After training the model, these numbers can be used to find the probability that each 

These aren't yet the probabilities of 

Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (also called "logits"), not probabilities.


Ideally, we'd like these values to represent the probabilities of each letter coming after "c". Of course, our model isn't trained yet, so we can't expect the probabilities to be correct just yet. But if we take a look at the 

these numbers should still be random. But 



But let's take a closer look at these values before going further.

Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (also called "logits"), not probabilities.

Just to make it more clear, let's 

<!-- Ideally, we'd like each of these outputs to represent the probability that the letter at that index comes after "c". To make this clearer, let's match up each output with the letter it represents. -->

<!-- let's match up each possible next letter with the value  -->

<!-- we can match each of these values with the letter it coincides with. -->

<!-- However, you may notice that the numbers in our output don't look very much like probabilities just yet. --> 

In [14]:
print("Our untrained model's current raw scores for each letter following 'c':")
for letter, logit in zip(stoi.keys(), c_output):
    print(f"{letter}: {logit:.4f}")

Our untrained model's current raw scores for each letter following 'c':
.: -0.3387
a: -1.3407
b: -0.5854
c: 0.5362
d: 0.5246
e: 1.1412
f: 0.0516
g: 0.7440
h: -0.4816
i: -1.0495
j: 0.6039
k: -1.7223
l: -0.8278
m: 1.3347
n: 0.4835
o: -2.5095
p: 0.4880
q: 0.7846
r: 0.0286
s: 0.6408
t: 0.5832
u: 1.0669
v: -0.4502
w: -0.1853
x: 0.7528
y: 0.4048
z: 0.1785


<!-- Of course, our data (stored in X_enc) contains much more than just a single letter. If our input contains many letters (such as all the letters stored in X), the output will still produce a vector for each letter in the input, but each of these vectors will be combined as rows of a matrix. Thus, if we send the network all of X as input, the resulting output will be a matrix with a row of probabilities for each of the 228146 examples letter in X. -->

Now that we understand what happens when we use a single letter as input to our neural network, let's look at what happens with the full dataset. Our input data, stored in `X_enc`, contains far more than just one letter. When we pass these inputs through the network, each one-hot encoded row in `X_enc` selects a corresponding row from the weights matrix to produce its logits. These logits are then stacked together, resulting in an output matrix where each row contains the logits for one input letter.

<!-- 
Now we understand what happens when we use a single letter as input to our neural network. Of course, our data (stored in `X_enc`) contains much more than just a single letter. Each of the 228146 rows in `X_enc` represent an input letter, and they each use their one-hot encoding to select a row from the weights matrix for their logits. These logits are stacked in rows and outputted together as a matrix.

as rows in a matrix.

has its own one-hot encoding and selects a row of weights from the matrix, just like `c_enc`.


When we input many letters at once (for example, all the letters in `X`), each 

the network will output a vector of logits for each input letter. These vectors are stacked together as rows in a matrix. Thus, if we pass all of `X` as input, the output will be a matrix where each row contains the logits for the next letter, corresponding to each of the 228146 input letters in `X`. -->

<!-- , but multiplying `X_enc @ W` will result in a vector just like above for each letter in X. The only difference is that instead of the network outputting a single vector,  -->

<!-- . When we use this neural network on all the letters in X, we'll get a matrix where each row is a vector just like above  -->

<!-- , but multiplying `X_enc @ W` will result in a vector just like above for each letter in X. These outputs  -->
<!-- Each of these vectors will be stored as a row in a final output matrix. -->

<!-- Now that we have a model that can produce 27 outputs for each input, we're on the right track. However, you may notice that the numbers in our output don't look very much like probabilities just yet. -->

<!-- Now let's think about these output vectors a bit. We want each output to tell us the probability of each other letter following the input letter. Right now, the numbers in our output don't look very much like probabilities. For starters, some of them are negative. We can fix that using exponentiation, which shifts numbers over so that negative numbers end up as small decimal values and positive numbers end up larger than 1. -->

In [15]:
# Get logits for each letter in X.
logits = X_enc @ W
logits

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  1.2791,  1.2964,  0.6105],
        [ 0.5760,  1.1415,  0.0186,  ...,  0.8123, -1.9006,  0.2286],
        [-0.3303, -0.7939,  0.3752,  ...,  0.6854, -0.1397, -1.1808],
        ...,
        [-1.1798, -0.5297,  0.9625,  ..., -0.7068, -1.2520,  3.0250],
        [ 1.3463,  0.8556,  0.3220,  ..., -1.4740, -0.3502,  0.4590],
        [ 0.4557,  0.2503, -1.3611,  ...,  2.1296, -1.5181,  0.1387]],
       grad_fn=<MmBackward0>)

### Turning Logits to Probabilities with Softmax

Earlier, we mentioned that we'd like to turn these logits into probabilities. We can do this using the softmax function, which works like so:

First, we exponentiate each of the logits. This turns all values positive while keeping their relative order. All negative numbers will turn into a value between 0 and 1, and all positive values will end up as some value larger than 1.

<!-- , as negative numbers will still be smaller than positive numbers. -->

In [16]:
# Change values to all be positive.
pos_values = logits.exp()
pos_values

tensor([[ 6.8683,  4.4251,  2.4614,  ...,  3.5935,  3.6562,  1.8413],
        [ 1.7789,  3.1315,  1.0187,  ...,  2.2532,  0.1495,  1.2568],
        [ 0.7187,  0.4521,  1.4553,  ...,  1.9845,  0.8696,  0.3070],
        ...,
        [ 0.3074,  0.5888,  2.6183,  ...,  0.4932,  0.2859, 20.5950],
        [ 3.8430,  2.3528,  1.3799,  ...,  0.2290,  0.7045,  1.5825],
        [ 1.5773,  1.2844,  0.2564,  ...,  8.4117,  0.2191,  1.1488]],
       grad_fn=<ExpBackward0>)

Then we turn this into a probability by summing up each row and setting each logit in that row equal to their fraction of that sum. 
<!-- This process is called the softmax function. -->

In [17]:
# Sum the column vectors to calculate the sum of each row, then divide each element in the row by the result to get their probabilities.
probs = pos_values / pos_values.sum(dim=1, keepdims=True)
probs

tensor([[0.1230, 0.0793, 0.0441,  ..., 0.0644, 0.0655, 0.0330],
        [0.0396, 0.0698, 0.0227,  ..., 0.0502, 0.0033, 0.0280],
        [0.0123, 0.0078, 0.0250,  ..., 0.0340, 0.0149, 0.0053],
        ...,
        [0.0044, 0.0085, 0.0378,  ..., 0.0071, 0.0041, 0.2970],
        [0.0843, 0.0516, 0.0303,  ..., 0.0050, 0.0154, 0.0347],
        [0.0435, 0.0354, 0.0071,  ..., 0.2319, 0.0060, 0.0317]],
       grad_fn=<DivBackward0>)

If we did softmax correctly, each row should now add up to 1.

In [18]:
# Confirm that each row now sums to 100%.
probs.sum(dim=1, keepdims=True)

tensor([[1.0000],
        [1.0000],
        [1.0000],
        ...,
        [1.0000],
        [1.0000],
        [1.0000]], grad_fn=<SumBackward1>)

<a id="training"></a>
## Training the Model

Now that we have our model, we're ready to train it. Before we fully train the network on all our inputs, let's walk through how it works by training it with a single bigram.

<!-- "s" "a" -->

<!-- , we'll walk through how it works by training it with a single made-up bigram. -->

<!-- let's walk through the steps of training it with a single input. -->

<!-- When we train the model, we change the weights of its neurons so that the logits 

Training the model is 

Right now, when we feed out model 


Our goal with training is simple: when we give it an input letter, 
To train our model, we're going to try using it on X.

Once our model is well trained, it should give the correct Y values high probabilities. -->

### Demonstration

<!-- Since my name is Sawyer, for this example we will train the model that when it sees the letter "s", it should output "a". -->
For this example, we will train the model that when it sees the letter "c", it should output "a". 

<!-- First, let's check what percent chance the model currently thinks 
how the model currently predicts 

see the probability the model currently assigns "a" 
 -->
<!-- we need to check how well it currently does  -->

In [19]:
# Select weights from W to get logits.
c_logits = c_enc @ W
# Turn logits into probabilities with softmax.
c_pos = c_logits.exp()
c_probs = c_pos / c_pos.sum()
# Examine current probabilities.
c_probs

tensor([0.0188, 0.0069, 0.0147, 0.0451, 0.0446, 0.0827, 0.0278, 0.0556, 0.0163,
        0.0092, 0.0483, 0.0047, 0.0115, 0.1003, 0.0428, 0.0021, 0.0430, 0.0579,
        0.0272, 0.0501, 0.0473, 0.0768, 0.0168, 0.0219, 0.0561, 0.0396, 0.0316],
       grad_fn=<DivBackward0>)

In [20]:
# Check model's current prediction that "a" follows "c".
original_prediction = c_probs[stoi["a"]]

print(f"Before training, the model predicts that 'a' has a {original_prediction * 100:.2f}% chance of following 'c'.")

Before training, the model predicts that 'a' has a 0.69% chance of following 'c'.


That's a very low probability. To train our model, we need a way to evaluate how bad its predictions are so that we can take the necessary steps to fix them. To do this, we use something called a loss function. The goal of an ML model is to minimize the loss. For this model, the loss function we will use is called the Negative Log Likelihood (NLL) loss. Let's break down what that means.

- Likelihood: Multiply together all probabilities for the chosen inputs. For this example, we only have one probability, but usually there will be many. The ideal likelihood is 1, meaning that the model always predicted 100% for the correct outputs.
- Log: Likelihoods can be an incredibly small number, which is hard to work with since computers have limited precision. We'd have less of a small number if we could add the values instead of multiplying them. Log likelihood allows us to do just that, because $log(a*b*c) == log(a) + log(b) + log(c)$, and since log is monotonically increasing, maximizing the log likelihood is equivalent to mazimizing the likelihood, so this convenience doesn't negatively impact the outcome.
- Negative: Likelihood is a number that we want to maximize, but loss is something that should be minimized. If we take the negative of the likelihood, it becomes something we can minimize instead, thus being useable for loss.

We can then normalize that loss by dividing what we get by the total by the number of inputs. Since getting the log of the likelihood allowed us add each of the losses rather than multiplying them, the combination of adding losses and dividing by their count can be combined into a single mean() operation.


<!-- Once we get to training our model on many examples at once, we can also normalize that loss by dividing the total by the number of inputs. Since using log causes the  -->


<!-- work with more than one prediction at a time, we'll also normalize the results by  -->


<!-- Maximum Likelihood Estimation (MLE). With MLE,  -->

<!-- When we training our model, we need a way to evaluate how bad its predictions are so that we can take the necessary steps to fix them. -->

<!-- were so it can take the steps -->


<!-- 
Before we can adjust the weights to fix it, however, we need a way to evaluate how wrong this pro

In order to evaluate just how 

When we're training our model, we'll need a way for the model to evaluate how bad its predictions were so it can take the steps

We need a way for our model to evaluate just how bad its prediction was so that it knows how much to change the weights by to improve it. For that, we need a loss function. Here we'll use a variation of Maximum Likelihood Estimation as our loss. Specifically, we will use the negative log likelihood loss. Let's break down what that means. -->



<!-- Here we will use the negative log likelihood loss. This is a  -->

<!-- We'll base our loss off of the likelihood loss.  -->


<!-- Here we will use the average negative log likelihood loss. Let's break down what that means.

- Likelihood: Multiply together all probabilities for the chosen inputs. For this example, we only have one probability, but usually there will be many. The ideal likelihood is 1, meaning that the model always predicted 100% for the correct output.
- Log: Likelihoods can be an incredibly small number, which is hard to work with since computers have limited precision. We'd have less of a small number if we could add the values instead of multiplying them. Log likelihood allows us to do just that, because $log(a*b*c) == log(a) + log(b) + log(c)$, and since log is monotonically increasing, maximizing the log likelihood is equivalent to mazimizing the likelihood, so this convenience doesn't negatively impact the outcome.
<!-- Taking the log of it helps turn that number into a value that is easier to manage. -->
<!-- - Log: Likelihood says to multiply all the outputs. That's  -->
<!-- - Negative: Likelihood is a number that we want to maximize, but loss is something that should be minimized. If we take the negative of the likelihood, it becomes something we can minimize instead, thus being useable for loss. -->
<!-- - Average: We want to know how well our model predicts overall for all our bigrams. For that, we get the average. -->
<!-- - * Normalize it by dividing by the number of examples. -->


<!-- This multiplies together all the  -->

<!-- We need to change the weights of our model to improve them. To do that, first we need a way to tell the model just how bad its prediction was. -->

In [21]:
# Use NLL loss to evaluate how poorly the model currently performs.
loss = -original_prediction.log().mean()
loss

tensor(4.9747, grad_fn=<NegBackward0>)

<!-- Remember, this loss is a result of our model's weights  -->


A perfect loss would be 0. The loss we got is pretty bad. To improve our model, we'll need to change the weights of our model so they produce more accurate results, thus minimizing the loss. We won't need to change all the weights to improve this one prediction, though. We'll only need to change the weights that had an impact on the loss we got. To determine which weights to change, we use derivatives from calculus. A derivative tells us how much a small change in a variable affects the output of a function. In our case, the function is the loss function, so by calculating the derivative of the loss with respect to each weight, we can see how much each weight contributed to the loss. 

The collection of all these derivatives (one for each weight) is called the gradient. To find the gradient of the loss with respect to each weight, we can use backpropagation. (Remember how we created W with the parameter `requires_grad = True`? That was so we can get its gradient now.) 
<!-- PyTorch tensors conveniently have backpropagation -->



<!-- we can find out how much the loss will change if we nudge that weight. -->





<!-- So now we need a way to find out which weights of the model we need to change to improve  -->

<!-- Now that we know the loss, we need to change the weights of our model in such a way to minimize that loss. For that, we need some way to determine which weights affected the loss we got. -->


<!-- To minimize this loss, we need some method of determining  -->

In [22]:
# Clear any existing gradient for W.
W.grad = None
# Backpropagate through the network to find out how much each weight impacted the loss.
loss.backward()

Before we examine W's gradient, let's try to think through what we expect it to look like. 

First, let's think about the size. W is size (27, 27). Each of these weights will have a partial derivative, so W.grad should be size (27, 27) as well.
<!-- 
, so we expect the gradient to be 

that same size, as each weight in W will have a derivative assigned to it.

We know that each weight will get assigned a 


Let's think through the steps that lead to the eventual loss to see if we can determine which weights had an impact on it.

First, we  -->

In [23]:
W.grad.size()

torch.Size([27, 27])

Now, what partial derivatives should we expect? Let’s walk through the steps that lead to this loss.

First, we calculated the logits with `c_enc @ W`, which multiplied our one-hot encoded vector `c_enc` (which represents "c" as a 1 at index 3 and 0 elsewhere) by the weights matrix `W`. As discussed earlier, multiplying a one-hot vector by a matrix selects the corresponding row of the matrix, which in this case is row 3, and multiplies each other row by 0. This means that only the weights in row 3 of `W` can have any effect on the loss. Changing the other weights can't impact the loss, since any change would still end up getting multiplied by 0.

As a result, when we look at the gradient, we should expect the partial derivatives to be zero everywhere except for the weights in row 3.

Let's take a look at the gradient now to see if we're on the right track.

In [24]:
W.grad

tensor([[ 0.0000, -0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000],
        [ 0.0000, -0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000],
        [ 0.0000, -0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000],
        [ 0.0188, -0.9931,  0.0147,  0.0451,  0.0446,  0.0827,  0.0278,  0.0556,
          0.0163,  0.0092,  0.0483,  0.0047,  0.0115,  0.1003,  0.0428,  0.0021

Just as we expected, the only row with non-zero values is the row with index 3. 

The specific values of the partial derivatives in that row were determined by backpropagating through all the operations of the network, which included getting the softmax, selecting the index of "a" to find the model's predicted probability, and getting the Negative Log Likelihood for loss.

<!-- which included getting the softmax and then finding the model's predicted probability for "a" by selecting the value at index 1 of the result, as well as all the loss operations. -->

<!-- selecting the value at in

resulting weight in index 1 for "a". -->

<!-- Now, we won't go so far as calculating the specific partial derivatives shown above, but we *can* get a bit more insight into where those numbers came from. When we selected "a", that was in index 1   -->

Now let's take a closer look at the non-zero row to get more insight.
<!-- Now, we won't go so far as calculating the specific values shown above, but we *can* get a bit more insight into where those numbers came from. Remember, each column in the weights matrix represents one neuron  -->

In [25]:
# Examine the only row with non-zero gradients.
W.grad[3]

tensor([ 0.0188, -0.9931,  0.0147,  0.0451,  0.0446,  0.0827,  0.0278,  0.0556,
         0.0163,  0.0092,  0.0483,  0.0047,  0.0115,  0.1003,  0.0428,  0.0021,
         0.0430,  0.0579,  0.0272,  0.0501,  0.0473,  0.0768,  0.0168,  0.0219,
         0.0561,  0.0396,  0.0316])

Notice how almost all of the weights in this row have a positive partial derivative with respect to the loss. That means that increasing the weights in those positions will increase the loss. We don't want that, since we're trying to minimize the loss. The only weight that has a negative partial derivative with respect to loss (meaning increasing it will decrease the loss) is the weight at index 1. This should make sense, as in this demonstration we specifically were trying to find the probability of the model predicting "a", which would be found at index 1 of the row selected by "c". If we increase that weight, we'd be increasing the probability that the model predicts for "a" following "c", so it makes perfect sense that we'd get a better result and thus decrease the loss.

Now that we have a solid understanding of the gradient, we are ready to use it to improve our model. We do this by nudging each of the weights by some small amount (called the learning rate) in the direction that would decrease loss. Since decreasing the loss means changing the weights in the opposite direction of the gradient, we call this "gradient descent". 

<!-- 
Naturally, if that probability were higher, 

<!-- These partial derivatives tell us how much the loss will increase if we increase the given weight. 

Notice how only one of the weights in this row has a negative partial derivative with respect to the loss. That means that increasing that weight will decrease the loss by the amount shown. 

<!-- These values are the selected weights' partial derivatives with respect to the loss.

That means that the weights with positive values w

These partial derivatives tell us how much the loss will increase if we increase their weights.
 -->

In [26]:
# Nudge each weight in the direction opposite the gradient to minimize loss.
W.data += -0.01 * W.grad

Now that we adjusted the model's weights, let's see if the model is any better at predicting that "a" should come after "c". To do this, we'll need to send our input through the network again.

In [27]:
print(f"Before training, the model predicted that 'a' had a {original_prediction * 100:.2f}% chance of following 'c'.")

# Once again, select weights from W to get logits.
c_logits = c_enc @ W
# Turn logits into probabilities with softmax.
c_pos = c_logits.exp()
c_probs = c_pos / c_pos.sum()
# Check model's current prediction that "a" follows "c".
new_prediction = c_probs[stoi["a"]]

print(f"Now, after adjusting the weights, the model predicts that 'a' has a {new_prediction * 100:.2f}% chance of following 'c'.")

Before training, the model predicted that 'a' had a 0.69% chance of following 'c'.
Now, after adjusting the weights, the model predicts that 'a' has a 0.70% chance of following 'c'.


As we hoped, our model is now a bit better at making predictions. To keep training, we'd need to calculate the loss again, use the new gradients for gradient descent, and keep repeating this process until the model's predictions stop improving.
<!-- 
repeat the 


The improvement isn't immense, since we only nudged the weights by a small amount, but when we go through the full training data we'll continously iterate through the steps of making predictions, calculating the new loss, then nudging the weights by a small amount 

We only made one slight adjustment to the weights here, but  -->

### Full training

Now that we've walked through a simple training example, we're ready to train the model on our full training data. We'll just repeat the steps we did above.

In [40]:
def train():
    # Go through the full data several times to keep making improvements.
    for _ in range(100):
        ### Forward pass. ###

        # Select row for each letter in X.
        logits = X_enc @ W
        # Use softmax function to turn logits into percents.
        pos = logits.exp()
        probs = pos / pos.sum(dim=1, keepdims=True)
        # For each row of probs, check model's prediction for label.
        predictions = probs[torch.arange(X_enc.size(0)), y]
        # Use NLL loss.
        loss = -predictions.log().mean()

        ### Backwards pass. ###

        # Reset gradients.
        W.grad = None
        # Backpropagate through full network.
        loss.backward()
        # Adjust weights by learning rate to improve loss.
        W.data = -50 * W.grad
        
    # Return loss after training so we can check model progress.
    return loss.item()

In [43]:
train()

3.085660934448242

TODO: Add model smoothing, then generation.

In [30]:
# Try input letter from X.
# Softmax the output.
# Use y to find current probability assigned to correct second letter of bigram.
# Calculate loss.
# Use backpropagation to see how much each weight impacts the loss.
# Nudge weights in direction that minimizes loss.