# Generating Names with Bigrams
<!-- Basic Bigram Generator -->

## Table of Contents
1) [Project Overview](#overview)
2) [Preparing the Data](#data)
3) [Building the Bigram Model](#model)
4) [Training the Model](#training)
5) [Generating Names](#generate)

<a id="overview"></a>
## Project Overview 
The goal of this project is to create a simple language model that can generate names resembling real ones from its training data.

We'll use a neural network trained on character-level bigrams -- that is, pairs of consecutive letters. When the model is given a letter as input, it should predict the probability of each possible next letter. We can then use these probabilities to pick the next letter in a name, then feed that new letter back into the model to continue generating the name, one letter at a time.


<!-- There will be two main steps for completing this goal. First, we'll train the model so it learns common bigrams in names. Next, we'll use that trained model to generate new, never-seen-before names. -->


<!-- To put that simply, if this model is given a letter, it should learn the probabilities for each letter following that one so that it can produce a likely next letter.  -->

<!-- 
learn which letter might 

it should find the percent chance that every other 

this model should learn the likelihood 

for any letter, the model should output the chance of each other letter following that one in a name.

Given a letter, the model should output the probability of each other letter following that one in a name.

That is, it will learn the common sequences of two adjacent letters found in names and use that to generate letters that might reasonably follow any given letter in a name. -->

<a id="data"></a>
## Preparing the Data
<!-- We'll be training this model on names so it learns to produce names of its own based on what it learns from the data. First we'll import our dependencies. -->
<!-- We'll be storing our data in tensors, so let's start by importing PyTorch.

Let's start by importing PyTorch so we can store our data in tensors (and later use  -->
<!-- First we'll grab a list of names to use for training. -->
We'll start by reading in a list of names we will use to train our model.

In [15]:
# Create a list of each name in names.txt
names = open("names.txt", "r").read().splitlines()
print(f"Displaying the first three names: {names[:3]}")

Displaying the first three names: ['emma', 'olivia', 'ava']


Now that we have our data, let's split it into bigrams so the model can learn common letter patterns in names. We also want it to learn how each name starts and ends, so let's indicate the start and end of a name with a ".". Below shows an example of the bigrams in the name Emma.

In [16]:
# Break "emma" into bigrams, including "." for start and end.
print(f"Bigrams for emma: {list(zip('.emma', 'emma.'))}")

Bigrams for emma: [('.', 'e'), ('e', 'm'), ('m', 'm'), ('m', 'a'), ('a', '.')]


<!-- We are going to store these bigrams in tensors to use them in our neural network. -->
To use these bigrams in our neural network, we'll organize them into two tensors: X (inputs) and y (labels).
<!-- X will contain each first letter within the bigrams. When we train our model, X will be the input.
y will contain the second letters within the bigrams, i.e., the solutions to the bigrams started in X. Throughout training, our model should learn to predict the values in y given X. -->
- X will contain the first letter of each bigram. These are the inputs we'll feed into the model.
- y will contain the second letter of each bigram, i.e., the letter that follows each input letter in X. These are the labels or "targets" the model should learn to predict.

To start off, let's import Pytorch.

In [17]:
import torch    # Tensors and backpropagation.

Since PyTorch tensors can't contain character elements, let's assign each letter to an integer value.

In [18]:
import string

# For each letter + ".", add them to dict with an integer value.
stoi = {char:int_val for (int_val, char) in enumerate("." + string.ascii_lowercase)}
print(stoi)

{'.': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


We'll also want a way to go backwards, turning numbers back into letters.

In [19]:
itos = {num: letter for letter, num in stoi.items()}
print(itos)

{0: '.', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


Now we are ready to split all the names into bigrams.

In [20]:
X = [] # First letters in bigrams
y = [] # X's bigram pairs.

for name in names:
    for char1, char2 in zip("." + name, name + "."):
        # Convert chars to ints so we can later add to tensors.
        X.append(stoi[char1])
        y.append(stoi[char2])

print(f"Our training data has {len(X)} examples.")
print(f"The first 10 values in X are {X[:10]}.")
print(f"The first 10 values in y are {y[:10]}.")

# Convert list to a tensor.
X = torch.tensor(X)
y = torch.tensor(y)

Our training data has 228146 examples.
The first 10 values in X are [0, 5, 13, 13, 1, 0, 15, 12, 9, 22].
The first 10 values in y are [5, 13, 13, 1, 0, 15, 12, 9, 22, 9].


Before we can use these values in a neural network, we'll need to alter them a bit more. See, neural networks contain weights which get multiplied by each input. Right now our inputs are just numbers from 0 to 26. It wouldn't be very helpful to do math with these inputs, since we want each letter to be treated equally. "z" shouldn't have a different multiplier than "a", for example.

To make sure each input gets treated equally, we can use one-hot encoding. With one-hot encoding, each letter gets represented by an array of length 27 (one for each possible letter including "."). This array contains all 0s apart from a single 1 at the index signifying the chosen letter. To show this more visually, let's explore the one-hot encoded version of the letter "c".

In [21]:
# Get number used to represent c. This will be the index of the 1 after one-hot encoding.
c_index = stoi["c"]

# Create the one-hot encoded representation of the letter "c".
c_enc = torch.zeros(27)
c_enc[c_index] = 1

print(f"c's index is {c_index}, so its one-hot encoding looks like {c_enc}")

c's index is 3, so its one-hot encoding looks like tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])


Let's apply this encoding to all the letters in our input data. Rather than encoding them all manually like above, we'll use PyTorch's one_hot() function. 

<!-- Note that the function makes the datatype of the encodings integers. To use them in our neural network

Unlike when I created the encoding myself, however, this will default to making the datatype of each element an integer.  -->

<!-- This function doesn't automatically keep the values as floats -->

<!-- By default, this function will store the letters  -->
<!-- That way, they will all be represented similarly -->

<!-- We'll also turn the values into floats while we're at it, as this is necessary so that they can be multiplied by float weights in our neural network. -->

In [22]:
import torch.nn.functional as F

# One-hot encode X to turn each letter into an array of length 27 (one index for each possible letter including ".".)
X_enc = F.one_hot(X, num_classes = 27)
# View encodings for first 5 letters.
X_enc[:5]

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])

As you can see, the values here are all integers. To later multiply them by the float weights in our neural network, we'll need to convert those values into floats.

<!-- We'll also turn the values into floats while we're at it, as this is necessary so that they can be multiplied by float weights in our neural network. -->

In [23]:
# We want the neural net to produce floats, so the inputs must be floats as well.
X_enc = X_enc.float()
# The first letter should now be encoded with floats.
X_enc[0]

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])

<a id="model"></a>
## Building the Bigram Model

<!-- Now that we've finished preparing our data, we're ready to make our neural network. We won't be training it just yet; this section is more about exploring how neural networks work and then showing how we can use it to generate names once it is trained. -->

<!-- Our neural network  -->
For this project, we'll use a very simple neural network with only a single layer of neurons. 

<!-- Before creating this model, let's explore how a single neuron works. -->

Before we jump into building this layer, it's important to understand how a single neuron works.

Since each letter in X now has a length of 27 (due to the encoding), this one neuron will need 27 weights so each element in the inputs can be multiplied by a unique weight. The weights will start off as random numbers from a normal distribution. (After training the model, these numbers should become more meaningful.)


<!-- A single neuron wouldn't be a very good network, but we will start with that to understand what exactly is happening in a neural net. -->

<!-- Before we get to making our full neural network, we'll start by exploring how a single neuron works. A neuron is  -->
<!-- A single neuron wouldn't be a very good network, but it is important to understand how it functions before making  -->

<!-- understanding it is -->

<!-- we will start with that to understand what exactly is happening in a neural net. -->


<!-- for now, we will just explore what a neuron even is, then  -->

<!-- this section is more about exploring how neural networks work and then showing how we can use it to generate names once it is trained. -->

<!-- for now, we'll just be making the untrained model and walking through how we can use it. -->

<!-- we will just be exploring what a neural network is and how it will be used to make predictions.

works.

TODO: Should we show how the model will make predictions here? -->

In [24]:
# Weights start off random, so let's use a generator.
gen = torch.Generator().manual_seed(42)

# A single neuron with one weight for each element in our input.
W = torch.randn((27, 1), generator=gen)
print("Here is a single neuron with a 27 weights:")
print(W)

Here is a single neuron with a 27 weights:
tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487],
        [-0.3925],
        [ 0.2415],
        [-1.1109],
        [ 0.0915],
        [-2.3169],
        [-0.2168],
        [-1.3847],
        [-0.8712],
        [-0.2234],
        [-0.6216],
        [-0.5920],
        [-0.0631],
        [-0.8286],
        [ 0.3309],
        [-1.5576],
        [ 0.9956],
        [-0.8798]])


<!-- TODO: Finish rewording this. (And maybe change W's name since it isn't a full network? But it is still weights, so maybe keep it?) -->

<!-- To use a neural network, we multiply  -->

<!-- Now that we have a neuron, let's try using it with our input to how it works. To use a neural network, we multiply each input by the weights within its neurons. -->
<!-- To use a neural network, we multiply each input by the weights within its neurons. Here we only have one neuron, so 

We can use a neural network by multiplying each input 

Let's try using this neuron! As mentioned earlier, we use this network by multiplying each input by the weights of this neuron.
 -->
Now we can use this neuron by multiplying our inputs with it.


When the input is a one-hot encoded vector, the dot product multiplies all weights by 0 except for the one weight at the position where the input has a 1. Thus, each one-hot encoded input simply selects the corresponding weight from the neuron and ignores the rest.

In [25]:
print("c_enc has a 1 at index 3, so the dot product should select the neuron's weight at index 3.")
c_enc @ W

c_enc has a 1 at index 3, so the dot product should select the neuron's weight at index 3.


tensor([-2.1055])

<!-- Now let's think for a moment about the ideal behavior of our neural network. As mentioned in the overview, we'd like our model to take a letter as input and determine the probability for each possible next letter. Since there are 27 possible options for next letters, our model should produce 27 outputs for each input letter (where each output indicates the probability of the letter at that index coming next). -->

As described earlier, we want our model to take a letter as input and output the probability of each possible next letter. Since there are 27 possible next letters, the model should produce a vector of 27 probabilities, one for each letter.

We've seen that a single neuron produces one output. Thus, to produce 27 outputs, we'll need a layer of 27 neurons.

In [26]:
# Resetting generator so cell always gives same numbers instead of generating the next numbers.
gen = torch.Generator().manual_seed(42)

# Each column in the weights matrix represents another neuron in the layer.
W = torch.randn((27, 27), generator = gen)

Now W (our weights) is a 27x27 matrix of random values from a normal distribution. Each column represents a single neuron in the layer. 

Before we explored how one-hot encoding selected a single value from the neuron. That's still true, but now that there are 27 neurons, it will select the value in that same row for each of the neurons. So for example, an input of c (index 3) dot producted with the layer would result in a vector of all the weights at the row with index 3.

In [35]:
# Select the third weight from each neuron.
c_output = c_enc @ W
c_output

tensor([-0.3387, -1.3407, -0.5854,  0.5362,  0.5246,  1.1412,  0.0516,  0.7440,
        -0.4816, -1.0495,  0.6039, -1.7223, -0.8278,  1.3347,  0.4835, -2.5095,
         0.4880,  0.7846,  0.0286,  0.6408,  0.5832,  1.0669, -0.4502, -0.1853,
         0.7528,  0.4048,  0.1785])

Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. 

You may notice that the numbers in our output don't look very much like probabilities just yet. At this stage, the outputs are just raw scores (also called "logits"), not probabilities. When we use this model during training and sampling, we'll need to normalize these outputs to turn them into probabilities. 

<!-- Below shows each letter and its associated logit. -->

<!-- To make it more clear how each logit is associated with a letter,  -->

<!-- To make this more clear, below we can see how each value in this output vector corresponds to a specific letter. -->

<!-- Each value in this output vector corresponds to a specific letter and can eventually be used to find their probability.  -->

<!-- Let's match up each possible next letter with the value -->

<!-- Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (called "logits"), not probabilities.  -->

<!-- Logits must be converted into 


Now we have a model that outputs 27 values for each input letter, but these values aren't yet probabilities. Instead, what we have are logits

So far what we have are 


Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. These values aren't probabilities just yet. So far, we only 

Now we have a model that can take a letter as input and produce 27 outputs, one for each possible next letter. Each value in this output vector corresponds to a specific letter and can eventually be used to find their probability. 


After training the model, these numbers can be used to find the probability that each 

These aren't yet the probabilities of 

Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (also called "logits"), not probabilities.


Ideally, we'd like these values to represent the probabilities of each letter coming after "c". Of course, our model isn't trained yet, so we can't expect the probabilities to be correct just yet. But if we take a look at the 

these numbers should still be random. But 



But let's take a closer look at these values before going further.

Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (also called "logits"), not probabilities.

Just to make it more clear, let's 

<!-- Ideally, we'd like each of these outputs to represent the probability that the letter at that index comes after "c". To make this clearer, let's match up each output with the letter it represents. -->

<!-- let's match up each possible next letter with the value  -->

<!-- we can match each of these values with the letter it coincides with. -->

<!-- However, you may notice that the numbers in our output don't look very much like probabilities just yet. --> 

In [53]:
print("Our untrained model's current raw scores for each letter following 'c':")
for letter, logit in zip(stoi.keys(), c_output):
    print(f"{letter}: {logit:.4f}")

Our untrained model's current raw scores for each letter following 'c':
.: -0.3387
a: -1.3407
b: -0.5854
c: 0.5362
d: 0.5246
e: 1.1412
f: 0.0516
g: 0.7440
h: -0.4816
i: -1.0495
j: 0.6039
k: -1.7223
l: -0.8278
m: 1.3347
n: 0.4835
o: -2.5095
p: 0.4880
q: 0.7846
r: 0.0286
s: 0.6408
t: 0.5832
u: 1.0669
v: -0.4502
w: -0.1853
x: 0.7528
y: 0.4048
z: 0.1785


<!-- Of course, our data (stored in X_enc) contains much more than just a single letter. If our input contains many letters (such as all the letters stored in X), the output will still produce a vector for each letter in the input, but each of these vectors will be combined as rows of a matrix. Thus, if we send the network all of X as input, the resulting output will be a matrix with a row of probabilities for each of the 228146 examples letter in X. -->

Of course, our data (stored in `X_enc`) contains much more than just a single letter of input. When we input many letters at once (for example, all the letters in `X`), the network will output a vector of logits for each input letter. These vectors are stacked together as rows in a matrix. Thus, if we pass all of `X` as input, the output will be a matrix where each row contains the logits for the next letter, corresponding to each of the 228146 input letters in `X`.

<!-- , but multiplying `X_enc @ W` will result in a vector just like above for each letter in X. The only difference is that instead of the network outputting a single vector,  -->

<!-- . When we use this neural network on all the letters in X, we'll get a matrix where each row is a vector just like above  -->

<!-- , but multiplying `X_enc @ W` will result in a vector just like above for each letter in X. These outputs  -->
<!-- Each of these vectors will be stored as a row in a final output matrix. -->

<!-- Now that we have a model that can produce 27 outputs for each input, we're on the right track. However, you may notice that the numbers in our output don't look very much like probabilities just yet. -->

<!-- Now let's think about these output vectors a bit. We want each output to tell us the probability of each other letter following the input letter. Right now, the numbers in our output don't look very much like probabilities. For starters, some of them are negative. We can fix that using exponentiation, which shifts numbers over so that negative numbers end up as small decimal values and positive numbers end up larger than 1. -->

In [28]:
# Get the probability vectors for the first two letters in X.
X_enc[:2] @ W

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
         -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,
          1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806,
          1.2791,  1.2964,  0.6105],
        [ 0.5760,  1.1415,  0.0186, -1.8058,  0.9254, -0.3753,  1.0331, -0.6867,
          0.6368, -0.9727,  0.9585,  1.6192,  1.4506,  0.2695, -0.2104, -0.7328,
          0.1043,  0.3488,  0.9676, -0.4657,  1.6048, -2.4801, -0.4175, -1.1955,
          0.8123, -1.9006,  0.2286]])

___________________

<!-- Now that we have a model that can produce 27 output probabilities for each letter, we're on the right track. However, y -->

You may notice that the numbers in our output don't look very much like probabilities just yet. At this stage, the outputs are just raw scores (also called "logits"), not probabilities. When we use this model during training and sampling, we'll need to normalize these outputs to turn them into probabilities. 

<!-- Ideally, we'd like these values to represent the probabilities of each letter coming after "c". However, so far, the outputs are just raw scores (also called "logits"), not probabilities. -->

<!-- Some of these supposed probabilities are negative, which doesn't make much sense.  -->

Now that we have a model that can produce 27 output probabilities for each letter, we're on the right track. However, you may notice that the numbers in our output don't look very much like probabilities just yet. Right now, what we have are logits: the raw output before we do any normalization. For them to be probabilities, they'd each need to be a number between 0 and 1, and their sum should add up to 1 (for 100%). To convert the model's raw scores (also called logits) into probabilities, we'll need to normalize the output. This can be done with the softmax function. The softmax function works as follows.

First, we exponentiate each of the outputs. This converts all values to positive numbers while perserving their order.

<!-- This fixes the issue of negative logits, as exponentiating makes all the digits positive in such a way that keeps values that were negative smaller than values that were positive. -->
<!-- 
, like so:

We can fix the issue of negatives by exponentiating the values, making all the digits positive while keeping values that were negative smaller than values that were positive. Then we can turn this into a probability by summing up each row and setting the weights for that row equal to their fraction of that sum. This process is called the softmax function. -->

<!-- (indicating 100% probability). -->

<!-- Right now, what we have are logits: the raw output before we do any normalization. When we use this model during training and sampling, we'll need to normalize these outputs to turn them into probabilities. We can do this using the softmax function.  -->

<!-- To understand the softmax function, let's take another look at the output we saw when we used "c" as our input. -->

In [48]:
# Exponentiate the logits so that none are negative.
c_pos = c_output.exp()
c_pos

tensor([0.7127, 0.2617, 0.5569, 1.7095, 1.6898, 3.1305, 1.0530, 2.1042, 0.6178,
        0.3501, 1.8292, 0.1787, 0.4370, 3.7989, 1.6218, 0.0813, 1.6291, 2.1915,
        1.0291, 1.8979, 1.7918, 2.9064, 0.6375, 0.8309, 2.1228, 1.4989, 1.1954])

Next, we turn this into a probability by summing 

up each row and setting the weights for that row equal to their fraction of that sum. This process is called the softmax function.

____________
Explain softmax, then show what happens with multiple inputs (shown below), then say how we'd need to use softmax for all of those.

Now that we have a model that can produce 27 output probabilities for each letter, we're on the right track. However, you may notice that the numbers in our output don't look very much like probabilities just yet. Right now, what we have are logits: the raw output before we do any normalization. When we use this model during training and sampling, we'll need to normalize these outputs to turn them into probabilities. We can do this using the softmax function. To understand the softmax function, let's take another look at the output we saw when we used "c" as our input.
<!-- 
The softmax function is pretty simple. 


First, you may have noticed that many of the values above are negative. Probabilities can't be negative. An easy fix to this is to exponentiate them. -->

<!-- So for now, let's explore how that normalization works. -->

<!-- We can do this using the softmax function.  -->

<!-- , which will be explored in futher detail in the following section. -->
<!-- 
To understand the softmax function, let's examine what we get when our input to the neural network is the letter "c".

We can explore the softmax function with the output of 



The softmax function works like so:

First, we need  -->


<!-- TODO: FINISH ABOVE! -->

In [29]:
# Re-examine what happens when we input "c" into our network.
c_enc @ W

tensor([-0.3387, -1.3407, -0.5854,  0.5362,  0.5246,  1.1412,  0.0516,  0.7440,
        -0.4816, -1.0495,  0.6039, -1.7223, -0.8278,  1.3347,  0.4835, -2.5095,
         0.4880,  0.7846,  0.0286,  0.6408,  0.5832,  1.0669, -0.4502, -0.1853,
         0.7528,  0.4048,  0.1785])

Remember, we'd like each value in this output tensor to show us the probability of a letter following "c". Right now, the probabilities are like so:

TODO, 

<!-- See how many of those numbers are negative? A negative probability doesn't mean much, and so the first step of the softmax function is to exponentiate the values. This 

, so we'll

They are negative and don't sum up to 1 (100%). We can fit the issue of negatives by exponentiating the values, making all the digits positive while keeping values that were negative smaller than values that were positive. Then we can turn this into a probability by summing up each row and setting the weights for that row equal to their fraction of that sum. This process is called the softmax function. -->

In [33]:
list(zip(stoi.keys(), c_enc @ W))

[('.', tensor(-0.3387)),
 ('a', tensor(-1.3407)),
 ('b', tensor(-0.5854)),
 ('c', tensor(0.5362)),
 ('d', tensor(0.5246)),
 ('e', tensor(1.1412)),
 ('f', tensor(0.0516)),
 ('g', tensor(0.7440)),
 ('h', tensor(-0.4816)),
 ('i', tensor(-1.0495)),
 ('j', tensor(0.6039)),
 ('k', tensor(-1.7223)),
 ('l', tensor(-0.8278)),
 ('m', tensor(1.3347)),
 ('n', tensor(0.4835)),
 ('o', tensor(-2.5095)),
 ('p', tensor(0.4880)),
 ('q', tensor(0.7846)),
 ('r', tensor(0.0286)),
 ('s', tensor(0.6408)),
 ('t', tensor(0.5832)),
 ('u', tensor(1.0669)),
 ('v', tensor(-0.4502)),
 ('w', tensor(-0.1853)),
 ('x', tensor(0.7528)),
 ('y', tensor(0.4048)),
 ('z', tensor(0.1785))]

<a id="training"></a>
## Training the Model

Now that we have a model that can produce 27 outputs for each input, we're on the right track. However, you may notice that the numbers in our output don't look very much like probabilities just yet. We'll need to fix that when training our model.

In [None]:
# To train:
# Use input. Check percent it assigned to label. Make it higher.