---
title: "Building an Autoregressive Neural Network"
"subtitle": Part 1
author: "Luca WB"
date: "2026-01-28"         
date-modified: last-modified 
categories: [scratch, en, code, nn]
---


## Brief summary

In this post, we will implement an Autoregressive Neural Network from scratch, relying solely on the PyTorch tensor class. We assume prior familiarity with Neural Networks; however, if your knowledge feels a bit rusty or you need a refresher, I recommend reading this post beforehand [Building Neural Networks from Scratch](../nn-from-scratch-1/index.qmd). 

The main reason for this is to learn how an Autoregressive NN works to generate words, for this, I'm drawing on Andrej Karpathy's video series about [makemore](https://www.youtube.com/watch?v=PaCmpygFfXo&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=2), a network capable of creating more words of the same type, so if you train with names, it generates more proper names it generates more words that remember proper names, and so on with anything that is formed by letters.

In this post, I will cover how to make a simple model for our baseline, and how to implement a model with MLP and compare them.

## Setup

First, you need to download PyTorch and the dataset. For PyTorch, just download in the official site https://pytorch.org/get-started/locally/. Now, for the dataset, you can
create your own with random names that you can think, but It's much easier just download the names.txt dataset from the Andrej repository https://github.com/karpathy/makemore/blob/master/names.txt.


## Creating a baseline

In propose of this, it's just to create the most simple and naive model. It's important because we need some baseline to compare with our future models, so we will create a model called bigram, the logic is just to look to the last character. Note that you will use just one character of context for our model, and we will consider that the most small part of our word is a character, for models like chatGPT, they don't use characters, they use combinations of characters similar to syllables.

So, to start, we need first import our dataset and PyTorch

In [None]:
import torch
 
# Basically makes a list of all the names
names = open("names.txt", "r").read().splitlines() 
names[:5]

Most part of models usually can't handle with characters, so it's useful to convert this letters in numbers in some way. For this, there are many possibles, but I will use just a simple dictionary to convert them. But we 


In [None]:
chars = sorted(list(set("".join(names)))) # Creates an ordered list with all letters in our dataset
charToInt = {s:i+1 for i,s in enumerate(chars)} # Creates a dict to convert chars to int, we add one in the value, because of the bellow line
charToInt["."] = 0 # I will explain later why we need a special character
print(charToInt)

Just to get it ready, if we convert to int, so we can read it at the end, we will need an intToChar converter, so let's get it ready

In [None]:
intToChar = {s:i for i,s in charToInt.items()}
print(intToChar)

Know, for our model, we need to calculate the total number that each sequence occurs, like, with we start with letter "a", how many times occurs that "m" is the next character. And it's for this that we need and special characters, because we always need something to start, after all, the autoregressive model logic and take the output of the model and put it in its input, so we need an initial input. In our case, we will use "." as the symbol to start a name/words and to stop word (without a final symbol, it would generate forever). To make more clear, see the code bellow


In [None]:
N = torch.zeros((27,27)).int()

for name in names:
    chars = ["."] + list(name) + ["."] # turn the name in a list of characters and add "." 
    for ch1,ch2 in zip(chars, chars[1:]): # In each loop, pick up one letter in ch1, and the next in ch2
        id1, id2 = charToInt[ch1], charToInt[ch2]
        N[id1, id2] += 1

Basically, this count how often some sequence of characters occurs, like the most common letter sequence is "n" follow by ".", this mean, that the most commum letter to finish a name in our dataset it's "n". If you run with all the names, you can use the code bellow to find the most common occurrences

In [None]:
id1, id2 = (N == N.max()).nonzero(as_tuple=True) # Creates a boolean matrix that only it's True in the max value, than return a tuple where its true
print(intToChar[id1.item()], "-->", intToChar[id2.item()],"occurs ", N.max().item())

So let's see how our bigrams are distributed

In [None]:
#| code-fold: true
import matplotlib.pyplot as plt

plt.figure(figsize= (16,16))
plt.imshow(N, cmap="Blues")
for i in range(27):
    for j in range(27):
        chstr = intToChar[i] + intToChar[j]
        plt.text(j,i, chstr, ha="center", va="bottom", color="gray")
        plt.text(j,i, N[i,j].item(), ha="center", va="top", color="gray")
    
plt.axis("off")

One thing very interesting you can note, it's that have many combinations that don't exist, like "bk" or "gc". This makes it impossible for our model to generate a name with this combination, it is ok to leave it like this, but it would be a good practice to add 1 in all values, thus ensuring that at least there is the minimal possibility of generating a rare sequence


In [None]:
N = N + 1 

In [None]:
#| echo: false
import matplotlib.pyplot as plt

plt.figure(figsize= (16,16))
plt.imshow(N, cmap="Blues")
for i in range(27):
    for j in range(27):
        chstr = intToChar[i] + intToChar[j]
        plt.text(j,i, chstr, ha="center", va="bottom", color="gray")
        plt.text(j,i, N[i,j].item(), ha="center", va="top", color="gray")
    
plt.axis("off")

So, lets transform our probability matrix

In [None]:
P = N
P = P / P.sum(dim=1, keepdims=True)

In [None]:
#| code-fold: true
import matplotlib.pyplot as plt

plt.figure(figsize= (16,16))
plt.imshow(P, cmap="Blues")
for i in range(27):
    for j in range(27):
        chstr = intToChar[i] + intToChar[j]
        plt.text(j,i, chstr, ha="center", va="bottom", color="gray")
        plt.text(j,i, f"{P[i,j].item():.3f}", ha="center", va="top", color="gray")
    
plt.axis("off")

Some probabilities stay in 0 because the visualization it's limited to 3 decimal numbers. Now we already have our model, it's just our probability matrix P, bellow I will show how to use it.


In [None]:
for i in range(10):
    out = []
    init = 0
    while True:
        id = torch.multinomial(P[init], num_samples=1, replacement=True).item()

        if id == 0: 
            break

        out.append(intToChar[id])
        init = id
    print("".join(out)) 

Not good at all, but it's correct, with you think the model it's just saying random letters, see the code bellow, were any letter has the same probability

In [None]:
N = torch.ones((27,27))
P = N
P = P / P.sum(dim=1, keepdims=True)

for i in range(10):
    out = []
    init = 0
    while True:
        id = torch.multinomial(P[init], num_samples=1, replacement=True).item()

        if id == 0: 
            break

        out.append(intToChar[id])
        init = id
    print("".join(out)) 

## Lets Build the MLP

Moving forward, now we need to create our MLP, the basic idea is to create a first layer that is responsible for transforming the characters(or better, the number that represents the character) into an embedding form, basically we transform random number that represent a word in a vector with multiple numbers, one reason for this is that in this way, the model have a representation in vector space of the letters, so, the vowels probably will be close from each other, and our special character "." will likely be distant from all other characters. For the model, this mean that de vowels probably can be changed with other vowels without increasing much loss. After create this, will simply connect in a common dense layer and a layer for output.

First, we need to create a new split of our data, we need X (the decision variables, our context) and Y (the target, the next char)


In [None]:
context_size = 2 # This is a very important variable, this mean how many characters the model will see to predict next

X, Y = [], []
for n in names:
    context = [0] * context_size # Create vector with 0 (the int that represent "." for us) with the length of our context
    for ch in n + ".": # We add "." just in final, because the context already start with "."
        ix = charToInt[ch]
        X.append(context)
        Y.append(ix)
        context = context[1:] + [ix] # We remove the first int in context and add the new
X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)

To create our first layer that will embed the integers, it will act like a weight matrix, but without bias (we don't need bias because it's redundant, you will see later that we don't even need to actually perform multiplication).

In [None]:
C = torch.randn((27, 2)) # 27 because our vocabulary have 27 words (the char + ".") and 2 because we want to transform each char into a vector with 2 dimensions 
C.shape

Now, to test, we need to multiply the X with C, but the shape doesn't match

In [None]:
X.shape, C.shape

So, for be able to make the multiplication, we need to make the last dimension of X become 27, this can be achieved by simple putting one hot, but it's a better way to do that. Bellow there is a multiplication between a one hot vector that represent 5 ("e" in our vocabulary)

In [None]:
import torch.nn.functional as F 

F.one_hot(torch.tensor(5), num_classes = 27).float() @ C

Now, see the code bellow

In [None]:
C[5], C[[5,6,7]] 

Basically, multiply by a one hot vector, it's pick up the specific line, so, to solve our multiplication problem, we don't need to transform all the last dimension of X with one_hot encoding, we can just use X like an index for C.


In [None]:
emb = C[X] 
emb.shape

The first dimension represent the inputs, the second is the size of our context, and the last is the value in the embedding. To clari, the code bellow, pick up the embedding value of the letter in the second position in the context for the first prediction 

In [None]:
emb[0][1]