# The Transformer

In this week's Virtual Lab, you will write your own implementation of ChatGPT! We will build a transformer architecture from scratch, learning as we go why each component is necessary and how each network component contributes to the overall language network's function.

The key idea behind the Transformer is that it uses a mechanism called self-attention to weigh the importance of different parts of the input sequence when making predictions about the output sequence. This is in contrast to earlier models that used recurrence or convolution to process sequences, which have limitations when it comes to capturing long-range dependencies.

The Transformer consists of a series of blocks, each of which contains a self-attention mechanism followed by a feedforward neural network. The self-attention mechanism computes a weighted sum of the input sequence, with the weights determined by the similarity between each element of the sequence and every other element. This allows the model to attend to different parts of the input sequence depending on the task at hand. The feedforward neural network then applies a non-linear transformation to the weighted sum to produce the output of the block.

The Bigram Language Model, which we will use in our Transformer, takes a sequence of integer indices representing characters in a text and learns to predict the next character in the sequence. It does this by applying a series of transformer blocks to the input sequence, and then passing the output through a linear layer to obtain logits, which represent the model's confidence in each possible next character. During training, the model is given the ground-truth next character as a target, and the loss is computed as the cross-entropy between the logits and the target. During generation, the model is given a starting sequence and generates new characters one at a time by sampling from the distribution of logits.

This notebook is adapted from Andrej Karpathy's GPT [Zero to Hero](https://youtu.be/kCc8FmEb1nY) example, where he spells out how to build a language model that can generate real text!

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

## 1. Loading and formatting data

First, let's explore the data we will be training our transformer with. We will be using the `tinyshakespeare` dataset, which is simply a text file containing a vast collection of Shakespeare's writing. We will be training our transformer to generate text in the language of Shakespeare, character by character. Let's explore the data!

### Getting the data

In [None]:
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('/home/student/Desktop/classroom/starter/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print(f'Total length of dataset in characters: {len(text)}')

In [None]:
# Looking at the first 1000 characters...
print(text[:1000])

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
all_chars = ''.join(chars)
print(f'Distinct characters: {all_chars}')
print(f'Number of unique characters: {vocab_size}')

Looking at the data we will be "learning", we can see that it is a script of Shakespeare in plain-text format.

Now, we need to create our encoding from characters to a numerical vector embedding. Transformers require that the data be provided in vector notation. How will we do this?

### Vector Embedding

In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("Hi! My name is Pria."))
print(decode(encode("Hi! My name is Pria.")))

So what is this doing? We are trying to encode our text file into a numerical representation, and in this case we are just taking the character vocabulary we've defined and using it as our direct mapping. For example, we can see that the first letter of our string is `H`. This corresponds to the numerical integer of `20`, according to our encoder. Can we see where this is defined?

In [None]:
chars[20]

This is a relatively simple mapping to our pre-defined vocabulary of vectors. Many of the more complex Language Models use much longer vocabularies that contain multiple characters or even words per token, however for the sake of simplicity we will complete this tutorial using character-based encoding with a simple vocabulary. Dive deeper into something like [tiktoken](https://github.com/openai/tiktoken) if you're interested to learn about more complex encodings!

Now that we have our encoder defined, we need to turn our Shakespeare data into a vector encoded format...

We will create a massive **tensor** of the whole dataset.

Tensors are the fundamental data structure in PyTorch and are used to represent data and parameters in neural networks. Tensors are similar to NumPy arrays, but they come with additional features specifically designed for deep learning. Some key features of PyTorch tensors include:

* GPU support: Tensors can be easily moved to a GPU (Graphics Processing Unit) for accelerated computation. This is especially useful for training and running deep learning models, as GPUs can perform calculations much faster than CPUs (Central Processing Units) in many cases.

* Automatic differentiation: PyTorch tensors support automatic differentiation, which is a technique used to compute gradients (derivatives) required for optimizing neural network parameters. This makes training models in PyTorch more efficient and easier to implement.

* Dynamic computation graph: PyTorch tensors allow for dynamic computation graphs, meaning the structure of the neural network can change during runtime. This provides more flexibility when designing and experimenting with complex models.

You can create tensors in PyTorch with various data types (e.g., float, int) and shapes (e.g., scalar, vector, matrix). Tensors can be created from existing data, like Python lists or NumPy arrays, or initialized with specific values, such as zeros or random numbers. Once created, you can perform mathematical operations, reshape tensors, and manipulate their data to build and train machine learning models. In this case, we are creating one big tensor with our Shakespeare data, and then splitting it into training and testing sets...

In [None]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

print(train_data[:300])

tensor_len = len(train_data) + len(val_data)
print(f'Total tensor length: {tensor_len}')

### Chunking

Now, we can see that we have a numerical representation of our data of the same character length of the original text. This is because we are encoding with a 1:1 character:token ratio. Again, we see a much larger ratio in current Large Language Models that can handle larger vocabularies and more complex embeddings. For now, this is great for us to learn!

Next, we need to chunk our data. This involves intelligently splitting our input data in such a way that we can pass modular, bite-sized "chunks" of data through the network for our Transformer to begin to learn. Chunking is necessary for a few reasons:

* **Memory constraints**: Transformers have a self-attention mechanism that computes relationships between all pairs of input elements (e.g., words or tokens). This results in a quadratic increase in memory requirements with respect to sequence length. For long sequences, the memory needed to store intermediate values during training can quickly exceed the available GPU memory. By breaking the input into smaller chunks or blocks, you can fit the model and its intermediate values into the memory, allowing for efficient training.

* **Computational efficiency**: When working with long sequences, the computational cost of the self-attention mechanism can be high due to the quadratic complexity. By chunking the input, you can significantly reduce the number of calculations needed, making the training process faster and more efficient.

* **Gradient propagation**: For very long sequences, gradient propagation through the layers of the Transformer during backpropagation can become challenging. Gradients can vanish or explode, making it difficult for the model to learn effectively. Chunking the input into smaller blocks helps mitigate this issue by reducing the sequence length and making gradient propagation more stable.

* **Parallelization**: Training deep learning models, including Transformers, can be time-consuming. By dividing the input into smaller chunks or blocks, you can take advantage of parallelization, distributing the computation across multiple GPUs or other hardware accelerators. This can speed up training and make the process more efficient.

The block size, then, is defined as the maximum number of characters that can be passed to the model in any given training step. We can start to visualize this...

In [None]:
block_size = 8
train_data[:block_size+1]

We can see that this results in a tensor of length 9. Why the extra character? Let's see what this is doing...

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1] # so y is defined as one step up the string...
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is: {target}")

What does this look like if we see just the text?

In [None]:
block_size_txt = 13

x_txt = text[:block_size_txt]
y_txt = text[1:block_size_txt+1] # so y is defined as one step up the string...
print(f"X Text: {x_txt}")
print(f"Y Text: {y_txt}\n")

for t in range(block_size_txt):
    context = x_txt[:t+1]
    target = y_txt[t]
    print(f"when input is [{context}] the target is: {target}")

We can see here that because we've defined our `Y` variable to be one character ahead of our X data, we are in essence passing the transformer a sliding window of text that has been converted to numbers.

With this, we can also define a **batch**. This allows the transformer to process multiple chunks at a time, keeping the GPUs/CPUs that we're working with busy.

In [None]:
torch.manual_seed(999)
block_size = 8 # what is the maximum context length for predictions?
batch_size = 4 # how many independent sequences will we process in parallel?

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)
print('------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f'when input is [{context.tolist()}] the target is: {target}')

With this single tensor, we have defined 32 (`8 * 4`) seperate inputs for training our transformer. Let's start modeling!

In [None]:
print(xb) # our input to the transformer

## 2. Implementing the Bigram Language Model

A Bigram Language Model is a simple language model that predicts the next word (token) in a sequence based on the current word only. It estimates the probability of a word following another word in a text corpus. In other words, it relies on the conditional probability of a word given its immediate predecessor.

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        # This stands for batch, time, channel
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

This code defines a simple bigram language model using PyTorch. The key parts of this code to understand are:

1. **forward()**: This function computes the logits for each token in the input sequence idx. If targets are provided, it also calculates the cross-entropy loss between the predicted logits and the target tokens. 
    * By defining `(B, T, C)` as `Batch` (size of our text chunk), `Time` (which step in that chunk we are at), and `Channel` (our character vocabulary), we are taking every input and passing them as their own tensor to the model.
    * We can use these to then define the `logits`, which are our probability distributions for the prediction of the next character in that input tensor
    

2. **generate()**: This function generates a new sequence of tokens by iteratively sampling from the probability distribution of the next token, given the current context. It takes the initial context idx and the number of new tokens to generate max_new_tokens as input, and returns the generated sequence.

When put together, this model can be used to predict the next token (or simply character, in this case) that will come up.

Now, we can now train this model through a series of learning steps! We will use the Adam optimizer

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32
for steps in range(100):
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

This is not English, and definitely is not Shakespeare...

What if we try training it for another `10,000` epochs?

In [None]:
batch_size = 32
for steps in range(10000):
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

We see the loss decrease, but this still doesn't seem to make much sense... It is not English, and certainly is not Shakespeare! There's a problem...

It predicts the next token based *solely* on the current token. In order to really feel like we are reanimating Shakespeare, we need to explore how these character-predictions can start talking to each other in order to understand context and semantic meaning within text.

And we arive to...

## 3. The Transformer

### The Self-Attention Mechanism

Self-attention is a key concept inside a Transformer that helps the model understand how important different words in a sentence are in relation to each other. Think of it as the Transformer's way of paying attention to different parts of the input when making predictions.

Imagine you're reading a sentence and trying to figure out what it means. You naturally pay more attention to certain words that help you understand the main idea, while other words might not be as important. Self-attention works similarly in a Transformer.

When a Transformer processes a sentence, it looks at all the words in the sentence simultaneously. For each word, it calculates a score that represents how important that word is in relation to the word it's trying to predict. These scores are used as weights to create a weighted average of all the words in the sentence. This weighted average is then used as input for the next layer in the Transformer.

By using self-attention, the Transformer can understand the relationships between words in a sentence, no matter how far apart they are. This allows the model to capture complex patterns and dependencies in the text, making it great at tasks like language translation, text generation, and more.

Let's look at a dummy-example of the math behind this self-attention mechanism...

In [None]:
### Dummy example of self-attention with matrix multiplication
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

This code is doing the following:
1. Creates a matrix `a` of shape (3, 3) with all ones.
2. Normalize the rows of matrix `a` by dividing each element by the sum of the corresponding row, making sure the sum of each row is 1. This creates a probability distribution over rows (or sentence elements to be weighed).
3. Create a random matrix `b` of shape (3, 2) with elements from 0 to 9.
4. Perform matrix multiplication between 'a' and 'b', resulting in matrix 'c' of shape (3, 2).

The key concept in this code is the matrix multiplication between matrices 'a' and 'b' (a @ b). In the context of self-attention, matrix 'a' represents the attention weights, where each row contains probabilities that determine how much each element in 'b' contributes to the final output. By performing the matrix multiplication, we obtain a weighted aggregation (matrix 'c'), where each element is a combination of the elements in 'b', with weights determined by the corresponding row in 'a'. 

**The model can now pay attention to context...**

So lets take that principle and apply it to an actual Transformer. Here, I will define all of the code, redefining many of the functions we've already worked through...

### Hyperparameters

Set hyperparameters like batch size, block size, maximum number of iterations, learning rate, number of embeddings (n_embd), number of heads (n_head), number of layers (n_layer), dropout, and device (CPU or GPU).

In [None]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu' # we will just be using CPU
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
torch.manual_seed(999)
# ------------

### Unique Characters and Mapping

Create a list of unique characters in the text and create a mapping from each character to an integer value. This allows the text to be represented as a sequence of integers rather than a sequence of characters.

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

### Train and Test Splits

Split the data into a training set (90% of data) and validation set (10% of data).

In [None]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

### Data Loading Function

Create a function that generates a small batch of data for inputs x and targets y from either the training or validation set.

In [None]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

### Loss Estimation Function

Create a function that estimates the loss for the training and validation set by iterating through several batches of data and averaging the loss across batches.

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

### Transformer Head Module

Create a class for one head of self-attention in the transformer. This includes linear transformations for the key, query, and value inputs, as well as computing attention scores ("affinities"), performing weighted aggregation of the values, and applying dropout.

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

### Multi-Head Attention

Create a class for multiple heads of self-attention in the transformer. This includes several heads created from the Head class, a linear transformation layer, and dropout.

In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

### Feed Forward Class

Create a class for a simple linear layer followed by a non-linearity.

In [None]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

### Block Class

Create a class for a transformer block that includes communication followed by computation. This includes a multi-head attention layer, a feed forward layer, and layer normalization.

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

### Bigram Language Model Class

Create a class for a super simple bigram language model that includes an embedding layer for tokens and positions, several transformer blocks, layer normalization, and a linear head for the output.

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

### Optimizer

Create a PyTorch optimizer using the AdamW algorithm and the model's parameters.

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

### Training Loop

Train the model by iterating through the training data and evaluating the loss at set intervals. The loss is then backpropagated through the model to update the parameters.

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

### Generation Function

Generate text from the trained model by passing in a context (a tensor of shape (1,1) initialized to all zeros) and generating new tokens one at a time using the probability distribution output by the model. This process is repeated for a maximum number of tokens to generate a longer sequence of text.

In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

We can see, that even though this isn't great, it is much better! We need to keep in mind that the model we've built it only trained to predict one character at a time, so it still struggles to make realistic English text. It might not understand the English language, but we can see how the context is much better than the raw Bigram model. It look like the Shakespeare text we input!

In order to create a true English language model, however, we will need to expand the vocabulary to include full words rather than single characters. This exactly what ChatGPT does, and we will explore more of this in the following lesson.