In [1]:
import numpy as np
import itertools

In [2]:
rng = np.random.default_rng(1337)

## n-gram Language Model


## Dataset

Before we get started, let's quickly process the dataset. The steps are:

1. Download the dataset to the `data/` directory first using: `curl -O https://raw.githubusercontent.com/karpathy/makemore/master/names.txt`
2. Open up the file and see what it looks like. It's a list of names, one per line.
3. Generate a random split of the data into training, validation, and test sets. We'll do 1000 names each for validation and test, and the rest for training.
4. Write each dataset as a separate file in the `data/` directory.


In [3]:
!curl -o data/names.txt https://raw.githubusercontent.com/karpathy/makemore/master/names.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  222k  100  222k    0     0  1507k      0 --:--:-- --:--:-- --:--:-- 1558k


In [4]:
import random

# 2. Read in all the names (N=32,032 names in total)
names = open("../data/names.txt", "r").readlines()
print("First 10 names:")
for name in names[:10]:
    print(name.strip())

# 3. Get a permutation to split the names into test, val, and train sets
random.seed(42)  # fix seed for reproducibility
ix = list(range(len(names)))
random.shuffle(ix)

# (Validation, Test, Training) = (1000, 1000, N-2000)
test_names = [names[i] for i in ix[:1000]]
val_names = [names[i] for i in ix[1000:2000]]
train_names = [names[i] for i in ix[2000:]]


# 4. Write list of names to separate files
def write_names(names, filename):
    with open(filename, "w") as f:
        for name in names:
            f.write(name)


write_names(test_names, "../data/test.txt")
write_names(val_names, "../data/val.txt")
write_names(train_names, "../data/train.txt")


First 10 names:
emma
olivia
ava
isabella
sophia
charlotte
mia
amelia
harper
evelyn


## Tokenization

Now, we have a list of names. We need to convert them into a list of tokens (in this case, characters), so that we can train a character-level language model. More details about tokenization can be found in the [tokenization video on YouTube](https://www.youtube.com/watch?v=zduSFxRajkE).


In [5]:
train_text = open("data/train.txt", "r").read()

# Check that the text is as expected (a-z and a newline character)
assert all(c == "\n" or ("a" <= c <= "z") for c in train_text)

In [6]:
# Unique characters we see in the input
uchars = sorted(list(set(train_text)))
vocab_size = len(uchars)
char_to_token = {c: i for i, c in enumerate(uchars)}
token_to_char = {i: c for i, c in enumerate(uchars)}
# Designate \n as the delimiting <|endoftext|> token
EOT_TOKEN = char_to_token["\n"]

print(f"Number of unique characters: {vocab_size}")  # This should be 27
print(f"A -> : {char_to_token['a']}")
print(f"L -> : {char_to_token['l']}")
print(f"Z -> : {char_to_token['z']}")


Number of unique characters: 27
A -> : 1
L -> : 12
Z -> : 26


In [7]:
# Pre-tokenize all the splits one time up here
test_tokens = [char_to_token[c] for c in open("data/test.txt", "r").read()]
val_tokens = [char_to_token[c] for c in open("data/val.txt", "r").read()]
train_tokens = [char_to_token[c] for c in open("data/train.txt", "r").read()]


In [8]:
# Look at the first name
train_names[0]


'rayvon\n'

In [9]:
# Look at the first name in tokenized form
print([char_to_token[c] for c in train_names[0]])

# Look at the first few tokens of the training set
print(train_tokens[:10])


[18, 1, 25, 22, 15, 14, 0]
[18, 1, 25, 22, 15, 14, 0, 20, 1, 23]


As we can see above, the first name `rayvon\n` is tokenized into `['r', 'a', 'y', 'v', 'o', 'n', '\n']`, and then converted into numerical indices. Here, `0` represents the newline token `\n`.


## Build the n-gram model

The central idea behind an n-gram model is to approximate the probability of a token `w_n` given the history of the previous `n` tokens (denoted by `w_{n-N+1:n-1}`, more commonly known as the "context"). Here, the lowercase `n` refers to the position of the token in the sequence, while the uppercase `N` refers to the length of the context. For bigram models, `N=2`, for trigram models, `N=3`, and so on.

In mathematical terms, this is written as: $$ P(w*n | w*{n-N+1:n-1}) $$.

The above probability is estimated using maximum likelihood estimation (MLE) as: $$ P(w*n | w*{n-N+1:n-1}) = \frac{C(w*{n-N+1:n-1} w_n)}{C(w*{n-N+1:n-1})} $$, where `C()` denotes the count of the n-gram in the dataset.

A more mathematical description can be found in the [n-gram Language Model](https://web.stanford.edu/~jurafsky/slp3/3.pdf) chapter of the book "Speech and Language Processing" by Dan Jurafsky & James H. Martin.

Let's build one such model using the training data using `N=3`


### Step-by-step walkthrough of training


In [10]:
SEQ_LEN = 3
counts = np.zeros((vocab_size,) * SEQ_LEN, dtype=np.int32)

print(counts.shape)

(27, 27, 27)


In [11]:
first_name = [char_to_token[c] for c in train_names[0]]
first_name

[18, 1, 25, 22, 15, 14, 0]

Since we have taken `N=3`, we will be building a trigram model. The model will be a dictionary where the keys are the context (a tuple of 2 tokens), and the values are also dictionaries. The inner dictionaries will have keys as the next token and values as the count of the n-gram in the dataset.


In [12]:
context = first_name[:SEQ_LEN]

print(f"Context: {context}")

counts[tuple(context)] += 1

print(
    f"Counts for context {[token_to_char[t] for t in context[:2]]}: {counts[tuple(context[:2])]}"
)

Context: [18, 1, 25]
Counts for context ['r', 'a']: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]


Above, we see that the counts for a given context ('r a') are updated to reflect the occurrence of the token 'y' (2nd last token in the vocabulary). In other words, we are saying:
$$ C(y | r a) = 1 $$

This is the basic idea. To build a model, we just let this process run through the entire training dataset that is chunked into sizes of our sequence length `N`. Before we move on, let's see one more example.


In [13]:
context = first_name[1 : SEQ_LEN + 1]

print(f"Context: {context}")

counts[tuple(context)] += 1

print(
    f"Counts for context {[token_to_char[t] for t in context[:2]]}: {counts[tuple(context[:2])]}"
)


Context: [1, 25, 22]
Counts for context ['a', 'y']: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]


Now, the count for the token 'v', given the context ('a y'), is increased by 1. In other words, we are saying:
$$ C(v | a y) = 1 $$
I hope this gives you a good intuition about how the counts are updated. Let's now build the model using the training data.


### Training on the dataset


In [14]:
# Utility function to iterate tokens with a fixed-sized window
def dataloader(tokens, window_size):
    for i in range(len(tokens) - window_size + 1):
        yield tokens[i : i + window_size]


In [15]:
# Quick example:
example_dataloader = dataloader(train_tokens, 3)
print(1, next(example_dataloader))
print(2, next(example_dataloader))
print(3, next(example_dataloader))


1 [18, 1, 25]
2 [1, 25, 22]
3 [25, 22, 15]


In [16]:
counts = np.zeros((vocab_size,) * SEQ_LEN, dtype=np.int32)

# This will iterate over all trigrams in the training set and update the counts
for tape in dataloader(train_tokens, SEQ_LEN):
    counts[tuple(tape)] += 1


### Generating the next token using the model

Now that we have the trained model (an array of counts), we can generate the next token given a context. This is done by calculating the probability of the occurernce of each token given the context, and then sampling from this distribution to get the next token.

Let's look at an example, with context = `'a y'`.


In [17]:
inference_context_char = "ay"
inference_context = [char_to_token[c] for c in inference_context_char]
print("Inference context:", inference_context)

inference_counts = counts[tuple(inference_context)].astype(np.float32)

# Add smoothing ("fake counts") to all counts
inference_counts += 1

print(f"Counts for context ({inference_context_char}): {inference_counts}\n")

inference_counts_sum = inference_counts.sum()
probs = inference_counts / inference_counts_sum

print(f"Probs for context ({inference_context_char}): {probs}\n")
print(f"Most likely next character: {token_to_char[probs.argmax()]}")

Inference context: [1, 25]
Counts for context (ay): [158. 361.  14.  50. 172.  74.   4.  11.  10.  36.   9.  20. 451.  55.
  93.  47.   1.   5.  42. 129.  61.  20.  78.   2.   1.  16.  36.]

Probs for context (ay): [0.08077709 0.18456033 0.00715746 0.02556237 0.08793456 0.03783231
 0.00204499 0.00562372 0.00511247 0.01840491 0.00460123 0.01022495
 0.2305726  0.02811861 0.04754601 0.02402863 0.00051125 0.00255624
 0.02147239 0.06595092 0.03118609 0.01022495 0.0398773  0.00102249
 0.00051125 0.00817996 0.01840491]

Most likely next character: l


Let us look at another example, with context = `'k a'`.


In [18]:
inference_context_char = "ka"
inference_context = [char_to_token[c] for c in inference_context_char]
print("Inference context:", inference_context)

inference_counts = counts[tuple(inference_context)].astype(np.float32)

# Add smoothing ("fake counts") to all counts
inference_counts += 1

print(f"Counts for context ({inference_context_char}): {inference_counts}\n")

inference_counts_sum = inference_counts.sum()
probs = inference_counts / inference_counts_sum

print(f"Probs for context ({inference_context_char}): {probs}\n")
print(f"Most likely next character: {token_to_char[probs.argmax()]}")

Inference context: [11, 1]
Counts for context (ka): [199.  10.   4.  21.  40.  61.   1.   4.  52. 210.   6.   1. 164. 144.
  70.   5.   6.   1. 205. 106.  91.   7.  20.  10.   2. 195.  19.]

Probs for context (ka): [0.12031439 0.00604595 0.00241838 0.01269649 0.0241838  0.03688029
 0.00060459 0.00241838 0.03143894 0.12696493 0.00362757 0.00060459
 0.09915357 0.08706167 0.04232164 0.00302297 0.00362757 0.00060459
 0.12394196 0.06408706 0.05501814 0.00423216 0.0120919  0.00604595
 0.00120919 0.11789601 0.0114873 ]

Most likely next character: i


So, given a context of `'ay'`, the model predicts `'l'` as the most likely next token, while for the context `'ka'`, the model predicts `'i'` as the most likely next token. Let's see an example of generating a sequence of tokens using the model.


In [19]:
num_tokens = 20
inference_context_char = "ka"

current_context = [char_to_token[c] for c in inference_context_char]
generated_tokens = current_context.copy()

for _ in range(num_tokens):
    inference_counts = counts[tuple(current_context)].astype(np.float32)
    inference_counts += 1  # Add smoothing
    probs = inference_counts / inference_counts.sum()

    next_token = rng.choice(len(probs), p=probs)
    generated_tokens.append(next_token)

    current_context = current_context[1:] + [
        next_token
    ]  # Update the context window for the next iteration

generated_text = "".join(token_to_char[token] for token in generated_tokens)
print("Generated text:")
print(generated_text)

Generated text:
kayaryoce
kr
die
in
ta


The above block of code generates a sequence of tokens given an input context. The context is used to predict the next token, which is then appended to the context. The last `N-1` tokens (in this case, 2) are used as the context for the next prediction. This process is repeated for `num_tokens` iterations to generate a sequence of tokens.

However, it does not look like the model is generating meaningful names. How do we quantify this metric? We now turn to calculating the loss, and optimizing the model parameters to minimize this loss.


#### Sidenote: Smoothing

Smoothing involves adding a small value to all the counts in the model to ensure that no n-gram has a count of 0, which would lead to a division by zero error. The most common smoothing technique is called Laplace smoothing, where we add 1 to all the counts. See Section 3.6 of the aforementioned book for more details.

_Back to understanding n-grams..._


### Improving the model


#### Training, validation, and testing datasets

It is highly recommended to understand the difference between training, validation and testing datasets. This is a topic that will be encountered in every machine learning project. A good resource in Section 3.2 of the book mentioned above for an in-depth explanation.

Briefly, training data is used to train the model, validation data is used to tune hyperparameters, and test data is used to evaluate the model's performance. It is important to have a validation set in between, as using the testing data to either train the model or to tune hyperparameters can lead to overfitting and poor overall performance.


#### Loss function

Since we are building a language model to predict the next token, we want to maximize the probability of the next token given the context. In other words, if we know that the token after `'a y'` is `'v'`, we want the model to predict `'v' with high probability.

Therefore, the objective is to maximize the likelihood of the next token given the context, OR equivalently, minimize the negative log likelihood of the next token given the context. This is the loss function that we will use to evaluate the performance of the model.


In [20]:
def evaluate_model(counts, tokens, SEQ_LEN):
    # Evaluate the given counts array on the given tokens

    sum_loss = 0.0
    count = 0

    for tape in dataloader(tokens, SEQ_LEN):
        x = tape[:-1]  # The context
        y = tape[-1]  # The actual target

        # Get the probabilities from the model for this context
        inference_counts = counts[tuple(x)].astype(np.float32)
        inference_counts += 1
        inference_counts_sum = inference_counts.sum()
        probs = inference_counts / inference_counts_sum

        # Get the probability for the actual target as predicted by the model
        prob = probs[y]

        # Add the negative log probability to the loss
        sum_loss += -np.log(prob)

        # Increment the count of how many contexts we've seen
        count += 1

    # Calculate the mean loss over all contexts seen
    mean_loss = sum_loss / count if count > 0 else 0.0
    return mean_loss

In [21]:
print(
    f"Mean training loss on model with `N=3`: {evaluate_model(counts, train_tokens, 3):.4f}"
)
print(
    f"Mean validation loss on model with `N=3`: {evaluate_model(counts, val_tokens, 3):.4f}"
)


Mean training loss on model with `N=3`: 2.2117
Mean validation loss on model with `N=3`: 2.2521


#### Hyperparameter tuning

With `N=3`, we get some specific loss values on the training and validation set. How do we test if this is the best value of `N`?

We can try different values of `N` and see which one gives the best performance on the validation set. This process is called **hyperparameter tuning.**

Let's see what happens with `N=4`.


In [22]:
SEQ_LEN = 4
counts = np.zeros((vocab_size,) * SEQ_LEN, dtype=np.int32)

# This will iterate over all 5-grams (pentagram?) in the training set and update the counts
for tape in dataloader(train_tokens, SEQ_LEN):
    counts[tuple(tape)] += 1


In [23]:
print(f'Counts for context "ray": {counts[tuple([char_to_token[c] for c in "ray"])]}')

Counts for context "ray": [18 47  0 13 21  7  0  3  2  2  0  1 59  6 16  3  0  1  0 20  7  1  9  0
  0  3  1]


In [24]:
num_tokens = 20
inference_context_char = "kar"

current_context = [char_to_token[c] for c in inference_context_char]
generated_tokens = current_context.copy()

for _ in range(num_tokens):
    inference_counts = counts[tuple(current_context)].astype(np.float32)
    inference_counts += 1  # Add smoothing
    probs = inference_counts / inference_counts.sum()

    next_token = rng.choice(len(probs), p=probs)
    generated_tokens.append(next_token)

    current_context = current_context[1:] + [
        next_token
    ]  # Update the context window for the next iteration

generated_text = "".join(token_to_char[token] for token in generated_tokens)
print("Generated text:")
print(generated_text)

Generated text:
karsh
sopbpztebzszucccp


In [25]:
print(
    f"Mean training loss on model with `N=4`: {evaluate_model(counts, train_tokens, 4):.4f}"
)
print(
    f"Mean validation loss on model with `N=4`: {evaluate_model(counts, val_tokens, 4):.4f}"
)


Mean training loss on model with `N=4`: 2.1006
Mean validation loss on model with `N=4`: 2.2114


We see that the mean validation loss reduced from 2.25 (`N=3`) to 2.21 (`N=4`). Similarly, we could try different values of `N` to see which one gives the best performance on the validation set. Moreover, we can treat `smoothing` as a hyperparameter and tune it as well to get the best performance on the validation set.

### Grid-search for hyperparameter tuning

To perform grid-search efficiently, let's rewrite our model code in an object-oriented manner. This will allow us to easily change the hyperparameters and train the model with different values of `N` and `smoothing`.

In [26]:
class NgramLanguageModel:
    def __init__(self, vocab_size: int, seq_len: int, smoothing: float = 0.0):
        self.seq_len = seq_len
        self.vocab_size = vocab_size
        self.smoothing = smoothing

        # The same n-dimensional array of counts as before
        self.counts = np.zeros((vocab_size,) * seq_len, dtype=np.uint32)

    def train(self, tape: list):
        assert len(tape) == self.seq_len

        # Increment the count for this context
        self.counts[tuple(tape)] += 1

    def get_counts(self, tape: list):
        assert len(tape) == self.seq_len - 1

        # Get the counts for this context
        return self.counts[tuple(tape)]

    def __call__(self, tape: list):
        assert len(tape) == self.seq_len - 1

        # Get the counts, apply smoothing, and normalize to get the probabilities
        counts = self.counts[tuple(tape)].astype(np.float32)

        # Add smoothing ("fake counts") to all counts
        counts += self.smoothing

        counts_sum = counts.sum()

        probs = counts / counts_sum

        return probs

In [27]:
def evaluate_model(model: NgramLanguageModel, tokens: list):
    # Evaluate the given model on the given tokens

    sum_loss = 0.0
    count = 0

    for tape in dataloader(tokens, model.seq_len):
        x = tape[:-1]  # The context
        y = tape[-1]  # The actual target

        # Get the probabilities from the model for this context
        probs = model(x)

        # Get the probability for the actual target as predicted by the model
        prob = probs[y]

        # Add the negative log probability to the loss
        sum_loss += -np.log(prob)

        # Increment the count of how many contexts we've seen
        count += 1

    # Calculate the mean loss over all contexts seen
    mean_loss = sum_loss / count if count > 0 else 0.0
    return mean_loss

In [28]:
seq_lens = [3, 4, 5, 6]
smoothings = [0.03, 0.1, 0.3, 1.0]
best_loss = float("inf")
best_hyperparams = {}

# Now, we'll iterate over all hyperparameters, train a model, evaluate it, and keep track of the best one

for seq_len, smoothing in itertools.product(seq_lens, smoothings):
    model = NgramLanguageModel(vocab_size, seq_len, smoothing)

    for tape in dataloader(train_tokens, seq_len):
        model.train(tape)

    # Calculate the training and validation loss
    train_loss = evaluate_model(model, train_tokens)
    val_loss = evaluate_model(model, val_tokens)

    print(
        f"{'Seq Len':<8}: {seq_len} | {'Smoothing':<10}: {smoothing} | {'Train Loss':<10}: {train_loss:.4f} | {'Val Loss':<10}: {val_loss:.4f}"
    )

    if val_loss < best_loss:
        best_loss = val_loss
        best_hyperparams = {"seq_len": seq_len, "smoothing": smoothing}

print(f"Best hyperparameters: {best_hyperparams}")


Seq Len : 3 | Smoothing : 0.03 | Train Loss: 2.1843 | Val Loss  : 2.2443
Seq Len : 3 | Smoothing : 0.1 | Train Loss: 2.1870 | Val Loss  : 2.2401
Seq Len : 3 | Smoothing : 0.3 | Train Loss: 2.1935 | Val Loss  : 2.2404
Seq Len : 3 | Smoothing : 1.0 | Train Loss: 2.2117 | Val Loss  : 2.2521
Seq Len : 4 | Smoothing : 0.03 | Train Loss: 1.8703 | Val Loss  : 2.1376
Seq Len : 4 | Smoothing : 0.1 | Train Loss: 1.9028 | Val Loss  : 2.1118
Seq Len : 4 | Smoothing : 0.3 | Train Loss: 1.9677 | Val Loss  : 2.1269
Seq Len : 4 | Smoothing : 1.0 | Train Loss: 2.1006 | Val Loss  : 2.2114
Seq Len : 5 | Smoothing : 0.03 | Train Loss: 1.4955 | Val Loss  : 2.3540
Seq Len : 5 | Smoothing : 0.1 | Train Loss: 1.6335 | Val Loss  : 2.2814
Seq Len : 5 | Smoothing : 0.3 | Train Loss: 1.8610 | Val Loss  : 2.3210
Seq Len : 5 | Smoothing : 1.0 | Train Loss: 2.2132 | Val Loss  : 2.4903
Seq Len : 6 | Smoothing : 0.03 | Train Loss: 1.1155 | Val Loss  : 2.7843
Seq Len : 6 | Smoothing : 0.1 | Train Loss: 1.4304 | Val Los

After the grid search, we observe that the best hyperparameters are `N=4` and `smoothing=1`. We can now train the model with these hyperparameters on the entire training dataset and evaluate it on the test dataset.

In [29]:
best_seq_len = best_hyperparams["seq_len"]
best_smoothing = best_hyperparams["smoothing"]

best_model = NgramLanguageModel(vocab_size, best_seq_len, best_smoothing)

for tape in dataloader(train_tokens, best_seq_len):
    best_model.train(tape)

In [30]:
num_tokens = 100
# Start with empty lines
current_context = [EOT_TOKEN] * (best_seq_len - 1)
generated_tokens = current_context.copy()

for _ in range(num_tokens):
    probs = best_model(current_context)

    next_token = rng.choice(len(probs), p=probs)

    generated_tokens.append(next_token)

    current_context = current_context[1:] + [
        next_token
    ]  # Update the context window for the next iteration

    print(token_to_char[next_token], end="")


rtjixbghulys
ikah
aver
malee
claraylen
nes
reece
will
brayshaylie
emerela
niticwpbqzivaakiyari
alis


As we can see, some of the generated names are sensible, but most are not. Let's look at the loss on the test set, as well as the perplexity of the model.

In [31]:
test_loss = evaluate_model(best_model, test_tokens)
print(f"\nTest loss: {test_loss:.4f}")


Test loss: 2.1064


In [32]:
perplexity = np.exp(test_loss)
print(f"Perplexity: {perplexity:.4f}")

Perplexity: 8.2184


## Conclusion

With that, we have successfully built a small n-gram language model on a dataset of names. We saw in detail how the training process works, how to generate the next token given a context, and how to evaluate the model using the loss and improve it using hyperparameter tuning. 

The associated `ngram.py` file contains the complete code for the model, with fixed randomness for reproducibility as well as saving the model to use later.