# Simple probabilistic bigram language model using torch

Based on 1000 boy names, see how a language model manages to generate new random ones based on bigram of characters.

In [1]:
import torch
import os

### Load dataset

In [2]:
with open("../data/names.txt", "r") as f:
    names = f.read().splitlines()
    print(f"{len(names)} names loaded. Examples: {names[:3]}..")

1000 names loaded. Examples: ['Aarav', 'Aaron', 'Abdiel']..


### Create vocab

- Store all unique characters in a vocab. 
    - Also add a unique token indicating that a word starts and ends.
- Keep capital letters (So we can see if model is smart enough to start names with capital later)
- Create dicts to help us map between chars and index in vocab

In [3]:
START_TOKEN, END_TOKEN = "(", ")"
vocab = [START_TOKEN] + sorted(list(set("".join(names)))) + [END_TOKEN]
print(f"Vocab of size {len(vocab)}")
# Create "string to index" and "index to string" dicts for lookup purposes
stoi = {s:i for i, s in enumerate(vocab)}
itos = {i:s for i, s in enumerate(vocab)}
print(f"Index of character 'x' = {stoi['x']}")
print(f"Character of index 26 = {itos[26]}")

Vocab of size 54
Index of character 'x' = 50
Character of index 26 = Z


### Bigram Matrix

Based on the dataset, we want to store the frequency of all bigrams in it. This is done by creating a matrix containing all chars in vocab.

- Rows = first word, Columns = Second word
    - Example: B[5, 10] = how many times bigram (5, 10) have occurred in dataset.

We can use this matrix to compute probabilites of characters following each other later.

In [4]:
B = torch.zeros((len(vocab), len(vocab)), dtype=torch.int32) # Bigram matrix
B

tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], dtype=torch.int32)

In [5]:
for name in names:
    name_chars = [START_TOKEN] + list(name) + [END_TOKEN] # E.g. ['(', 'A', 'd', 'a', 'm', ')']
    for c1, c2 in zip(name_chars, name_chars[1:]): # E.g. c1='A', c2='d'
        B[stoi[c1], stoi[c2]] += 1 # Add count to (c1, c2) bigram

Print some examples from Bigram matrix filled with frequencies:

In [6]:
bigram_samples = ["ck", "ry", "kr"]
for sample in bigram_samples:
    print(f"The bigram {sample} occurred {B[stoi[sample[0]], stoi[sample[1]]].item()} times in dataset")

The bigram ck occurred 17 times in dataset
The bigram ry occurred 19 times in dataset
The bigram kr occurred 0 times in dataset


### Probability matrix (based on bigram frequencies)

This is created by dividing each item by the sum of its row. 

For example, if this row contains frequency of char 'x': `[0, 2, 1, 3]`, then the sum of the row is:
```
0+2+1+3 = 6
```
We can then compute the probability row as:
```
[0/6, 2/6, 1/6, 3/6] = [0, 0.33, 0.16, 0.5]
```

Example of shapes and sum in torch:

In [7]:
E = torch.tensor([[1, 2, 3], [3, 4, 5]])
print(E)
print(f"E have shape {E.shape}")
print(f"First row = {E[0, :]}\nFirst column = {E[:, 0]}")
print(f"Sum on axis 0: {E.sum(0, keepdim=True)}\nSum on axis 1: {E.sum(1, keepdim=True)}")
print(f"Sum on axis 0 shape: {E.sum(0, keepdim=False).shape} (keepdim=False)")
print(f"Sum on axis 0 shape: {E.sum(0, keepdim=True).shape} (keepdim=True)")
print(f"Sum on axis 1 shape: {E.sum(1, keepdim=False).shape} (keepdim=False)")
print(f"Sum on axis 1 shape: {E.sum(1, keepdim=True).shape} (keepdim=True)")

tensor([[1, 2, 3],
        [3, 4, 5]])
E have shape torch.Size([2, 3])
First row = tensor([1, 2, 3])
First column = tensor([1, 3])
Sum on axis 0: tensor([[4, 6, 8]])
Sum on axis 1: tensor([[ 6],
        [12]])
Sum on axis 0 shape: torch.Size([3]) (keepdim=False)
Sum on axis 0 shape: torch.Size([1, 3]) (keepdim=True)
Sum on axis 1 shape: torch.Size([2]) (keepdim=False)
Sum on axis 1 shape: torch.Size([2, 1]) (keepdim=True)


Compute the matrix and print some samples

In [8]:
P = (B+1).float() # add 1 to bigram matrix in order to avoid -inf on bigrams that havent been seen
P /= P.sum(1, keepdims=True) # divide by column vector containing sum of each row. 54x54 / 54x1
print(f"P shape: {P.shape}")
print(f"Probability of 'c' followed by 'k' = {P[stoi['c'], stoi['k']]}")
print((
    f"Row 5 contains probabilities of all words following char {itos[5]}." 
    f" The sum of the row is (should be 1) = {P[5].sum()}."
    f"\nThe most probable char following {itos[5]} is {itos[P[5].argmax().item()]}"
))

P shape: torch.Size([54, 54])
Probability of 'c' followed by 'k' = 0.11042945086956024
Row 5 contains probabilities of all words following char E. The sum of the row is (should be 1) = 1.0000001192092896.
The most probable char following E is l


# Generate random names

The loop works like this:
- Based on the start token, sample a 'probable' next char
- Keep sampling probable next chars based on previous sampled char, until the end token is sampled

Expect terrible names because the model only knows bigram context when generating, BUT it should be able to start the names with one of the capital letters when it sees a START_TOKEN.

In [9]:
num = 10
for i in range(num):  
    name = []
    c1 = stoi[START_TOKEN]
    while True:
        c2_probs = P[c1] # probability vector of chars following c1
        # get c2 by multinomial sampling on the probabilty vector
        c2 = torch.multinomial(c2_probs, num_samples=1, replacement=True).item()
        if c2 == stoi[END_TOKEN]: break
        name.append(itos[c2])
        c1 = c2 # c2 is first char in next iteration
    print(f"How are you doing today Mr. {''.join(name)}")

How are you doing today Mr. DoDLMard
How are you doing today Mr. Grenenc
How are you doing today Mr. Omnthericknn
How are you doing today Mr. MsedJohyliamoto
How are you doing today Mr. Jacondesen
How are you doing today Mr. Caiejallele
How are you doing today Mr. ChmmiwwZg(ACoreso
How are you doing today Mr. AzyamastVmen
How are you doing today Mr. ChandukArer
How are you doing today Mr. Ky
