## Intro

In this Notebook I will train and evaluate two different Trigram models.
More precisely I will use two different approaches for creating a Trigram model.
Approach 1 is to build a matrix that holds the true probability distribution of which character 
comes next given that the model has been given two particular preceeding characters.
We then sample from this distribution to produce the next character in the output 'name'.

The second approach is to use a Neural Net. We take in the two preceeding characters to generate the next one 
and so on. The goal is to achieve the same performance or loss using the Neural Net approach as with approach 1.
As you will see approach 1 is the perfect approach given the loss function I will use. 
It is impossible to achieve better loss with a Neural Net approach to the Trigram model.

In [12]:
import torch

## First approach

We begin by loading in all the names from the text file:

In [9]:
words = open('names_dataset.txt', 'r').read().splitlines()

A peak at some names

In [11]:
words[0:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

Initializing the matrix that will hold the counts for future characters,
please observe that we need 256 rows because for 27 characters the possible
permutations of 2 characters following each other is 27^2 = 27*27 = 729

In [17]:
# Every row in this matrix will eventually be filled with the prob-distribution of the next characters given two particular preceeding characters
N = torch.zeros((729, 27), dtype=torch.int32)
N.shape

torch.Size([729, 27])

Creating two mappings between integers and characters so I can represent every characters with an integer in computations

In [19]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
stoi
itos

{1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z',
 0: '.'}

Creating two mappings between pairs of characters and Integers

In [None]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

Time to populate our probability matrix with counts of future characters!
I will use * to signal beginning and end of string. This will also be treated as a character.

In [None]:

for w in words:
    chs = ['*'] + list(w) + ['*']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1