A Bigram model is a language model in which we predict the probability of the correctness of a sequence of words by just predicting the occurrence of the word “a” after the word “b”.

For eg: Let us take the sentence “I have a car”

probability of the sequence of these words will be

p(I) * p(have| I) * p(a | have, I) ...

In [14]:
import torch

# Obtain the characters
with open("../datasets/the_wizard_of_oz.txt", "r", encoding="utf-8") as file:
    text = file.read()
    text = set(text)
    
# (Character level) Encoder:
string_to_int = { ch:i for i, ch in enumerate(text) }
encode = lambda s: [string_to_int[c] for c in s]
# (Character level) Decoder:
int_to_string = { i:ch for i, ch in enumerate(text) }
decode = lambda l: ''.join([int_to_string[i] for i in l])

# Small vocabulary, but big amount of samples
print(encode("test!"))
print(decode([29, 66, 60, 29, 11]))

[29, 66, 60, 29, 11]
test!


In [15]:
# More optimized way to work with the data
data = torch.tensor(encode(text), dtype=torch.long)

# Hyperparameters: block_size is the length of the sequence that we are going to use to predict the next character
#                  batch_size is the number of sequences that we are going to parallelly process
block_size = int(data.size(0))//2
batch_size = 4

training_set = data[:block_size]
validation_set = data[block_size:]