In [4]:
#input the dataset and read it in
with open('cleaned_dataset.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("Length of dataset (in characters): ", len(text))

Length of dataset (in characters):  6199345


In [6]:
#The first 1000 characters
print(text[:1000])

M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people youd expect to be involved in anything strange or mysterious, because they just didnt hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didnt think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursleys sister, but they hadnt met for several years; 

In [7]:
#Listing all the possible unique characters that occur in our dataset
characters = sorted(list(set(text)))
vocab_size = len(characters)
print(''.join(characters))
print(vocab_size)


 !"&'()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}
86


Now, we need some strategy to tokenize the input text. When we say tokenize we mean convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements.

&nbsp;

Here in our case we are going to be building a character level language model - So will be translating individual characters into integers.

&nbsp;

We will be implementing encoders and decoders, but rather a simple one (as that should be enough for our usecase).

But there are may others (Encoding texts into integers and also decoding them) which use different schema and different vocabularies:

- Google uses [sentencepiece](https://github.com/google/sentencepiece): This encoder implements sub-word units. What that means is that it neither considers the entire word nor a single character. And that is what is usually adopted in practice.

- OpenAI uses [tiktoken](https://github.com/openai/tiktoken): This uses BPE i.e. Bi Pair Encoding tokenizer and this what GPT uses. Here the vocabulary size is very large, almost upto 50,000 tokens.

So here we have tradeoffs:
- You can have very long sequence integers with a small vocabulary.
- You can have very large vocabulary with a small sequence of integers.

Now, we will be sticking to a character level tokenizer only and we are using a simple encoder and decoder. And our vocabulary size is pretty small i.e. `86` characters (so our tradeoff will be that we will have a large sequence of integers when it is encoded)

In [9]:
# Creating mapping from characters to integers

stoi = { ch:i for i,ch in enumerate(characters) }
itos = { i:ch for i,ch in enumerate(characters) }

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

#Example to see how the encoding and decoding is happening
# print(encode("harry potter"))
# print(decode(encode("harry potter")))

# Output:
# [64, 57, 74, 74, 81, 1, 72, 71, 76, 76, 61, 74]
# harry potter