This implementations follows the lecture by Andrej Karpathy.
https://www.youtube.com/watch?v=kCc8FmEb1nY

Concepts are delved into following 3blue1browns series on neural networks
https://www.youtube.com/watch?v=aircAruvnKk

### Imports

In [None]:
import torch

### Read and prepare dataset

In [13]:
with open("alice.txt", encoding="utf-8") as f:
    text = f.read()

print("length of dataset:", len(text))

chars = sorted(set(text))
vocab_size = len(chars) # note that capital and small letters are treated as different characters
print("length of vocabulary:", vocab_size)
print("vocabulary:", ''.join(chars))


length of dataset: 163434
length of vocabulary: 91
vocabulary: 
 !#$%'()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzù—‘’“”•™﻿


### Build encoder and decoder
The encoders job is translating the vocabulary into integers
The decoders job is to reverse this encoding turning it back into the original character

Encoders can follow different schemas, popular implementations are tiktoken (chatGPT) and sentencepiece (Google). The encoders are sub word encoders, meaning that they don't follow a simple schema of just converting each unique word into a token. This means words can be broken into tokens partly into the word. This leads to a lot more tokens being generated, which means a sentence can be broken down into a short sequence of integers.

For intuition this implementation of encoding and decoding will use a simple encoder, which encodes per character, meaning it will generate a long sequence of small tokens.

In [28]:
# create dictionaries to convert characters to integers and vice versa
char_to_int = {c: i for i, c in enumerate(chars)}
int_to_char = {i: c for i, c in enumerate(chars)}


# encode the text
# lambda functions are used as small throwaway functions
encoder = lambda string: [char_to_int[char] for char in string] # make a list of every encoded character in input string
decoder = lambda string: [int_to_char[i] for i in string] # reverse the encoding

print("Without encoder function", [char_to_int["h"], char_to_int["e"], char_to_int["l"], char_to_int["l"], char_to_int["o"]])
print("With encoder function", encoder("hello"))

print("Decoded", decoder(encoder("hello")))

Without encoder function [63, 60, 67, 67, 70]
With encoder function [63, 60, 67, 67, 70]
Decoded ['h', 'e', 'l', 'l', 'o']


### Prepare the dataset
This section encodes the entire dataset and splits the data into a train portion and a validation portion

In [30]:
data = encoder(text)
print("Encoded data", data[:10])
print ("Decoded data", decoder(data[:10]))

Encoded data [90, 46, 64, 75, 67, 60, 24, 1, 27, 67]
Decoded data ['\ufeff', 'T', 'i', 't', 'l', 'e', ':', ' ', 'A', 'l']
