# Setup

In [1]:
import re
import torch

In [2]:
with open("paul_graham_essay.txt", "r") as f:
    text = f.read()

# Remove HTML elements
text = re.sub(r'<.*?>', '', text)

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  2592909


In [4]:
print(text[500:1000])

 those from small additions of whichever
quality was missing.  The more common case is a small
addition of generality: a piece of gossip that's more than
just gossip, because it teaches something interesting about
the world. But another less common approach is to focus on
the most general ideas and see if you can find something new
to say about them. Because these start out so general, you
only need a small delta of novelty to produce a useful
insight.

A small delta of novelty is all you'll be 


# Building Tokenizer

Workflow:
$$
\text{Text} \xrightarrow{\text{Tokenize}} \text{Token IDs} \xrightarrow{\text{Linear}} \text{Embedding} \xrightarrow{\text{Multi-Head Attention}} \text{Attention} \xrightarrow{\text{Feed Forward}} \text{Output}
$$

In [5]:
# Vocab of all unique characters
chars = list(set(text))
chars.sort()
vocab_size = len(chars)
print("vocab size: ", vocab_size)
print(vocab_size, "unique characters: ", ''.join(chars))

vocab size:  96
96 unique characters:  
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~é


In [6]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[72, 73, 73, 1, 84, 72, 69, 82, 69]
hii there


In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(text[500:1000])
print(data[500:1000])

torch.Size([2592909]) torch.int64
 those from small additions of whichever
quality was missing.  The more common case is a small
addition of generality: a piece of gossip that's more than
just gossip, because it teaches something interesting about
the world. But another less common approach is to focus on
the most general ideas and see if you can find something new
to say about them. Because these start out so general, you
only need a small delta of novelty to produce a useful
insight.

A small delta of novelty is all you'll be 
tensor([ 1, 84, 72, 79, 83, 69,  1, 70, 82, 79, 77,  1, 83, 77, 65, 76, 76,  1,
        65, 68, 68, 73, 84, 73, 79, 78, 83,  1, 79, 70,  1, 87, 72, 73, 67, 72,
        69, 86, 69, 82,  0, 81, 85, 65, 76, 73, 84, 89,  1, 87, 65, 83,  1, 77,
        73, 83, 83, 73, 78, 71, 15,  1,  1, 53, 72, 69,  1, 77, 79, 82, 69,  1,
        67, 79, 77, 77, 79, 78,  1, 67, 65, 83, 69,  1, 73, 83,  1, 65,  1, 83,
        77, 65, 76, 76,  0, 65, 68, 68, 73, 84, 73, 79, 78,  1, 7

In [8]:
train_size = int(len(data) * 0.9)
train_data, val_data = data[:train_size], data[train_size:]

In [9]:
block_size = 8
train_data[:block_size+1]

tensor([52, 69, 80, 84, 69, 77, 66, 69, 82])