---
#### **Open/Extract Data For Processing**
1. open data file
2. extract a set of unique characters in text for tokenization
---

In [46]:
# Open data to process in transformer
file = 'wizard_of_oz.txt'

with open('data/'+file, 'r', encoding='utf-8') as file:
    text = file.read()

# Extract all the unique characters out of our data for tokenization
chars = sorted(set(text))
print("chars len:", len(chars))
print("Some characters of chars: [..., " + (', ').join(chars[20:40]) + ", ...]")

chars len: 81
Some characters of chars: [..., 8, 9, :, ;, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, ...]


---
#### **Create Tokenization Pipeline**
1. Create mapping of character to integer value
2. Create mapping of integer value to character
3. Create an encoding function to convert string -> array of integer (vector)
4. Create a decoding function to convert vector -> string
---

In [62]:
# !Tokenization is broken into 2 steps!
# Encoding - converts each element of our chars array into an integer
# Decoding - converts an array of integers back into a string

string_to_int = {ch:i for i,ch in enumerate(chars)} # give each unique char an id {'\n': 0, ' ': 1, '!': 2, ...}
int_to_string = {i:ch for i,ch in enumerate(chars)} # inverse of previous

encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

print(encode("HELLO")) # Encode function converts string to array of ints: "hello" -> [32, 29, 36, 36, 39]
print(decode([32, 29, 36, 36, 39])) # Decode converts [32, 29, 36, 36, 39] back into "hello"

[32, 29, 36, 36, 39]
HELLO
