In [29]:
import tiktoken
import torch #PyTorch: https://pytorch.org

In [1]:
# read it in to inspect it
with open(file='files/shakespeare.txt', mode='r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print("length of dataset (complete works of Shakespearre) in characters: ", len(text))

length of dataset (complete works of Shakespearre) in characters:  1115394


In [3]:
print(text[:250]) #1st 250 characters

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [4]:
characters = sorted(list(set(text))) #sort(create list of(the set of all characters in text))
vocabulary_size = len(characters) 
print("The characters are:",''.join(characters),".\nWith ",vocabulary_size," words in the Shakespearean vocabulary.")

The characters are: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz .
With  65  words in the Shakespearean vocabulary.


Develop strategy to tokenize raw text, which is a means representing the entire possible vocabulary through a series of integers.  Since this is a character-level language model, we will need to tokenize the individual characters, of which there are 65. 

<hr><br>

### <center>Tokenizing: Encoding & Decoding</center>

Sub-word vs Character Level - tradeoff

Many ways to accomplish this ...

- Google uses [SentencePiece](https://github.com/google/sentencepiece) - sub-word level tokenizer
- OpenAI (ChatGPT) uses [tiktoken](https://github.com/openai/tiktoken) - BPE (Byte-Pair Encoding) tokenizer
- Will build a basic one below


In Python, a dictionary (or lookup table) can be initialized via `d={}`, & `enumerate()` can be used to iterate over a list.<br><br>
So, if we initialize a dictionary using the folowing:<br>
`{ character:integer for integer,character in enumerate(characters) }`,<br>
We declare a lookup table with Key-Value pairs being set to `character:integer`, such that each of the 65 (0-64) characters in the sorted `list` are mapped to integers; characters -> integers.<br> We can repeat the process & use `integer:character` in the initialization for create a second mapping from integers --> characters.  

In [5]:
character_to_integer_dictionary = { character:integer for integer,character in enumerate(characters)}
integer_to_character_dictionary = { integer:character for integer,character in enumerate(characters)}

In [6]:
character_to_integer_dictionary.items()

dict_items([('\n', 0), (' ', 1), ('!', 2), ('$', 3), ('&', 4), ("'", 5), (',', 6), ('-', 7), ('.', 8), ('3', 9), (':', 10), (';', 11), ('?', 12), ('A', 13), ('B', 14), ('C', 15), ('D', 16), ('E', 17), ('F', 18), ('G', 19), ('H', 20), ('I', 21), ('J', 22), ('K', 23), ('L', 24), ('M', 25), ('N', 26), ('O', 27), ('P', 28), ('Q', 29), ('R', 30), ('S', 31), ('T', 32), ('U', 33), ('V', 34), ('W', 35), ('X', 36), ('Y', 37), ('Z', 38), ('a', 39), ('b', 40), ('c', 41), ('d', 42), ('e', 43), ('f', 44), ('g', 45), ('h', 46), ('i', 47), ('j', 48), ('k', 49), ('l', 50), ('m', 51), ('n', 52), ('o', 53), ('p', 54), ('q', 55), ('r', 56), ('s', 57), ('t', 58), ('u', 59), ('v', 60), ('w', 61), ('x', 62), ('y', 63), ('z', 64)])

In [7]:
integer_to_character_dictionary.items() 

dict_items([(0, '\n'), (1, ' '), (2, '!'), (3, '$'), (4, '&'), (5, "'"), (6, ','), (7, '-'), (8, '.'), (9, '3'), (10, ':'), (11, ';'), (12, '?'), (13, 'A'), (14, 'B'), (15, 'C'), (16, 'D'), (17, 'E'), (18, 'F'), (19, 'G'), (20, 'H'), (21, 'I'), (22, 'J'), (23, 'K'), (24, 'L'), (25, 'M'), (26, 'N'), (27, 'O'), (28, 'P'), (29, 'Q'), (30, 'R'), (31, 'S'), (32, 'T'), (33, 'U'), (34, 'V'), (35, 'W'), (36, 'X'), (37, 'Y'), (38, 'Z'), (39, 'a'), (40, 'b'), (41, 'c'), (42, 'd'), (43, 'e'), (44, 'f'), (45, 'g'), (46, 'h'), (47, 'i'), (48, 'j'), (49, 'k'), (50, 'l'), (51, 'm'), (52, 'n'), (53, 'o'), (54, 'p'), (55, 'q'), (56, 'r'), (57, 's'), (58, 't'), (59, 'u'), (60, 'v'), (61, 'w'), (62, 'x'), (63, 'y'), (64, 'z')])

In [8]:
encode = lambda string: [character_to_integer_dictionary[character] for character in string] #input string --> output integer mapping
decode = lambda list: ''.join([integer_to_character_dictionary[integer] for integer in list]) #input list of integers --> output original characters from mapping

## <center> Tokenizing `"Hello World!"` </center>

In [15]:
phrase = "hello world!"

In [16]:
# Simple tokenizer built for our purposes; remember that our dictionaries have 65 elements
encoded_tmp = encode(phrase)
print("Encoded Message: ",encoded_tmp,"--> Decoded Message: ", decode(encoded_tmp))

Encoded Message:  [46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2] --> Decoded Message:  hello world!


Now, let us use a sub-word encoder, like **tiktoken** on the same phrase, not that encoding contains fewer characters. 

In [27]:
tt_Encoded = tiktoken.get_encoding('gpt2')

print("Encoded Message: ",tt_Encoded.encode(phrase),"--> Decoded Message: ", tt_Encoded.decode(tt_Encoded.encode(phrase)))

Encoded Message:  [31373, 995, 0] --> Decoded Message:  hello world!


<h2 align=center> Tensors</h2>

In Machine Learning, **parrellization** is extensively used, because (mathematical) operations can be applied to multiple elements simoultaneously, as opposed to a sequential or stepwise fashion.<br>Similar to being asynchronous.<br><br>Tensors are just arrays of values which are utilized to make parrelization possible.  Tensors can be:
- Rank 0: Scalar (Magnitude) - 0x1
- Rank 1: Vector (0+Direction) - 1x1
- Rank 2: Dyad (1+Direction) - 2x2
- Rand 3: Triad (2+Direction) - 3x3

<h4 align=center>Utilizing the simple encode() & decode() functions we created, let us create a tensor which wraps the encoded text called 'data'</h4>

In [30]:
#text contains the tiny shakespeare which was read in earlier; 
#will encode the text & wrap it into a tensor
data = torch.tensor(encode(text), dtype=torch.long)

In [54]:
 
print('The number of elements from the encoded text is {:,}'.format(data.shape[0]), 'encoded & wrapped into a tensor with the data type is \'',data.dtype,'\'.\nThe first 100 elements are: ',data[:100],'\nthese elements will correspond exactly with the 1st 100 characters from the original tiny shakespeare text')

The number of elements from the encoded text is 1,115,394 encoded & wrapped into a tensor with the data type is ' torch.int64 '.
The first 100 elements are:  tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59]) 
these elements will correspond exactly with the 1st 100 characters from the original tiny shakespeare text


In [62]:
print(text[:100]) # = decode(data[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


<h4 align=center>Split Data into Training & Validation Sets</h4>

In [57]:
n = int(0.9*len(data)) # n = 90% of the # of elements available
training_data = data[:n] # slice 90% into training 
validation_data = data[n:] # slice 10% into validation

<h4>We now need to feed the text into the transformer, but cannot do so at once (prohibitive), so instead we utilize random sampling of random chunks.  A 'block size' refers to the size of each of these chunks</h4>

In [59]:
block_size = 8

To see this in action, consider the next block of code, where as the number of elements are added, you see the output is unique.

In [60]:
#W/ a block size of 8, the transformer is being trained on each of the following, i.e. a context of size 1, through a context of size 8;
#This gives the transformer all-sized contexts
x = training_data[:block_size]
y = training_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'When the input is {context}, the Target is {target}')

When the input is tensor([18]), the Target is 47
When the input is tensor([18, 47]), the Target is 56
When the input is tensor([18, 47, 56]), the Target is 57
When the input is tensor([18, 47, 56, 57]), the Target is 58
When the input is tensor([18, 47, 56, 57, 58]), the Target is 1
When the input is tensor([18, 47, 56, 57, 58,  1]), the Target is 15
When the input is tensor([18, 47, 56, 57, 58,  1, 15]), the Target is 47
When the input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the Target is 58


In [None]:
training_data[:block_size+1] #adding +1 will give the transfomer 8 unique examples of character associations