#### Quick Transformer lol

In [2]:
import torch
import torch.nn as nn

neurons = nn.Linear(1,2)

embedding = nn.Embedding(10, 3)

print(neurons.weight)



Parameter containing:
tensor([[-0.6635],
        [-0.8240]], requires_grad=True)


In [3]:
# an Embedding matrix of 10 embeddings with 3 dimensions each
embedding = nn.Embedding(10, 3)

# a batch of 2 samples of 1 index each
input = torch.LongTensor([[9],[0]])

print('THE INPUTS \n',input)

# what does the embedding matrix look like?
print('THE EMBEDDING MATRIX \n',embedding.weight)

# what is the result of the input embedding?
print('EMBEDDINGS FOR THE GIVEN INPUTS \n',embedding(input))

THE INPUTS 
 tensor([[9],
        [0]])
THE EMBEDDING MATRIX 
 Parameter containing:
tensor([[-0.4110,  0.1662, -0.9450],
        [-1.6423,  0.0504,  0.6597],
        [ 1.4175,  0.9239, -0.4107],
        [-0.6651, -0.2545, -0.5963],
        [-0.1035, -1.2211, -0.0064],
        [ 2.5755,  0.9422, -0.7656],
        [ 0.1951, -0.7598, -0.2173],
        [ 0.9963,  1.1018, -0.2662],
        [ 1.3772,  0.9599, -1.0536],
        [-0.4803,  1.0032, -1.4466]], requires_grad=True)
EMBEDDINGS FOR THE GIVEN INPUTS 
 tensor([[[-0.4803,  1.0032, -1.4466]],

        [[-0.4110,  0.1662, -0.9450]]], grad_fn=<EmbeddingBackward0>)


notes on nn.Embedding in the Bigram Language Model

the embedding module creates a lookup table with a configurable number of rows and columns. 
- number of rows is the number of unique words in the vocabulary,
- columns is the size of the word embeddings.

 The embedding module is initialized with random values, and the embeddings are learned during training.

 

#### Import the dataset

In [4]:
with open('input.txt','r', encoding='utf-8') as f:
    text = f.read()


In [5]:
print('the length of the text is ',len(text))
print(text[:200])

the length of the text is  1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


## OK

embedding layer is a lookup table, 

- input to the embedding layer is a list of indices
- output is a list of word (token) embeddings.

lets turn our dataset into a list of indices, and feed it to the embedding layer. 

#### how do we do this?
(We're doing a character level model)

[x] we find all possible vocab words (letters) in datataset 
    - separate the dataset by letters, make a set of them, a list 
- tokenize the dataset wrt the vocab 
    - enumerate every element in the list 
    - make a function that returns index num for each string
        - and a string for each num
- pass all of these token indices into the embedding table
    - turn entire dataset into nums wrt this tokenization strat
- get back all the embeddings  

In [6]:
# FIND ALL POSSIBLE VOCAB WORDS IN DATASET 

from pprint import pprint

vocab_letters  = sorted(list(set(text)))
print(''.join(vocab_letters))
print(len(vocab_letters))

# this is the entirety of our vocab for our word level transformer 


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [55]:
# INDEX ALL OF THE LETTERS
# LETTER ENCODER

# maps between the index and the letter 
letter_list = {letter:index for index, letter in enumerate(vocab_letters)}

str_to_int = lambda s: [letter_list[letter] for letter in s]

print('does our encoder work? ',str_to_int('amongus baby'))

numbered_dataset = str_to_int(text)

# here is the dataset as a list of numbers
print('our dataset turned into numbers:', numbered_dataset[:100])



does our encoder work?  [39, 51, 53, 52, 45, 59, 57, 1, 40, 39, 40, 63]
our dataset turned into numbers: [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59]


In [70]:
# DECODER - ints back to str
# jsut in case 

int_to_str = { index: letter for index, letter in enumerate(vocab_letters)}

# here is the mapping of the numbers back to letters 
print(int_to_str)

decoded = lambda i : ''.join([int_to_str[integer] for integer in i ])

# lets make sure the decoder works 
decoded([39, 51, 53, 52, 45, 59, 57, 1, 40, 39, 40, 63])

decoded_dataset = decoded(numbered_dataset)

print(decoded_dataset[:200])

{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i', 48: 'j', 49: 'k', 50: 'l', 51: 'm', 52: 'n', 53: 'o', 54: 'p', 55: 'q', 56: 'r', 57: 's', 58: 't', 59: 'u', 60: 'v', 61: 'w', 62: 'x', 63: 'y', 64: 'z'}
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [None]:
# MAKE A FUNCTION THAT ENCODES STR -> IDX AND
# DECODES  IDX -> BACK TO STR