# Quick Transformer lol

In [48]:
import torch
import torch.nn as nn

neurons = nn.Linear(1,2)

embedding = nn.Embedding(10, 3)

print(neurons.weight)



Parameter containing:
tensor([[0.9651],
        [0.0068]], requires_grad=True)


In [49]:
# an Embedding matrix of 10 embeddings with 3 dimensions each
embedding = nn.Embedding(10, 3)

# a batch of 2 samples of 1 index each
input = torch.LongTensor([[9],[0]])

print('THE INPUTS \n',input)

# what does the embedding matrix look like?
print('THE EMBEDDING MATRIX \n',embedding.weight)

# what is the result of the input embedding?
print('EMBEDDINGS FOR THE GIVEN INPUTS \n',embedding(input))

THE INPUTS 
 tensor([[9],
        [0]])
THE EMBEDDING MATRIX 
 Parameter containing:
tensor([[-0.0526,  0.2111,  1.5163],
        [-1.0557, -1.1104,  0.0187],
        [-0.8817, -0.3208,  0.3500],
        [-0.5877,  1.2597, -0.7540],
        [ 0.2042, -0.0193, -1.6395],
        [-0.1164, -1.6652, -0.8704],
        [ 2.1463,  1.0753, -0.2476],
        [ 1.0732,  1.5695,  0.9333],
        [ 1.8121,  1.2056,  0.6709],
        [-0.0588,  1.2817,  0.6878]], requires_grad=True)
EMBEDDINGS FOR THE GIVEN INPUTS 
 tensor([[[-0.0588,  1.2817,  0.6878]],

        [[-0.0526,  0.2111,  1.5163]]], grad_fn=<EmbeddingBackward0>)


nn.Embedding 

the embedding module creates a lookup table with a configurable number of rows and columns. 
- number of rows is the number of unique words in the vocabulary,
- columns is the size of the word embeddings.

 The embedding module is initialized with random values, and the embeddings are learned during training.

 

## Import the dataset

In [50]:
with open('input.txt','r', encoding='utf-8') as f:
    text = f.read()


In [51]:
print('the length of the text is ',len(text))
print(text[:200])

the length of the text is  1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


## OK, lets tokenize this bih

embedding layer is a lookup table, 

- input to the embedding layer is a list of indices
- output is a list of word (token) embeddings.

lets turn our dataset into a list of indices, and feed it to the embedding layer. 

#### how do we do this?
(We're doing a character level model)

[x] we find all possible vocab words (letters) in datataset 
    - separate the dataset by letters, make a set of them, a list 
- tokenize the dataset wrt the vocab 
    - enumerate every element in the list 
    - make a function that returns index num for each string
        - and a string for each num
- pass all of these token indices into the embedding table
    - turn entire dataset into nums wrt this tokenization strat
- get back all the embeddings  

In [52]:
# FIND ALL POSSIBLE VOCAB WORDS IN DATASET 

from pprint import pprint

vocab_letters  = sorted(list(set(text)))
print(''.join(vocab_letters))
print(len(vocab_letters))

# this is the entirety of our vocab for our word level transformer 


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


##### MAKE A FUNCTION THAT ENCODES STR -> IDX AND

 DECODES  IDX -> BACK TO STR

In [53]:
# INDEX ALL OF THE LETTERS
# LETTER ENCODER

# maps between the index and the letter 
letter_list = {letter:index for index, letter in enumerate(vocab_letters)}

str_to_int = lambda s: [letter_list[letter] for letter in s]

print('does our encoder work? ',str_to_int('amongus baby'))

numbered_dataset = str_to_int(text)

# here is the dataset as a list of numbers
print('our dataset turned into numbers:', numbered_dataset[:100])



does our encoder work?  [39, 51, 53, 52, 45, 59, 57, 1, 40, 39, 40, 63]
our dataset turned into numbers: [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59]


In [54]:
# DECODER - ints back to str
# jsut in case 

int_to_str = { index: letter for index, letter in enumerate(vocab_letters)}

# here is the mapping of the numbers back to letters 
print(int_to_str)

decoded = lambda i : ''.join([int_to_str[integer] for integer in i ])

# lets make sure the decoder works 
decoded([39, 51, 53, 52, 45, 59, 57, 1, 40, 39, 40, 63])

decoded_dataset = decoded(numbered_dataset)

print(decoded_dataset[:200])

{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i', 48: 'j', 49: 'k', 50: 'l', 51: 'm', 52: 'n', 53: 'o', 54: 'p', 55: 'q', 56: 'r', 57: 's', 58: 't', 59: 'u', 60: 'v', 61: 'w', 62: 'x', 63: 'y', 64: 'z'}
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


# lets make an embedding matrix 

where each character has its own embedding vector 
lets make the vectors of depth 1000!

In [110]:
import torch.nn as nn 

vocab_length = len(vocab_letters)

embedding_matrix = nn.Embedding(vocab_length, 5, sparse=True)

# ok so now every single element of vocab has a respective embedding 

In [56]:
# quick example 

# lets tokenize a word and embbed it 

word = 'amongussy'

encoded_word = str_to_int(word)

print('this is the encoded word:' ,encoded_word)

# get the embeddings for each one of these tokens 

# turn list into a tensor (basically the same as a list)
encoded_word_tensor = torch.LongTensor(encoded_word)

embedded_word = embedding_matrix(encoded_word_tensor)

print('this is the word embedding: \n', embedded_word)

# basically we are getting the index of each one of those elements and retrieving the embedding from  it
# reminded --> the embedding class starts as a random matrix and converges to true, meaning rich emebddings for each word!!!


this is the encoded word: [39, 51, 53, 52, 45, 59, 57, 57, 63]
this is the word embedding: 
 tensor([[ 0.5132,  0.1527, -2.1701, -0.8355,  0.6727],
        [-0.6289, -0.3758, -0.6985,  0.1369,  1.8234],
        [-0.6090, -0.8580, -0.5560,  0.1076, -1.4859],
        [ 0.1231,  0.1578, -0.5168, -0.1400, -0.6233],
        [-0.8951,  0.3108, -0.3624,  0.6059, -0.0674],
        [-2.0045,  1.6886,  0.7344,  1.0055, -0.4711],
        [ 0.1327, -0.9947,  0.0762,  0.7978, -0.3777],
        [ 0.1327, -0.9947,  0.0762,  0.7978, -0.3777],
        [ 1.2512, -0.6634, -0.6068,  2.1761, -0.1148]],
       grad_fn=<EmbeddingBackward0>)


# lets encode the whole damn dataset 


In [86]:
# lets encode the whole damn dataset 

encoded_dataset = str_to_int(text)

print('first thousand elements of the dataset ',encoded_dataset[:1000])

data = torch.tensor(encoded_dataset, dtype=torch.long)

print(data)

# pprint('this is the encoded dataset \n', str(encoded_dataset))
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like thisb

first thousand elements of the dataset  [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50, 1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58, 53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47, 57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42, 8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63, 53, 59, 1, 49, 52, 53, 61, 1, 15, 39, 47, 59, 57, 1, 25, 39, 56, 41, 47, 59, 57, 1, 47, 57, 1, 41, 46, 47, 43, 44, 1, 43, 52, 43, 51, 63, 1, 58, 53, 1, 58, 46, 43, 1, 54, 43, 53, 54, 50, 43, 8, 0, 0, 13, 50, 5

In [115]:
# lets retrieve the embeddings for the first 8 elements of the dataset 

sequence_len = 16

# this pulls sequence_len amount of embeddings out of the matrix from the dataset indexes 
# the rows are the embeddings and columns are the features 
embedding_matrix(data[:sequence_len])

tensor([[-0.5857, -1.8396,  1.0277,  1.0311, -1.2523],
        [-0.3580, -0.3273, -0.0891,  0.0411, -1.9448],
        [-0.1796,  0.3194, -1.0722,  0.2712, -0.2294],
        [-0.0531, -0.3701,  1.4484, -0.3733,  0.1365],
        [ 1.0384,  1.7280, -0.3597,  0.6417,  1.0337],
        [ 1.3725,  1.0016,  1.6514, -0.8714,  0.5868],
        [ 0.0077,  0.1219, -1.0376,  1.0543, -0.7141],
        [-0.3580, -0.3273, -0.0891,  0.0411, -1.9448],
        [ 1.0384,  1.7280, -0.3597,  0.6417,  1.0337],
        [-0.3580, -0.3273, -0.0891,  0.0411, -1.9448],
        [-2.1950,  0.8369,  0.3031, -1.4039,  0.4448],
        [-0.6527, -0.1350, -1.6862,  1.4711, -0.5694],
        [-0.2396, -1.5007, -0.2766,  0.3701, -1.2415],
        [ 1.6063, -0.0322,  0.8690,  0.2027,  1.0678],
        [ 1.1143,  0.8949, -1.0565,  1.8408,  0.3969],
        [-0.4171, -0.0582,  1.0479,  0.8606, -0.2880]],
       grad_fn=<EmbeddingBackward0>)

#### Embbed the whole dataset 

In [117]:
embedded_dataset= embedding_matrix(data)

# now each letter has its own embedding vector 
embedded_dataset.shape

torch.Size([1115394, 5])

# dataset is embedded!!

pass into transformer block

what's in the **block**??

#### 1. attention head 
#### 2. feedforward layer 

### now we gotta figure out how to do a foward pass
- what does the forward pass do??



### Scratch that

lets implement a head of attention first, then we can put this into the model
- why? bc i felt like it lel

---
# Attention
#### What does a head of attention do??

it creates 
1. queries ❓
2. keys 🔑
3. values 🗣️

we multiply k and q to get the affinities/attention scores between embeddings

softmax attention scores to get weights

mult weights & values to get the ΔEmbedding

add ΔEmbedding to Embeddings recursively 
- slowly add more and more semantic meaning to embeds

---

# Actually jk

lets get this thing to create completions.

even if they suck, lets just get it to work first

In [None]:
# we 