# Step 1 - Experimenting with Text files
The model we will be using for this 1st section is called a [Bigram](https://web.stanford.edu/~jurafsky/slp3/3.pdf) model which is a type of Natural language processing (NLP) model that predicts a word based on the immediately preceding word. 


Text file used is the book Wizard of OZ which you can download from Gutenberg library for free.
<br>Click link and make sure you select "Plain Text UTF-8"
<br>https://www.gutenberg.org/ebooks/22566



In [26]:
# Bring in text file "Wizard of OZ"
with open('data/wizard_of_oz.txt', 'r', encoding='utf=8') as f:
  text = f.read()
# print(text[:200])
# bring in all our uniqye text characters as a set and sort
chars = sorted(set(text))
vocab_size = len(chars)
print(chars)
print(len(chars)) # 81 unique character values

['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\ufeff']
81


### Pytorch - Deep Learning framework used to train neural networks
Will allow us to train our model with GPU CUDA tensors 
https://pytorch.org/tutorials/beginner/basics/intro.html


In [27]:
# Check if GPU CUDA is enabled otherwise use CPU 
import torch
import torch.nn as nn
from torch.nn import functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


### Speed Comparison - Pytorch Tensors vs Numpy Arrays for working with multidimensional data

In [33]:
# Speed Check - Pytorch Cuda on GPU vs Numpy on CPU
import time
import numpy as np

torch_rand1 = torch.rand(100, 100, 100, 100).to(device)
torch_rand2 = torch.rand(100, 100, 100, 100).to(device)
np_rand1 = torch.rand(100, 100, 100, 100)
np_rand2 = torch.rand(100, 100, 100, 100)

start_time = time.time()

rand = (torch_rand1 @ torch_rand2)

end_time = time.time()

elapsed_time = end_time - start_time
print(f'Pytorch CUDA: {elapsed_time:.8f}s')

start_time = time.time()

rand = np.multiply(np_rand1, np_rand2)
end_time = time.time()
elapsed_time = end_time - start_time
print(f'Numpy: {elapsed_time:.8f}s')

Pytorch CUDA: 0.01599765s
Numpy: 0.16937160s


### Tokenizing our text w Encoders and Decoders 
After gathering all the unique characters in our text we will need to convert these values into tokens. For this we need Encoders and Decoders <br>
**Encoders**: Converts our text values into integers (makes it machine readable)<br>
**Decoders**: Converts our integers into text values (makes it human readable after our model completes it's training)

https://www.datacamp.com/blog/what-is-tokenization#:~:text=Imagine%20you%27re%20trying,the%20two%20contexts.


In [29]:
# Encoder and Decoder logic
string_to_int = { ch:i for i, ch in enumerate(chars) }
int_to_string = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

# encoded_hello = encode('hello')
# decoded_hello = decode(encoded_hello)
# print(f'Encoded hello =', encoded_hello)
# print(f'Decoded hello =', decoded_hello)

data = torch.tensor(encode(text), dtype=torch.long)
print(data[:100])

tensor([80,  1,  1, 28, 39, 42, 39, 44, 32, 49,  1, 25, 38, 28,  1, 44, 32, 29,
         1, 47, 33, 50, 25, 42, 28,  1, 33, 38,  1, 39, 50,  0,  0,  1,  1, 26,
        49,  0,  0,  1,  1, 36, 11,  1, 30, 42, 25, 38, 35,  1, 26, 25, 45, 37,
         0,  0,  1,  1, 25, 45, 44, 32, 39, 42,  1, 39, 30,  1, 44, 32, 29,  1,
        47, 33, 50, 25, 42, 28,  1, 39, 30,  1, 39, 50,  9,  1, 44, 32, 29,  1,
        36, 25, 38, 28,  1, 39, 30,  1, 39, 50])


### Grouping our characters into block sizes then chunking our blocks into batch sizes

In [35]:
# Create block size, think of this as the words we want to be chunked together
block_size = 8
batch_size = 4

x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print('When input is', context, 'prediction is', target)

When input is tensor([80]) prediction is tensor(1)
When input is tensor([80,  1]) prediction is tensor(1)
When input is tensor([80,  1,  1]) prediction is tensor(28)
When input is tensor([80,  1,  1, 28]) prediction is tensor(39)
When input is tensor([80,  1,  1, 28, 39]) prediction is tensor(42)
When input is tensor([80,  1,  1, 28, 39, 42]) prediction is tensor(39)
When input is tensor([80,  1,  1, 28, 39, 42, 39]) prediction is tensor(44)
When input is tensor([80,  1,  1, 28, 39, 42, 39, 44]) prediction is tensor(32)


### Create a train/test split (80/20)
We will divide our data into two parts: 80% of the data will be used for training our model and 20% of the data will be used for testing, which is data unseen to our model.<br>
https://builtin.com/data-science/train-test-split

In [31]:
# 80/20 Split
n = int(0.8*len(data))
train_data = data[:n]
test_data = data[n:]

def get_batch(split):
  data = train_data if split == 'train' else test_data
  ix = torch.randint(len(data) - block_size, (batch_size,))
  print(ix)
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  x, y = x.to(device), y.to(device)
  return x, y

x, y = get_batch('train')
print('inputs: ')
print(x)
print('Predictions: ')
print(y)

tensor([  4383,  55969,  91389, 130744])
inputs: 
tensor([[16,  1,  1, 44, 32, 29,  1, 46],
        [73, 54, 67, 73,  9,  1, 73, 74],
        [ 9,  1, 54, 67, 57,  1, 76, 58],
        [72, 76, 68, 71, 57, 10, 55, 65]], device='cuda:0')
Predictions: 
tensor([[ 1,  1, 44, 32, 29,  1, 46, 29],
        [54, 67, 73,  9,  1, 73, 74, 56],
        [ 1, 54, 67, 57,  1, 76, 58,  1],
        [76, 68, 71, 57, 10, 55, 65, 54]], device='cuda:0')


### Creating our Bi-gram model and defining a Forward pass function
Forward Pass - This process involves passing input data through set of layers called neural nets and applying math transformations (using weights, bias, activation functions) that help the model learn and identify patterns or relationships in the data.

In [37]:
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embeddings_table = nn.Embedding(vocab_size, vocab_size)

  def forward_pass(self, index, targets):
    logits = self.token_embeddings_table(index)

    return logits