# Step 1 - Experimenting with Text files
The model we will be using for this 1st section is called a [Bigram](https://web.stanford.edu/~jurafsky/slp3/3.pdf) model which is a type of Natural language processing (NLP) model that predicts a word based on the immediately preceding word. 


Text file used is the book Wizard of OZ which you can download from Gutenberg library for free.
<br>Click link and make sure you select "Plain Text UTF-8"
<br>https://www.gutenberg.org/ebooks/22566



In [1]:
# Bring in text file "Wizard of OZ"
with open('data/wizard_of_oz.txt', 'r', encoding='utf=8') as f:
  text = f.read()
# print(text[:200])
# bring in all our uniqye text characters as a set and sort
chars = sorted(set(text))
print(chars)
print(len(chars)) # 81 unique character values

['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\ufeff']
81


### Pytorch Deep Learning library used to train our model with GPU CUDA tensors 
https://pytorch.org/tutorials/beginner/basics/intro.html


In [7]:
# Check if GPU CUDA tensors are available otherwise use CPU 
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


### Tokenizing our text w Encoders and Decoders 
After gathering all the unique characters in our text we will need to convert these values into tokens. For this we need Encoders and Decoders <br>
**Encoders**: Converts our text values into integers (makes it machine readable)<br>
**Decoders**: Converts our integers into text values (makes it human readable after our model completes it's training)



In [8]:
# Encoder and Decoder logic
string_to_int = { ch:i for i, ch in enumerate(chars) }
int_to_string = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

# encoded_hello = encode('hello')
# decoded_hello = decode(encoded_hello)
# print(f'Encoded hello =', encoded_hello)
# print(f'Decoded hello =', decoded_hello)

data = torch.tensor(encode(text), dtype=torch.long)
print(data[:100])

tensor([80,  1,  1, 28, 39, 42, 39, 44, 32, 49,  1, 25, 38, 28,  1, 44, 32, 29,
         1, 47, 33, 50, 25, 42, 28,  1, 33, 38,  1, 39, 50,  0,  0,  1,  1, 26,
        49,  0,  0,  1,  1, 36, 11,  1, 30, 42, 25, 38, 35,  1, 26, 25, 45, 37,
         0,  0,  1,  1, 25, 45, 44, 32, 39, 42,  1, 39, 30,  1, 44, 32, 29,  1,
        47, 33, 50, 25, 42, 28,  1, 39, 30,  1, 39, 50,  9,  1, 44, 32, 29,  1,
        36, 25, 38, 28,  1, 39, 30,  1, 39, 50])


### Create a train/test split (80/20)
We will divide our data into two parts: 80% of the data will be used for training our model and 20% of the data will be used for testing, which is data unseen to our model.

In [3]:
# 80/20 Split
n = int(0.8*len(data))
train_data = data[:n]
val_data = data[n:]

### Chunking our characters into block sizes then chunking our blocks into batch sizes

In [9]:
# Create block size, think of this as the words we want to be chunked together
block_size = 8
batch_size = 4

x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print('When input is', context, 'target is', target)

When input is tensor([80]) target is tensor(1)
When input is tensor([80,  1]) target is tensor(1)
When input is tensor([80,  1,  1]) target is tensor(28)
When input is tensor([80,  1,  1, 28]) target is tensor(39)
When input is tensor([80,  1,  1, 28, 39]) target is tensor(42)
When input is tensor([80,  1,  1, 28, 39, 42]) target is tensor(39)
When input is tensor([80,  1,  1, 28, 39, 42, 39]) target is tensor(44)
When input is tensor([80,  1,  1, 28, 39, 42, 39, 44]) target is tensor(32)


In [10]:
random = torch.randint(-100, 100, (6,))
random

tensor([ 51, -22, -67,  13, -53,  -7])

In [11]:
tensor = torch.tensor([[0.1, 1.2], [2.2, 3.1], [4.9, 5.2]])
tensor

tensor([[0.1000, 1.2000],
        [2.2000, 3.1000],
        [4.9000, 5.2000]])

In [12]:
zeros = torch.zeros(2,3)
zeros

tensor([[0., 0., 0.],
        [0., 0., 0.]])

In [13]:
ones = torch.ones(2,5)
ones

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

In [14]:
input = torch.empty(2,3)
input

tensor([[-1.0362e+11,  1.9128e-42,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00]])

In [15]:
arange = torch.arange(5)
arange

tensor([0, 1, 2, 3, 4])

In [16]:
linspace = torch.linspace(3,10, steps=5)
linspace

tensor([ 3.0000,  4.7500,  6.5000,  8.2500, 10.0000])