### Notebook description



### Load data

- Loading dataset to train the model

In [1]:
with open('../data/raw/input.txt', 'r') as f:
    text = f.read()

In [2]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



### Caracters

- Let's check all the caracters that the model will be able to see

In [3]:
chars = sorted(list(set(text)))
print(''.join(chars))
print(f"Number of unique caracters: {len(chars)}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Number of unique caracters: 65


### Encoding caracters

- Let's create an estrategy to transforme words to integers;
- Using the list of caraters we can create a map from caracters to integers

In [4]:
stoi = { c: i for i, c in enumerate(chars) }
itos = { i: c for i, c in enumerate(chars) }

encode = lambda x: [stoi[c] for c in x]
decode = lambda x: ''.join([itos[i] for i in x])

print(encode('hello'))
print(decode(encode('hello')))

[46, 43, 50, 50, 53]
hello


### Tokenization & train/test split

In [5]:
import torch

text_encoded = torch.tensor(encode(text), dtype=torch.long)
print(text_encoded[:100])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [6]:
train_size = int(0.9 * len(text_encoded))

train_data = text_encoded[:train_size]
val_data = text_encoded[train_size:]

### Set block size and batch size

* **Block Size**: The block size refers to the number of data samples that are processed together in parallel during training. In deep learning, it is common to train models using mini-batches, where a mini-batch is a subset of the entire training dataset. The block size determines the number of samples in each mini-batch. By processing data in mini-batches, we can take advantage of parallel computing capabilities and optimize the training process. It allows us to efficiently utilize the computational resources of modern hardware, such as GPUs, which are designed to perform parallel computations. Additionally, mini-batch training helps to generalize the model by introducing some level of randomness in each iteration.
* **Batch Size**: The batch size is the number of samples within a single mini-batch. It determines how many samples are processed together before updating the model's parameters. During training, the model makes predictions on the batch, calculates the loss, and then updates the weights based on the loss. The batch size affects the speed and stability of the training process. Choosing an appropriate batch size is crucial. A small batch size can lead to noisy gradients and slower convergence, while a large batch size may require more memory and computational resources. It is often a trade-off between computational efficiency and model performance.


In [7]:
torch.manual_seed(24022024)

batch_size = 4
block_size = 8

def get_batch(split):
    """Return two random batch of data with dimension (batch_size, block_size).
    The first batch is the input and the second batch is the target.
    
    split: str, 'train' or 'val'
    return: tuple of two tensors"""

    data = train_data if split == 'train' else val_data

    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y


### Create the model

In [39]:
import torch
import torch.nn as nn

from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()

        # each token directly reads off the logits for the next token from a lookup table
        self.tolken_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx: (batch_size, block_size)
        # targets: (batch_size, block_size)

        # (batch_size, block_size, vocab_size)
        logits = self.tolken_embedding_table(idx)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(-1)
            # (batch_size, block_size, vocab_size)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):
            # get the prediction
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

In [34]:
vocab_size = len(chars)
m = BigramLanguageModel(vocab_size)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


YQV?vPNQhUYq!DicZMQSeYG;aZULTAJS3nTAeVFrUeTCFjGtMPJH--l
hcILAhbnJ- F-a?o
AzoGcysk?!aIMmjwPp;pQ$AASZM


In [35]:
# create a Pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

In [37]:
batch_size = 32

for steps in range(10000):

    x, y = get_batch('train')
    logits, loss = m(x, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if steps % 100 == 0:
        print(f'Step: {steps}, Loss: {loss.item()}')

Step: 0, Loss: 3.7534027099609375
Step: 100, Loss: 3.7499630451202393
Step: 200, Loss: 3.615746021270752
Step: 300, Loss: 3.6393930912017822
Step: 400, Loss: 3.5247511863708496
Step: 500, Loss: 3.3106467723846436
Step: 600, Loss: 3.407113552093506
Step: 700, Loss: 3.3172221183776855
Step: 800, Loss: 3.335146903991699
Step: 900, Loss: 3.1862876415252686
Step: 1000, Loss: 3.138798475265503
Step: 1100, Loss: 2.9729411602020264
Step: 1200, Loss: 3.0321455001831055
Step: 1300, Loss: 3.015462875366211
Step: 1400, Loss: 2.9232938289642334
Step: 1500, Loss: 2.9104466438293457
Step: 1600, Loss: 2.9720618724823
Step: 1700, Loss: 2.953613519668579
Step: 1800, Loss: 2.7322216033935547
Step: 1900, Loss: 2.813430070877075
Step: 2000, Loss: 2.881880760192871
Step: 2100, Loss: 2.700406074523926
Step: 2200, Loss: 2.749924659729004
Step: 2300, Loss: 2.6859850883483887
Step: 2400, Loss: 2.741866111755371
Step: 2500, Loss: 2.741502285003662
Step: 2600, Loss: 2.6578240394592285
Step: 2700, Loss: 2.69344878

In [40]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


Whesss s,
NG nd f li!

Louris ol?
ETovepr g he R:

Townorsenof
NTIOLI's; n be a in mor frloucck&llot
