### Notebook description



### Load data

- Loading dataset to train the model

In [2]:
with open('../data/raw/input.txt', 'r') as f:
    text = f.read()

In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



### Caracters

- Let's check all the caracters that the model will be able to see

In [4]:
chars = sorted(list(set(text)))
print(''.join(chars))
print(f"Number of unique caracters: {len(chars)}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Number of unique caracters: 65


### Encoding caracters

- Let's create an estrategy to transforme words to integers;
- Using the list of caraters we can create a map from caracters to integers

In [5]:
stoi = { c: i for i, c in enumerate(chars) }
itos = { i: c for i, c in enumerate(chars) }

encode = lambda x: [stoi[c] for c in x]
decode = lambda x: ''.join([itos[i] for i in x])

print(encode('hello'))
print(decode(encode('hello')))

[46, 43, 50, 50, 53]
hello


### Tokenization & train/test split

In [6]:
import torch

text_encoded = torch.tensor(encode(text), dtype=torch.long)
print(text_encoded[:100])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [7]:
train_size = int(0.9 * len(text_encoded))

train_data = text_encoded[:train_size]
val_data = text_encoded[train_size:]

### Set block size and batch size

* **Block Size**: The block size refers to the number of data samples that are processed together in parallel during training. In deep learning, it is common to train models using mini-batches, where a mini-batch is a subset of the entire training dataset. The block size determines the number of samples in each mini-batch. By processing data in mini-batches, we can take advantage of parallel computing capabilities and optimize the training process. It allows us to efficiently utilize the computational resources of modern hardware, such as GPUs, which are designed to perform parallel computations. Additionally, mini-batch training helps to generalize the model by introducing some level of randomness in each iteration.
* **Batch Size**: The batch size is the number of samples within a single mini-batch. It determines how many samples are processed together before updating the model's parameters. During training, the model makes predictions on the batch, calculates the loss, and then updates the weights based on the loss. The batch size affects the speed and stability of the training process. Choosing an appropriate batch size is crucial. A small batch size can lead to noisy gradients and slower convergence, while a large batch size may require more memory and computational resources. It is often a trade-off between computational efficiency and model performance.


In [11]:
torch.manual_seed(24022024)

batch_size = 4
block_size = 8

def get_batch(split):

    data = train_data if split == 'train' else val_data

    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y


In [12]:
get_batch('train')

(tensor([[56,  1, 50, 47, 49, 43, 61, 47],
         [13, 52, 42,  1, 46, 43,  1, 58],
         [ 0, 35, 47, 50, 50,  1, 59, 52],
         [53, 53, 42,  1, 61, 53, 56, 49]]),
 tensor([[ 1, 50, 47, 49, 43, 61, 47, 57],
         [52, 42,  1, 46, 43,  1, 58, 53],
         [35, 47, 50, 50,  1, 59, 52, 42],
         [53, 42,  1, 61, 53, 56, 49,  2]]))