<h3>With Numpy arrays: </h3>   

Load the .txt and create a character-based tokenization

In [22]:
with open("wizard_of_oz.txt",'r', encoding="utf-8") as f:
    text = f.read()

character_based_tokens = sorted(set(text))
print(character_based_tokens)

['\n', ' ', '!', '&', '(', ')', ',', '-', '.', '0', '1', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '“', '”', '\ufeff']


Define the dictionaries for both sides

In [23]:

dict_values_as_key = {ch:i for i, ch in enumerate(character_based_tokens)}
dict_index_as_key = {i:ch for i, ch in enumerate(character_based_tokens)}

_NB: Instead of all the previous part, we could have used ord() which return automatically the unicode of a character (and to get from the unicode to the value, we use chr()).
<br>With that, We don't need to create our own dictionaries_

Encode and Decode functions 

In [24]:

encode = lambda original_string: [dict_values_as_key[c] for c in original_string]
decode = lambda encoded_array : ''.join([dict_index_as_key[c] for c in encoded_array])


# print("Example: Encoding and decoding of 'hello'") 
# encoded_hello = encode('hello')
# print("Encoding : ", encoded_hello) 
# decoded_hello = decode(encoded_hello)
# print("Decoding : ", decoded_hello) 

<h3>With Tensors: </h3>   

Now, for our data, we won't use Numpy arrays, We use Tensors.
<br>Tensors is very similar to classical array but comes from torch framework so it has more features for ML.
_<br>NB: the variable 'character_based_tokens' was useful to create the dictionaries_

1- Import and predefined size:

In [25]:
import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

batch_size = 4 #How many blocks run on parallel, also called B
block_size = 8 #The length of each block, also called T
vocab_size = len(character_based_tokens) #the number of tokens, also called C

cuda


2- Create a tensor with .txt encoded, then split it into training and testing datasets:

In [26]:
data = torch.tensor(encode(text), dtype = torch.long) # Write the .txt in tensor format

#Split into train/test data
n = int(0.8 * len(data))
train_data = data[:n]
test_data = data[n:]

3- function to get batches x and y: 
<br>*NB: It takes as parameter train_data or test_data*

In [27]:
def get_batch(data):

    maximum_index = len(data) - block_size # Get the last available index where operations can occur correctly
    tab_ix = torch.randint(maximum_index, (batch_size,)) # Generate a 1D tensor of indexes selected randomnly. 
    #NB: It has 'batch_size' columns. It fills these columns with integer values from 0 to maximum_index

    x = torch.stack([data[i:i+block_size] for i in tab_ix]) #For each index, we create a 1D tensor of 'block_size' number of columns. At the end, we stack all these tensors together


    y = torch.stack([data[i+1:i+block_size+1] for i in tab_ix]) #Same logic as for x but we do it with an offset of 1 for each index (for example, in y, we start the operations on data at index 72 instead of index 71). 

    #At the end, we have two batches: x and y. 

    # Move the batches to the specified device (GPU or CPU)
    x, y = x.to(device), y.to(device)
    return x, y

4- Class Bigram:

In [None]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.tab_token_embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=vocab_size) #We create a variable tab_token_embedding.
        #We do 'vocab_size*vocab_size' so that every letters can be associated (for example, for row a: aa, ab, ac, ad, ...)
        #For more detail on embedded vectors, check torch_functions.ipynb
        
    def forward(self, indexes, targets=None):
        logits = self.tab_token_embedding(indexes) #To remember, EACH ROW of tab_token_embedding correspond to ONE TOKEN (or character).
                                                   # To make it simple, in this line, EACH INDEX in indexes will be replaced with its corresponding row. 
                                                   # With that, we get logits: logits.shape: (B, T, C)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape #B is the batch size, T is the block_size and C is the vocab_size

            logits = logits.view(B*T, C) # We convert our logits 3D Tensor: (B,T,C), into a 2DTensor: (B*T , C)
            targets = targets.view(B*T) #We convert our targets 2D Tensor: (B,T), into a 1DTensor: (B*T)

            loss = F.cross_entropy(logits, targets) #We want it to be smaller and smaller because it means we get clause
        
        return logits, loss
    
    def generate_tokens(self, indexes, max_new_tokens): 
        for _ in range(max_new_tokens):
            logits, loss = self.forward(indexes) #Get the prediction
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            index_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            indexes = torch.cat((indexes, index_next), dim=1) # (B, T+1)
        return indexes

        

context = torch.tensor([[0]], dtype=torch.long, device=device)
targets = torch.randint(0, vocab_size, (batch_size, block_size))  # Shape: (B, T). Each colonne values between 0 and vocab_size


tensor([[0., 0.]])


In [29]:
# x = train_data[:block_size]
# y = train_data[1:block_size+1]

# # NB: If we wanted to take a block from a random section of our tensor:
# # import random
# # r = random.randint(0, len(train_data)- block_size + 1)
# # x = train_data[ r : r+block_size]
# # y = train_data[ r+1 :block_size+1]

# for i in range (block_size):
#     context = x[:i+1]
#     target = y[i]

#     print("when input in tensor is ", context, " , the target is : ", target)