### Initial loading and viewing the dataset

Here we will be seeing the dataset that we are going to use, as well as the intuition of all the operations we will be performing on the main code. This is just a forked branch from the main path in section 1b.

In [1]:
with open('cleaned_dataset.txt', 'r') as f:
    text = f.read()

data = text[:1000]
print(data[:100])

M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly norm


- We are loading the dataset and reading it into *text*. We are then storing the first 1000 words into *data* and then printing just the first 100 from *data*.
- We can also note that the tokenizer of gpt2 **roughly has a compression rate of about 3:1**, so the 1000 words stored will roughly have 300 tokens that will come out of *data*.

If we want some additional stats on the dataset, we can run the following cell:

In [2]:
def wc_equivalent(filename):
    with open(filename, 'rb') as f:  # open in binary mode to get bytes accurately
        content = f.read()
        byte_count = len(content)

    text = content.decode('utf-8', errors='ignore')  # decode bytes to string
    lines = text.splitlines()
    line_count = len(lines)
    word_count = len(text.split())

    print(f"Lines: {line_count}")
    print(f"Words: {word_count}")
    print(f"Bytes: {byte_count}")

# Usage
wc_equivalent('cleaned_dataset.txt')


Lines: 79295
Words: 1095233
Bytes: 6199345


Sensei just used `wc input.txt` on the terminal and he got the similar set of variables as ouput. I have no idea how he did that as `wc` wasnt recognised in mine (maybe a pip install for it? idk, as of this moment i have no idea lol), so i just pplxed it and got that code snippet.

Now lets go ahead and tokenize the *data*

In [4]:
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode(data)
print(tokens[:24])

[44, 374, 13, 290, 9074, 13, 360, 1834, 1636, 11, 286, 1271, 1440, 11, 4389, 16809, 9974, 11, 547, 6613, 284, 910, 326, 484]


So we used OpenAI's tiktoken for our tokenization, we asked for the gpt2 encoding and then encoded the *data* and finally printing the first 24 (If you check in the tiktokenizer app we will get 27 tokens in total for our first 100 words).

Fun fact: Turns out in tiktokens, the newline slash is represented by the token number `198`, in our case we haven't encountered yet, but just wanted to mention it to check in the future.

Now, we have got this one dimensional value of tokens from our data. We now want to be able to feed this into our neural network so they can be processed.

So in our case, we need to feed these tokens into the indeces value `idx` of the `forward()` method of our GPT class, therefore we need them in the shape of *(B, T)* where T is the maximum sequence length that can be passed.

We will now see how we can convert that 1D value of tokens into this 2D size so that it can be passed into the model.

In [5]:
import torch
buf = torch.tensor(tokens[:24])
x = buf.view(4, 6)
print(x)

tensor([[   44,   374,    13,   290,  9074,    13],
        [  360,  1834,  1636,    11,   286,  1271],
        [ 1440,    11,  4389, 16809,  9974,    11],
        [  547,  6613,   284,   910,   326,   484]])


So above cell is one such method which sensei likes to implement to achieve the inputs we want to feed into the model.

We `import torch` and create a tensor object which contains the first 24 tokens, but we are rearranging them using `view()` to be in a 2D array. Here ofcourse 4, 6 is because we have only chosen 24 tokens.

Now, those are the values that are passed into `forward()`, so if we take an example:

- if *13* in the `idx`, then we know the layer will only consider the previous tokens till there *44,   374,   13* and will use those to predict the next value which is *290*. 
- So each token has a target which it needs to predict.

Just for this 24 tokens, you can see that the last token doesnot have that "next token" to be predicted, so we are just writing this additional lines of code such that we can have the next target token also in the object.

In [6]:
import torch
buf = torch.tensor(tokens[:24 + 1])
x = buf[:-1].view(4, 6)
y = buf[1:].view(4, 6)
print(x)
print(y)

tensor([[   44,   374,    13,   290,  9074,    13],
        [  360,  1834,  1636,    11,   286,  1271],
        [ 1440,    11,  4389, 16809,  9974,    11],
        [  547,  6613,   284,   910,   326,   484]])
tensor([[  374,    13,   290,  9074,    13,   360],
        [ 1834,  1636,    11,   286,  1271,  1440],
        [   11,  4389, 16809,  9974,    11,   547],
        [ 6613,   284,   910,   326,   484,   547]])



-----

So ultimately, this is what he likes to usually do:
- Load all the tokens
- Convert them into dimensions of *(B, T)*
- Load them into two types of tensor objects: (i) is what we feed into the transformer and (ii) contains the labels of what it needs to predict next, so ultimately we are passing *(B, T, T+1)*

Now, you can go back to the main notebook path in section-1b **Let’s train: data batches (B,T) → logits (B,T,C)** to continue!