# **Section 1 - B**

----------

## Sample, auto-detect the device

At the end of the previous section, we had loaded the pretrained gpt2 model and its weights into our architecture and generated the model. But now we could like to initialize our own weights, we want the model to weights to be generated randomly.

So that can be done fairly simple way:

```
#model = GPT.from_pretrained("gpt2")
model = GPT(GPTConfig())
```

we just call our default `GPTConfig()` that we made. So what PyTorch does is that, it internally assigns random weights to each of the layers in our config, therefore we can use this to generate text from our model.

Lastly, before i run this, we also added an additional line of code to better control the device used to run this model. In my case i do have a GPU with CUDA capability. So, if you want to run the model until this point you can also do that using CPU. We have added this additional flag point just to show which device you are using here:

```
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

print(f"Device used: {device})
```

The rest of the code follows a dynamic approach of detecting the device as well (even in the `forward()` you will see that we have used `device=idx.device`), therefore we are ensuring that all the layers are using the same device while generating.

And this is the final output that we generated!

![Sampling and Auto detect output](assets/auto-device-output.png)

Obviously it is gibberish lol, we will get to the training next!

&nbsp;

## Let’s train: data batches (B,T) → logits (B,T,C)

> **NOTE**
>
>We will be loading our dataset now. Sensei used his fav "The tiny Shakespear dataset", I am going ahead and using MY Favourite dataset which is what i also used for my GPT-1 implementation which is the HARRY POTTER NOVELS COLLECTION dataset. I directly took the `cleaned_dataset.txt` file which i had processed.
>
>If you want to see a simple breakdown version of the dataset and what we are about to do, take a look at [this notebook](https://github.com/MuzzammilShah/GPT-TransformerModel-2/blob/main/section-1b-dataset.ipynb) on my repo.

In [None]:
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

print(f"Device used: {device}")

num_return_sequences = 5
max_length = 30

#==========THIS SECTION==========
import tiktoken
enc = tiktoken.get_encoding('gpt2')

with open('cleaned_dataset.txt', 'r') as f:
    text = f.read()

data = text[:1000]
tokens = enc.encode(data)

B, T = 4, 32
buf = torch.tensor(tokens[:B*T + 1])
x = buf[:-1].view(B, T)
y = buf[1:].view(B, T)
#================================

#model = GPT.from_pretrained("gpt2")
model = GPT(GPTConfig())
model.eval()
model.to(device)

#==========THIS SECTION==========
x = x.to(device)
logits = model(x)
print(logits.shape)
import sys; sys.exit(0)
#================================

So the above SECTIONS are the newly added codes just like how they were done in the [dataset breakdown notebook](https://github.com/MuzzammilShah/GPT-TransformerModel-2/blob/main/section-1b-dataset.ipynb), we are only performing a debugging step here therefore the values have been hardcoded. Since we have a batch of 4 by 32, we get the logits for that.

- The output we got when the program was run (notice there is a sys exit): **`torch.Size([4, 32, 50257])`**

- So `50257` are the logits for what comes next at every position. That is the `x`.

-----

### **Debugging moment, yay! (Mini version)**

So, i had encountered the error `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`.

This wouldn't happen in sensei's video as he had a manual OVERIDE of the device to cpu. In my case i am continuing to stick with cuda. But since our buffer was initialised manually here: `buf = torch.tensor(tokens[:B*T + 1])`, it by default sits in the cpu. 

To fix this, we just added one additional line of code: `x = x.to(device)` just before calculating the logits.

-----

Next we still have the `y` which contains the targets. So now is the time to calculate the loss -> do the backward pass -> and do the optimization. Lets go ahead and calculate the loss first.

&nbsp;

## Cross entropy loss