# day 341,day 342

In [1]:
import torch
import torchvision
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import os,requests,zipfile

In [None]:
import datetime
datetime.datetime.now()

datetime.datetime(2024, 4, 3, 16, 0, 28, 440199)

In [None]:
torch.__version__

'2.2.1+cu121'

# torch.compile does code optimization:
* torch.compile makes code optimization in newer GPU's so that our training time will be dramatically reduced(43% faster) when using torch.compile in combination with newer GPU's such as A100 or above.

## two important things torch.compile offers are:
1. fusion(operator fusion).
2. graph tracing(to find the cause, to predict ahead of time what's to come ).

## bandwidth cost reduction(through fusion).
* Bandwidth cost is the time it takes to move your data from your CPU to GPU. Higher the Bandwidth slower the data processing.
* the fusion aspect of the torch.compile arranges them neatly in such a way that the operation is less bandwidth consuming and therefore faster transmission of data from CPU to GPU.

![sfsf](https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/extras-memory-bandwidth-output-small.gif)


* So instead of performing an operation on a piece of data and then saving the result to memory (increased bandwidth costs), you chain together as many operations as possible via fusion.

* A rough analogy would be using a blender to make a smoothie.

* Most blenders are good at blending things (like GPUs are good at performing matrix multiplications).

* Using a blender without operator fusion would be like adding each ingredient one by one and blending each time a new ingredient is added. Not only is this insane, it increases your bandwidth cost.

* The actual blending is fast each time (like GPU computations generally are) but you lose a bunch of time adding each ingredient one by one.

* Using a blender with operator fusion is akin to using a blender by adding all the ingredients at the start (operator fusion) and then performing the blend once.

* You lose a little time adding at the start but you gain all of the lost memory bandwidth time back.

* fusing the operation is instead of writing the code like tom = np.sin(np.pi) banu = np.cos(tom) sam = banu+5 * np.sin(np.pi), we fuse them together to make them look: np.cos(np.sin(np.pi)) + 5 * np.sin(np.pi), to get all the operation done in one shot rather than in silos.


## graph tracing:
* it is the process of the algorithm running through the whole network to the final outcome before it acutally executes the operations in the network.
* it is just like the min-max algorithm in the adversarial network where the algorithm before taking the next step to defeat the opponent, it runs the simulation in its head, working out all combination of its choices and its opponent choices and make the next move based on what's ideal. it can see things 10's 1000's of moves ahead and make effective decisions.
* this is graph tracing: think before you act!!!

![fsfs](https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/extras-graph-capture.gif)


* but the caveat is this graphing or thinking ahead of time or doing operation before it is due actually takes some upfront time but the subsequent operations will be smooth as butter.






In [None]:
weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT
transforms = torchvision.models.EfficientNet_B2_Weights.DEFAULT.transforms()
model = torch.compile(torchvision.models.efficientnet_b2(weights=weights))
model

Downloading: "https://download.pytorch.org/models/efficientnet_b2_rwightman-c35c1473.pth" to /root/.cache/torch/hub/checkpoints/efficientnet_b2_rwightman-c35c1473.pth
100%|██████████| 35.2M/35.2M [00:00<00:00, 62.0MB/s]


OptimizedModule(
  (_orig_mod): EfficientNet(
    (features): Sequential(
      (0): Conv2dNormActivation(
        (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): SiLU(inplace=True)
      )
      (1): Sequential(
        (0): MBConv(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
              (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): SiLU(inplace=True)
            )
            (1): SqueezeExcitation(
              (avgpool): AdaptiveAvgPool2d(output_size=1)
              (fc1): Conv2d(32, 8, kernel_size=(1, 1), stride=(1, 1))
              (fc2): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1))
              (activation): SiLU(inplace=True)
              (s

# getting our device compute capability

In [None]:
torch.cuda.get_device_capability() # it should be read as 7.5. this is for T4 GPU

(7, 5)

# setting the device globally


In [11]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# sets everything we create to this device unless explicitly changed.
torch.set_default_device(device)

# lets create a tensor to see what device it is in.
tensor = torch.tensor(data=np.random.randn(224,224,3))

print('device the tensor is on is: %s'%(tensor.device))

device the tensor is on is: cpu


# how to make your GPU go Brrrrr....

* Making the best use of GPU comprises of the practice of training the machine with more data but the training time would be decresing with increase in size of data.

## we can make the GPU speed ups by:
1. Increasing the batch size
2. Incresing the data (more data).
3. Increasing the model size (go for a bigger model!).
4. Decreasing the transfer between CPU and GPU by setting up all tensors to be on GPU memory(numpy to tensor conversion).

![ffsd](https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/extras-speedups-are-biggest-when-more-gpu-is-used.png)


## Biggest relative speedups with relative scale:
* GPU makes use of parallelism (technique to do computation parallelly). This is the reason why despite the size boost we see increasing relative speed of computation.


## Decrease Batchsize or image size when GPU memory is small:
* decreasing the Batchsize or image size would help boost speed up the training process when the GPU memory is small.
* ```The higher the memory available on GPU, the higher the batch size can be, higher the image size can be, higher the amount of data can be, bigger the model we are going to train can be, larger the number of tensors and operations inside the model can be.```
* The general metric is if the total amount of memory on GPU is > 16GB then the batch size can be 128 and image size can be 224. if the total memory of GPU is < 16 GB then the batch size needs to be 32 and image size needs to be 64.


# Tensorfloat 32 for a mix of 16 bit and 32 bit floating precision:
* this method significantly speeds up the tensor operations of GPU.

In [13]:
# setting to allow for tf32 computation.
torch.backends.cuda.matmul.allow_tf32 = True
