<a href="https://colab.research.google.com/github/Firojpaudel/Demystifying_Language_Modeling/blob/main/Computes/Pytorch_and_FLOPS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Floating Point Precisions:
---



Before diving into the tensors, first lets try to learn the GPU architectures as well.

Here, we will be comparing two popular GPUs, one in terms of SOTA performance and another in terms of availability.

In terms of SOTA performance, we have Nvidia's **H100** GPU. It is buit on Hooper Architecture. And was released on 2022.

**Some of its features:**
1. It has $80\text{GB HBM3}$ memory. (**HBM3** is the fourth generation of High Bandwidth Memory technology)
2. Bandwidth: $3.35 \text{ TB/s}$
3. Its optimised  for massive models (e.g., LLMs, diffusion models) with high memory and compute demands.

> **Note ⓘ** \
> These are not provided in Colab's Free Tier. However Pro Tier has one.


Now, lets go to **T4** GPU.

**T4** is based on Turing Architecture.

1. Release Date: 2018
2. Built for energy-efficient inference and lightweight ML workloads in data centers.
3. $16\text{GB GDDR6}$ memory, $320 \text{ GB/s}$ memory bandwidth, no NVLink

---

Also, while we are at it, let's compare $\text{TFLOPs}$ between these two:


|Precision | H100 (SXM, 80GB) | T4 (16 GB) |
|----------|------------------|------------|
|FP32	|67 TFLOPS	|8.1 TFLOPS|
|FP16	|989 TFLOPS (Tensor Cores)	|65 TFLOPS (Tensor Cores)|
|BF16	|989 TFLOPS (Tensor Cores)	|Not natively supported (emulated via FP16)|
|FP8	|1979 TFLOPS (Transformer Engine)	|Not supported|



> **Takeaway ⩩** \
> H100’s vastly superior TFLOPS across all precisions, especially FP8 and BF16, makes it ideal for SOTA AI workloads, while T4’s modest FP16/FP32 performance suits lighter tasks.
---

Now that we know a bit about the architectures, lets now understand Tensor Data Types (Precision) —code based :)

> **Note ⓘ** \
> Since we wil be using Colab free tier GPU (T4) the entire time, we will be sticking with all the formats that are compatible with this GPU only.

---

So, first: precision types and perfomances (we will be testing on MatMul)


In [1]:
#! Imports
import torch

In [2]:
##! FP32 - Matmul

A = torch.randn(1024, 2048, device="cuda", dtype= torch.float32)
B = torch.randn(2048, 1024, device="cuda", dtype= torch.float32)
C = torch.matmul(A, B)

print("------")
print(f"Ouput type: {C.dtype}")
print(f"VRAM Usage: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print("------")

------
Ouput type: torch.float32
VRAM Usage: 28.12 MB
------


In [2]:
##! FP16 - Matmul

A = torch.randn(1024, 2048, device="cuda", dtype= torch.float16)
B = torch.randn(2048, 1024, device="cuda", dtype= torch.float16)
C = torch.matmul(A, B)

print("------")
print(f"Ouput type: {C.dtype}")
print(f"VRAM Usage: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print("------")

------
Ouput type: torch.float16
VRAM Usage: 18.12 MB
------


In [2]:
##! FP16 - Matmul with AMP

A = torch.randn(1024, 2048, device="cuda")
B = torch.randn(2048, 1024, device="cuda")
with torch.amp.autocast('cuda', dtype= torch.float16):
  C = torch.matmul(A, B)

print("------")
print(f"Ouput type: {C.dtype}")
print(f"VRAM Usage: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print("------")

------
Ouput type: torch.float16
VRAM Usage: 26.12 MB
------


In [2]:
##! BF16 - Emulated (for T4)
A = torch.randn(1024, 2048, device="cuda")
B = torch.randn(2048, 1024, device="cuda")
with torch.amp.autocast('cuda', dtype= torch.bfloat16):
  C= torch.matmul(A, B)

print("------")
print(f"Ouput type: {C.dtype}")
print(f"VRAM Usage: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print("------")

------
Ouput type: torch.bfloat16
VRAM Usage: 26.12 MB
------


In [2]:
##! SOTA- batched MatMul

A = torch.randn(32, 512, 1024, device="cuda") #! For batched multiplication, we must have a 3D tensor
B = torch.randn(32, 1024, 512, device="cuda")
with torch.amp.autocast('cuda', dtype= torch.bfloat16):
  C= torch.bmm(A, B)

print("------")
print(f"Ouput type: {C.dtype}")
print(f"VRAM Usage: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print("------")

------
Ouput type: torch.bfloat16
VRAM Usage: 152.12 MB
------


---
Now that we have compared all the precisions and their VRAM usages, using `fp16` makes more sense, since it consumes less of VRAM.

---

#### Up-Next — How Big of a Model Can I Fit on My GPU?

Let’s say we’re curious and have been wondering:

> *“If I want to train a model with 500M, 1B, or 2B parameters, will it fit in 16 GB GPU VRAM?”*

To answer that, let's go step-by-step with a practical estimation formula.

---

#### Up-Next — How Big of a Model Can I Fit on My GPU?

**Practical VRAM Estimation Formula (FP16 + AdamW):**

$$
\text{VRAM (GB)} \approx
\underbrace{\text{Param Memory}}_\text{Weights + Adam states}
+
\underbrace{\text{Activation Memory}}_\text{Per sample × batch size}
+
\underbrace{1.5}_\text{Safety Buffer}
$$

---

##### 1. Param Memory (in GB)
For **FP16 weights** (2 bytes/param) and **AdamW optimizer** (3 state copies: weights + momentum + variance):

$$
\text{Param Memory} = \frac{6 \times (P \times 1000)}{1024}
$$
- $ P $: Parameters in **billions** (e.g., 0.5 for 500M)  
- Multiply by 1000 to convert billions → millions  
- *Example*: 1B model = $ \frac{6 \times 1000}{1024} \approx 5.86 \text{ GB} $

---

##### 2. Activation Memory (in GB)
For forward + backward passes:

$$
\text{Activation Memory} = \frac{8 \times L \times H \times S \times B}{1024^3}
$$
- $ L $: Layers  
- $ H $: Hidden size  
- $ S $: Sequence length  
- $ B $: Batch size  
- *Derivation*: 4 bytes/activation × 2 (fwd/bwd) × tensor dimensions  

---

##### Final Combined Formula
$$
\boxed{\text{VRAM (GB)} \approx
\frac{6 \times (P \times 1000)}{1024}
+
\frac{8 \times L \times H \times S \times B}{1024^3}
+
1.5}
$$

---

##### Verified Examples
**500M Parameter Model** ($P=0.5$)
- Config: $L=24, H=1024, S=128, B=4$

- **Param**: $ \frac{6 \times 500}{1024} \approx 2.93 \text{ GB} $

- **Activation**: $ \frac{8 \times 24 \times 1024 \times 128 \times 4}{1024^3} \approx 0.09 \text{ GB} $

- **Total**: $ 2.93 + 0.09 + 1.5 = \boxed{4.52 \text{ GB}} $  

> Fits easily on T4 (16GB)

**1B Parameter Model** ($P=1$)
- Config: $L=24, H=1024, S=128, B=4$

- **Param**: $ \frac{6 \times 1000}{1024} \approx 5.86 \text{ GB} $

- **Activation**: $ 0.09 \text{ GB} $ (same as above)

- **Total**: $ 5.86 + 0.09 + 1.5 = \boxed{7.45 \text{ GB}} $  

> Comfortably fits

**2B Parameter Model** ($P=2$)
- Config: $L=24, H=2048, S=128, B=4$

- **Param**: $ \frac{6 \times 2000}{1024} \approx 11.72 \text{ GB} $

- **Activation**: $ \frac{8 \times 24 \times 2048 \times 128 \times 4}{1024^3} \approx 0.18 \text{ GB} $

- **Total**: $ 11.72 + 0.18 + 1.5 = \boxed{13.4 \text{ GB}} $  

>  **Fits but borderline** (16GB - 13.4GB = 2.6GB headroom)
---
