# PyTorch Tutorial 20: Quantization and Efficiency (The Systems Layer)

**Author:** [Your Name/Organization]  
**Date:** 2025  

In the world of Large Language Models, **Memory is Money**. 

A 70 Billion parameter model in standard 32-bit precision requires:
$$ 70 \times 10^9 \times 4 \text{ bytes} \approx 280 \text{ GB VRAM} $$

That requires 4x A100 GPUs ($80k+ hardware). But with **Quantization**, we can shrink this to 4-bits, fitting it on a single GPU. 

This tutorial dives into the "Systems" side of AI: making models run fast and cheap.

## Learning Objectives
1.  **Understand Precision**: FP32 vs FP16 vs INT8.
2.  **Implement Quantization**: Write a function to compress a tensor from 32-bit float to 8-bit integer.
3.  **Implement LoRA**: Build a Low-Rank Adapter layer from scratch to fine-tune massive models efficiently.

---

## 1. Vocabulary First

-   **FP32 (Single Precision)**: Standard float format. 1 sign bit, 8 exponent, 23 mantissa. (4 bytes)
-   **BF16 (Bfloat16)**: Truncated FP32. 1 sign, 8 exponent, 7 mantissa. Same range as FP32, less precision. (2 bytes)
-   **INT8**: 8-bit Integer. Values from -128 to 127. (1 byte)
-   **Quantization**: The process of mapping a large continuous range (float) to a small discrete set (int).
-   **LoRA (Low-Rank Adaptation)**: Instead of updating all weights $W$, we learn a small update $\Delta W = A \times B$, where $A$ and $B$ are tiny matrices.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


## 2. Implementing Int8 Quantization

We will implement **Absmax Quantization**. This is the simplest form of symmetric quantization.

### The Formula
To map a float tensor $X$ to int8:

1.  Find the absolute maximum value: $S = \max(|X|)$.
2.  Calculate the scaling factor: $scale = 127 / S$.
3.  Quantize: $X_{quant} = \text{round}(X \times scale)$.
4.  Clamp: Ensure values stay within [-128, 127].

To get back the original values (Dequantization):
$$ X_{approx} = X_{quant} / scale $$

In [2]:
def quantize_int8(tensor):
    """
    Quantizes a float32 tensor to int8 using Absmax Quantization.
    Returns the quantized tensor (int8) and the scale factor (float).
    """
    # 1. Find the absolute max value
    absmax = torch.abs(tensor).max()
    
    # Avoid division by zero
    if absmax == 0:
        scale = 1.0
    else:
        # 2. Calculate scale to map max value to 127
        scale = 127.0 / absmax
    
    # 3. Quantize
    # We multiply by scale and round to nearest integer
    quantized = torch.round(tensor * scale)
    
    # 4. Clamp to int8 range [-128, 127]
    quantized = torch.clamp(quantized, -128, 127)
    
    # Cast to actual int8 type to save memory
    quantized = quantized.to(torch.int8)
    
    return quantized, scale

def dequantize_int8(quantized, scale):
    """
    Dequantizes an int8 tensor back to float32.
    """
    # Convert back to float for calculation
    quantized_float = quantized.to(torch.float32)
    
    # Reverse the scaling
    return quantized_float / scale

### Testing Quantization

Let's see how much error this introduces.

In [3]:
# Create a random tensor
original = torch.randn(5, 5) * 10.0 # Scale it up a bit
print("Original (First Row):", original[0])

# Quantize
q_tensor, scale = quantize_int8(original)
print("\nQuantized (int8):", q_tensor[0])
print(f"Scale Factor: {scale.item():.4f}")

# Dequantize
reconstructed = dequantize_int8(q_tensor, scale)
print("\nReconstructed:", reconstructed[0])

# Calculate Error (MSE)
mse = F.mse_loss(original, reconstructed)
print(f"\nMean Squared Error: {mse.item():.6f}")

# Memory Savings
orig_mem = original.element_size() * original.numel()
quant_mem = q_tensor.element_size() * q_tensor.numel()
print(f"Memory: {orig_mem} bytes -> {quant_mem} bytes ({orig_mem/quant_mem}x compression)")

Original (First Row): tensor([-10.7216,  22.7951,  -2.6229, -12.3593,   7.3067])

Quantized (int8): tensor([-60, 127, -15, -69,  41], dtype=torch.int8)
Scale Factor: 5.5714

Reconstructed: tensor([-10.7693,  22.7951,  -2.6923, -12.3847,   7.3591])

Mean Squared Error: 0.002523
Memory: 100 bytes -> 25 bytes (4.0x compression)


## 3. Implementing LoRA (Low-Rank Adaptation)

Fine-tuning a 70B model involves updating 70B weights. That's expensive.
LoRA freezes the main weights $W$ and adds a parallel branch with two tiny matrices $A$ and $B$.

$$ h = Wx + BAx $$

Where $W$ is $[d_{in}, d_{out}]$, $A$ is $[d_{in}, r]$, and $B$ is $[r, d_{out}]$.
If $r=8$, we save 99.9% of trainable parameters.

In [4]:
class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        
        # 1. The Pre-trained Weight (Frozen)
        # In a real library, this would wrap an existing layer
        self.linear = nn.Linear(in_features, out_features)
        self.linear.weight.requires_grad = False # FREEZE IT!
        self.linear.bias.requires_grad = False
        
        # 2. The LoRA Adapters (Trainable)
        # A: [in, rank] -> Gaussian Init
        # B: [rank, out] -> Zero Init (so training starts as identity)
        self.lora_a = nn.Linear(in_features, rank, bias=False)
        self.lora_b = nn.Linear(rank, out_features, bias=False)
        
        # 3. Scaling
        self.scaling = alpha / rank
        
        # Initialize weights
        nn.init.kaiming_uniform_(self.lora_a.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_b.weight)
        
    def forward(self, x):
        # Original Path (Frozen)
        original_output = self.linear(x)
        
        # LoRA Path (Trainable)
        # x -> A -> B -> Scale
        lora_output = self.lora_b(self.lora_a(x)) * self.scaling
        
        # Combine
        return original_output + lora_output

### Testing LoRA

Let's verify that initially, LoRA does nothing (because B is zero), but it has way fewer parameters.

In [5]:
# Dimensions
d_in, d_out = 1024, 1024
rank = 8

# Create Layer
layer = LoRALinear(d_in, d_out, rank=rank)

# Count Parameters
total_params = sum(p.numel() for p in layer.parameters())
trainable_params = sum(p.numel() for p in layer.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters (LoRA): {trainable_params:,}")
print(f"Percentage Trainable: {100 * trainable_params / total_params:.2f}%")

# Verify Output
x = torch.randn(1, d_in)
y = layer(x)
print(f"Output Shape: {y.shape}")

Total Parameters: 1,065,984
Trainable Parameters (LoRA): 16,384
Percentage Trainable: 1.54%
Output Shape: torch.Size([1, 1024])


## 4. Conclusion

You have now implemented the two pillars of efficient AI:
1.  **Quantization**: Reducing memory usage by 4x with minimal error.
2.  **LoRA**: Reducing trainable parameters by 99% for fine-tuning.

This concludes the PyTorch Tutorial Series. You have gone from Tensors (Notebook 00) to building Agents (18), Aligning them (19), and Optimizing them (20). You are ready for the industry.