GitHub - Siddhinita/Quantization: Analysis of effects of quantization on model quality

1. Floating Point Representations & Bfloat16

Standard floating point numbers (IEEE 754) are represented as: $$Value = (-1)^{S} \times 2^{E - bias} \times (1 + M)$$

Core Concept: Bits Allocation

Exponent Bits determine the Dynamic Range (magnitude: how large or small the number can be).
Mantissa Bits determine the Precision (resolution: how many significant digits/detail we can store).

Comparison: FP32 vs. FP16 vs. Bfloat16

The key difference lies in how the 32 or 16 bits are allocated between these two goals.

Format	Sign	Exponent (Range)	Mantissa (Precision)	Effective Range	Precision (Sig. Digits)	Use Case
FP32	1	8	23	$\pm 10^{-45}$ to $\pm 10^{38}$	~7 decimals	Master weights, Optimizer states
FP16	1	5	10	$\pm 6 \times 10^{-8}$ to $\pm 65,504$	~3-4 decimals	Traditional mixed precision
BF16	1	8	7	$\pm 10^{-45}$ to $\pm 10^{38}$	~2-3 decimals	Modern ML Training (TPUs/GPUs)

Key Distinction:

Bfloat16 (Brain Floating Point): Truncates the mantissa (precision) but keeps the same 8-bit exponent as FP32. This preserves the dynamic range, making it stable for training without aggressive loss scaling.
FP16: Has higher precision (10 mantissa bits) but a smaller exponent (5 bits). This results in a very limited range (max ~65k), making it prone to overflow/underflow.

Example: Representing -3.5 in FP16

To store -3.5 in Float16 (1 Sign, 5 Exponent, 10 Mantissa):

Sign: Negative $\rightarrow$ 1
Binary Scientific Notation:
- $3.5_{10} = 11.1_{2} = 1.11_{2} \times 2^{1}$
- Exponent is $1$. Mantissa (fraction) is $0.11$.
Biased Exponent:
- FP16 Bias = 15.
- $Stored = 1 + 15 = 16$.
- Binary for 16 $\rightarrow$ 10000
Mantissa (Stored):
- Drop the leading 1 (implicit). Store only .11.
- Pad to 10 bits $\rightarrow$ 1100000000

Result: 1 10000 1100000000 (Hex: 0xC300)

2. Integer Quantization (Inference)

A. Uniform Quantization (Integer)

We map continuous floating-point values to discrete integers (Int8/Int4). We define a clipping range $[ \alpha, \beta ]$ for the floating point values (where $\alpha = x_{min}$ and $\beta = x_{max}$) and map them to the integer range $[q_{min}, q_{max}]$ (e.g., 0 to 255 for uint8, or -128 to 127 for int8).

A. Asymmetric (Affine) Quantization

Used for distributions not centered at zero (e.g., ReLU activations: $0 \to \infty$). We need both a scale and a shift (zero point).

1. Calculate Scale ($S$) and Zero Point ($Z$):

$$S = \frac{\beta - \alpha}{q_{max} - q_{min}}$$

$$Z = round(q_{min} - \frac{\alpha}{S})$$

2. The Transformations:

Quantize (Float $\to$ Int): $$q = clamp\left( round\left( \frac{x}{S} + Z \right), q_{min}, q_{max} \right)$$
Dequantize (Int $\to$ Float): $$x = S \times (q - Z)$$

B. Symmetric Quantization

Used for distributions centered at zero (e.g., Weights). We force the range to be symmetric $[-\alpha, \alpha]$ and force $Z=0$.

1. Calculate Scale ($S$): We take the absolute maximum of the tensor: $\alpha = \max(|x_{min}|, |x_{max}|)$. $$S = \frac{\alpha}{q_{max}}$$ (Note: For int8, $q_{max}=127$. We usually avoid -128 to keep symmetry).

2. The Transformations:

Quantize (Float $\to$ Int): $$q = clamp\left( round\left( \frac{x}{S} \right), -q_{max}, q_{max} \right)$$
Dequantize (Int $\to$ Float): $$x = S \times q$$

B. Non-Uniform Quantization (NF4)

This refers to NF4 (Normal Float 4), popularized by QLoRA.

The Problem: Neural network weights usually follow a Bell Curve (Normal Distribution), not a flat line. Using evenly spaced integers (Uniform Quantization) wastes bits on the "tails" where few weights exist.
The Solution (NF4): The quantization bins are strictly defined by the quantiles of the Normal Distribution (0 to 1). There are more bins near zero (where most weights are) and fewer bins at the extremes.
Result: NF4 holds much higher accuracy than Int4 for the same memory footprint.

Granularity (The "Block" Concept): In modern LLMs (like LLaMA or Gemini), quantization granularity is crucial:

Per-Tensor: One scale for the whole layer (low accuracy).
Per-Channel: One scale per output channel (standard for CNNs/Linear layers).
Block-wise (Group-wise): The weights are split into small groups (e.g., block size of 128 or 64 elements), and each group gets its own scaling factor. This offers the best trade-off between compression and perplexity.

Double Quantization:

The scaling factors (which are usually FP32 or BF16) add overhead.
Methods like QLoRA treat these scaling factors as a tensor and quantize them as well (e.g., quantizing FP32 constants into FP8), further reducing memory footprint.

How this works in inference?:

Storage: Weights are stored in Int4/Int8.
Runtime: During the forward pass, we load the Int4/Int8 weights $\to$ Dequantize them to FP16/BF16 using the stored scales $\to$ Perform the Matrix Multiplication (GEMM) in high precision (FP16/BF16) $\to$ Discard the temporary FP weights.

3. FP8 Quantization

Mostly used for training, but can also be used during inference. FP8 is standardized (OCP/NVIDIA Hopper) into two distinct formats:

E4M3 (1 Sign, 4 Exponent, 3 Mantissa):
- Higher precision, lower dynamic range.
- Typically used for Weights and Activations during the forward pass.
E5M2 (1 Sign, 5 Exponent, 2 Mantissa):
- Matches the dynamic range of FP16.
- Typically used for Gradients during the backward pass to prevent overflow.

Technical Note: Unlike Int8, FP8 handling requires hardware support (like NVIDIA H100s) to perform native FP8 GEMMs. It is not just simple rounding; it often involves delayed scaling recipes to maximize the limited dynamic range.

4. Quantization Aware Training (QAT)

QAT simulates the effects of low-precision arithmetic during the training phase to allow the neural network to adapt its weights to the loss of precision.

Forward Pass: Weights and activations are "fake quantized" (rounded to Int8/Int4 representation limits) and then effectively dequantized back to float for the operation. $$\hat{w} = dequantize(quantize(w))$$
Backward Pass (Straight-Through Estimator - STE): The quantization operation is non-differentiable (step function). During backpropagation, we ignore the rounding derivative (assume gradient is 1) and pass the gradients through the quantization block unchanged to update the latent floating-point weights.

5. Mixed Precision Training

This technique uses lower precision formats (FP16/BF16) to speed up math and reduce memory, while keeping a "Master Copy" of weights in FP32 for numerical stability.

The Breakdown

Weights: Stored in FP32 (Master Weights).
Activations: Cast to FP16/BF16 (reduces memory for stash/re-materialization).
Gradients: Computed in FP16/BF16.
Optimizer States: Stored in FP32 (to accumulate small updates accurately).

The Training Loop (Corrected Workflow)

Forward: Cast FP32 Master Weights $\to$ FP16/BF16. Compute activations and loss in FP16/BF16 (Note: The final Softmax/CrossEntropy is usually computed in FP32 for stability).
Loss Scaling (Specific to FP16):
- Gradients in FP16 are often very small (e.g., $2^{-20}$), causing underflow (becoming zero).
- Scale: Multiply the Loss by a factor $S$ (e.g., 1024).
- By Chain Rule, all gradients are now scaled by $S$, shifting them into the representable range of FP16.
Backward: Compute gradients in FP16.
Unscale: Convert gradients to FP32 and divide by $S$.
Update: Apply the FP32 gradients to the FP32 Master Weights and Optimizer States.

Dynamic Loss Scaling

If Infinity/NaN (Overflow) is detected in the gradients: Skip the weight update for this batch and decrease the scaling factor (usually by 0.5x).
If no overflow occurs for $N$ iterations: Increase the scaling factor to utilize more range.
Note: Bfloat16 training rarely requires loss scaling because its dynamic range matches FP32.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
colabs		colabs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Floating Point Representations & Bfloat16

Core Concept: Bits Allocation

Comparison: FP32 vs. FP16 vs. Bfloat16

Example: Representing -3.5 in FP16

2. Integer Quantization (Inference)

A. Uniform Quantization (Integer)

A. Asymmetric (Affine) Quantization

B. Symmetric Quantization

B. Non-Uniform Quantization (NF4)

3. FP8 Quantization

4. Quantization Aware Training (QAT)

5. Mixed Precision Training

The Breakdown

The Training Loop (Corrected Workflow)

Dynamic Loss Scaling

About

Uh oh!

Releases

Packages

Languages

Siddhinita/Quantization

Folders and files

Latest commit

History

Repository files navigation

1. Floating Point Representations & Bfloat16

Core Concept: Bits Allocation

Comparison: FP32 vs. FP16 vs. Bfloat16

Example: Representing -3.5 in FP16

2. Integer Quantization (Inference)

A. Uniform Quantization (Integer)

A. Asymmetric (Affine) Quantization

B. Symmetric Quantization

B. Non-Uniform Quantization (NF4)

3. FP8 Quantization

4. Quantization Aware Training (QAT)

5. Mixed Precision Training

The Breakdown

The Training Loop (Corrected Workflow)

Dynamic Loss Scaling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages