# ***Fine Tunning LLM models***

## ***Quantization***

Quantization is the process of converting model weights and activations from ***higher precision formats (e.g., FP32) to lower precision formats*** (e.g., INT8), which reduces memory footprint and boosts inference speed—particularly critical for deploying large models on edge devices or limited-resource systems.

- FP32 to FP16 ( conversion is called half precision)

### 🚀 ***Benefits of Quantization***
- Ability to perfrom fine tunning
- Reduced memory usage (e.g., 4x smaller from FP32 → INT8)
- Faster inference, especially on CPUs
- Lower power consumption
- Helps deploy models on mobile, embedded, or edge devices

### ❌***Disadvantages of Quantization***
- Potential loss of precision (quantization error)
- May require calibration or retraining to recover accuracy
- Not all operations or hardware support lower precision (e.g., INT8)


---

## ***Mathematical Intiution***

Quantization maps floating-point values to integers. There are two primary types:

- 🔷 Symmetric Quantization- ( batch normalisation is a tech of symmetric quantization )
    - In symmetric quantization, the zero-point is always zero, and values are assumed to be centered around it.
    - Example:
        >>- Converting a range [0.0, 1000.0] (FP32) to [0, 255] (UINT8)
        - This assumes that the minimum and maximum values are symmetric around zero (or can be shifted accordingly).
        - Used when the input distribution is approximately symmetric or zero-centered.

    - How does the conversion happen?
        - Original floating-point range: Xmin = 0.0, Xmax = 1000.0
        - Quantized range: Qmin = 0, Qmax = 255

    - Scale factor:
        - Single precision fp32  - mantasa and exponent

    - Equation - MinMaxScaler (fp1000 to uint8)
        - 0.0 to a quantized value of 0
        - 1000 to be convered to a value of 255
        - the scale is - Xmax-Xmin/Qmax-Qmin
        eq - 1000-0/255-0 = 3.92
        - scale factor is 3.92, now say between -0 and 1000, if it is 789. then divide by 789/3.92
        - use round factor as well

---  

### 🔶 ***Asymmetric Quantization***

    - In asymmetric quantization, the zero-point is non-zero, which allows for better handling of values that are not centered around zero (i.e., skewed distributions).

    Example:
    - Converting [-20, 1000] (FP32) to [0, 255] (UINT8)

    - 1000+20/255 = 4.0
    - now if i quantize -20 to -5 with the scale factor of 20
    - but i have a negaitve number outisd, so now -5+5 becomes zerop pont.

### ***Extra Notes***

Batch Normalization layers typically use symmetric quantization due to their balanced input distribution.
Choosing between symmetric and asymmetric quantization often depends on:
1. The data distribution
2. Target hardware compatibility
3. Model architecture and tolerance to quantization noise


---

## POST Training QUANTIZATION (PTQ)

- PTQ is applied after a model has been fully trained using high-precision (typically 32-bit float) weights.
- The model is calibrated using a small representative dataset to estimate the dynamic ranges (min/max values) of activations and weights.
- Quantization maps high-precision parameters (e.g., float32) to lower-precision formats (e.g., int8, uint8) by applying scaling and zero-point shifting.
- The quantized model is smaller and runs faster on resource-constrained devices (like edge/IoT/embedded).
- PTQ is non-invasive – it doesn’t require retraining the model.

However, this technique may lead to significant accuracy degradation, especially for models sensitive to numerical precision (e.g., NLP transformers or object detection models).

> ✅ Best for: Models already trained and deployed where retraining is expensive or impossible. Suitable for simpler or more robust models like MobileNet, ResNet on classification tasks

---


## Quantization-Aware Training (QAT)

***QAT simulates quantization effects during training to make the model robust to quantization noise.***

Process:
- Start with a trained float32 model.
- Inject fake quantization operations (called "fake quant ops") during forward/backward passes to mimic int8 behavior.
- Fine-tune the model with training data to adapt the weights for lower precision.
- Export the final model to a fully quantized version.

- QAT results in higher accuracy retention compared to PTQ, especially for complex models or tasks involving fine-grained numerical precision.

- Supports per-channel quantization, activation clipping, and more advanced techniques that further minimize accuracy loss.

>> ✅ Best for: When accuracy is critical or PTQ causes unacceptable performance drops. Commonly used in production ML systems (e.g., TensorFlow Lite QAT, PyTorch FX QAT, ONNX QAT workflows).

