What is LLM Fine-Tuning

- Fine-Tuning is adapting a pre-trained llm to a specific task or domain
- it involves adjusting a small potion of model parameters on more focused dataset
- Fine-Tuning customizes output to be more relevant and accurate for your use case

The Power of Fine-Tuning

- cost-effectiveness
- improved performance
- data efficiency

How Does LLM Fine-Tuning work?

- check bellow code

Real-World Use Cases
- chatbot
- content generation
- Domain-specifice analysis


For Fine-Tunging

Use This Notes:- https://docs.unsloth.ai/

### 1. Post-Training Quantization with bnb

Here’s how you can use BitsAndBytes (bnb) for post-training quantization (no further LoRA fine-tuning) and some knobs to help recover precision:

---

## 1. Post-Training Quantization with bnb

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from bitsandbytes import BitsAndBytesConfig

# 1) Choose your model
model_name = "facebook/opt-1.3b"

# 2) Configure bitsandbytes for 4-bit post-training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                # turn on 4-bit quantization
    bnb_4bit_quant_type="nf4",        # use NF4 quant scheme (better than “fp4”)
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 if GPU allows
    bnb_4bit_use_double_quant=True,   # nested quant for extra accuracy
)

# 3) Load the quantized model directly—no fine-tuning step
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",      # spreads layers over all GPUs/CPU
)
model.eval()

# 4) Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# 5) Inference example
prompt = "The quick brown fox"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

**Key bitsandbytes options:**

* **`bnb_4bit_quant_type="nf4"`**: “Normal-float4” often yields better accuracy than plain 4-bit FP.
* **`bnb_4bit_use_double_quant=True`**: Applies a two-stage quantization that reduces overall quantization error.
* **`bnb_4bit_compute_dtype=torch.bfloat16`**: If you’re on an Ampere-class or newer GPU, doing the internal rescaling in bf16 instead of fp16 can slightly improve fidelity.

---

## 2. Recovering Precision Without Fine-Tuning

1. **Representative Calibration**
   Run a small, diverse set of real inputs through the model and compare outputs against FP32. Then adjust the quantization parameters (e.g. switch between `nf4`, `fp4`, or even 8-bit) to find the sweet spot.

2. **Layer-Wise Quant Strategy**
   Some sensitive layers (e.g. the first and last attention projections) can stay in fp16/bf16, while the bulk of weights are in 4-bit. You can override at load time:

   ```python
   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
       # keep first and last Linear layers in fp16:
       llm_int8_threshold=6.0,    # higher threshold skips quantization on small layers
   )
   ```

3. **Advanced Recipes**

   * **GPTQ**: A “greedy” per-group quantization that minimizes layer-wise error (via \[GPTQ-for-LLaMa]).
   * **SmoothQuant**: Splits the quant burden between weights and activations to balance error.

4. **Mixed Precision Hybrid**
   If 4-bit still loses too much, try 8-bit dynamic quant (`load_in_8bit=True`) via the same API:

   ```python
   bnb_config = BitsAndBytesConfig(load_in_8bit=True)
   ```

   This often hits >90% of FP32 accuracy with a \~2× memory reduction.

---

### TL;DR

* Use **`bnb_4bit_quant_type="nf4"`** + **double quant** + **bf16 compute** for best off-the-shelf 4-bit.
* If needed, carve out critical layers to stay in 16-bit or bump to 8-bit.
* For large LLMs, consider GPTQ or SmoothQuant wrappers on top of bnb to regain more accuracy—still no fine-tuning required.


Before diving into code, here’s the short story: **Quantization-Aware Training (QAT)** simulates the effects of low-precision arithmetic during training by inserting “fake-quant” modules into your model’s forward pass. This lets the model **adapt its weights** to the quantization noise, often recovering **90–99 %** of full-precision accuracy—far better than pure post-training quantization. QAT requires a bit more setup (layer fusion, defining quant-configs, a fine-tuning loop), but it’s still “zero-to-hero” in under 20 lines of PyTorch once your data loader is ready.

---

## ## 1. QAT Fundamentals

* **FakeQuant modules** clamp and round activations/weights during forward passes, while keeping gradients in full precision. The model “sees” quantization noise and learns to compensate for it ([PyTorch][1]).
* You must **fuse** adjacent layers (e.g., Conv+ReLU, Linear+ReLU) before QAT so that quantization points are minimized and more representative of actual inference graphs ([Lei Mao's Log Book][2]).
* QAT typically uses the same 8-bit scheme as static PTQ (per-channel weight quant, per-tensor activation quant), but the ranges are **learned** during fine-tuning.

---

## ## 2. Minimal PyTorch QAT Example

Below is a self-contained script to take a pre-trained `resnet18`, fuse it, prepare it for QAT, and fine-tune on CIFAR-10 for just a few epochs. You can adapt it to any image or text model by swapping the backbone and the data loader.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.models import resnet18
from torchvision import datasets, transforms
import torch.quantization as quant

# 1) Data loaders for calibration & training
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
])
train_ds = datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_ds, batch_size=32, shuffle=True)

# 2) Load & fuse model
model_fp32 = resnet18(pretrained=True).eval()
model_fp32.fuse_model()  # built-in fusing for ResNet :contentReference[oaicite:2]{index=2}

# 3) Specify QAT config: per-channel weights, default activations
model_fp32.qconfig = quant.get_default_qat_qconfig('fbgemm')  # Intel server backend :contentReference[oaicite:3]{index=3}

# 4) Prepare QAT: insert FakeQuant modules
quant.prepare_qat(model_fp32, inplace=True)

# 5) Fine-tune with quant noise for a few epochs
optimizer = optim.SGD(model_fp32.parameters(), lr=1e-4, momentum=0.9)
criterion = nn.CrossEntropyLoss()
model_fp32.train()
for epoch in range(3):  # short run for demonstration
    for imgs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model_fp32(imgs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# 6) Convert to a true quantized model
model_int8 = quant.convert(model_fp32.eval(), inplace=False)

# 7) Save or evaluate
torch.jit.save(torch.jit.script(model_int8), "resnet18_qat_int8.pt")
```

* After `prepare_qat`, your model’s forward pass will simulate **8-bit** quantization noise on both weights and activations ([PyTorch][1]).
* The final `convert()` swaps FakeQuant modules for real `nnq.Conv2d` / `nnq.Linear` ops that run at INT8 ([Lei Mao's Log Book][2]).

---

## ## 3. Applying QAT to Large Language Models

Recent work (e.g., PyTorch blog on LLama3) shows that QAT on LLMs can recover **96 %** of zero-shot accuracy loss on Hellaswag and **68 %** of perplexity degradation on WikiText compared to PTQ alone ([PyTorch][3]). Hugging Face’s 🤗Optimum-Intel leverages the same torch-ao QAT APIs under the hood:

```bash
pip install optimum[openvino]  # brings in NNCF & QAT support
```

```python
from optimum.intel.openvino import OVQuantizer

# 1) Instantiate quantizer for QAT
quantizer = OVQuantizer.from_pretrained("facebook/opt-1.3b")

# 2) Prepare model for QAT (wraps torch.quantization under the hood)
model = quantizer.prepare_qat(framework="pytorch")

# 3) Fine-tune with your SFT or RLHF loop as usual
for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

# 4) Convert to an optimized INT8 OpenVINO model
quantizer.convert(model)
quantizer.save_pretrained("./opt-1.3b-q8-qat")
```

* This uses Intel’s **NNCF** toolkit for automated fusing, QConfig selection, and range learning ([Hugging Face][4]).
* You can swap in any Hugging Face transformer model by changing the `from_pretrained` argument ([Hugging Face][5]).

---

## ## 4. Best Practices & Tips

| Technique                                                    | Benefit                                                                           |
| ------------------------------------------------------------ | --------------------------------------------------------------------------------- |
| Per-channel weight quantization                              | Reduces outlier channel errors; essential for conv layers ([PyTorch][1])          |
| Fine-tune ≥1 epoch with small LR (1e-4)                      | Allows weights to adjust without catastrophic forgetting                          |
| Larger batch calibration for activations                     | Better range estimation, especially for language models                           |
| Mixed-precision (INT8 weights + FP16 act)                    | Balance between memory saving and numeric stability                               |
| Advanced QAT schedulers (e.g., gradual quant noise increase) | Smooth adaptation to quant noise, especially in deep nets ([Weights & Biases][6]) |

---

**In short:**

1. **Fuse** your model.
2. **Set** `.qconfig` to a QAT profile.
3. **`prepare_qat`**, then **fine-tune** normally.
4. **`convert`** to INT8.

This process seamlessly integrates into any training loop and, for most CNNs or LLMs, recovers nearly full-precision accuracy—without a single line of extra model architecture code.

[1]: https://pytorch.org/docs/stable/quantization.html?utm_source=chatgpt.com "Quantization — PyTorch 2.7 documentation"
[2]: https://leimao.github.io/blog/PyTorch-Quantization-Aware-Training/?utm_source=chatgpt.com "PyTorch Quantization Aware Training - Lei Mao's Log Book"
[3]: https://pytorch.org/blog/quantization-aware-training/?utm_source=chatgpt.com "Quantization-Aware Training for Large Language Models with PyTorch"
[4]: https://huggingface.co/docs/optimum/main/en/intel/openvino/optimization?utm_source=chatgpt.com "Optimization - Hugging Face"
[5]: https://huggingface.co/docs/optimum/en/concept_guides/quantization?utm_source=chatgpt.com "Quantization - Hugging Face"
[6]: https://wandb.ai/byyoung3/Generative-AI/reports/Quantization-Aware-Training-QAT-A-step-by-step-guide-with-PyTorch--VmlldzoxMTk2NTY2Mw?utm_source=chatgpt.com "Quantization-Aware Training (QAT): A step-by-step guide with PyTorch"


In [None]:
# quantization
from transformers import BitsAndBytesConfig
import torch

model = ""
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)