# What is Fine-Tuning and how is this different from RAG ?
- specializing a LLM for a particular task by training on a small set of data
- this is done to excel at a particular task, intuitively we are tweaking the brain connections (weights) of the model to learn something particular out of it's training data
- finetuning is no magical technique, it's used to make a model more opionated rather than general
- Finetuning comes with the risk of catastrophic forgetting (very high lr with low data)
- thinking it like a MBBS student which goes on to do specialization is a loose metaphor because the human their has the ability to distinguish between spurious correlations but models in training don't or rather humans have external and internal mechanism to identify and suppress spurious correlations where models treate any loss reducing pattern as the legitimate unless explicity constrained !!

### EX: we may take a small llama 2b or 1b param model and fine tune it on say Legal Datasets, we might be able to get a model which not only excels at answering naturally but also has a specialized knowledge in Legal cases

# Quantization
**1. Symmetric Quantization: real values are roughly centered around zero**
- zero_point = 00
- equal ranges on both the sides
- $x_{int} = cilp(round(\frac{x_{float}}{s}),-Q_{max},Q_{max})$

**2. Asymmetric Quantization**

- when the assumptions of symmetric quantization fail (usually in activations)
- $x_{int} = round(\frac{x_{float}}{s} + z)$
- where s is the scale and z is the zero point

## Post Training Quantization
- we start with a pre-trained model (typically in FP32) -> Calibrate the weights to the desired size -> Quantized model

## Quantization aware training
- forward pass uses quantized values -> backward pass updates FP weights -> model learns that tiny values get killed

### Collapse in Quantization
- INT8 quantization handles value collapse not by preventing it, but by reshaping distributions (via scaling, per-channel quantization, clipping, and QAT) so that important distinctions survive bucketization

# Actual Coding
- using LoRA to tune Llama model
-

In [1]:
!pip install -q  accelerate peft bitsandbytes transformers trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.9/532.9 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## Prompt template for LLama 2 model
```
<s>[INST}<<SYS>>
system prompt
<</SYS>>

user prompt [/INST] Model Answer </s>
```

Dataset used: [link text](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

LLama 2 formatted dataset: [formated-dataset](https://huggingface.co/datasets/gpjt/openassistant-guanaco-llama2-format)

- we use the instruct version of llama 2 which requires the format, base llama doesn't need this template

## Model configs

In [10]:
model_name = r'NousResearch/Llama-2-7b-chat-hf'
dataset_name = 'mlabonne/guanaco-llama2-1k'
new_model = 'Llama-2-7b-chat-finetune'

## LoRA config

In [3]:
#lora parameters
lora_rank = 64
lora_alpha = 16
lora_dropout = 0.1

#quantization parameters
use_4bit = True
bnb_4bit_compute_dtype = 'float16'
bnb_4bit_quant_type = 'nf4'
use_nested_quant = False

In [4]:
#training arguments
output_dir = './results'
num_training_epochs = 1
fp16 = False
bf16 = False

per_device_train_batch_size = 4
per_device_eval_batch_size = 4

grad_accum = 1
grad_checkpoint = True
max_grad_norm = 0.3
lr = 5e-5
weight_decay = 0.001
optim = 'paged_adamw_32bit'
scheduler = 'cosine'
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25

max_seq_len = None
packing=False
device_map = {"":0}

In [7]:
dataset = load_dataset(dataset_name,split='train')

compute_dtype = getattr(torch,bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit = use_4bit,
    bnb_4bit_quant_type = bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant
)



README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-9ad84bb9cf65a4(…):   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [11]:
model= AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    device_map = device_map
)


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [12]:
model.config.use_cache = False
model.config.pretraining_tp = 1