### **Quantization Aware Training**

* Training in a way that controls how the model performs once it is quantized
  * a quantized version of the weights
  * original unquantized weights
* Forward Pass (inference)
  * Use quantized version of the model weights to make predictions
* Back propagation (updating model weights)
  * Update original, unquantized version of the model weights

---

### **Quantization Methods for LLM**

* Linear Quantization  ( PSQ - Post Training Quantization)
* LLM.INT8 (only 8-bit) - Aug 2022 [Dettmers et al]
* `QLoRA` (only 4-bit) - May 2023 [Dettmers et al]
* AWQ - June 2023 - [Lit et al]
* GPTQ - Oct 2022 - Frantar et al
* SmoothQuant - Nov 2022 - Xiao et al

#### **2-bit Quantization**

- QuIP - Jul 2023
- HQQ - Nov 2023
- AQLM - Feb 2024

All methods are designed to:
  - Make LLMs smaller
  - Minimizing performance degradation

---


### * **GGUF** format for efficient storage

####  Fine Tuning
* PEFT methods  

### T5-FLAN

In [13]:
model_name = "google/flan-t5-small"

In [14]:
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

In [15]:
model = T5ForConditionalGeneration.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
)

In [20]:
input_text = "Hellow, my name is"

input_ids = tokenizer(input_text, return_tensors='pt')

outputs = model.generate(**input_ids, max_new_tokens=20)


In [21]:
tokenizer.decode(outputs[0], skin_special_tokens=True)

'<pad> iam a scott</s>'

In [None]:
module_size = compute_model_sizes(model)

print(f"The model size is {module_sizes[''] * 1e-9} GB")

### Quantize the model (8-bit precision)

In [23]:
from quanto import quantize, freeze

import torch

In [27]:
quantize(model)

In [28]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): QLinear(in_features=512, out_features=384, bias=False)
              (k): QLinear(in_features=512, out_features=384, bias=False)
              (v): QLinear(in_features=512, out_features=384, bias=False)
              (o): QLinear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): QLinear(in_features=512, out_features=1024, bias=False)
              (wi_1): QLinear(in_features=512, out_features=1024, bias=False)
              

In [29]:
freeze(model)

In [None]:
module_size = compute_model_sizes(model)

print(f"The model size is {module_sizes[''] * 1e-9} GB")

In [30]:
# Load model directly
from transformers import AutoProcessor, AutoModelForMaskGeneration

processor = AutoProcessor.from_pretrained("Zigeng/SlimSAM-uniform-77")
model = AutoModelForMaskGeneration.from_pretrained("Zigeng/SlimSAM-uniform-77")

preprocessor_config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/364 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/38.9M [00:00<?, ?B/s]