## Quantization
* Quantization is a model optimization technique that aims to reduce the model's size and speed up the inference process from the models by simplifying the mathematical operations the model performs to reach an output value using an input value.


* In deep learning, quantization refers to the process of reducing the precision of the weights and activations of a neural network model. 
* Reduced memory footrint
* Faster inference
* Energy efficiency

* Larger models are able to  maintain their capacities even when converted to 4-bit, with some techniques such as the NF4 suggesting no impact on their performance.

In [None]:
import torch.quantization
import torch.nn as nn

# model = LSTM+Linear

quantized_model = torch.quantization.quantize_dynamic(
       model,
      {nn.LSTM, nn.Linear}, # determines the set of layers to dynamically quantize
      dtype=torch.qint8 # determines the target dtype for quantized weights
)
print(quantized_model)

In [None]:
# !pip install transformers
# !pip install accelerate

# # Due to using GPTQ
# !pip install optimum
# !pip install auto-gptq

### How to use already quantized model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

## How to easily quantize a model using AutoGPTQ along with the Transformers 

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

## NF4 and Double Quantization can be leveraged using the bitsandbytes library

* Suports - `load_in_8bit` or `load_in_4bit`
* Currently only supports `LLM.int8()`, `FP4`, and `NF4` quantization.

In [None]:
# !pip install bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True, # Also supports load_in_8bit
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True, # double quantization
   bnb_4bit_compute_dtype=torch.bfloat16 # Specifies computational dtype which can be different than input type
)

model_name = "PY007/TinyLlama-1.1B-step-50K-105b"

tokenizer_nf4 = AutoTokenizer.from_pretrained(model_name, quantization_config=nf4_config)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config)