### GenAi 
* here we will try Llama3.1-8B

In [1]:
# if u havent install them 
# %pip install -U bitsandbytes
# %pip install -U transformers
# %pip install -U accelerate
# %pip install -U peft
# %pip install -U trl

#### Import libraries

In [2]:
from transformers import (AutoModelForCausalLM, AutoTokenizer , BitsAndBytesConfig)

In [3]:
# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True, # nested quantization for more memory efficient inference and training
    # The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4
    bnb_4bit_quant_type="nf4", # NF4 quantization type for better results
    
    bnb_4bit_compute_dtype="bfloat16",
)

In [4]:
# choose the model
model_id = "meta-llama/Meta-Llama-3.1-8B"

# Load the model 
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype="bfloat16",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

##### Lower Precision
* We start with the basic understanding of different floating point data types, which are also referred to as "precision" in the context of Machine Learning.

* Lower Precision: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit can achieve computational advantages without a considerable decline in model performance.

* we will use `load_in_8bit=True` and `load_in_4bit=True` to make the model take less VRAM that way it will load to cuda even with 8GB VRAM 
* and we will compare between them 

#### Load the tokenizer 

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [10]:
# choose a prompt 
prompt = "Hi i am aboud "

# convert it to tokenize it 
token = tokenizer(prompt , return_tensors="pt")

# split it ot input_ids and attention_mask 
input_ids = token.input_ids
attention_mask = token.attention_mask

In [13]:
# use input to cuda 
input_ids = input_ids.to('cuda')
attention_mask = attention_mask.to('cuda')


In [29]:
%%time 
# to show you how much time it take 

# then use the model to generate the output 
outputs = model.generate(input_ids = input_ids , attention_mask = attention_mask  , max_length = 10 ,  pad_token_id = 128001  )

CPU times: total: 688 ms
Wall time: 661 ms


In [30]:
# convert the output from tokens to text or readable language 
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Hi i am aboud 4 year old']

### Summary

##### without `load_in_8bit=True`
* ok not bad 
* but it take about 2 minutes 
* and it take about 16GB ram 

##### with `load_in_8bit=True`
* it take about 8.5 sec 
* 8GB VRAM and 1.5GB shared memory 

##### with `load_in_4bit=True`
* it take about 4 sec 
* 6 GB VRAM 

##### With quantization config
* it take about 3.5 sec
* 5.5 GB VRAM
* we use nf4 quantization type for better results
* we use nested quantization for more memory efficient inference and training
##### with input and attention .to(cuda)
* it take about 1.5 sec
* the same VRAM 5.5 GB

##### u can change the max_length 
* make it low will descrease time consuming
* make it high will increase time consuming
* that thing depends on your prompt and dataset
* change it to 10
* it take about 6ms

##### source 
Optimizing LLMs for Speed and Memory : 
https://huggingface.co/docs/transformers/llm_tutorial_optimization <br> 
bnb-4bit-integration : https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing

##### if u want to learn how 4-bit quantization work 
* https://huggingface.co/blog/hf-bitsandbytes-integration
* https://huggingface.co/blog/4bit-transformers-bitsandbytes