<a href="https://colab.research.google.com/github/Rhuan-Messias/LLM_RAG_Study/blob/main/hugging_face_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q --upgrade bitsandbytes accelerate
!pip install -U bitsandbytes

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

login(userdata.get('HF_TOKEN'),add_to_git_credential=True)


In [None]:
#instruct models and 1 reasoning model

#Llama requires being approved
LLAMA1 = 'meta-llama/Llama-3.1-8B-Instruct'

LLAMA2 = 'meta-llama/Llama-3.2-1B-Instruct'

PHI = "microsoft/Phi-4-mini-instruct"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"

In [None]:
messages = [
    {"role": "user", "content":"Tell a joke about Lord of The Rings"}
]

Quantization is a technique used to reduce the memory footprint and computational cost of large language models (LLMs) by representing their weights and activations with lower precision data types, such as 8-bit integers (INT8) or 4-bit integers (INT4), instead of the standard 32-bit floating-point numbers (FP32).

If you have a continuous range of numbers (like real numbers between 0 and 100), and you decide to only represent them with integers (0, 1, 2, ... 100), you've 'quantized' the data. You've limited the possible values to a finite, discrete set.

In [None]:
#Quantization Config

quant_conf = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA2)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

In [None]:
inputs

In [None]:
#The model
model = AutoModelForCausalLM.from_pretrained(
    LLAMA2,
    quantization_config=quant_conf,
    device_map="auto")

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:.2f} MB")

In [None]:
model

In [None]:
#running the model

outputs = model.generate(inputs, max_new_tokens=80)
outputs[0]

In [None]:
tokenizer.decode(outputs[0])

In [None]:
#Clean up memory

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()

In [None]:
#Wrapping all above into a function and using streaming and generating prompts

def generate(model, messages, quant=True, max_new_tokens=80):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token #to properly add spaces
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  attention_mask = torch.ones_like(input_ids, dtype=torch.long, device='cuda')
  streamer = TextStreamer(tokenizer)

  if quant:
    model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_conf).to('cuda')
  else:
    model = AutoModelForCausalLM.from_pretrained(model).to('cuda')

  outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, streamer=streamer, max_new_tokens=max_new_tokens)



In [None]:
generate(PHI, messages)