<a href="https://colab.research.google.com/github/Satyadeep-Dey/AI-experiments/blob/main/6_Quantization_%2B_Low_level_API_%2B_Call_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Low Level APIs

Hugging Face **low-level APIs** refer to the more granular, flexible building blocks provided by the transformers library that allow you to interact directly with models and tokenizers — without relying on high-level abstraction layers like pipeline.

In this Notebook we look at the low level API of Transformers - the models that wrap PyTorch code for the transformers themselves.



# Needed Libraries

1. **requests** :
    * **Purpose:** Simple HTTP library for making API requests.
    * **Use case:** Useful for downloading models, datasets, or interacting with REST APIs.

2. **torch**
    * **Purpose:** PyTorch, a deep learning framework developed by Facebook.
    * **Use case:** For building, training, and running neural networks.

3. **bitsandbytes**
    * **Purpose:** A lightweight CUDA extension for 8-bit and 4-bit optimizers and matrix multiplication.
    * **Use case:** Used to reduce memory usage and speed up large models, especially in inference or fine-tuning. Commonly used with Hugging Face models.

4. **transformers**
    * **Purpose:** Hugging Face's Transformers library.
    * **Use case:** Provides pre-trained transformer models like BERT, GPT, T5, etc., with easy APIs for text generation, classification, etc.

5. **sentencepiece**
    * **Purpose:** Tokenizer developed by Google for unsupervised text tokenization.
    * **Use case:** Many Hugging Face models (e.g., T5, mBART) use it for handling subword units.

6. **accelerate**
    * **Purpose:** Another Hugging Face library for optimizing model training and inference.
    * **Use case:** Helps scale training across CPUs, GPUs, or even TPUs with minimal code changes. Works great in multi-GPU setups too.

# What does -q do in pip install command?
       -q means quiet mode, so it suppresses the usual output during installation—keeps the notebook or terminal cleaner.

In [None]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"   # Meta
PHI3 = "microsoft/Phi-3-mini-4k-instruct"         # Microsoft
GEMMA2 = "google/gemma-2-2b-it"                   # Google
QWEN2 = "Qwen/Qwen2-7B-Instruct"                  # Alibaba
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"  # Mistral AI -> If this doesn't fit it your GPU memory, try others from the hub

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of doctors"}
  ]

# Quantization

In machine learning — especially deep learning — quantization is the process of reducing the precision of the numbers used to represent a model's parameters (like weights and activations).

Normally, models are trained and stored using 32-bit floating-point numbers (float32).
Quantization reduces this to lower precision types like:
  1. 16-bit (e.g., float16, bfloat16)
  2. 8-bit (e.g., int8)
  3. 4-bit (e.g., nf4, fp4)

**Why Quantize?**
Quantization is mainly used for efficiency:

  * Lower Memory	- Model takes up less RAM/VRAM
  * Faster Inference	- Smaller numbers → faster computation on some hardware
  * Lower Power - Especially useful for edge devices (phones, Raspberry Pi, etc.)



In [None]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

# Let's see what these parameters mean

  * load_in_4bit=True.
    * This enables 4-bit quantization, a very memory-efficient format.
The model weights will be stored and loaded in 4-bit precision instead of 16-bit or 32-bit, drastically reducing memory usage.

  * bnb_4bit_use_double_quant=True
    * Double quantization is a technique to compress the model even further.
It applies an additional quantization step on the quantization constants (i.e., quantizing the quantization parameters themselves).This helps improve compression with a minimal impact on accuracy.

  * bnb_4bit_compute_dtype=torch.bfloat16
    * This specifies the data type for computation, which here is bfloat16 (Brain Floating Point 16).While weights are stored in 4-bit, operations are done in bfloat16, which is more precise and well supported by modern hardware (especially GPUs like A100, H100, etc.).

  * bnb_4bit_quant_type="nf4"
    * This sets the quantization scheme to "nf4", which stands for Normalized Float 4.nf4 is a specialized 4-bit quantization method shown to perform better than traditional 4-bit formats.It maintains more dynamic range and accuracy compared to other 4-bit schemes like fp4.

Summary:
  * This config is telling your model to:
    * Load in a very memory-efficient 4-bit format (nf4)
    * Use double quantization to further shrink size
    * Perform computations in bfloat16, a fast and reasonably accurate format on supported hardware
    * This setup is often used to fine-tune or run large language models (like LLaMA or Mistral) on limited hardware (like consumer GPUs or smaller cloud instances).

In [None]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
#adds a padding token ([PAD]) to the tokenizer’s vocabulary if it doesn't already have one.

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", padding=True).to("cuda") # cuda -> use GPU
#return_tensors="pt" means it will return PyTorch tensors rather than a Python string or list.
# The inputs is now a tensor, not a dictionary.

# Access the input_ids and attention_mask directly as attributes
input_ids = inputs  # or inputs.input_ids if the model expects it as a separate key
# Create attention mask from input_ids - assuming padding token is 0
attention_mask = (input_ids != tokenizer.pad_token_id).type(torch.int64).to("cuda") # changed to create attention_mask from input_ids


In [None]:
# The model

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

# What device_map="auto" does:
# device_map="auto" automatically splits the model across all available GPUs (or just one if you only have one).
# It’s especially useful for very large models like LLaMA, which might not fit on a single GPU.
# Under the hood, it uses accelerate to analyze your available hardware and figure out the best layer-to-device mapping.

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

In [None]:

outputs = model.generate(
    inputs,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    max_new_tokens=80
)

print(tokenizer.decode(outputs[0]))

In [None]:
# Clean up

del inputs, outputs, model
torch.cuda.empty_cache()

## Now let's make a function and enable streaming

Use a HuggingFace utility called TextStreamer so that results stream back.
To stream results, we simply replace:  
`outputs = model.generate(inputs, max_new_tokens=80)`  
With:  
`streamer = TextStreamer(tokenizer)`  
`outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)`

also , added the argument `add_generation_prompt=True` to my call to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts



In [None]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages):
  tokenizer = AutoTokenizer.from_pretrained(model)
  #tokenizer.pad_token = tokenizer.eos_token
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt",padding=True, add_generation_prompt=True).to("cuda")
  # Access the input_ids and attention_mask directly as attributes
  input_ids = inputs  # or inputs.input_ids if the model expects it as a separate key
  # Create attention mask from input_ids - assuming padding token is 0
  attention_mask = (input_ids != tokenizer.pad_token_id).type(torch.int64).to("cuda") # changed to create attention_mask from input_ids

  streamer = TextStreamer(tokenizer)

  model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)

  # outputs = model.generate
  #     (inputs, max_new_tokens=80, streamer=streamer)

  outputs = model.generate(
    inputs,
    attention_mask=attention_mask,
    max_new_tokens=80,
    # pad_token_id=tokenizer.pad_token_id,
    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    streamer=streamer
)

# Clean up
  del tokenizer, streamer, model, inputs, outputs
  torch.cuda.empty_cache()




# Let's now call some LLM

In [None]:
generate(PHI3, messages)

In [None]:
generate(LLAMA, messages) # Meta

In [None]:
generate(QWEN2,messages) # Microsoft

In [None]:
# let's try another prompt now that model has been loaded
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell me something about Assam"}
  ]
generate(QWEN2,messages) # Microsoft

**Gemma from Google requires us to accept their terms in Hugging Face.**

  * Visit this page to ask for access -
    https://huggingface.co/google/gemma-2-2b-it

In [None]:
message_gemma = [{"role": "user", "content": "Tell a light-hearted joke for a room of Doctors"}]
# since Gemma from Google does not support system role
generate(GEMMA2, message_gemma)

**Mixtral also requires us to accept their terms in Huggging Face .**

  * Visit https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 to ask for access.
  * This requires a lot of CPU/GPU/memory !!

In [None]:
# generate(MIXTRAL,messages)
# don't execute sice it runs out of disk space . I had 112.6 GB !!