## Models

A **model** is a machine learning algorithm that has been trained to perform a specific NLP task, like text generation, classification, or translation. Hugging Face provides access to many pre-trained models from state-of-the-art architectures like:

* **BERT (Bidirectional Encoder Representations from Transformers):** Primarily used for understanding the context of text (e.g., classification, NER).

* **GPT (Generative Pretrained Transformer):** Primarily used for generating text or continuation of text.

* **T5 (Text-to-Text Transfer Transformer):** A versatile model that treats every task as a text generation problem (e.g., summarization, translation).

* **DistilBERT:** A smaller, faster version of BERT, designed to be more efficient.

In [1]:
# Importing Libraries

from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc
import os

In [2]:
#Loading HF Token

hf_token = os.getenv("HUGGING_FACE_WRITE_TOKEN")
login(hf_token)

In [3]:
# Instruct models

LLAMA = "meta-llama/Llama-3.2-1B-Instruct"

PHI = "microsoft/Phi-4-mini-instruct"
GEMMA = "google/gemma-3-270m-it"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"

# Reasoning model

DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Quantization in Large Language Models (LLMs)

## 1. Definition
- Quantization reduces the numerical precision of model parameters
- Converts high-precision values (FP32 / FP16) to lower precision (INT8, INT4, etc.)
- Primarily used for inference optimization

---

## 2. Purpose of Quantization
- Reduce model size
- Speed up inference
- Lower memory and compute requirements
- Enable deployment on edge and low-resource devices

---

## 3. Benefits
- Smaller disk and memory footprint
- Faster inference latency
- Lower power consumption
- Cost-effective deployment at scale

---

## 4. Trade-offs
- Slight degradation in accuracy
- Aggressive quantization may affect:
  - Reasoning quality
  - Numerical stability
  - Long-context performance

---

## 5. Common Precision Levels
- FP32: highest accuracy, very large size
- FP16 / BF16: near-FP32 accuracy, smaller size
- INT8: good balance of size and accuracy
- INT4: very small, faster inference
- INT2: experimental, significant accuracy loss

---

## 6. Types of Quantization

### 6.1 Post-Training Quantization (PTQ)
- Applied after model training
- No retraining required
- Most widely used in practice

### 6.2 Quantization-Aware Training (QAT)
- Model trained with quantization simulation
- Higher accuracy than PTQ
- Computationally expensive

---

## 7. Popular Quantization Methods
- GPTQ: layer-wise optimization, good 4-bit quality
- AWQ: activation-aware, better accuracy at low bit-widths
- GGUF: optimized for CPU inference (llama.cpp)

---

## 8. What Gets Quantized
- Model weights (always)
- Activations (sometimes)
- KV cache (important for long context efficiency)

---

## 9. Memory Impact Example
- 7B model FP16 ≈ 14 GB
- 7B model INT8 ≈ 7 GB
- 7B model INT4 ≈ 3.5 GB

---

## 10. When to Use Quantization
- Running models locally
- Large-scale inference deployment
- Memory- or latency-constrained environments

---

## 11. When to Avoid Heavy Quantization
- High-precision reasoning tasks
- Training or fine-tuning scenarios
- Accuracy-critical applications

---

## 12. Key Takeaway
- Quantization trades minimal accuracy for major gains in efficiency
- Essential for practical LLM deployment


In [4]:
messages = [
    {"role": "user", "content": "Tell a joke for a room of Data Scientists"}
  ]

In [5]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True, # Loads model weights in 4-bit precision, drastically reducing memory usage at the cost of a small accuracy drop.
    bnb_4bit_use_double_quant=True, # Applies double quantization to further compress 4-bit weights, saving additional memory with minimal quality impact.
    bnb_4bit_compute_dtype=torch.bfloat16, # Performs computations in bfloat16 for improved numerical stability while keeping weights in 4-bit storage.
    bnb_4bit_quant_type="nf4" # Uses NF4 quantization optimized for normally distributed LLM weights, giving better accuracy than standard INT4 at the same size.
)
# NF4 is a 4-bit data type designed specifically for LLM weights

In [6]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token

# LLaMA models do not have a native padding token, so we set pad_token = eos_token to make batching and tensor operations work without breaking the model.

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", padding=True).to("cuda")

# return_tensor=pt will return tensor after the calculations as pytorch tensors

In [7]:
inputs

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1544,   3799,    220,   2366,     20,    271, 128009, 128006,
            882, 128007,    271,  41551,    264,  22380,    369,    264,   3130,
            315,   2956,  57116, 128009]], device='cuda:0')

In [8]:
# Model

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

[2025-12-27 18:06:03,406] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -lcufile: No such file or directory
collect2: error: ld returned 1 exit status


In [9]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 1,012.0 MB


## Looking under the hood at the Transformer model

The next cell prints the HuggingFace `model` object for Llama.

This model object is a Neural Network, implemented with the Python framework PyTorch. The Neural Network uses the architecture invented by Google scientists in 2017: the Transformer architecture.

While we're not going to go deep into the theory, this is an opportunity to get some intuition for what the Transformer actually is.

If you're completely new to Neural Networks, check out [YouTube intro playlist](https://www.youtube.com/playlist?list=PLWHe-9GP9SMMdl6SLaovUQF2abiLGbMjs) for the foundations.

Now take a look at the layers of the Neural Network that get printed in the next cell. Look out for this:

- It consists of layers
- There's something called "embedding" - this takes tokens and turns them into 4,096 dimensional vectors. We'll learn more about this in Week upcoming.
- There are then 16 sets of groups of layers (32 for Llama 3.1) called "Decoder layers". Each Decoder layer contains three types of layer: (a) self-attention layers (b) multi-layer perceptron (MLP) layers (c) batch norm layers.
- There is an LM Head layer at the end; this produces the output

Notice the mention that the model has been quantized to 4 bits.

It's not required to go any deeper into the theory at this point, but if you'd like to, I've asked our mutual friend to take this printout and make a tutorial to walk through each layer. This also looks at the dimensions at each point. If you're interested, work through this tutorial after running the next cell:

https://chatgpt.com/canvas/shared/680cbea6de688191a20f350a2293c76b

In [10]:
# Execute this cell and look at what gets printed; investigate the layers

model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), 

### And if you want to go even deeper into Transformers

In addition to looking at each of the layers in the model, you can actually look at the HuggingFace code that implements Llama using PyTorch.

Here is the HuggingFace Transformers repo:  
https://github.com/huggingface/transformers

And within this, here is the code for Llama 4:  
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama4/modeling_llama4.py

It's a fascinating rabbit hole if you're interested!

In [11]:
# OK, with that, now let's run the model!

outputs = model.generate(inputs, max_new_tokens=80)
outputs[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


tensor([128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
            25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
           220,   1544,   3799,    220,   2366,     20,    271, 128009, 128006,
           882, 128007,    271,  41551,    264,  22380,    369,    264,   3130,
           315,   2956,  57116, 128009, 128006,  78191, 128007,    271,   8586,
           596,    264,  22380,    369,    264,   3130,    315,    828,  14248,
          1473,  10445,   1550,    279,    828,  28568,   1464,    709,    449,
           813,  23601,   1980,  18433,    568,   4934,    311,  24564,    872,
          5133,    323,   1505,    264,    810,  11297,   3717,   2268,     40,
          3987,    420,  22380,  12716,    264,  15648,    311,    872,  12580,
           323,   8779,   1124,  12234,   1306,    264,   1317,   1938,    315,
          3318,    449,   5219,      0, 128009], device='cuda:0')

In [12]:
# Well that doesn't make much sense!
# How about this..

tokenizer.decode(outputs[0])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 27 Dec 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell a joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere's a joke for a room of data scientists:\n\nWhy did the data scientist break up with his girlfriend?\n\nBecause he wanted to analyze their relationship and find a more efficient connection!\n\nI hope this joke brings a smile to their faces and helps them relax after a long day of working with numbers!<|eot_id|>"

In [13]:
# Clean up memory like previous notebooks

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.ipc_collect()
torch.cuda.empty_cache()

## A couple of quick notes on the next block of code:

I'm using a HuggingFace utility called TextStreamer so that results stream back.
To stream results, we simply replace:  
`outputs = model.generate(inputs, max_new_tokens=80)`  
With:  
`streamer = TextStreamer(tokenizer)`  
`outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)`

Also I've added the argument `add_generation_prompt=True` to my call to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts

In [14]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages, quant=True, max_new_tokens=80):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  attention_mask = torch.ones_like(input_ids, dtype=torch.long, device="cuda")
  streamer = TextStreamer(tokenizer)
  if quant:
    model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_config).to("cuda")
  else:
    model = AutoModelForCausalLM.from_pretrained(model).to("cuda")
  outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=max_new_tokens, streamer=streamer)



In [None]:
generate(PHI, messages)

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]
generate(GEMMA, messages, quant=False)

In [None]:
generate(DEEPSEEK, messages, quant=False, max_new_tokens=500)