Hosting an LLM for your organization, community, or even for yourself requires a lot of critical thinking alongside optimization steps, because the cost will affect your business approach. Companies like Together.ai or Nebius AI, which provide open-source LLMs as APIs, have implemented many optimization techniques to reduce the cost of hosting an LLM.

These optimization techniques are distributed across three different phases:

1. **Model Compression & Quantization Techniques**
   - **Weight-Only Quantization:** Lowers weight precision (e.g., 4-bit) for speed and memory savings.
   - **AWQ:** Advanced quantization that preserves accuracy.
   - **KV Cache Quantization:** Reduces KV cache memory with int8/int4 quantization.
   - **Combined Techniques:** Mixes AWQ and KV cache quantization for best results.

2. **Core Inference Optimization Techniques**
   - **Persistent Batching:** Continuously adds requests to a running batch for better GPU use and lower latency.
   - **Blocked KV Cache (Paged Attention):** Uses non-contiguous memory blocks for KV cache to reduce fragmentation and share memory efficiently.
   - **High-Performance CUDA Kernels:** Custom GPU code for faster operations.
   - **Tensor Parallelism:** Splits model across multiple GPUs for large models.
   - **FlashAttention / Flash-Decoding:** Fast attention algorithms for long sequences.
   - **CUDA Graphs:** Captures GPU operations to reduce CPU overhead and latency.
   - **GQA Optimization:** Speeds up Grouped-Query Attention models.
   - **Long Context Handling:** Efficiently manages long inputs with NTK-RoPE scaling, LogN scaling, and prefix caching.

3. **Deployment & Serving Features**
   - **RESTful API Server:** Standard HTTP interface for model access.
   - **Multi-Model Serving:** Hosts several models in one deployment.
   - **Distributed Serving:** Spreads requests across machines/GPUs for scaling.
   - **Vision-Language Model Support:** Handles text and image inputs efficiently.
   - **Function Calling Support:** Enables structured outputs for tool/API integration.

# Understanding  Basics

If you dont have access to HuggingFace Model Hub (Run this)

In [1]:
# notebook login huggingface.co
# from huggingface_hub import notebook_login
# notebook_login()

In [2]:
# Defining the model ID (We are using Llama-3.2 1B model)
model_id = "meta-llama/Llama-3.2-1B"

Importing Libraries

In [None]:
# Import necessary libraries from Hugging Face Transformers for model and tokenizer handling
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Import PyTorch for tensor operations
import torch
# Import time for measuring performance
import time
# Import tqdm for progress bars in Jupyter notebooks
from tqdm.notebook import tqdm
# Import json for handling JSON data
import json

# Import the custom function for evaluating model outputs using an LLM judge
from llm_as_eval import evaluate_with_judge

# Import the warnings module to suppress unnecessary warnings
import warnings

warnings.filterwarnings("ignore")

LLM_API_KEY = "YOUR_LLM_API_KEY"  # Replace with your actual API key (OpenAI, HuggingFace, Nebius, Together AI etc.)

Loading original model

In [None]:
# Load the tokenizer for the specified model
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the causal language model with bfloat16 precision on CPU
original_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto" # Use "auto" for automatic device mapping if you have a GPU
)

Some parameters are on the meta device because they were offloaded to the cpu.


Create a function to generate text, calculate memory usage, and measure time

In [4]:
# Function to generate text from a model and tokenizer with optional memory and time measurement
def generate_text(tokenizer, model, **kwargs):
    """
    Generate text using a given tokenizer and model.
    
    Args:
        tokenizer: The tokenizer for encoding input text.
        model: The language model for text generation.
        **kwargs: Additional arguments for model.generate(), including:
            - input_text (str): The prompt to generate from.
            - max_new_tokens, do_sample, temperature, top_p, top_k, pad_token_id, etc.
    
    Returns:
        tuple: A tuple containing the generated text (str), peak memory usage in MB (float or None),
               and generation time in seconds (float).
    """
    # Extract the input text from keyword arguments
    input_text = kwargs.pop('input_text')
    # Tokenize the input text and convert it to PyTorch tensors
    inputs = tokenizer(input_text, return_tensors="pt")
    
    # Check if a CUDA-enabled GPU is available
    if torch.cuda.is_available():
        # Reset peak memory statistics for the current CUDA device
        torch.cuda.reset_peak_memory_stats()
        # Get the device of the model (e.g., 'cuda:0')
        device = next(model.parameters()).device
        # Move the input tensors to the same device as the model
        inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Record the start time for generation
    start_time = time.time()
    # Generate text using the model with the provided inputs and generation arguments
    output_tokens = model.generate(**inputs, **kwargs)
    # Record the end time for generation
    end_time = time.time()
    
    # Calculate the total generation time and round it to 3 decimal places
    generation_time = round(end_time - start_time, 3)
    
    # Check if a CUDA-enabled GPU is available to measure memory
    if torch.cuda.is_available():
        # Measure the peak memory allocated on the GPU during generation
        # Convert bytes to megabytes (MB) and round to 3 decimal places
        peak_memory = round(torch.cuda.max_memory_allocated() / (1024 ** 2), 3)
    else:
        # If no GPU is available, set peak memory to None
        peak_memory = None
        
    # Decode the generated tokens back into a string, skipping special tokens
    generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
    # Return the generated text, peak memory usage, and generation time
    return generated_text, peak_memory, generation_time


Create a function to measure memory footprint of the model

In [5]:
def get_model_memory_footprint(model):
    """
    Get the memory footprint of a model in megabytes.
    
    Args:
        model: The model to measure memory footprint for.
    
    Returns:
        float: Memory footprint in megabytes, rounded to 2 decimal places.
    """
    # Get memory footprint in bytes and convert to megabytes
    memory_bytes = model.get_memory_footprint()
    memory_mb = memory_bytes / (1024 * 1024)
    
    return round(memory_mb, 2)

Running a simple example

In [7]:
# Example: Generate text using the loaded model and tokenizer

# Define the input prompt
prompt = "Some people are equal, but some people are more"

# Generate response with specified generation parameters
generated_response = generate_text(
    input_text=prompt,
    tokenizer=tokenizer,
    model=original_model,
    max_new_tokens=3,           # Limit to 3 new tokens
    do_sample=True,             # Enable sampling
    temperature=0.7,            # Sampling temperature
    top_p=0.9,                  # Nucleus sampling
    top_k=50,                   # Top-k sampling
    pad_token_id=tokenizer.eos_token_id,  # Padding token
)

# Print the generated response and peak memory usage
print("Generated Response:", generated_response[0])

# Print the model's memory footprint
model_memory = get_model_memory_footprint(original_model)
print("Model Memory Footprint (MB):", model_memory)

if generated_response[1] is not None:
    print("Peak Memory Usage (MB):", generated_response[1])

Generated Response: Some people are equal, but some people are more equal than others
Model Memory Footprint (MB): 2357.13
Peak Memory Usage (MB): 1240.101


Looking at the architecture of llama 3.2 1B

In [7]:
original_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

(Original Model) Let's look at the weights of the query projection (q_proj) in the self-attention mechanism of the first decoder layer.

In [9]:
# Access the weight matrix of the query projection (q_proj) 
# in the self-attention mechanism of the first decoder layer of the model.
q_proj_weight = original_model.model.layers[0].self_attn.q_proj.weight

# Display the weight tensor
q_proj_weight

Parameter containing:
tensor([[-0.0183,  0.0071,  0.0219,  ..., -0.0070, -0.0089,  0.0149],
        [ 0.0112,  0.0593,  0.0630,  ..., -0.0334, -0.0148,  0.0058],
        [ 0.0182,  0.0141,  0.0361,  ..., -0.0432, -0.0388, -0.0233],
        ...,
        [ 0.0305,  0.0289,  0.0801,  ..., -0.0767, -0.0311, -0.0334],
        [ 0.0242, -0.0325,  0.0369,  ..., -0.0123, -0.0269, -0.0151],
        [-0.0264, -0.0498, -0.0210,  ...,  0.0601,  0.0130, -0.0007]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)

Loading our eval dataset

In [6]:
# Import the json library to work with JSON files
import json

# --- Load Evaluation Dataset ---
# Open the evaluation data file in read mode
with open('eval_data.json', 'r') as file:
    # Load the JSON content from the file into the 'eval_data' variable
    eval_data = json.load(file)

# --- Display Sample Questions and Answers ---
# Print the first two questions and their answers from the dataset to verify it's loaded correctly
print("Sample Question 1:", eval_data[0]['q'])
print("Sample Answer 1:", eval_data[0]['a'])
print("\nSample Question 2:", eval_data[1]['q'])
print("Sample Answer 2:", eval_data[1]['a'])

# --- Total Number of Questions ---
# Print the total number of questions in the evaluation dataset
print("Total Number of Questions:", len(eval_data))

Sample Question 1: What color is the sky?
Sample Answer 1: The sky is blue.

Sample Question 2: How many legs does a cat have?
Sample Answer 2: It has four legs.
Total Number of Questions: 49


Let's create a function for evaluating the model on the evaluation dataset.

In [7]:
# --- Evaluate the Model ---
def evaluate_model(model, tokenizer, eval_data, **kwargs):
    """
    Evaluate the model on a given dataset.
    
    Args:
        model: The language model to evaluate.
        tokenizer: The tokenizer for encoding input text.
        eval_data: A list of dictionaries containing evaluation data with 'q' (question) and 'a' (answer).
        **kwargs: Additional arguments to be passed to the generate_text function.
    
    Returns:
        list: A list of dictionaries, where each dictionary contains the question, 
              generated answer, ground truth answer, peak memory usage (MB), 
              and generation time (seconds).
    """
    
    results = []
    
    # Set default generation parameters
    generation_params = {
        'max_new_tokens': 10,
        'do_sample': True,
        'temperature': 0.7,
        'top_p': 0.9,
        'top_k': 50,
    }
    
    # Update with any user-provided kwargs
    generation_params.update(kwargs)
    
    for item in tqdm(eval_data, desc="Evaluating model"):
        question = item['q']
        ground_truth_answer = item['a']
        
        # Generate an answer using the model and capture all metrics
        generated_answer, peak_memory, generation_time = generate_text(
            input_text=question,
            tokenizer=tokenizer,
            model=model,
            **generation_params
        )
        
        # Append the results as a dictionary
        results.append({
            "question": question,
            "generated_answer": generated_answer,
            "ground_truth_answer": ground_truth_answer,
            "peak_memory_mb": peak_memory,
            "generation_time_s": generation_time
        })

    # Return the list of results
    return results

# Evaluation of Base Model
### (Act as a Baseline for Latency, Accuracy, Memory) Check

In [12]:
# Running the evaluation on the model with the evaluation dataset
base_model_results = evaluate_model(original_model, tokenizer, eval_data)

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

In case you want to save the evaluation results to a JSON file, you can use the following code.

In [None]:
# --- Save the Base Model Results to a JSON File ---
# This is useful for saving the results of a long evaluation so you don't have to run it again.
with open('base_model_results.json', 'w') as f:
    # Use json.dump to write the base_model_results list to the file
    # indent=4 makes the JSON file human-readable
    json.dump(base_model_results, f, indent=4)

In [2]:
# --- Load the Base Model Results from the JSON File ---
# This demonstrates how to load the results back, for example, in a different session.
with open('base_model_results.json', 'r') as f:
    # Load the JSON data from the file back into the base_model_results variable
    base_model_results = json.load(f)

In [None]:
# --- Evaluate with LLM Judge ---
# Use the 'evaluate_with_judge' function to assess the model's performance.
# This function takes the model's results and an API key for the judging service.
# It returns a DataFrame with detsailed results and a dictionary of overall metrics.
# Note: You need to replace "your_nebius_api_key_here" with your actual Nebius API key.
base_model_results_df, base_model_metrics = evaluate_with_judge(
    base_model_results,  # The results from your model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

Scoring results: 100%|██████████| 49/49 [00:06<00:00,  7.36it/s]


In [8]:
base_model_results_df.head()  # Display the first few rows of the DataFrame with evaluation results

Unnamed: 0,question,generated_answer,ground_truth_answer,peak_memory_mb,generation_time_s,similarity_score,llm_judge_response
0,What color is the sky?,What color is the sky? What color is the ocean...,The sky is blue.,1240.194,3.912,0.2,0.2
1,How many legs does a cat have?,How many legs does a cat have? The answer depe...,It has four legs.,1240.257,3.976,0.2,0.2
2,What sound does a dog make?,What sound does a dog make? What sound does a ...,A dog says woof.,1240.226,4.084,0.2,0.2
3,How many days in a week?,How many days in a week? How many days in a mo...,There are seven days.,1240.226,3.974,0.2,0.2
4,What color is grass?,"What color is grass? It can be green, brown, o...",Grass is green.,1240.163,3.945,0.6,0.6


In [9]:
# --- Display Base Model Metrics ---
# Print the overall performance metrics for the base model.
# These metrics include average latency, memory usage, and similarity score.
print("Base Model Performance Metrics:")
for key, value in base_model_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

Base Model Performance Metrics:
- Avg Latency: 4.4598
- Avg Memory: 1240.2071
- Avg Score: 0.3939


# Weight only Quantization

Reduces the precision of the model's weights (e.g., from 16-bit floats to 4-bit integers) while keeping activations in higher precision (e.g., 16-bit float). This is a popular trade-off for speed vs. accuracy.

| Format    | Meaning                                 |
| --------- | --------------------------------------- |
| **W4A16** | Weights in 4-bit, Activations in 16-bit |
| **W8A8**  | Weights and Activations both in 8-bit   |
| **W4A8**  | Weights in 4-bit, Activations in 8-bit  |
| **W8A16** | Weights in 8-bit, Activations in 16-bit |

### For W4A16 Quantized Model Evaluation

In [10]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define quantization config: 4-bit weights (W4A16)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,          # optional: improves accuracy
    bnb_4bit_quant_type="nf4",               # "nf4" (normal float 4-bit) or "fp4"
    bnb_4bit_compute_dtype=torch.bfloat16    # use bfloat16 for activations (W4A16)
)

In [11]:
# Load the model with 4-bit quantization (weights only)
w4a16_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"  # use "auto" for best device placement (GPU/CPU)
)

In [14]:
# Memory footprint of the quantized model
w4a16_model_memory = get_model_memory_footprint(w4a16_model)
print("Quantized Model Memory Footprint (MB):", w4a16_model_memory)

Quantized Model Memory Footprint (MB): 965.13


In [13]:
# --- Evaluate the W4A16 Quantized Model ---
# Running the evaluation on the quantized model with the evaluation dataset
w4a16_model_results = evaluate_model(w4a16_model, tokenizer, eval_data)

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

In [15]:
# --- Evaluate with LLM Judge ---
# Use the 'evaluate_with_judge' function to assess the quantized model's performance.
w4a16_model_results_df, w4a16_model_metrics = evaluate_with_judge(
    w4a16_model_results,  # The results from your quantized model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

Scoring results: 100%|██████████| 49/49 [00:08<00:00,  5.82it/s]


In [16]:
w4a16_model_results_df.head()  # Display the first few rows of the DataFrame with evaluation results

Unnamed: 0,question,generated_answer,ground_truth_answer,peak_memory_mb,generation_time_s,similarity_score,llm_judge_response
0,What color is the sky?,What color is the sky? What color is the sky? ...,The sky is blue.,1053.762,2.781,0.2,0.2
1,How many legs does a cat have?,"How many legs does a cat have? (I know, you're...",It has four legs.,1053.919,1.367,0.1,0.1
2,What sound does a dog make? woof or howl?,What sound does a dog make? woof or howl? What...,A dog says woof.,1054.311,1.416,0.8,0.8
3,How many days in a week?,How many days in a week? 3.5? 5? 7,There are seven days.,1053.84,1.38,0.6,0.6
4,What color is grass?,What color is grass? (2015)\n1 What color is g...,Grass is green.,1053.684,1.389,0.0,0.0


In [18]:
# --- Display W4A16 Quantized Model Metrics ---
# Print the overall performance metrics for the W4A16 quantized model.
# These metrics include average latency, memory usage, and similarity score.
print("W4A16 Quantized Model Performance Metrics:")
for key, value in w4a16_model_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

W4A16 Quantized Model Performance Metrics:
- Avg Latency: 1.4541
- Avg Inference Memory Consumption: 1053.8099
- Avg Score: 0.4224


In [19]:
# Access the weight matrix of the query projection (q_proj) 
# in the self-attention mechanism of the first decoder layer of the quantized model.
q_proj_weight = w4a16_model.model.layers[0].self_attn.q_proj.weight

# Display the weight tensor
q_proj_weight

Parameter containing:
Parameter(Params4bit([[ 41],
            [213],
            [ 65],
            ...,
            [ 92],
            [ 75],
            [135]], device='cuda:0', dtype=torch.uint8))

### For W8A8 Quantized Model Evaluation

In [8]:
# Configure for 8-bit weight + activation quantization (W8A8)
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,                           # W8
    llm_int8_threshold=6.0,                      # Optional: threshold for outlier detection
    llm_int8_has_fp16_weight=False,              # Force 8-bit only mode (no fallback to fp16)
    llm_int8_enable_fp32_cpu_offload=True,       # Optional: offload to CPU for better memory management
)

# defining the tokenizer again for clarity
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model with 8-bit weights and 8-bit activations
w8a8_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"  # Will use GPU if available
)

Some parameters are on the meta device because they were offloaded to the cpu.


In [None]:
# Memory footprint of the quantized model
w8a8_model_memory = get_model_memory_footprint(w8a8_model)
print("Quantized Model Memory Footprint (MB):", w8a8_model_memory)

Quantized Model Memory Footprint (MB): 1429.13


In [10]:
# --- Evaluate the W8a8 Quantized Model ---
# Running the evaluation on the quantized model with the evaluation dataset
w8a8_model_results = evaluate_model(w8a8_model, tokenizer, eval_data)

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

In [12]:
# --- Evaluate with LLM Judge ---
# Use the 'evaluate_with_judge' function to assess the quantized model's performance.
w8a8_model_results_df, w4a16_model_metrics = evaluate_with_judge(
    w8a8_model_results,  # The results from your quantized model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

Scoring results:   0%|          | 0/49 [00:00<?, ?it/s]

Scoring results: 100%|██████████| 49/49 [00:07<00:00,  6.32it/s]


In [13]:
# Print the evaluation results
print("W8A8 Quantized Model Performance Metrics:")
for key, value in w4a16_model_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

W8A8 Quantized Model Performance Metrics:
- Avg Latency: 5.2888
- Avg Inference Memory Consumption: 1489.3576
- Avg Score: 0.4245


# AWQ (Activation Weight Quantization)

**AWQ** stands for **Activation-aware Weight Quantization**.
It is a **post-training quantization (PTQ)** method that:

* Quantizes **only the weights** (usually to **int4** or **int8**)
* **Leaves activations in full precision** (e.g., float16 or bfloat16)
* Is designed specifically for **LLMs** (e.g., LLaMA, Mistral, etc.)
* Supports **per-channel quantization** and **outlier-aware scaling**

#### So AWQ is a form of **Weight-only quantization (WOQ)** — typically **W4A16** or **W8A16** depending on the config.m,m

## AWQ = W4A16 or W8A16?

| Bit Precision | Weights | Activations | AWQ Supports        |
| ------------- | ------- | ----------- | ------------------- |
| W4A16         | 4-bit   | 16-bit      | Yes (default)     |
| W8A16         | 8-bit   | 16-bit      | Yes (less common) |

In [None]:
# AWQ-style W4A16 Configuration
awq_w4a16_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,          # Enables double quantization for better accuracy
    bnb_4bit_quant_type="nf4",               # Normal Float 4-bit quantization
    bnb_4bit_compute_dtype=torch.bfloat16    # Activations remain in bfloat16 (W4A16)
)

# AWQ-style W8A16 Configuration  
awq_w8a16_config = BitsAndBytesConfig(
    load_in_8bit=True,                       # 8-bit weights
    llm_int8_threshold=6.0,                  # Threshold for outlier detection
    llm_int8_has_fp16_weight=False,          # Force 8-bit weights (no fp16 fallback)
    llm_int8_enable_fp32_cpu_offload=False   # Keep activations in fp16/bf16 (W8A16)
)

# KV Cache

During **autoregressive decoding** (e.g., generating text token by token), models store previously computed:

* **K (Key)**
* **V (Value)**

…vectors in memory so they don’t need to be recomputed every time. This is called the **Key-Value (KV) cache**.

But:

* These KV tensors are usually stored in **fp16**/**bf16**/**fp32** by default.
* For long sequences and large batches, **KV cache becomes a huge memory bottleneck**.

It refers to quantizing the **Key** and **Value** tensors to lower precision **only for caching**, e.g.:

* `int8` (8-bit)
* `int4` (4-bit)

While:

* The rest of the model stays in higher precision (e.g., bfloat16 or float16)
* Attention scores are still computed accurately

Cache can be implemented in various ways, such as:
- **use_cache** (bool, *optional*, defaults to `True`)  
  Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.

- **cache_implementation** (str, *optional*, defaults to `None`)  
  Name of the cache class that will be instantiated in `generate`, for faster decoding.  
  Possible values are:
  - [dynamic](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.DynamicCache): Dynamically grows key/value cache during generation (default).
  - [static](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.StaticCache): Preallocates fixed-size cache for faster JIT-compiled decoding.
  - [offloaded_static](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.OffloadedStaticCache): Like static, but offloads unused layers to CPU to save GPU memory.
  - [sliding_window](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.SlidingWindowCache): Keeps only the most recent tokens in memory for efficient long-gen.
  - [hybrid](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.HybridCache): Combines sliding window and static strategies for context and speed.
  - [mamba](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.MambaCache): Specialized cache for models like Mamba or Gemma2.
  - [quantized](https://huggingface.co/docs/transformers/v4.53.0/en/internal/generation_utils#transformers.QuantizedCache): Uses low-bit precision (int2/4/8) to reduce memory usage with minimal quality drop.

We will implement and evaluate KV Caching on the original base model and then on the quantized models.

In [19]:
print(model.generation_config.cache_implementation)

NameError: name 'model' is not defined

### For Base Model Evaluation with KV Caching

In [None]:
# Evaluating original model using KV Caching
base_model_kv_results = evaluate_model(
    original_model,
    tokenizer,
    eval_data,
    use_cache=True,  # Enable KV caching
    cache_implementation="quantized",  # Use Quantized KV caching
    pad_token_id=tokenizer.eos_token_id  # Ensure padding token is set (To remove warnings about pad_token_id not being set )
)

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

In [20]:
# --- Evaluate with LLM Judge ---
# Use the 'evaluate_with_judge' function to assess the model's performance with KV caching.
base_model_kv_results_df, base_model_kv_metrics = evaluate_with_judge(
    base_model_kv_results,  # The results from your model evaluation with KV cache
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

# Printing the evaluation results
print("Base Model with KV Caching Performance Metrics:")
for key, value in base_model_kv_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

Scoring results: 100%|██████████| 49/49 [00:09<00:00,  5.38it/s]

Base Model with KV Caching Performance Metrics:
- Avg Latency: 6.2055
- Avg Inference Memory Consumption: 1241.7709
- Avg Score: 0.4510





### For W4A16 Quantized Model with KV Caching

In [12]:
# Evaluating W4A16 Quantized Model using KV Caching
w4a16_model_kv_results = evaluate_model(
    w4a16_model,
    tokenizer,
    eval_data,
    use_cache=True,  # Enable KV caching
    cache_implementation="quantized",  # Use Quantized KV caching
    pad_token_id=tokenizer.eos_token_id  # Ensure padding token is set
)

# --- Evaluate with LLM Judge ---
w4a16_model_kv_results_df, w4a16_model_kv_metrics = evaluate_with_judge(
    w4a16_model_kv_results,  # The results from your model evaluation with KV cache
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

# Printing the evaluation results
print("W4A16 Model with KV Caching Performance Metrics:")
for key, value in w4a16_model_kv_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

Scoring results: 100%|██████████| 49/49 [00:08<00:00,  5.84it/s]

W4A16 Model with KV Caching Performance Metrics:
- Avg Latency: 1.1640
- Avg Inference Memory Consumption: 1053.6658
- Avg Score: 0.4224





# Model Performance Comparison (Weight Only Quantization + KV Caching)

There is a reason why companies like Together.ai or Nebius AI have implemented these optimization techniques, especially the KV Caching + Weight-Only Quantization, to reduce the cost of hosting an LLM.

This table summarizes the key performance metrics for the `meta-llama/Llama-3.2-1B` model under different optimization techniques.

| Model Configuration | Model Memory (MB) | Avg. Latency (s) | Avg. Peak Memory (MB) | Avg. LLM Judge Score |
| :------------------ | :---------------- | :--------------- | :-------------------- | :------------------- |
| **Base Model (bf16)** | 2357.13 | 4.46 | 1240.21 | 0.39 |
| **W4A16 Quantized** | **965.13** | 1.45 | **1053.81** | 0.42 |
| **W8A8 Quantized** | 1429.13 | 5.29 | 1489.36 | 0.42 |
| **Base + KV Cache** | 2357.13 | 6.21 | 1241.77 | **0.45** |
| **W4A16 + KV Cache** | **965.13** | **1.16** | 1053.67 | 0.42 |

Overall, the results show that the W4A16 quantized model with KV caching provides a good balance of speed, memory efficiency, and response quality.

# Other Observations:

*   **Best Latency:** The combination of **W4A16 Quantization and KV Caching** provides the lowest latency (`1.16s`), making it the fastest for inference.
*   **Lowest Memory Usage:** **W4A16 quantization** drastically reduces both the static model memory footprint (`965.13 MB`) and the peak memory used during inference.
*   **Highest Accuracy:** The **Base Model with KV Caching** achieved the highest LLM Judge score (`0.45`), suggesting that while it was slower, its response quality was slightly better.
*   **Trade-offs:** The results highlight a clear trade-off between speed/memory and response quality. W4A16 offers a great balance, significantly improving performance with a minimal impact on the evaluation score. The W8A8 configuration was slower and used more memory than the base model in this specific test, which might be due to implementation overhead.


# SDPA Attention

In [None]:
# To see which attention implementations are available in your transformers installation,
# you can inspect the `ALL_ATTENTION_FUNCTIONS` dictionary.
# This is useful for knowing what strings you can pass to the `attn_implementation` argument.
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS

list(ALL_ATTENTION_FUNCTIONS.keys())


['flash_attention_3',
 'flash_attention_2',
 'flex_attention',
 'paged_attention',
 'sdpa',
 'sdpa_paged',
 'eager_paged']

In [8]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define quantization config: 4-bit weights (W4A16)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,          # optional: improves accuracy
    bnb_4bit_quant_type="nf4",               # "nf4" (normal float 4-bit) or "fp4"
    bnb_4bit_compute_dtype=torch.bfloat16    # use bfloat16 for activations (W4A16)
)

# SDPA (Sparse Distributed Parallel Attention) is a memory-efficient attention implementation.
w4a16_model_sdpa = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",  # use "auto" for best device placement (GPU/CPU)
    attn_implementation="sdpa",  # Use memory-efficient SDPA,
)

In [None]:
# SDPA_PAGED (Sparse Distributed Parallel Attention with Paged Memory) is another memory-efficient attention implementation.
w4a16_model_sdpa_paged = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",  # use "auto" for best device placement (GPU/CPU)
    attn_implementation="sdpa_paged",  # Use memory-efficient SDPA_PAGED,
)


# Evaluate the W4A16 SDPA Model and W4A16 SDPA Paged Model
w4a16_model_sdpa_results = evaluate_model(w4a16_model_sdpa, tokenizer, eval_data)
w4a16_model_sdpa_paged_results = evaluate_model(w4a16_model_sdpa_paged, tokenizer, eval_data)

# --- Evaluate both results with LLM Judge ---
w4a16_model_sdpa_results_df, w4a16_model_sdpa_metrics = evaluate_with_judge(
    w4a16_model_sdpa_results,  # The results from your SDPA model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)
w4a16_model_sdpa_paged_results_df, w4a16_model_sdpa_paged_metrics = evaluate_with_judge(
    w4a16_model_sdpa_paged_results,  # The results from your SDPA Paged model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

In [None]:
get_model_memory_footprint(w4a16_model_sdpa)

965.13

In [None]:
# Running the evaluation on the model with the evaluation dataset
w4a16_model_sdpa_results = evaluate_model(w4a16_model_sdpa, tokenizer, eval_data)

In [None]:
# --- Evaluate with LLM Judge ---
# Use the 'evaluate_with_judge' function to assess the quantized model's performance.
w4a16_model_sdpa_results_df, w4a16_model_sdpa_metrics = evaluate_with_judge(
    base_model_results,  # The results from your quantized model evaluation
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

Scoring results: 100%|██████████| 49/49 [00:09<00:00,  5.18it/s]


In [None]:
w4a16_model_sdpa_metrics

{'avg_latency': np.float64(1.3038163265306122),
 'avg_inference_memory_consumption': np.float64(1053.8089591836736),
 'avg_score': np.float64(0.3918367346938776)}

# SDPA Performance Comparison

This table summarizes the performance of the W4A16 quantized model with different SDPA (Scalable Dot-Product Attention) implementations.

| Model Configuration      | Avg. Latency (s) | Avg. Peak Memory (MB) | Avg. LLM Judge Score |
| :----------------------- | :--------------- | :-------------------- | :------------------- |
| **W4A16 + SDPA**         | 1.103            | 1003.81               | 0.421                |
| **W4A16 + SDPA Paged**   | 1.303            | 1041.80               | 0.391                |

**Observations:**

*   **Latency:** The standard `sdpa` implementation is slightly faster than `sdpa_paged`.
*   **Memory:** The standard `sdpa` implementation also consumes less peak memory during inference.
*   **Accuracy:** The `sdpa` implementation achieves a higher LLM Judge score, indicating better response quality in this evaluation.

Overall, for this specific model and task, the standard `sdpa` attention implementation provides a better balance of performance and accuracy compared to the `sdpa_paged` version.


# SDPA + KV Cache + Weight Only Quantization (W4A16)

In [6]:
import transformers.cache_utils as cache_utils

# List all cache classes available in the module
cache_classes = [cls for cls in dir(cache_utils) if "Cache" in cls and not cls.startswith("_")]
print("Available cache strategies:")
for cls in cache_classes:
    print(f"- {cls}")

Available cache strategies:
- Cache
- CacheConfig
- DynamicCache
- EncoderDecoderCache
- HQQQuantizedCache
- HybridCache
- HybridChunkedCache
- MambaCache
- OffloadedCache
- OffloadedHybridCache
- OffloadedStaticCache
- QuantizedCache
- QuantizedCacheConfig
- QuantoQuantizedCache
- SinkCache
- SlidingWindowCache
- StaticCache
- StaticCacheConfig


In [16]:
# Define a list of KV cache strategies to evaluate
# These are different implementations for caching key-value pairs during text generation
kv_cache_strategies = [
    "dynamic", # DynamicCache: dynamically manages cache size based on input length
    "static", # StaticCache: uses a fixed-size cache for all inputs
    # "sliding_window", # SlidingWindowCache: maintains a sliding window of the most recent tokens
    "quantized", # QuantizedCache: uses quantization techniques to reduce memory usage
    # The following might require specific model architectures or additional setup
    # "offloaded_static",
    # "hybrid",
    # "mamba",
]

In [17]:
# Perform kv cache evaluation on w4a16_model_sdpa (W4A16 SDPA model)
for cache_strategy in kv_cache_strategies:
    print(f"Evaluating with cache strategy: {cache_strategy}")
    
    # Evaluate the model with the specified cache strategy
    w4a16_model_sdpa_kv_results = evaluate_model(
        w4a16_model_sdpa,
        tokenizer,
        eval_data,
        use_cache=True,  # Enable KV caching
        cache_implementation=cache_strategy,  # Use specified cache strategy
        pad_token_id=tokenizer.eos_token_id,  # Ensure padding token is set
    )
    
    # --- Evaluate with LLM Judge ---
    w4a16_model_sdpa_kv_results_df, w4a16_model_sdpa_kv_metrics = evaluate_with_judge(
        w4a16_model_sdpa_kv_results,  # The results from your model evaluation with KV cache
        api_key=LLM_API_KEY,  # API key for the judging service
        model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
    )
    
    # Print the evaluation results for the current cache strategy
    print(f"{cache_strategy} Performance Metrics:")
    for key, value in w4a16_model_sdpa_kv_metrics.items():
        print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

Evaluating with cache strategy: dynamic


Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

Scoring results: 100%|██████████| 49/49 [00:07<00:00,  6.89it/s]

dynamic Performance Metrics:
- Avg Latency: 0.8046
- Avg Inference Memory Consumption: 1056.3977
- Avg Score: 0.4327
Evaluating with cache strategy: static





Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

Scoring results: 100%|██████████| 49/49 [00:09<00:00,  5.19it/s]

static Performance Metrics:
- Avg Latency: 4.0182
- Avg Inference Memory Consumption: 1056.1599
- Avg Score: 0.4224
Evaluating with cache strategy: quantized





Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

Scoring results: 100%|██████████| 49/49 [00:07<00:00,  6.65it/s]

quantized Performance Metrics:
- Avg Latency: 5.4194
- Avg Inference Memory Consumption: 1056.2500
- Avg Score: 0.4143





# Multi Round Conversation Optimization (Remembering past interactions)

Let's look at this example:

### Example of Multi-Chat Conversation

> **User:** What's the capital of Japan?
> **Assistant:** Tokyo is the capital of Japan.
> **User:** And what about China?

The model generates replies one at a time using **auto-regressive decoding**.

#### First Reply:
- The cache is empty.
- The model reads: `"User: What's the capital of Japan?"`
- It generates: `"Tokyo is the capital of Japan."`
- While doing this, it stores useful data (keys and values) in memory.

#### Second Reply:
- The model sees this full history: `"User: What's the capital of Japan? \n Assistant: Tokyo is the capital of Japan. \n User: And what about China?"`
- But thanks to the cache, it doesn't have to reprocess the whole thing.
- It reuses the stored data and only processes: `"User: And what about China?"`
- Then it generates: `"Beijing is the capital of China."`

Two important things should be noted here:
- **Context matters:** When the user asks "And what about China?", the model understands it's about capitals because of the earlier question.
- **Cache saves work:** The key-value cache lets the model reuse earlier parts of the chat without re-reading everything, making it faster and more efficient.

In [None]:
# First question
question = "What is the capital of Japan?"
model_inputs = tokenizer(question, return_tensors='pt').to("cuda")

# Generate the first answer
output = w4a16_model_sdpa.generate(
    **model_inputs,
    max_new_tokens=30,
    return_dict_in_generate=True,
    use_cache=True,  # Enable KV caching
    cache_implementation="dynamic",  # Use best cache strategy for ourcase
)

# Detokenizing the answer
answer1 = tokenizer.batch_decode(output.sequences)[0]
print(answer1)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>What is the capital of Japan? 2019
What is the capital of Japan?
A. Osaka
B. Tokyo
C. Fukuoka
D. Nagoya



But then we also need to store the decoded values for the future conversation, so ….

In [8]:
# Follow-up question, using past_key_values for efficiency
follow_up = "What about China?"
model_inputs = tokenizer(answer1 + "\n" + follow_up, return_tensors='pt').to("cuda")

The key parameter to keep in mind is `past_key_values`, which prevents the model from recalculating the KV cache for the parts of the conversation that have already been processed.

In [10]:
output = w4a16_model_sdpa.generate(
    **model_inputs,
    past_key_values=output.past_key_values, # It let's model avoid recalc of KV Cache
    max_new_tokens=30,
    return_dict_in_generate=True,
    use_cache=True,  # Enable KV caching
    # cache_implementation="dynamic",  # Use best cache strategy for ourcase
)
answer2 = tokenizer.batch_decode(output.sequences)[0][len(answer1 + "\n" + follow_up):]
print(answer2)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


What about China? 2019
What about China?
A. Beijing
B. Guangzhou
C. Shanghai
D. Hong Kong
E. Shenzhen


A better system would be to use the `past_key_values` parameter in the `generate` method, which allows the model to answer follow-up questions correctly without hallucinating or forgetting previous context.

# Prompt Lookup Decoding

In many cases your hosted LLM needs original tokens that are already available in context of LLM these tasks are like summarization, Text simplification or QA and so. and smaller llms are a part of bigger architecture which handles such tasks to reduce the overall tasks also.

In [None]:
# First question
# Define the prompt for the model
question = "Name the capital city of Japan?"
# Tokenize the input question and move it to the GPU for processing
model_inputs = tokenizer(question, return_tensors='pt').to("cuda")


# Generate a response using the model with prompt lookup decoding
output = w4a16_model_sdpa.generate(
    **model_inputs,
    max_new_tokens=5, # Limit the generation to 5 new tokens
    return_dict_in_generate=True, # Return a detailed output object
    use_cache=True,  # Enable KV caching to speed up decoding
    cache_implementation="dynamic",  # Use the dynamic cache implementation for efficiency
    
    # Specify the number of prompt tokens to use for lookup decoding.
    # This speeds up processing by matching the initial tokens of the prompt
    # against a pre-computed table, avoiding redundant computations.
    prompt_lookup_num_tokens=3,

)
# Decode the generated tokens back into a human-readable string
answer = tokenizer.batch_decode(output.sequences)[0]
# Print the final answer
print(answer)

<|begin_of_text|>Name the capital city of Japan? Answer: Tokyo


In [9]:
# Performing evaluation with prompt lookup decoding with kv cache
prompt_lookup_results = evaluate_model(
    w4a16_model_sdpa,  # The model to evaluate
    tokenizer,         # The tokenizer for encoding input text
    eval_data,         # The evaluation dataset
    use_cache=True,    # Enable KV caching
    cache_implementation="dynamic",  # Use the dynamic cache implementation
    prompt_lookup_num_tokens=10,  # Use 10 tokens for prompt lookup decoding
    pad_token_id=tokenizer.eos_token_id  # Ensure padding token is set
)

Evaluating model:   0%|          | 0/49 [00:00<?, ?it/s]

In [10]:
# Evaluate with LLM Judge
prompt_lookup_results_df, prompt_lookup_metrics = evaluate_with_judge(
    prompt_lookup_results,  # The results from your model evaluation with prompt lookup decoding
    api_key=LLM_API_KEY,  # API key for the judging service
    model_name="meta-llama/Llama-3.3-70B-Instruct"  # Name of the model being evaluated
)

# Print the evaluation results for prompt lookup decoding
print("Prompt Lookup Decoding Performance Metrics:")
for key, value in prompt_lookup_metrics.items():
    # Print each metric with its corresponding value, formatted to 4 decimal places
    print(f"- {key.replace('_', ' ').title()}: {value:.4f}")

Scoring results: 100%|██████████| 49/49 [00:08<00:00,  5.97it/s]

Prompt Lookup Decoding Performance Metrics:
- Avg Latency: 1.2086
- Avg Inference Memory Consumption: 1056.7745
- Avg Score: 0.3837



