# Optimize inference
**Inference** is the process of using a trained model to generate predictions or responses. For LLMs, this means taking your input text and generating a response.
Think of it like this:

- Training: Teaching the model (like teaching a student)
- Inference: Using the model to answer questions (like the student taking a test)

**Why Optimize Inference?**

- Speed: Faster responses improve user experience
- Cost: Less computation time = lower server costs
- Scalability: Serve more users with the same hardware
- Energy: Reduce power consumption

**Key Metrics to Understand**

- Latency: Time to generate first token (how long until the model starts responding)
- Throughput: Tokens generated per second (how fast the model continues responding)
- Memory Usage: RAM/VRAM consumed by the model

#### Setup
1. Install required Packages
2. Login to huggingface
3. Import necessary packages

In [None]:
!pip install -q unsloth llama-cpp-python transformers torch vllm huggingface_hub flash_attn  -U transformers bitsandbytes

In [None]:
from huggingface_hub import login
login()

In [None]:
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## FlashAttention
**What is Attention?**

Attention is the mechanism that helps LLMs focus on relevant parts of the input when generating each word. Imagine reading a book and highlighting important sentences - that's similar to how attention works.

**The Problem with Standard Attention**

Standard attention has a problem: it uses O(n²) memory, meaning if you double the input length, you need 4x more memory!

```
# Simplified standard attention (don't run this - it's just for illustration)
def standard_attention(Q, K, V):
    # This creates a huge matrix that grows quadratically
    attention_scores = Q @ K.T  # Memory usage: sequence_length²
    attention_weights = softmax(attention_scores)
    output = attention_weights @ V
    return output
```

**FlashAttention Solution**

FlashAttention reorganizes computations to be memory-efficient and faster.

**How It Works (Simplified)**

Instead of computing all attention at once (which needs lots of memory), FlashAttention:
- Breaks the computation into smaller chunks
- Processes chunks one at a time
- Cleverly combines results without losing accuracy

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
input_text = "Hello, how are you today?"

def run_inference(use_flashattention=False):
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if use_flashattention:
        print("\n⚡ Using FlashAttention:")
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            attn_implementation="flash_attention_2"
        ).to("cuda")
    else:
        print("\n🧠 Using Traditional Attention:")
        model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    inputs = tokenizer.encode(input_text, return_tensors='pt').to("cuda")

    start_time = time.time()
    outputs = model.generate(inputs, max_length=50)
    torch.cuda.synchronize()  # Wait for GPU ops to finish
    elapsed_time = time.time() - start_time
    memory_used = torch.cuda.max_memory_allocated() / 1024**2  # In MB

    print(f"⏱ Time taken: {elapsed_time:.4f} seconds")
    print(f"💾 Peak memory: {memory_used:.2f} MB")
    print(f"📝 Output: {tokenizer.decode(outputs[0])[:80]}...\n")

# Run both tests
run_inference(use_flashattention=False)
run_inference(use_flashattention=True)



🧠 Using Traditional Attention:


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


⏱ Time taken: 5.0404 seconds
💾 Peak memory: 1902.58 MB
📝 Output: Hello, how are you today? Is there anything I can assist with?
As an AI language...


⚡ Using FlashAttention:


You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


RuntimeError: FlashAttention only supports Ampere GPUs or newer.

❌ FlashAttention is not supported on T4 (Turing architecture, compute capability 7.5). It only works on Ampere (A100, etc.) and newer.

## torch.compile
**What is torch.compile?**

[**torch.compile**](https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is PyTorch's way of making your model run faster by optimizing the computational graph. Think of it as a smart compiler that looks at your code and finds ways to make it more efficient.
How It Works

- Graph Capture: Records what operations your model performs
- Optimization: Finds ways to combine, reorder, or simplify operations
- Code Generation: Creates optimized machine code



In [None]:
# Load your model
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Before optimization - normal inference
def generate_text_normal(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    return tokenizer.decode(outputs[0])

# After optimization - compiled model
compiled_model = torch.compile(model)

def generate_text_optimized(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = compiled_model.generate(**inputs, max_length=100)
    return tokenizer.decode(outputs[0])

# Performance comparison
import time

prompt = "The future of artificial intelligence is"

# Time normal inference
start = time.time()
result1 = generate_text_normal(prompt)
normal_time = time.time() - start

# Time compiled inference
start = time.time()
result2 = generate_text_optimized(prompt)
compiled_time = time.time() - start

print(f"Normal inference: {normal_time:.2f}s")
print(f"Compiled inference: {compiled_time:.2f}s")
print(f"Speedup: {normal_time/compiled_time:.2f}x")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Normal inference: 6.11s
Compiled inference: 5.20s
Speedup: 1.18x


Sometimes the compiled model takes longer, and this can happen because of many reasons.

torch.compile is mostly a go to technique when :
- Models with consistent input shapes and patterns
- On newer GPUs (A100, H100)
- First run is slower (compilation overhead)

##GGUF Format
**What is GGUF?**
GGUF (GPT-Generated Unified Format) is a special file format that makes AI models:

- Smaller in size
- Faster to load
- Runnable on CPU efficiently

Think of it like compressing a video file - same content, smaller size, optimized for playback.

**Why GGUF?**
Traditional Model File:

- Size: 13GB
- RAM needed: 16GB+
- Loading time: 2-3 minutes

GGUF Model File:

- Size: 4GB
- RAM needed: 6GB
- Loading time: 30 seconds


**Working with GGUF**

You need to find a gguf checkpoint of the model you want to use on the hub: click [here](https://huggingface.co/models?library=gguf&sort=trending), and filter by name

In [None]:
from huggingface_hub import hf_hub_download

model_name = "unsloth/Llama-3.2-1B-Instruct-GGUF"

model_path = hf_hub_download(
    repo_id=model_name,
    filename="Llama-3.2-1B-Instruct-Q2_K_L.gguf"
)

from llama_cpp import Llama
llm = Llama(model_path=model_path)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
)
output

llama_model_loader: loaded meta data with 36 key-value pairs and 147 tensors from /root/.cache/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct-GGUF/snapshots/b69aef112e9f895e6f98d7ae0949f72ff09aa401/Llama-3.2-1B-Instruct-Q2_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama-3.2-1B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2-1B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_mode

{'id': 'cmpl-fb771262-e089-4149-ac35-dbdeebd962c0',
 'object': 'text_completion',
 'created': 1747992647,
 'model': '/root/.cache/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct-GGUF/snapshots/b69aef112e9f895e6f98d7ae0949f72ff09aa401/Llama-3.2-1B-Instruct-Q2_K_L.gguf',
 'choices': [{'text': 'Q: Name the planets in the solar system? A: 8 planets in the solar system are Mars, Mercury, Venus, Earth, Neptune, Uranus, Earth, and Pluto. However, the seven planets in the',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 14, 'completion_tokens': 32, 'total_tokens': 46}}

**GGUF Benefits**

- Smaller Memory Footprint: Only load what you need
- Faster Loading: No need to load entire model upfront
- CPU Friendly: Optimized for CPU inference
- Quantization Ready: Built-in support for compressed models

## Quantization
**What is Quantization?**

Quantization is like compressing a high-resolution photo - you reduce the file size while trying to keep the important details. For neural networks, we reduce the precision of numbers to use less memory and compute.
Number Precision Analogy
Think about describing temperature:

- FP32 (32-bit float): "It's 72.847362°F outside"
- INT8 (8-bit integer): "It's about 73°F outside"

Both convey the same essential information, but INT8 uses 4x less memory!



**Types of Quantization**

1. Post-training Quantization: Convert model after training
2. Quantization-aware Training: Train with quantization in mind


Let's understand how we can quantized a model in transformers using bitsandbytes

In [None]:
pip install bitsandbytes

In [None]:
# To quantize models you can use transformers with a quantization config
# we will use bitsandbytes quantization as an example

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

quantized_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-1B",
    torch_dtype="auto",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

What are we having for dinner?!
Ahh, it's a tough one!


To understand what's happening here, you can check the weights of matrices, for example let's check the q_proj matrix of the first layer.

In [None]:
quantized_model.model.layers[0].self_attn.q_proj.weight

Parameter containing:
Parameter(Params4bit([[214],
            [ 46],
            [250],
            ...,
            [228],
            [244],
            [ 25]], device='cuda:0', dtype=torch.uint8))

In 4-bit quantization, each weight is represented using only 4 bits, allowing values from 0 to 15. Since a `torch.uint8` value consists of 8 bits, it can store two such 4-bit weights packed together—one in the upper 4 bits and one in the lower 4 bits. For example, if we have two weights, 12 and 8, their 4-bit binary representations are `1100` and `1000` respectively. When packed into a single byte, they form the binary value `11001000`, which corresponds to 200 in decimal. So, when inspecting matrices like the `q_proj` weight matrix in a quantized model, you'll see `torch.uint8` values that actually represent two 4-bit weights, compactly encoded to reduce memory usage.

## vLLM and Advanced Techniques
**What is vLLM?**

[vLLM](https://github.com/vllm-project/vllm) is a high-performance serving library for LLMs. Think of it as a race car engine for your AI model - same model, but turbocharged performance!
Key vLLM Optimizations

- PagedAttention: Smart memory management
- Continuous Batching: Process multiple requests efficiently
- Tensor Parallelism: Use multiple GPUs
- Speculative Decoding: Predict multiple tokens at once

**Basic vLLM Example**

In [None]:
from vllm import LLM, SamplingParams

# Create LLM instance
llm = LLM(
    model="microsoft/DialoGPT-small",
    tensor_parallel_size=1,  # Number of GPUs
    max_model_len=1024
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=100
)

# Generate responses for multiple prompts (batched)
prompts = [
    "Explain machine learning in simple terms:",
    "What are the benefits of renewable energy?",
    "How does the internet work?"
]

# This processes all prompts efficiently in batch
outputs = llm.generate(prompts, sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {output.outputs[0].text}")
    print("-" * 50)

INFO 05-23 09:32:00 [config.py:2968] Downcasting torch.float32 to torch.float16.
INFO 05-23 09:32:18 [config.py:717] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 05-23 09:32:18 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='microsoft/DialoGPT-small', speculative_config=None, tokenizer='microsoft/DialoGPT-small', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=No

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO 05-23 09:32:20 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-23 09:32:20 [cuda.py:289] Using XFormers backend.
INFO 05-23 09:32:21 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-23 09:32:21 [model_runner.py:1108] Starting to load model microsoft/DialoGPT-small...
INFO 05-23 09:32:21 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

INFO 05-23 09:32:24 [weight_utils.py:281] Time spent downloading weights for microsoft/DialoGPT-small: 2.680174 seconds
INFO 05-23 09:32:24 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-23 09:32:24 [loader.py:458] Loading weights took 0.40 seconds
INFO 05-23 09:32:25 [model_runner.py:1140] Model loading took 0.2378 GiB and 3.488462 seconds
INFO 05-23 09:32:26 [worker.py:287] Memory profiling takes 1.17 seconds
INFO 05-23 09:32:26 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 05-23 09:32:26 [worker.py:287] model weights take 0.24GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 12.53GiB.
INFO 05-23 09:32:27 [executor_base.py:112] # cuda blocks: 22815, # CPU blocks: 7281
INFO 05-23 09:32:27 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 356.48x
INFO 05-23 09:32:31 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]

INFO 05-23 09:33:12 [model_runner.py:1592] Graph capturing finished in 41 secs, took 0.14 GiB
INFO 05-23 09:33:12 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 46.93 seconds


Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt 1: Explain machine learning in simple terms:
Response:  Machine Learning . Machine Learning .
--------------------------------------------------
Prompt 2: What are the benefits of renewable energy?
Response: 
--------------------------------------------------
Prompt 3: How does the internet work?
Response: 
--------------------------------------------------


**Advanced vLLM Features**

In [None]:
# Tensor Parallelism (multiple GPUs) in case you have access to many gpus
# llm_parallel = LLM(
#    model="microsoft/DialoGPT-medium",
#    tensor_parallel_size=2,  # Use 2 GPUs
#    pipeline_parallel_size=1
#)

# Prefix Caching (reuse computations for similar prompts)
llm_cached = LLM(
    model="microsoft/DialoGPT-small",
    enable_prefix_caching=True
)

# Configure advanced optimizations
from vllm import LLM, SamplingParams

llm_optimized = LLM(
    model="microsoft/DialoGPT-medium",
    # Memory optimizations
    max_model_len=4096,
    block_size=16,
    swap_space=4,  # GB of CPU memory for swapping

    # Speed optimizations
    use_v2_block_manager=True,
    enable_prefix_caching=True,
    attention_backend="FLASHINFER",  # Fastest attention

    # GPU optimizations
    enforce_eager=False,  # Enable CUDA graphs
    max_num_seqs=256,     # Batch size
)

##When to Use What?
**For Beginners:**

Start with torch.compile (one line of code!)
Try 4-bit quantization with Transformers
Use Unsloth for easy optimization

**For Production:**

Use vLLM for serving multiple users
Combine quantization + FlashAttention
Monitor performance and memory usage

**For Resource-Constrained Environments:**

GGUF format for CPU inference
Aggressive quantization (4-bit or even 2-bit)
Smaller model variants

**Common Pitfalls to Avoid**

❌ Don't: Apply all optimizations at once without testing

✅ Do: Add optimizations one by one and measure impact

❌ Don't: Use aggressive quantization for critical applications without quality testing

✅ Do: Test model quality after each optimization

❌ Don't: Ignore memory monitoring

✅ Do: Monitor GPU/CPU usage to find bottlenecks

## Challenge

Let's say you have a 8B model, and you want to do inference in the most optimized way, what would you do ? (if possible give how much memory is needed