# LLM Inference Workshop

This notebook explores how to run inference with Large Language Models (LLMs) efficiently, demonstrating the progression from full precision to quantized models.

## Setup and Installation

First, let's install the necessary libraries:

In [1]:
#!pip install -q transformers torch accelerate bitsandbytes psutil pandas matplotlib

In [2]:
# !nvidia-smi

## Loading Models with Different Precision

Let's explore different precision options for LLM inference, starting with higher precision (fp16/bf16) and then moving to quantized models (8-bit, 4-bit).

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, logging as hf_logging
import torch
import time
import os
import psutil
import pandas as pd
import matplotlib.pyplot as plt


hf_logging.set_verbosity_error()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(torch.cuda.get_device_properties(device))

  from .autonotebook import tqdm as notebook_tqdm


TypeError: get_device_properties() missing 1 required positional argument: 'device'

In [None]:
def print_memory_usage():
    """Return current GPU memory usage in MB (allocated), or None if no GPU is available."""
    if torch.cuda.is_available():
        allocated_bytes = torch.cuda.memory_allocated()
        return allocated_bytes / (1024 * 1024)

def inference_benchmark(model, tokenizer, prompt, num_runs=3, max_new_tokens=100):
    """Benchmark inference speed"""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Warmup
    _ = model.generate(inputs["input_ids"], max_new_tokens=10)
    
    # Benchmark
    times = []
    for _ in range(num_runs):
        start_time = time.time()
        _ = model.generate(
            inputs["input_ids"],
            max_new_tokens=max_new_tokens
        )
        times.append(time.time() - start_time)
    
    return {
        "avg_time": sum(times) / len(times),
        "tokens_per_second": max_new_tokens / (sum(times) / len(times))
    }

# Choose a model for demonstration
#model_id = "/project/rcc/youzhi/toxic-models/Qwen/Qwen2.5-7B-Instruct-1M"
#model_id = "/project/rcc/youzhi/toxic-models/Qwen/Qwen2.5-14B-Instruct-1M"
model_id = "/project/rcc/youzhi/toxic-models/mistralai/Mistral-Small-24B-Instruct-2501"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Storage for benchmark results
results = []

### 1. FP16 Precision (Half Precision)

Let's first load the model in FP16 precision, which is a good balance between accuracy and memory efficiency.

In [None]:
print("Loading model with FP16 precision...")
base_memory = print_memory_usage()

# Load model with FP16 precision
start_time = time.time()
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    #device_map="auto"
)

model_fp16.to(device)

load_time_fp16 = time.time() - start_time

print(f"Model loaded in {load_time_fp16:.2f} seconds")
fp16_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {fp16_memory:.2f} MB")

Let's run some inference with the FP16 model:

In [None]:
# Test FP16 model inference
prompt = "Explain how transformers work in machine learning, focusing on self-attention mechanisms:"

fp16_benchmark = inference_benchmark(model_fp16, tokenizer, prompt)
results.append({
    "Precision": "FP16",
    "Memory (MB)": fp16_memory,
    "Avg Generation Time (s)": fp16_benchmark["avg_time"],
    "Tokens/Second": fp16_benchmark["tokens_per_second"]
})

# Generate sample output
inputs = tokenizer(prompt, return_tensors="pt").to(model_fp16.device)
outputs = model_fp16.generate(inputs["input_ids"], max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Sample output from FP16 model:")
print(response)

### 2. BF16 Precision (Brain Floating Point)

BF16 has the same number of bits as FP16 but a different distribution of precision, often better for training numerical stability.

In [None]:
# Free up memory
del model_fp16
torch.cuda.empty_cache()
print_memory_usage()

# Load model with BF16 precision
print("Loading model with BF16 precision...")
base_memory = print_memory_usage()

start_time = time.time()
model_bf16 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    #device_map="auto"
)

model_bf16.to(device)

load_time_bf16 = time.time() - start_time

print(f"Model loaded in {load_time_bf16:.2f} seconds")
bf16_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {bf16_memory:.2f} MB")

In [None]:
# Test BF16 model inference
bf16_benchmark = inference_benchmark(model_bf16, tokenizer, prompt)
results.append({
    "Precision": "BF16",
    "Memory (MB)": bf16_memory,
    "Avg Generation Time (s)": bf16_benchmark["avg_time"],
    "Tokens/Second": bf16_benchmark["tokens_per_second"]
})

# Free up memory
del model_bf16
torch.cuda.empty_cache()

### 3. 8-bit Quantization

Now let's try 8-bit quantization, which significantly reduces memory usage with minimal impact on quality.

In [None]:
print("Loading model with 8-bit quantization...")
base_memory = print_memory_usage()

start_time = time.time()
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_8bit=True
)


load_time_8bit = time.time() - start_time

print(f"Model loaded in {load_time_8bit:.2f} seconds")
int8_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {int8_memory:.2f} MB")

In [None]:
# Test 8-bit model inference
int8_benchmark = inference_benchmark(model_8bit, tokenizer, prompt)
results.append({
    "Precision": "INT8",
    "Memory (MB)": int8_memory,
    "Avg Generation Time (s)": int8_benchmark["avg_time"],
    "Tokens/Second": int8_benchmark["tokens_per_second"]
})

# Free up memory
del model_8bit
torch.cuda.empty_cache()

### 4. 4-bit Quantization

Finally, let's try 4-bit quantization, which offers the maximum memory savings but may have more quality impact.

In [None]:
print("Loading model with 4-bit quantization...")
base_memory = print_memory_usage()

# Configure 4-bit quantization options
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # normalized float 4 format
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

start_time = time.time()
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
)

model_4bit.to(device)

load_time_4bit = time.time() - start_time

print(f"Model loaded in {load_time_4bit:.2f} seconds")
int4_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {int4_memory:.2f} MB")

In [None]:
# Test 4-bit model inference
int4_benchmark = inference_benchmark(model_4bit, tokenizer, prompt)
results.append({
    "Precision": "INT4",
    "Memory (MB)": int4_memory,
    "Avg Generation Time (s)": int4_benchmark["avg_time"],
    "Tokens/Second": int4_benchmark["tokens_per_second"]
})

# Generate sample output to compare quality
inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)
outputs = model_4bit.generate(inputs["input_ids"], max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Sample output from 4-bit quantized model:")
print(response)

## FlashAttention

In [None]:
print("Loading model with Flash Attention in FP16 precision...")
base_memory = print_memory_usage()

# Load model with FP16 precision
start_time = time.time()

model_flashattention = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2" if torch.cuda.is_available() else "eager",
    device_map="auto" 
)

load_time_fp16 = time.time() - start_time

print(f"Model loaded in {load_time_fp16:.2f} seconds")
fp16fa_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {fp16_memory:.2f} MB")

In [None]:
# Test FP16 flash attention model inference
fp16fa_benchmark = inference_benchmark(model_flashattention, tokenizer, prompt)
results.append({
    "Precision": "FP16-FA",
    "Memory (MB)": fp16fa_memory,
    "Avg Generation Time (s)": fp16fa_benchmark["avg_time"],
    "Tokens/Second": fp16fa_benchmark["tokens_per_second"]
})

# Generate sample output to compare quality
inputs = tokenizer(prompt, return_tensors="pt").to(model_flashattention.device)
outputs = model_flashattention.generate(inputs["input_ids"], max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Sample output from 16-bit flash attention quantized model:")
print(response)

## Fused Kernels

In [None]:
print("Loading model with Flash Attention in FP16 precision...")
base_memory = print_memory_usage()

# Load model with FP16 precision
start_time = time.time()

model_flashattention = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2" if torch.cuda.is_available() else "eager",
    device_map="auto" 
)

if hasattr(torch, "compile"):
    model_flashattention = torch.compile(model_flashattention)

load_time_fp16 = time.time() - start_time

print(f"Model loaded in {load_time_fp16:.2f} seconds")
fp16fa_memory = print_memory_usage() - base_memory
print(f"Memory footprint: {fp16fa_memory:.2f} MB")

In [None]:
# Test FP16 flash attention model inference with fused kernels
fused_fp16fa_benchmark = inference_benchmark(model_flashattention, tokenizer, prompt)
results.append({
    "Precision": "FP16-FA-Fused",
    "Memory (MB)": fp16fa_memory,
    "Avg Generation Time (s)": fused_fp16fa_benchmark["avg_time"],
    "Tokens/Second": fused_fp16fa_benchmark["tokens_per_second"]
})

# Generate sample output to compare quality
inputs = tokenizer(prompt, return_tensors="pt").to(model_flashattention.device)
outputs = model_flashattention.generate(inputs["input_ids"], max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Sample output from 16-bit flash attention model with fused kernels:")
print(response)

## Comparing Results

Let's visualize and compare the results from different precision formats:

In [None]:
results_df = pd.DataFrame(results)

# Create a color map for each unique precision
unique_precisions = results_df["Precision"].unique()
color_map = {prec: f"C{i}" for i, prec in enumerate(unique_precisions)}

# Build the color arrays for each bar
colors_for_memory = [color_map[prec] for prec in results_df["Precision"]]
colors_for_speed = [color_map[prec] for prec in results_df["Precision"]]

display(results_df)

plt.figure(figsize=(12, 5))

# Plot memory usage comparison
plt.subplot(1, 2, 1)
plt.bar(results_df["Precision"], results_df["Memory (MB)"], color=colors_for_memory)
plt.title("GPU Memory Usage by Precision")
plt.ylabel("Memory Usage (MB)")
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Plot inference speed comparison
plt.subplot(1, 2, 2)
plt.bar(results_df["Precision"], results_df["Tokens/Second"], color=colors_for_speed)
plt.title("Inference Speed by Precision")
plt.ylabel("Tokens per Second")
plt.grid(axis="y", linestyle="--", alpha=0.7)

plt.tight_layout()
plt.show()


### top_k, top_p, tempurature

In [None]:
greedy_outputs = model_4bit.generate(
    inputs["input_ids"], 
    max_new_tokens=200, 
    do_sample=False
)
greedy_response = tokenizer.decode(greedy_outputs[0], skip_special_tokens=True)
print("=== Greedy decoding (no sampling) ===")
print(greedy_response)
print()

In [None]:
temp_outputs = model_4bit.generate(
    inputs["input_ids"], 
    max_new_tokens=200, 
    do_sample=True,
    temperature=1.5  
)
temp_response = tokenizer.decode(temp_outputs[0], skip_special_tokens=True)
print("=== Temperature Sampling (T=1.5) ===")
print(temp_response)
print()

In [None]:
temp_outputs = model_4bit.generate(
    inputs["input_ids"], 
    max_new_tokens=200, 
    do_sample=True,
    temperature=5.5  
)
temp_response = tokenizer.decode(temp_outputs[0], skip_special_tokens=True)
print("=== Temperature Sampling (T=5.5) ===")
print(temp_response)
print()

In [None]:
topk_outputs = model_4bit.generate(
    inputs["input_ids"], 
    max_new_tokens=200, 
    do_sample=True,
    temperature=0.7, 
    top_k=50  # Only sample from top 50 tokens
)
topk_response = tokenizer.decode(topk_outputs[0], skip_special_tokens=True)
print("=== Top-k Sampling (k=50, T=0.7) ===")
print(topk_response)
print()

In [None]:
topp_outputs = model_4bit.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9   # Keep the top 90% probability mass
)
topp_response = tokenizer.decode(topp_outputs[0], skip_special_tokens=True)
print("=== Top-p Sampling (p=0.9, T=0.7) ===")
print(topp_response)
print()

In [None]:
combo_outputs = model_4bit.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.9
)
combo_response = tokenizer.decode(combo_outputs[0], skip_special_tokens=True)
print("=== Combined Sampling (top_k=50, top_p=0.9, T=0.8) ===")
print(combo_response)

## Advanced Inference Techniques

Now that we've selected 4-bit quantization for its efficiency, let's explore more advanced inference techniques.

### Batch Processing for Throughput

Processing multiple inputs at once can be more efficient:

In [None]:
# Create a batch of prompts
prompts = [
    "What is machine learning?",
    "Explain neural networks",
    "How do transformers work?",
    "Define reinforcement learning",
    "What is transfer learning?"
]

# Process individually
start_time = time.time()
individual_outputs = []
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)
    outputs = model_4bit.generate(inputs["input_ids"], max_new_tokens=500)
    individual_outputs.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
individual_time = time.time() - start_time
print(f"Individual processing time: {individual_time:.2f}s")

# Process as a batch
start_time = time.time()
batch_inputs = tokenizer(prompts, padding=True, return_tensors="pt").to(model_4bit.device)
batch_outputs = model_4bit.generate(batch_inputs["input_ids"], max_new_tokens=500)
batch_decoded = [tokenizer.decode(output, skip_special_tokens=True) for output in batch_outputs]
batch_time = time.time() - start_time
print(f"Batch processing time: {batch_time:.2f}s")
print(f"Speed improvement: {individual_time/batch_time:.2f}x")

### Using KV Caching for Faster Generation

In [None]:
# Demonstrating the explicit use of KV cache for token generation

prompt = "The benefits of artificial intelligence in healthcare include:"
inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)

# disable KV cache explicitly
start_time = time.time()
outputs1 = model_4bit.generate(
    inputs["input_ids"],
    max_new_tokens=50,
    use_cache=False          # <- turn off the KV cache
)
time1 = time.time() - start_time
print(f"Standard generation time (no‑cache): {time1:.2f}s")


# enable KV cache explicitly
start_time = time.time()
outputs1 = model_4bit.generate(
    inputs["input_ids"],
    max_new_tokens=50,
    use_cache=True          # <- turn on the KV cache
)
time2 = time.time() - start_time
print(f"KV cache generation time: {time2:.2f}s")


# Method 2: Manually implementing token-by-token generation with KV cache
# start_time = time.time()
# generated_ids = inputs.input_ids
# past_key_values = None

# for _ in range(50):  
#     # Forward pass with caching
#     with torch.no_grad():
#         outputs = model_4bit(
#             input_ids=generated_ids[:, -1:] if past_key_values is not None else generated_ids,
#             past_key_values=past_key_values,
#             use_cache=True
#         )
    
#     past_key_values = outputs.past_key_values  # Cache for next iteration
#     next_token_id = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(-1)
#     generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)
    
#     # Check for EOS token
#     if next_token_id.item() == tokenizer.eos_token_id:
#         break

# time2 = time.time() - start_time
# print(f"Explicit KV cache generation time: {time2:.2f}s")

### Streaming Generation

For better user experience, we can stream the model output:

In [None]:
from IPython.display import display, clear_output

prompt = "Write a short introduction about machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)

# Streamed generation with proper KV caching
generated_ids = inputs.input_ids
past_key_values = None
response = tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)

for _ in range(200):  
    clear_output(wait=True)
    print(response, end="")
    
    # Generate next token
    with torch.no_grad():
        outputs = model_4bit(
            input_ids=generated_ids[:, -1:] if past_key_values is not None else generated_ids,
            past_key_values=past_key_values,
            use_cache=True
        )
        
    past_key_values = outputs.past_key_values
    
    # Sample next token (instead of greedy)
    logits = outputs.logits[:, -1, :] / 0.7  # temperature = 0.7
    probs = torch.softmax(logits, dim=-1)
    next_token_id = torch.multinomial(probs, num_samples=1)
    
    generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)
    
    # Decode and display
    next_token = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
    response += next_token
    
    # Check for EOS token
    if next_token_id.item() == tokenizer.eos_token_id:
        break
        
    time.sleep(0.1)  # Slow down for visual effect

clear_output(wait=True)
print(response)

## Speculative Decoding

Let's implement a simple version of speculative decoding using a smaller model to predict tokens, then verify with our larger model.

In [None]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def speculative_decoding(
    model_draft, 
    model_main, 
    tokenizer, 
    prompt, 
    draft_tokens=4, 
    max_new_tokens=100,
    temperature=0.7
):
    """
    Generate text using speculative decoding, where:
      - model_draft (smaller) proposes 'draft_tokens' at a time
      - model_main (larger) verifies or rejects those tokens in a single pass

    Args:
        model_draft: The smaller "draft" model (e.g. 4-bit quantized).
        model_main: The larger "main" model (e.g. FP16).
        tokenizer: Tokenizer for both models (assume they share the vocab).
        prompt: Initial text prompt (str).
        draft_tokens: Number of tokens proposed by model_draft in each chunk.
        max_new_tokens: Maximum tokens to generate (not counting the prompt).
        temperature: Sampling temperature for the draft model (and main if it must sample).

    Returns:
        A string containing the generated text.
    """
    # Move models to GPU if available (and set eval mode)
    model_draft.to(device).eval()
    model_main.to(device).eval()

    # Encode the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    generated_ids = inputs["input_ids"]  # shape: (1, prompt_length)

    accepted_tokens = 0
    rejected_tokens = 0
    prompt_len = generated_ids.shape[1]  # to track how many new tokens we generate

    # We iterate until we produce max_new_tokens or hit EOS
    while (generated_ids.shape[1] - prompt_len) < max_new_tokens:
        # 1) Use the draft model to propose a chunk of 'draft_tokens'
        #    We'll do a short generation with do_sample and some temperature.
        #    This returns the entire sequence, so we'll slice out just the new part.
        draft_out = model_draft.generate(
            generated_ids,
            max_new_tokens=draft_tokens,
            do_sample=True,
            temperature=temperature,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True
        )[0]
        
        # Extract only the newly proposed tokens (the "chunk")
        proposed_chunk = draft_out[generated_ids.shape[1]:]  # shape: (chunk_len,)

        # If no new tokens (model ended with EOS), break
        if len(proposed_chunk) == 0:
            break

        # 2) Single forward pass in the main model to verify chunk
        #    We'll do teacher forcing on the proposed_chunk to get probabilities.
        #    That means we feed the main model all tokens up to, but not including,
        #    each target token position.
        #
        #    Instead of forward passing N times, we do ONE pass over:
        #       cat(generated_ids, proposed_chunk[:-1])
        #    so the final logit will correspond to the last token in 'proposed_chunk'.
        #
        #    Then we'll compare each position's sample with the draft token.
        # -------------------------------------------------------------------------
        if len(proposed_chunk) > 1:
            # main_input includes everything up to the last token of the chunk - 1
            main_input = torch.cat([generated_ids, proposed_chunk[:-1].unsqueeze(0)], dim=1)
        else:
            # If there's only 1 token in the chunk, there's no "chunk[:-1]",
            # so just use the existing generated_ids.
            main_input = generated_ids

        # Forward pass
        with torch.no_grad():
            outputs_main = model_main(main_input, use_cache=False)
            # logits shape: (batch=1, seq_len, vocab_size)
            logits_main = outputs_main.logits

        # We'll collect the accepted tokens from this chunk
        accepted_chunk = []
        chunk_rejected = False

        # Figure out how to index the logits for each new token
        # If main_input has length L, the i-th new token's logit is at index L-1 + i
        # because the model at position L-1 predicts the token at position L (the first chunk token), etc.
        # The total new chunk length = len(proposed_chunk).
        offset = logits_main.shape[1] - 1  # the last logit index in main_input

        for i, draft_token in enumerate(proposed_chunk):
            # 2a) Compare the main model's distribution to the draft token
            # The logit index for this new token = offset + i
            logit_idx = offset + i
            # If that index is out of range, something went wrong
            if logit_idx >= logits_main.shape[1]:
                break

            # Probability distribution over the vocab at this position
            dist = F.softmax(logits_main[0, logit_idx, :], dim=-1)

            # Sample from the main model's distribution
            main_sample_id = torch.multinomial(dist, num_samples=1)

            if main_sample_id.item() == draft_token.item():
                # 2b) Accept the draft token
                accepted_chunk.append(draft_token)
                accepted_tokens += 1
            else:
                # 2c) Reject this token (and all subsequent tokens in chunk)
                #     We'll fallback to the main model's token here,
                #     and discard the rest of the chunk.
                accepted_chunk.append(main_sample_id.squeeze())
                rejected_tokens += (len(proposed_chunk) - i)
                chunk_rejected = True
                break

        # Now we cat the accepted_chunk to the generated_ids
        if len(accepted_chunk) > 0:
            accepted_tensor = torch.stack(accepted_chunk).unsqueeze(0)  # shape: (1, n_accepted)
            generated_ids = torch.cat([generated_ids, accepted_tensor], dim=1)

        # If we rejected at some point, we stop verifying the remainder of the chunk.
        # We'll continue generating from the next iteration.
        # Check if we hit EOS
        if generated_ids[0, -1].item() == tokenizer.eos_token_id:
            break

    # Done generating or hit EOS
    output_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    total_proposed = accepted_tokens + rejected_tokens
    accept_rate = accepted_tokens / total_proposed if total_proposed > 0 else 1.0
    
    print(f"Tokens accepted: {accepted_tokens}, rejected: {rejected_tokens}")
    print(f"Acceptance rate: {accept_rate:.2%}")

    return output_text


prompt = "Once upon a time,"

#result = speculative_decoding(model_4bit, model_fp16, tokenizer, prompt, draft_tokens=4, max_new_tokens=50)
#print("Speculative Decoding Output:\n", result)


## Conclusion and Further Resources

In this notebook, we've explored various techniques for efficient LLM inference:

1. Different precision formats (FP16, BF16, INT8, INT4) with their memory and speed tradeoffs
2. Batch processing for increased throughput
3. KV caching for faster token generation
4. Streaming generation for better user experience
5. Speculative decoding to speed up inference

For further optimization, consider exploring:

- Specialized inference libraries like vLLM, DeepSpeed or TensorRT-LLM
- Mixture of Experts (MoE) models like Mixtral
- Model distillation to create smaller, faster models
- Prompt engineering for token efficiency
- Hardware-specific optimizations (CUDA graphs, FlashAttention, etc.)

Key references:
- [Hugging Face Efficient Inference Guide](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Speculative Decoding Paper](https://arxiv.org/abs/2302.01318)
- [FlashAttention Paper](https://arxiv.org/abs/2205.14135)