**Purpose of Day 6:**  
Learn advanced optimization techniques to make `LLMs faster` and more `efficient` for real-world use (KV caching, flash attention, speculative decoding).

**Purpose:** Demonstrate KV caching—store and reuse attention computations to speed up text generation.

- **First call:** Computes full attention, caches key-value pairs.
- **Second call:** Uses cached KV, only computes for new tokens → faster.

Shows how production systems speed up consecutive generations.

#### `1`  KV caching (speed boost)

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# name = 'distilgpt2'
name = './gpt2-local'
quantize = True

In [3]:
if quantize:
    bnb = BitsAndBytesConfig(load_in_8bit=True)
    model = AutoModelForCausalLM.from_pretrained(
        name, quantization_config=bnb, device_map='auto')
else:
    model = AutoModelForCausalLM.from_pretrained(
        name, torch_dtype=torch.float16, device_map='auto')

In [4]:
tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token = tokenizer.eos_token
# It sets the padding token to the end-of-sequence token so the model knows what to use for padding during generation. 
# Needed for proper batch handling and avoiding errors.

In [5]:
streamer = TextStreamer(tokenizer, skip_prompt=True)

In [1]:
def generate_with_cache(prompt, max_new_tokens=50, temperature=0.7, top_p=0.9):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        streamer = streamer,
        pad_token_id=tokenizer.eos_token_id,
        use_cache=True  # Internal caching only
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [7]:
# Test
print("=== First call ===")
result1 = generate_with_cache("The future of AI", max_new_tokens=20)
print(result1)

print("\n=== Continue same sequence ===")
result2 = generate_with_cache(" is", max_new_tokens=20)
print(result2)

=== First call ===
 is likely to be more complicated than we thought. AI is just the beginning of the development process.
The future of AI is likely to be more complicated than we thought. AI is just the beginning of the development process.

=== Continue same sequence ===
 and is a product of the human condition. We are in a state of profound distress. We are
 is and is a product of the human condition. We are in a state of profound distress. We are


#### `2`  Efficient Serving with vLLM

Those are vLLM's key speed features.

- **PagedAttention:** Manages KV cache like OS virtual memory → less waste.
- **Continuous batching:** Mixes requests dynamically → GPU always busy.
- **Optimized CUDA kernels:** Custom GPU code → faster computations.

Together they make vLLM the fastest for serving LLMs.

In [None]:
def vllm_serving_demo():

    try:
        from vllm import LLM, SamplingParams
        
        # Initialize LLM
        llm = LLM(model="gpt2", quantization="awq", max_model_len=4096)
        
        # Sampling params
        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=100,
        )
        
        # Batch prompts
        prompts = [
            "Explain machine learning to a 10-year-old:",
            "Write a Python function to calculate fibonacci:",
            "The benefits of renewable energy include:"
        ]
        
        # Generate
        outputs = llm.generate(prompts, sampling_params)
        
        # Print results
        for output in outputs:
            print(f"Prompt: {output.prompt}")
            print(f"Generated: {output.outputs[0].text}\n")
            
    except ImportError:
        print("vLLM not installed. Install with: pip install vllm")
        print("This demonstrates the production serving approach")

# Run vLLM demo
vllm_serving_demo()

#### `3`

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
small = AutoModelForCausalLM.from_pretrained("./distilgpt2-local")
large = AutoModelForCausalLM.from_pretrained("./gpt2-local")
tokenizer = AutoTokenizer.from_pretrained('./distilgpt2-local')
tokenizer.pad_token = tokenizer.eos_token

In [14]:
streamer =TextStreamer(tokenizer, skip_prompt=True)

In [33]:
def speculative(prompt, max_tokens=50, lookahead=5, stream=True):
    inputs = tokenizer.encode(prompt, return_tensors='pt')
    generated = inputs.clone()

    for s in range(max_tokens):
        s_outputs = small.generate(
            generated, max_new_tokens=lookahead, do_sample=True, temperature=0.7, pad_token_id=tokenizer.eos_token_id, 
            num_return_sequences=1)

        tokens = s_outputs[0, generated.shape[1]:]

        l_outputs = large(
            s_outputs, output_hidden_states=False, output_attentions=False)

        l_logits = l_outputs.logits
        l_probs = torch.softmax(l_logits[:, -lookahead-1:-1], dim=-1)
        
        accept_token = []

        for i, d in enumerate(tokens):
            probs = l_probs[0, i, d]

            if probs > 0.5:
                accept_token.append(d.unsqueeze(0))
            else:
                l_token = torch.multinomial(l_probs[0, i], 1)
                accept_token.append(l_token)
                break

        if not accept_token:
            break
            
        new_tokens = torch.cat(accept_token)
        generated = torch.cat([generated, new_tokens.unsqueeze(0)], dim=1)
        
        # Stream final accepted tokens only if requested
        if stream:
            streamer.put(new_tokens)
        
        if tokenizer.eos_token_id in new_tokens:
            break
    
    if stream:
        streamer.end()
        
    return tokenizer.decode(generated[0])

In [34]:
prompt = "Artificial intelligence will transform"

In [39]:
result = speculative(prompt, max_tokens=30, lookahead=3)

 brains very rapidly, the report shows.

This new approach to medicine, which looks at the human brain as a set of distinct parts that are really only "together


In [40]:
print(result)

Artificial intelligence will transform our brains very rapidly, the report shows.

This new approach to medicine, which looks at the human brain as a set of distinct parts that are really only "together
