<a href="https://colab.research.google.com/github/Milan-Chicago/multi-agent-course/blob/main/Completed_Maven_TextStreamer_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install transformers
!pip install bitsandbytes

### Install Dependencies

Install the `transformers` and `bitsandbytes` libraries. These are required for model loading, tokenization, text streaming, and 4-bit quantization.

- **transformers**: Provides model architectures, tokenizers, and generation utilities
- **bitsandbytes**: Enables efficient 4-bit and 8-bit quantization on CUDA GPUs

In [14]:
from huggingface_hub import notebook_login
notebook_login()

### Hugging Face Authentication

Use `notebook_login()` to authenticate with Hugging Face Hub. This is required for accessing gated models (e.g., Llama). Once logged in, your token is cached for the session.

## TextStreamer + Inference Performance Metrics

This notebook demonstrates two key concepts:

1. **TextStreamer / TextIteratorStreamer** - Stream generated tokens to the console in real-time instead of waiting for the entire sequence to complete
2. **Inference performance metrics** - Measure key latency and throughput metrics with and without 4-bit quantization

We compare full-precision vs. 4-bit quantized inference on the same model to show the impact on memory usage and speed.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, TextStreamer
import torch
from transformers import BitsAndBytesConfig

### Imports

- **AutoTokenizer / AutoModelForCausalLM**: Automatically load the correct tokenizer and model architecture for any Hugging Face model
- **TextStreamer**: Streams generated tokens directly to stdout as they are produced
- **BitsAndBytesConfig**: Configures 4-bit or 8-bit quantization parameters
- **torch**: PyTorch, the underlying deep learning framework

In [4]:
def start_gpu_stat():
    #@title Show current memory stats
    #Set torch device to get properties global: torch.cuda.set_device(0)
    gpu_stats = torch.cuda.get_device_properties(0)
    initial_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    return initial_gpu_memory, max_memory

def final_gpu_stat(_initial_gpu_memory, _max_memory):
    #@title Show final memory and time stats
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_diff = round(used_memory - _initial_gpu_memory, 3)
    used_percentage = round(used_memory         /_max_memory*100, 3)
    diff_percentage = round(used_memory_for_diff/_max_memory*100, 3)

    print(f"Max memory = {_max_memory} GB.")
    print(f"{_initial_gpu_memory} GB of INITIAL memory reserved.")
    print(f"Peak reserved FINAL memory = {used_memory} GB.")
    print(f"Peak reserved memory DIFFERENCE = {used_memory_for_diff} GB.")
    print(f"Peak reserved memory % of FINAL memory = {used_percentage} %.")
    print(f"Peak reserved memory % of DIFFERENCE memory = {diff_percentage} %.")

### GPU Memory Tracking Utilities

Two helper functions for monitoring GPU memory usage:

- **`start_gpu_stat()`**: Records the initial GPU memory reservation and total GPU memory before loading a model
- **`final_gpu_stat()`**: Measures peak reserved memory after model loading and computes the difference, showing how much memory the model actually consumed

These are used to compare memory footprint between full-precision and quantized models.

# Part 1: Full-Precision Inference (No Quantization)

In this section, we load the **Llama 3.1 8B Instruct** model in full precision (float16/float32) and measure:
- GPU memory consumption
- Text streaming output
- Inference performance metrics (TTFT, ITL, throughput)

In [5]:
model_id = "unsloth/Meta-Llama-3.1-8B-Instruct" # Replace with your model

# Load tokenizer and full precision model
tokenizer = AutoTokenizer.from_pretrained(model_id)

initial_gpu_memory, max_memory = start_gpu_stat()

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)

final_gpu_stat(initial_gpu_memory, max_memory)

config.json:   0%|          | 0.00/956 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]



model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Max memory = 79.179 GB.
0.0 GB of INITIAL memory reserved.
Peak reserved FINAL memory = 14.959 GB.
Peak reserved memory DIFFERENCE = 14.959 GB.
Peak reserved memory % of FINAL memory = 18.893 %.
Peak reserved memory % of DIFFERENCE memory = 18.893 %.


### Model Loading Results

The model is loaded in full precision with `device_map="auto"` which distributes layers across available devices. GPU memory stats show the baseline memory footprint before quantization.

In [6]:
from transformers import TextStreamer


# Define Alpaca-style prompt format
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

# Prepare input text
prompt_text = alpaca_prompt.format("What is the importance of using renewable energy?")  # instruction

inputs = tokenizer([prompt_text], return_tensors="pt").to(model.device)  # Move inputs to model's device

# Initialize text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=False, skip_special_tokens=False)

# Generate response with streamer
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=100)


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the importance of using renewable energy?

### Response:
The importance of using renewable energy lies in its ability to provide a sustainable and environmentally friendly alternative to traditional fossil fuels. Renewable energy sources, such as solar, wind, and hydroelectric power, can significantly reduce greenhouse gas emissions and mitigate climate change. Additionally, renewable energy can help to reduce dependence on imported fuels, improve energy security, and create jobs in the clean energy sector. Furthermore, the cost of renewable energy technologies has decreased over the years, making them more competitive with fossil fuels and more accessible to


### Text Streaming with TextStreamer

The `TextStreamer` prints tokens to the console as they are generated, providing a real-time streaming experience. This is useful for interactive applications where you want to display output incrementally.

Key parameters:
- `skip_prompt=False`: Also prints the input prompt
- `skip_special_tokens=False`: Includes special tokens like `<|begin_of_text|>` in the output

## Inference Performance Metrics

We measure four key metrics to evaluate inference performance:

| Metric | Definition |
|--------|------------|
| **Time To First Token (TTFT)** | Time to process the prompt and generate the first output token. Critical for perceived responsiveness. |
| **Inter-Token Latency (ITL)** | Average time between consecutive generated tokens. Lower is smoother streaming. |
| **End-to-End Latency** | Total wall-clock time to generate the entire response. Approximately `output_length * ITL`. |
| **Throughput** | Number of output tokens generated per second. Higher is better. |

In [7]:
from transformers import TextIteratorStreamer
from threading import Thread
import time

# Define Alpaca-style prompt format
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

# Prepare input text
prompt_text = alpaca_prompt.format("What is the importance of using renewable energy?")
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)

# Initialize variables for time measurements
start_time = time.time()
token_times = []

# Initialize streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

# Start generation in a separate thread
thread = Thread(target=model.generate, kwargs={
    'input_ids': inputs['input_ids'],
    'attention_mask': inputs['attention_mask'],
    'streamer': streamer,
    'max_new_tokens': 100
})
thread.start()

# Initialize a variable to store the model output
model_output = ""
first_token_time = None

# Iterate over the streamer to get the generated text in chunks
for i, new_text in enumerate(streamer):
    model_output += new_text
    print(new_text, end='')

    # Measure time for the first token
    if i == 0:
        first_token_time = time.time()
    # Measure time for each token
    token_times.append(time.time())

# Calculate end-to-end latency
end_time = time.time()
end_to_end_latency = end_time - start_time

# Calculate time to first token
ttft = first_token_time - start_time if first_token_time else 0

# Calculate inter-token latency
itl = sum(x - y for x, y in zip(token_times[1:], token_times[:-1])) / (len(token_times) - 1) if len(token_times) > 1 else 0

# Calculate throughput
throughput = len(tokenizer.encode(model_output)) / end_to_end_latency if model_output else 0

print("\nTime To First Token (TTFT):", ttft)
print("Inter-token latency (ITL):", itl)
print("End-to-end Latency:", end_to_end_latency)
print("Throughput:", throughput)

The importance of using renewable energy is that it helps reduce our reliance on non-renewable energy sources, such as coal and oil, which contribute to greenhouse gas emissions and climate change. Renewable energy sources, like solar and wind power, are sustainable and can be replenished naturally, providing a cleaner and more environmentally friendly alternative. Additionally, renewable energy can help reduce energy costs, improve energy security, and create jobs in the clean energy sector. Furthermore, using renewable energy can help mitigate the impacts of climate
Time To First Token (TTFT): 0.02173590660095215
Inter-token latency (ITL): 0.01848189115524292
End-to-end Latency: 1.8700683116912842
Throughput: 54.008722231465384


### Metrics Measurement with TextIteratorStreamer

Unlike `TextStreamer` (which prints directly), `TextIteratorStreamer` yields tokens as an iterator, allowing us to timestamp each token for precise metric calculation.

The generation runs in a **background thread** while the main thread iterates over tokens, recording arrival times. This is the standard pattern for streaming with metric collection.

# Part 2: 4-Bit Quantized Inference

In this section, we repeat the same experiments using **4-bit NF4 quantization** via `bitsandbytes`. This allows us to directly compare:
- Memory footprint (full-precision vs 4-bit)
- Inference speed and latency
- Output quality

**Important**: Restart the kernel before running this section to clear the full-precision model from GPU memory.

> **Note:** Shut down and restart the kernel before running the cells below. This ensures the full-precision model is fully unloaded from GPU memory, giving accurate memory measurements for the quantized model.

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, TextStreamer
import torch
from transformers import BitsAndBytesConfig

### Re-import Libraries

After kernel restart, re-import all required libraries for the quantized experiment.

In [9]:
def start_gpu_stat():
    #@title Show current memory stats
    #Set torch device to get properties global: torch.cuda.set_device(0)
    gpu_stats = torch.cuda.get_device_properties(0)
    initial_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    return initial_gpu_memory, max_memory

def final_gpu_stat(_initial_gpu_memory, _max_memory):
    #@title Show final memory and time stats
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_diff = round(used_memory - _initial_gpu_memory, 3)
    used_percentage = round(used_memory         /_max_memory*100, 3)
    diff_percentage = round(used_memory_for_diff/_max_memory*100, 3)

    print(f"Max memory = {_max_memory} GB.")
    print(f"{_initial_gpu_memory} GB of INITIAL memory reserved.")
    print(f"Peak reserved FINAL memory = {used_memory} GB.")
    print(f"Peak reserved memory DIFFERENCE = {used_memory_for_diff} GB.")
    print(f"Peak reserved memory % of FINAL memory = {used_percentage} %.")
    print(f"Peak reserved memory % of DIFFERENCE memory = {diff_percentage} %.")

### GPU Memory Tracking (Re-defined)

Re-define the GPU memory tracking functions after kernel restart.

In [10]:
model_id = "unsloth/Meta-Llama-3.1-8B-Instruct" # Post Training Quantization

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer and model in 4-bit
tokenizer = AutoTokenizer.from_pretrained(model_id)

initial_gpu_memory, max_memory = start_gpu_stat()

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

final_gpu_stat(initial_gpu_memory, max_memory)

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

Max memory = 79.179 GB.
15.064 GB of INITIAL memory reserved.
Peak reserved FINAL memory = 29.404 GB.
Peak reserved memory DIFFERENCE = 14.34 GB.
Peak reserved memory % of FINAL memory = 37.136 %.
Peak reserved memory % of DIFFERENCE memory = 18.111 %.


### Quantized Model Loading

The model is loaded with 4-bit NF4 quantization using `BitsAndBytesConfig`. Configuration:
- **`load_in_4bit=True`**: Enable 4-bit weight quantization
- **`bnb_4bit_compute_dtype=torch.float16`**: Use float16 for matrix computations
- **`bnb_4bit_quant_type="nf4"`**: Use Normal Float 4-bit quantization (from the QLoRA paper)

Compare the memory stats here with Part 1 to see the memory reduction.

In [11]:
from transformers import TextStreamer

# Define Alpaca-style prompt format
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

# Prepare input text
prompt_text = alpaca_prompt.format("What is the importance of using renewable energy?")  # instruction

inputs = tokenizer([prompt_text], return_tensors="pt").to(model.device)  # Move inputs to model's device

# Initialize text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

# Generate response with streamer
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=100)


The importance of using renewable energy is that it reduces our reliance on fossil fuels, decreases greenhouse gas emissions, and helps to mitigate the effects of climate change. Additionally, renewable energy sources such as solar, wind, and hydroelectric power are becoming increasingly cost-competitive with traditional fossil fuels, making them a more viable option for powering our homes, businesses, and transportation. Furthermore, renewable energy can create jobs and stimulate local economies, particularly in rural areas where these resources are often abundant. Overall, the shift


### Streaming with Quantized Model

Text streaming works identically with the quantized model. The quantization is transparent to the generation API. Compare the output quality with the full-precision version above.

In [12]:
from transformers import TextIteratorStreamer
from threading import Thread
import time

# Define Alpaca-style prompt format
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

# Prepare input text
prompt_text = alpaca_prompt.format("What is the importance of using renewable energy?")
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)

# Initialize variables for time measurements
start_time = time.time()
token_times = []

# Initialize streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

# Start generation in a separate thread
thread = Thread(target=model.generate, kwargs={
    'input_ids': inputs['input_ids'],
    'attention_mask': inputs['attention_mask'],
    'streamer': streamer,
    'max_new_tokens': 100
})
thread.start()

# Initialize a variable to store the model output
model_output = ""
first_token_time = None

# Iterate over the streamer to get the generated text in chunks
for i, new_text in enumerate(streamer):
    model_output += new_text
    print(new_text, end='')

    # Measure time for the first token
    if i == 0:
        first_token_time = time.time()
    # Measure time for each token
    token_times.append(time.time())

# Calculate end-to-end latency
end_time = time.time()
end_to_end_latency = end_time - start_time

# Calculate time to first token
ttft = first_token_time - start_time if first_token_time else 0

# Calculate inter-token latency
itl = sum(x - y for x, y in zip(token_times[1:], token_times[:-1])) / (len(token_times) - 1) if len(token_times) > 1 else 0

# Calculate throughput
throughput = len(tokenizer.encode(model_output)) / end_to_end_latency if model_output else 0

print("\nTime To First Token (TTFT):", ttft)
print("Inter-token latency (ITL):", itl)
print("End-to-end Latency:", end_to_end_latency)
print("Throughput:", throughput)

Renewable energy is crucial because it helps reduce our reliance on finite resources, decreases greenhouse gas emissions, and mitigates the impacts of climate change. Moreover, it provides energy security by reducing dependence on imported fuels, and it can create jobs and stimulate local economies. Renewable energy sources, such as solar, wind, and hydro power, are sustainable and can be replenished naturally, offering a cleaner alternative to fossil fuels. By transitioning to renewable energy, we can ensure a more environmentally friendly and economically resilient
Time To First Token (TTFT): 0.3287832736968994
Inter-token latency (ITL): 0.02766495704650879
End-to-end Latency: 3.0954294204711914
Throughput: 32.62875235728218


### Performance Metrics (Quantized)

Measure the same TTFT, ITL, latency, and throughput metrics for the quantized model. Typical observations:
- **Memory**: ~5-6x reduction compared to full precision
- **Throughput**: Often higher due to reduced memory bandwidth requirements
- **TTFT/ITL**: Usually lower (faster) with quantization
- **Quality**: Minimal degradation for most tasks with NF4 quantization