# UdaciHeadline: LLM Inference Optimization Project

## Project Introduction
Large Language Models (LLMs) are transforming content creation, but deploying them efficiently remains a major hurdle. Imagine you're an ML Engineer at a bustling online news portal. Your key task? Automatically generating catchy headlines from article summaries using an LLM. The problem? The current inference process is sluggish, causing publication delays and driving up operational costs. In this project, UdaciHeadline, you'll step into this role and tackle this critical challenge head-on. Your mission is to accelerate the headline generation pipeline significantly by applying state-of-the-art LLM inference optimization techniques. Get ready to dive deep into practical optimization and deployment!

## Project Summary
This project provides hands-on experience in optimizing the inference performance of a pre-trained Large Language Model (like Llama-3.2-1B) for news headline generation. You will bring together concepts of LLM architecture, optimization techniques, and deployment frameworks. Specifically, you will:

1.  **Establish a baseline** inference pipeline and profile its performance.
2.  Implement and evaluate architectural optimizations like **KV-caching**.
3.  Apply model compression techniques like **quantization** and **pruning**.
4.  Configure and benchmark **distributed inference** using Tensor and Pipeline Parallelism.
5.  Apply advanced decoding mechanisms like **speculative decoding**.
6.  Perform comprehensive **benchmarking and analysis** across all stages.
7.  Produce a **final report** summarizing findings and trade-offs.

## Imports and Global Configuration

Let's import the libraries we'll use throughout the project and define some constants like the model name and the prompt template.

In [28]:
pip install --upgrade pip setuptools

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [2]:
# pip install --upgrade cmake 

Collecting cmake
  Downloading cmake-4.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (6.5 kB)
Downloading cmake-4.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (29.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m206.5 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: cmake
Successfully installed cmake-4.1.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
!pip install evaluate
!pip install --upgrade transformers
!pip install rouge_score
!pip install bitsandbytes
!pip install accelerate
!pip install datasets

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting huggingface-hub>=0.7.0 (from evaluate)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting httpx<1.0.0 (from datasets>=2.0.0->evaluate)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1.0.0->datasets>=2.0.0->evaluate)
  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub>=0.7.0->evaluate)
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadat

In [1]:
# !pip install datasets

Collecting datasets
  Using cached datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Using cached pyarrow-21.0.0.tar.gz (1.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting httpx<1.0.0 (from datasets)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.9.0,>=2023.1.0->datasets)
  Using cached aiohttp-3.13.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting httpcore==1.* (from httpx<1.0.0->datasets)
  Using

In [28]:
import os
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from evaluate import load as load_metric

from time import time as get_time
from pprint import pprint
import torch.nn.utils.prune as prune
import copy
#os.environ["HF_HUB_OFFLINE"] = "1"
# ---- Constants ----

MODEL_NAME = "meta-llama/Meta-Llama-3-1B"
MAX_NEW_TOKENS = 50 # Max length for the generated headline

#PROMPT = \
# print(f"Prompt: \"{PROMPT}\"")
TARGET_LAYER_NAME_STR = "model.layers.0.mlp.gate_proj"

# We will prune 50% of the weights in this layer
PRUNING_AMOUNT = 0.5 
NUM_PREDICTION = 10

## Data Loading

We will use the "News Category Dataset" from Kaggle. The `kagglehub` library makes it easy to download and access. Your task is to implement the function to load and preprocess the data according to the docstring.

In [29]:

def load_news_dataset(path):
    """TODO: Implement the data loading and preprocessing logic here."""
    dataset = load_dataset("json", data_files=path, split="train[:1000]")
    articles = [item["short_description"] for item in dataset]

    return dataset,articles


# 2. Baseline Performance

Before we can optimize, we need a starting point. Here, you'll establish the baseline performance of the `Llama-3.2-1B` model without any specific optimizations. We will measure latency, throughput, and the quality of the generated headlines using the ROUGE score.

### Your Task: Implement the Evaluation Pipeline
You need to implement the core functions for loading a model, generating a headline, and evaluating performance. These functions will be reused for every optimization technique.

In [30]:
def load_model(model_name, quantization_config=None,device="cpu"):
    """TODO: Implement the logic for loading a tokenizer and model."""
    
    dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
    print(f"Using model: {model_name}")
    print(f"Using device: {device}")
    print(f"Using dtype: {dtype}")

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Tokenizer loaded successfully.")
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto",quantization_config=quantization_config).to(device)
    print("Model loaded successfully and moved to device.")
    model.eval()
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    return model,tokenizer

def generate_headline(model, tokenizer,texts,device,max_length,use_cache=False):
    """TODO: Implement the headline generation and latency measurement logic."""
    headlines = []
    latencies = []
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
            start = get_time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                use_cache=use_cache,  # Baseline: no KV caching
                do_sample=False
            )
            end = get_time()
            latencies.append(end - start)
            headline = tokenizer.decode(outputs[0], skip_special_tokens=True)
            headlines.append(headline)
    return headlines, latencies

def report_metrics(times,section):#results, latencies, max_new_tokens):
    """TODO: Implement the logic for calculating and reporting all performance metrics."""
    avg_latency = sum(times) / len(times)
    p99_latency = sorted(times)[int(0.99 * len(times))]
    throughput = len(times) / sum(times)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    print(f"Below are for: {section}s")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")

    pass

def evaluate_model(dataset,generated,num_prediction):
    """TODO: Implement the model evaluation loop."""
    rouge = load_metric("rouge")
    references = [item for item in dataset[:num_prediction]['headline']]
    results = rouge.compute(predictions=generated, references=references)
    print("ROUGE Scores:", results)
    
    return results
def clean(model,tokenizer):
    # Clean up model from memory
    del model
    del tokenizer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("\nCleaned up models and emptied CUDA cache.")

In [31]:
def get_module_by_name_str(model, module_name_str):
    """Gets a module from a model using its string name (e.g., 'model.layers.0.mlp.gate_proj')."""
    names = module_name_str.split('.')
    current_module = model
    for name_part in names:
        if hasattr(current_module, name_part):
            current_module = getattr(current_module, name_part)
        else:
            try: # Handle numeric indices in ModuleLists
                idx = int(name_part)
                current_module = current_module[idx]
            except (ValueError, TypeError, IndexError):
                raise AttributeError(f"Could not resolve name part '{name_part}' in '{module_name_str}'.")
    return current_module

def calculate_sparsity(module, param_name='weight'):
    """Calculates sparsity of a named parameter in a module."""
    if hasattr(module, param_name):
        param = getattr(module, param_name)
        if param is not None:
            return 100. * float(torch.sum(param == 0)) / float(param.nelement())
    return 0.0

In [32]:
from huggingface_hub import login
HF_TOKEN = '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'
login(token=HF_TOKEN)
datasets, articles = load_news_dataset("../dataset/News_Category_Dataset.json")

In [33]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
device = "cuda" if torch.cuda.is_available() else "cpu"



In [8]:
# TODO: Establish your baseline performance.

dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
model_new,tokenizer = load_model(model_name=model_name,quantization_config = None,device=device)
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS)
report_metrics(times,"Baseline")
evaluate_model(datasets,generated,NUM_PREDICTION)

Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Tokenizer loaded successfully.
Loading model...


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Model loaded successfully and moved to device.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for: Baselines
Avg Latency: 3.823s
P99 Latency: 4.207s
Throughput: 0.26 samples/sec
Max GPU Memory: 2373.51 MB


Downloading builder script: 0.00B [00:00, ?B/s]

ROUGE Scores: {'rouge1': 0.06223311470956908, 'rouge2': 0.0027027027027027024, 'rougeL': 0.04828548644338118, 'rougeLsum': 0.049484078622184645}


{'rouge1': 0.06223311470956908,
 'rouge2': 0.0027027027027027024,
 'rougeL': 0.04828548644338118,
 'rougeLsum': 0.049484078622184645}

# 3. Architectural Optimization: KV Caching

**Your Task:** One of the most effective ways to speed up token generation is using a Key-Value (KV) cache. This avoids re-computing attention scores for tokens that are already part of the sequence. Enable the `use_cache` flag in the generation arguments and re-run the evaluation. Observe the impact on latency and throughput.

In [9]:
# TODO: Evaluate the model with KV Caching enabled.
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times,"KV Caching")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for: KV Cachings
Avg Latency: 1.094s
P99 Latency: 1.117s
Throughput: 0.91 samples/sec
Max GPU Memory: 2373.51 MB
ROUGE Scores: {'rouge1': 0.06440349178782798, 'rouge2': 0.0027027027027027024, 'rougeL': 0.04832239714195846, 'rougeLsum': 0.05169932288826383}


{'rouge1': 0.06440349178782798,
 'rouge2': 0.0027027027027027024,
 'rougeL': 0.04832239714195846,
 'rougeLsum': 0.05169932288826383}

# 4. Model Compression: Pruning

**Your Task:** Pruning removes redundant model weights, which can reduce model size and potentially speed up inference. Here, you will implement unstructured, magnitude-based pruning by creating a function that applies it to the model's linear layers and then evaluating the result.

In [10]:
# def prune_model_weights(model, amount=0.3):
#     """TODO: Applies L1 unstructured pruning to the linear layers of a model."""
#     print(f"\n--- Accessing Target Layer: {TARGET_LAYER_NAME_STR} ---")
#     target_module = get_module_by_name_str(model, TARGET_LAYER_NAME_STR)
#     print(f"Successfully accessed target layer of type: {type(target_module)}")

#     sparsity_before = calculate_sparsity(target_module, 'weight')
#     print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' BEFORE pruning: {sparsity_before:.2f}%\n")
#     print(f"--- Applying L1 unstructured pruning (amount={PRUNING_AMOUNT}) ---")
#     prune.l1_unstructured(target_module, name="weight", amount=PRUNING_AMOUNT)

#     print("Pruning hook has been applied.")
#     print(f"The layer now has a 'weight_mask' and 'weight_orig' attribute.")
#     print(f"\n--- Making pruning permanent for '{TARGET_LAYER_NAME_STR}.weight' ---")
#     prune.remove(target_module, "weight")
#     print("Pruning has been made permanent. The 'weight' attribute is now the sparse tensor.")
#     sparsity_after = calculate_sparsity(target_module, 'weight')
#     print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' AFTER pruning: {sparsity_after:.2f}%\n")
    
#     return model

# model_prune = prune_model_weights(model_new,PRUNING_AMOUNT)
# sample_texts = articles[:num_prediction]
# generated, times = generate_headline(model_prune, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
# report_metrics(times," Unsupervised pruning")
# evaluate_model(datasets,generated,num_prediction)


In [11]:
def get_module_by_name(model, module_name):
    """Access a submodule in a model using its string name."""
    names = module_name.split('.')
    module = model
    for name in names:
        module = getattr(module, name)
    return module
def apply_pruning(model, layers_to_prune, amount, method):
    """Apply a specified pruning method to a list of layers."""
    parameters_to_prune = []
    for layer_name in layers_to_prune:
        try:
            module = get_module_by_name(model, layer_name)
            parameters_to_prune.append((module, 'weight'))
        except AttributeError:
            print(f"Warning: Layer {layer_name} not found. Skipping.")

    if not parameters_to_prune:
        print("No valid layers found to prune.")
        return

    pruning_method_map = {
        'l1_unstructured': prune.L1Unstructured,
    }
    
    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=pruning_method_map[method],
        amount=amount,
    )

    # # Make the pruning permanent
    # for module, param_name in parameters_to_prune:
    #     prune.remove(module, param_name)
    #print(f"Applied '{method}' pruning with {amount*100:.0f}% sparsity to {len(parameters_to_prune)} layers.")

def get_model_memory_footprint(model):
    """Calculates and returns the model's memory footprint in MB."""
    mem_params = sum(param.nelement() * param.element_size() for param in model.parameters())
    mem_bufs = sum(buf.nelement() * buf.element_size() for buf in model.buffers())
    total_mem_bytes = mem_params + mem_bufs
    return total_mem_bytes / (1024 ** 2) # Convert bytes to MB


dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
strategic_pruned_model,tokenizer = load_model(model_name=model_name,device=device)
memory_baseline = get_model_memory_footprint(strategic_pruned_model)
print(f"Loaded '{'strategic_pruned_model'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")


Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'strategic_pruned_model' model.
Memory Footprint: 2357.13 MB


In [12]:
NUM_LAYERS_TO_TARGET = 4
MLP_LAYERS = []
for i in range(NUM_LAYERS_TO_TARGET):
    MLP_LAYERS.extend([
        f"model.layers.{i}.mlp.gate_proj",
        f"model.layers.{i}.mlp.up_proj",
        f"model.layers.{i}.mlp.down_proj"
    ])
    print(MLP_LAYERS)

['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj', 'model.layers.2.mlp.gate_proj', 'model.layers.2.mlp.up_proj', 'model.layers.2.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj', 'model.layers.2.mlp.gate_proj', 'model.layers.2.mlp.up_proj', 'model.layers.2.mlp.down_proj', 'model.layers.3.mlp.gate_proj', 'model.layers.3.mlp.up_proj', 'model.layers.3.mlp.down_proj']


In [17]:
apply_pruning(strategic_pruned_model,MLP_LAYERS,PRUNING_AMOUNT,'l1_unstructured')

In [13]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(strategic_pruned_model, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Magnitude Unstructured pruning")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Magnitude Unstructured prunings
Avg Latency: 1.104s
P99 Latency: 1.107s
Throughput: 0.91 samples/sec
Max GPU Memory: 4729.85 MB
ROUGE Scores: {'rouge1': 0.06440349178782798, 'rouge2': 0.0027027027027027024, 'rougeL': 0.04832239714195846, 'rougeLsum': 0.05169932288826383}


{'rouge1': 0.06440349178782798,
 'rouge2': 0.0027027027027027024,
 'rougeL': 0.04832239714195846,
 'rougeLsum': 0.05169932288826383}

In [14]:
# Clean up model from memory
clean(strategic_pruned_model,tokenizer)



Cleaned up models and emptied CUDA cache.


# 5. Model Compression: Quantization

**Your Task:** Quantization reduces the precision of model weights (e.g., from 16-bit to 4-bit), significantly cutting down memory usage and often speeding up inference. You will define a 4-bit quantization configuration and use it to load and evaluate a new model.

In [15]:
dtype = torch.float16 

quantized_model,tokenizer = load_model(model_name=model_name,device=device)
memory_footprints = {}
memory_baseline = get_model_memory_footprint(quantized_model)
memory_footprints["baseline_name"] = f"{memory_baseline:.2f} MB"
print(f"Loaded '{'baseline_name'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")

Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'baseline_name' model.
Memory Footprint: 2357.13 MB


In [16]:
clean(quantized_model,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_8bit = get_model_memory_footprint(model_8bit)



Cleaned up models and emptied CUDA cache.


  import pynvml  # type: ignore[import]


In [17]:
memory_footprints["quant_8bit_name"] = f"{memory_8bit:.2f} MB"
print(f"Loaded '{'quant_8bit_name'}' model.")
print(f"Memory Footprint: {memory_8bit:.2f} MB")

Loaded 'quant_8bit_name' model.
Memory Footprint: 1429.13 MB


In [19]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_8bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Quantised 8bit model")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [20]:
clean(model_8bit,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_4bit = get_model_memory_footprint(model_4bit)



Cleaned up models and emptied CUDA cache.


In [21]:
memory_footprints["quant_4bit_name"] = f"{memory_4bit:.2f} MB"
print(f"Loaded '{'quant_4bit_name'}' model.")
print(f"Memory Footprint: {memory_4bit:.2f} MB")

Loaded 'quant_4bit_name' model.
Memory Footprint: 965.13 MB


In [22]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_4bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Quantised 4bit model")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [23]:
clean(model_4bit,tokenizer)


Cleaned up models and emptied CUDA cache.


# 6. Distributed Inference (Multi-GPU)

**Your Task:** If you have multiple GPUs, you can split the model across them to reduce the memory burden on a single GPU and potentially improve latency. We will explore two common techniques: Tensor Parallelism and Pipeline Parallelism.

*Note: This section requires a multi-GPU environment.*

### Tensor Parallelism
Tensor parallelism splits individual model layers (the tensors) across multiple GPUs. Operations like matrix multiplications are executed in parallel on different GPUs, and the results are aggregated. This is highly effective for reducing the memory footprint of very large layers. The `accelerate` library can handle this automatically via `device_map="auto"`.

### Pipeline Parallelism
Pipeline parallelism assigns entire layers or blocks of layers to different GPUs, creating a sequence or "pipeline" that the data flows through. For example, layers 1-10 run on GPU 0, layers 11-20 run on GPU 1, and so on. This is useful for very deep models where even a single layer might be too large for one GPU after tensor parallelism.

In [24]:
print(f"PyTorch version: {torch.__version__}")
#print(f"DeepSpeed version: {deepspeed.__version__}")
print(f"CUDA is available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    if torch.cuda.device_count() < 4:
        print("!! WARNING: This demo is designed for 4 GPUs. It may not run correctly. !!")

PyTorch version: 2.6.0+cu124
CUDA is available: True
Number of GPUs available: 4


In [8]:
!cuda_home = "/usr/local/cuda-12.1"

/bin/sh: cuda_home: command not found


In [9]:
ls /usr/local/ | grep cuda

[01;36mcuda[0m@
[01;34mcuda-12.1[0m/
[01;34mcuda-12.2[0m/
[01;34mcuda-12.3[0m/
[01;34mcuda-12.4[0m/


In [10]:
ls /usr/local/cuda-12.2/bin/nvcc

[0m[01;32m/usr/local/cuda-12.2/bin/nvcc[0m*


In [11]:
# import subprocess

# nvcc_path = subprocess.check_output(["which", "nvcc"], universal_newlines=True).strip()
# output = subprocess.check_output([nvcc_path, "-V"], universal_newlines=True)
# print(output)

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0



In [12]:
# pip uninstall -y torch 

Found existing installation: torch 2.6.0+cu118
Uninstalling torch-2.6.0+cu118:
  Successfully uninstalled torch-2.6.0+cu118
Note: you may need to restart the kernel to use updated packages.


In [13]:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl (780.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m45.0 MB/s[0m  [33m0:00:07[0m:00:01[0m00:01[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m216.4 MB/s[0m  [33m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m100.9 MB/s[0m  [33m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from t

In [6]:
# import torch
# print(torch.__version__)

  import pynvml  # type: ignore[import]


2.5.1+cu121


In [12]:
# !export CUDA_HOME=/usr/local/cuda-12.2
# !export PATH=$CUDA_HOME/bin:$PATH

In [17]:
# import os
# cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")

In [19]:
!sudo ln -s /usr/local/cuda-12.2 /usr/local/cuda-11.8

In [21]:
pip install deepspeed

Collecting deepspeed
  Using cached deepspeed-0.18.1.tar.gz (1.6 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting einops (from deepspeed)
  Downloading einops-0.8.1-py3-none-any.whl.metadata (13 kB)
Collecting hjson (from deepspeed)
  Downloading hjson-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting ninja (from deepspeed)
  Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Downloading einops-0.8.1-py3-none-any.whl (64 kB)
Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (180 kB)
Building wheels for collected packages: deepspeed
[33m  DEPRECATION: Building 'deepspeed' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-b

In [22]:
import torch
import time
from evaluate import load as load_metric

# Load ROUGE once
rouge = load_metric("rouge")

def generate_and_benchmark(model, tokenizer, texts, references, label="KV-Caching", use_cache=True, assistant_model=None):
    predictions = []
    latencies = []

    model.eval()
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
            start = time.time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=30,
                use_cache=use_cache,
                do_sample=False,
                assistant_model=assistant_model  # Optional for speculative decoding
            )
            end = time.time()https://onedrive.live.com/?id=root&cid=9A4AB227FDDF5F95&qt=mru
            latencies.append(end - start)
            pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
            predictions.append(pred)

    # Compute metrics
    avg_latency = sum(latencies) / len(latencies)
    p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
    throughput = len(latencies) / sum(latencies)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    print(f"\n[{label}]")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")
    print(f"ROUGE-1: {rouge_scores['rouge1'].mid.fmeasure:.4f}")
    print(f"ROUGE-2: {rouge_scores['rouge2'].mid.fmeasure:.4f}")
    print(f"ROUGE-L: {rouge_scores['rougeL'].mid.fmeasure:.4f}")

    return {
        "Variant": label,
        "Avg Latency (s)": round(avg_latency, 3),
        "P99 Latency (s)": round(p99_latency, 3),
        "Throughput (samples/sec)": round(throughput, 2),
        "Max GPU Memory (MB)": round(gpu_memory, 2),
        "ROUGE-1": round(rouge_scores["rouge1"].mid.fmeasure, 4),
        "ROUGE-2": round(rouge_scores["rouge2"].mid.fmeasure, 4),
        "ROUGE-L": round(rouge_scores["rougeL"].mid.fmeasure, 4)
    }

In [46]:
# TODO: Check for multi-GPU environment and evaluate with Tensor Parallelism.
# The `device_map="auto"` in your `load_model` function should automatically apply this.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Load model with automatic device mapping
model_tp = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    device_map="auto",  # Tensor parallelism
    torch_dtype=torch.float16
)
model_tp.eval()

if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model_tp.config.pad_token_id = model_tp.config.eos_token_id


In [48]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_tp, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Tensor Parallel")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Tensor Parallels
Avg Latency: 1.427s
P99 Latency: 1.429s
Throughput: 0.70 samples/sec
Max GPU Memory: 3605.27 MB
ROUGE Scores: {'rouge1': 0.06903897053835856, 'rouge2': 0.005234348272322956, 'rougeL': 0.05600198089488176, 'rougeLsum': 0.05687975724695431}


{'rouge1': 0.06903897053835856,
 'rouge2': 0.005234348272322956,
 'rougeL': 0.05600198089488176,
 'rougeLsum': 0.05687975724695431}

In [49]:
# references = [item for item in datasets[:NUM_PREDICTION]['headline']]
# results_tp = generate_and_benchmark(
#     model=model_tp,
#     tokenizer=tokenizer,
#     texts=sample_texts,
#     references=references,
#     label="Tensor Parallel",
#     use_cache=True
# )

In [None]:
# TODO: Evaluate with Pipeline Parallelism.
# This is more advanced and may require manually defining a device_map to assign
# different layers of the model to different GPUs.

# 7. Advanced Decoding: Speculative Decoding

**Your Task:** Speculative decoding uses a smaller, faster "draft" model to generate several candidate tokens. A larger, more accurate "target" model then verifies these tokens in a single forward pass. This can significantly speed up generation if the draft model is a good predictor. You will load a larger target model and a smaller draft model, benchmark the target model alone, and then benchmark it with assistance from the draft model.

In [52]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_pp, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Pipeline Parallel")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Pipeline Parallels
Avg Latency: 1.431s
P99 Latency: 1.437s
Throughput: 0.70 samples/sec
Max GPU Memory: 4222.28 MB
ROUGE Scores: {'rouge1': 0.06903897053835856, 'rouge2': 0.005234348272322956, 'rougeL': 0.05600198089488176, 'rougeLsum': 0.05687975724695431}


{'rouge1': 0.06903897053835856,
 'rouge2': 0.005234348272322956,
 'rougeL': 0.05600198089488176,
 'rougeLsum': 0.05687975724695431}

In [56]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Load Draft Model (1B)
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype=torch.float16,
    device_map="auto"
)
draft_model.eval()

# Load Target Model (7B)
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto"
)
target_model.eval()

# Load ROUGE
rouge = load_metric("rouge")


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [67]:
def run_speculative_decoding(texts, references, max_new_tokens=30):
    predictions = []
    latencies = []

    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(target_model.device)
            start = time.time()
            outputs = target_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                use_cache=True,
                do_sample=False,
                assistant_model=draft_model  # Enables speculative decoding
            )
            end = time.time()
            latencies.append(end - start)
            pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
            predictions.append(pred)

    # Metrics
    avg_latency = sum(latencies) / len(latencies)
    p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
    throughput = len(latencies) / sum(latencies)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    print(f"\n[Speculative Decoding]")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")
    print("ROUGE Scores:", rouge_scores)
    

    return {
        "Variant": "Speculative Decoding",
        "Avg Latency (s)": round(avg_latency, 3),
        "P99 Latency (s)": round(p99_latency, 3),
        "Throughput (samples/sec)": round(throughput, 2),
        "Max GPU Memory (MB)": round(gpu_memory, 2)
    }

In [68]:
def run_target_only(texts, references):
    return generate_and_benchmark(
        model=target_model,
        tokenizer=tokenizer,
        texts=texts,
        references=references,
        label="Target Model Only",
        use_cache=True
    )

In [69]:
if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model_tp.config.pad_token_id = model_tp.config.eos_token_id


sample_texts = articles[:NUM_PREDICTION]
references = [item for item in datasets[:NUM_PREDICTION]['headline']]
results_speculative = run_speculative_decoding(sample_texts, references)
# generated, times = generate_headline(model_tp, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
# report_metrics(times," Tensor Parallel")
# evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for


[Speculative Decoding]
Avg Latency: 2.588s
P99 Latency: 4.107s
Throughput: 0.39 samples/sec
Max GPU Memory: 7997.36 MB
ROUGE Scores: {'rouge1': 0.10497653392172682, 'rouge2': 0.013333333333333332, 'rougeL': 0.08647341113848431, 'rougeLsum': 0.08954642064636534}


In [None]:
def run_target_only(texts, references):
    return generate_and_benchmark(
        model=target_model,
        tokenizer=tokenizer,
        texts=texts,
        references=references,
        label="Target Model Only",
        use_cache=True
    )

In [72]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(target_model, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," target_model")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  target_models
Avg Latency: 3.475s
P99 Latency: 3.480s
Throughput: 0.29 samples/sec
Max GPU Memory: 7997.36 MB
ROUGE Scores: {'rouge1': 0.09330471222564926, 'rouge2': 0.010526315789473684, 'rougeL': 0.07611139601895961, 'rougeLsum': 0.08449443301989853}


{'rouge1': 0.09330471222564926,
 'rouge2': 0.010526315789473684,
 'rougeL': 0.07611139601895961,
 'rougeLsum': 0.08449443301989853}

# 8. Final Report and Analysis

**Your Task:** Consolidate your findings into a summary report. 

1.  Fill in the Markdown table below with the **Latency**, **Throughput**, and **ROUGE scores** for each optimization technique you implemented.
2. Compile the final Project Report in PDF format:
    *   Document the entire process, detailing the methodology, techniques, and libraries used.
    *   Present the final benchmark results clearly.
    *   Provide a thorough analysis of the trade-offs between performance, resources, and quality for each optimization step.
    *   Conclude with recommendations for the most effective optimization strategy for this specific headline generation task, supported by your data.

Some example questions for discussing the trade-offs:
    *   Which method gave the best performance improvement?
    *   Did any methods significantly hurt the ROUGE score (quality)?
    *   Which optimization would you recommend for deployment in a production environment at the news portal, and why? Consider factors like cost, complexity, and performance.

## Performance Comparison

| Optimization Technique | Mean Latency (s) | Throughput (tokens/s) | ROUGE-1 Score |
|--------------------------|------------------|-----------------------|---------------|
| Baseline (No Cache)      | TODO             | TODO                  | TODO          |
| KV Caching               | TODO             | TODO                  | TODO          |
| Pruning (30%)            | TODO             | TODO                  | TODO          |
| Quantization (4-bit)     | TODO             | TODO                  | TODO          |
| Tensor Parallelism       | TODO             | TODO                  | TODO          |
| Pipeline Parallelism     | TODO             | TODO                  | TODO          |
| Speculative Decoding     | TODO             | TODO                  | TODO          |

---

