# UdaciHeadline: LLM Inference Optimization Project

## Project Introduction
Large Language Models (LLMs) are transforming content creation, but deploying them efficiently remains a major hurdle. Imagine you're an ML Engineer at a bustling online news portal. Your key task? Automatically generating catchy headlines from article summaries using an LLM. The problem? The current inference process is sluggish, causing publication delays and driving up operational costs. In this project, UdaciHeadline, you'll step into this role and tackle this critical challenge head-on. Your mission is to accelerate the headline generation pipeline significantly by applying state-of-the-art LLM inference optimization techniques. Get ready to dive deep into practical optimization and deployment!

## Project Summary
This project provides hands-on experience in optimizing the inference performance of a pre-trained Large Language Model (like Llama-3.2-1B) for news headline generation. You will bring together concepts of LLM architecture, optimization techniques, and deployment frameworks. Specifically, you will:

1.  **Establish a baseline** inference pipeline and profile its performance.
2.  Implement and evaluate architectural optimizations like **KV-caching**.
3.  Apply model compression techniques like **quantization** and **pruning**.
4.  Configure and benchmark **distributed inference** using Tensor and Pipeline Parallelism.
5.  Apply advanced decoding mechanisms like **speculative decoding**.
6.  Perform comprehensive **benchmarking and analysis** across all stages.
7.  Produce a **final report** summarizing findings and trade-offs.

## Imports and Global Configuration

Let's import the libraries we'll use throughout the project and define some constants like the model name and the prompt template.

In [1]:
!pip install evaluate
!pip install --upgrade transformers
!pip install rouge_score
!pip install bitsandbytes
!pip install accelerate

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
[0mSuccessfully installed evaluate-0.4.6
Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6

In [2]:
import os
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from evaluate import load as load_metric

from time import time as get_time
from pprint import pprint
import torch.nn.utils.prune as prune
import copy
#os.environ["HF_HUB_OFFLINE"] = "1"
# ---- Constants ----

MODEL_NAME = "meta-llama/Meta-Llama-3-1B"
MAX_NEW_TOKENS = 50 # Max length for the generated headline

#PROMPT = \
# print(f"Prompt: \"{PROMPT}\"")
TARGET_LAYER_NAME_STR = "model.layers.0.mlp.gate_proj"

# We will prune 50% of the weights in this layer
PRUNING_AMOUNT = 0.5 
NUM_PREDICTION = 10

  from .autonotebook import tqdm as notebook_tqdm


## Data Loading

We will use the "News Category Dataset" from Kaggle. The `kagglehub` library makes it easy to download and access. Your task is to implement the function to load and preprocess the data according to the docstring.

In [3]:

def load_news_dataset(path):
    """TODO: Implement the data loading and preprocessing logic here."""
    dataset = load_dataset("json", data_files=path, split="train[:1000]")
    articles = [item["short_description"] for item in dataset]

    return dataset,articles


# 2. Baseline Performance

Before we can optimize, we need a starting point. Here, you'll establish the baseline performance of the `Llama-3.2-1B` model without any specific optimizations. We will measure latency, throughput, and the quality of the generated headlines using the ROUGE score.

### Your Task: Implement the Evaluation Pipeline
You need to implement the core functions for loading a model, generating a headline, and evaluating performance. These functions will be reused for every optimization technique.

In [4]:
def load_model(model_name, quantization_config=None,device="cpu"):
    """TODO: Implement the logic for loading a tokenizer and model."""
    
    dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
    print(f"Using model: {model_name}")
    print(f"Using device: {device}")
    print(f"Using dtype: {dtype}")

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Tokenizer loaded successfully.")
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto",quantization_config=quantization_config).to(device)
    print("Model loaded successfully and moved to device.")
    model.eval()
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    return model,tokenizer

def generate_headline(model, tokenizer,texts,device,max_length,use_cache=False):
    """TODO: Implement the headline generation and latency measurement logic."""
    headlines = []
    latencies = []
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
            start = get_time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                use_cache=use_cache,  # Baseline: no KV caching
                do_sample=False
            )
            end = get_time()
            latencies.append(end - start)
            headline = tokenizer.decode(outputs[0], skip_special_tokens=True)
            headlines.append(headline)
    return headlines, latencies

def report_metrics(times,section):#results, latencies, max_new_tokens):
    """TODO: Implement the logic for calculating and reporting all performance metrics."""
    avg_latency = sum(times) / len(times)
    p99_latency = sorted(times)[int(0.99 * len(times))]
    throughput = len(times) / sum(times)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    print(f"Below are for: {section}s")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")

    pass

def evaluate_model(dataset,generated,num_prediction):
    """TODO: Implement the model evaluation loop."""
    rouge = load_metric("rouge")
    references = [item for item in dataset[:num_prediction]['headline']]
    results = rouge.compute(predictions=generated, references=references)
    print("ROUGE Scores:", results)
    
    return results
def clean(model,tokenizer):
    # Clean up model from memory
    del model
    del tokenizer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("\nCleaned up models and emptied CUDA cache.")

In [5]:
def get_module_by_name_str(model, module_name_str):
    """Gets a module from a model using its string name (e.g., 'model.layers.0.mlp.gate_proj')."""
    names = module_name_str.split('.')
    current_module = model
    for name_part in names:
        if hasattr(current_module, name_part):
            current_module = getattr(current_module, name_part)
        else:
            try: # Handle numeric indices in ModuleLists
                idx = int(name_part)
                current_module = current_module[idx]
            except (ValueError, TypeError, IndexError):
                raise AttributeError(f"Could not resolve name part '{name_part}' in '{module_name_str}'.")
    return current_module

def calculate_sparsity(module, param_name='weight'):
    """Calculates sparsity of a named parameter in a module."""
    if hasattr(module, param_name):
        param = getattr(module, param_name)
        if param is not None:
            return 100. * float(torch.sum(param == 0)) / float(param.nelement())
    return 0.0

In [6]:
from huggingface_hub import login
HF_TOKEN = ''
login(token=HF_TOKEN)
datasets, articles = load_news_dataset("../dataset/News_Category_Dataset.json")

Generating train split: 209527 examples [00:00, 326207.49 examples/s]


In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
device = "cuda" if torch.cuda.is_available() else "cpu"



In [8]:
# TODO: Establish your baseline performance.

dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
model_new,tokenizer = load_model(model_name=model_name,quantization_config = dtype,device=device)
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS)
report_metrics(times,"Baseline")
evaluate_model(datasets,generated,NUM_PREDICTION)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for: Baselines
Avg Latency: 4.185s
P99 Latency: 4.795s
Throughput: 0.24 samples/sec
Max GPU Memory: 2372.02 MB
ROUGE Scores: {'rouge1': 0.051078657630867316, 'rouge2': 0.0, 'rougeL': 0.04266694403994176, 'rougeLsum': 0.04434559579139584}


{'rouge1': 0.051078657630867316,
 'rouge2': 0.0,
 'rougeL': 0.04266694403994176,
 'rougeLsum': 0.04434559579139584}

# 3. Architectural Optimization: KV Caching

**Your Task:** One of the most effective ways to speed up token generation is using a Key-Value (KV) cache. This avoids re-computing attention scores for tokens that are already part of the sequence. Enable the `use_cache` flag in the generation arguments and re-run the evaluation. Observe the impact on latency and throughput.

In [9]:
# TODO: Evaluate the model with KV Caching enabled.
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times,"KV Caching")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for: KV Cachings
Avg Latency: 1.225s
P99 Latency: 1.286s
Throughput: 0.82 samples/sec
Max GPU Memory: 2372.02 MB
ROUGE Scores: {'rouge1': 0.058109756940126295, 'rouge2': 0.0, 'rougeL': 0.04490838618745595, 'rougeLsum': 0.0462616749415655}


{'rouge1': 0.058109756940126295,
 'rouge2': 0.0,
 'rougeL': 0.04490838618745595,
 'rougeLsum': 0.0462616749415655}

# 4. Model Compression: Pruning

**Your Task:** Pruning removes redundant model weights, which can reduce model size and potentially speed up inference. Here, you will implement unstructured, magnitude-based pruning by creating a function that applies it to the model's linear layers and then evaluating the result.

In [13]:
# def prune_model_weights(model, amount=0.3):
#     """TODO: Applies L1 unstructured pruning to the linear layers of a model."""
#     print(f"\n--- Accessing Target Layer: {TARGET_LAYER_NAME_STR} ---")
#     target_module = get_module_by_name_str(model, TARGET_LAYER_NAME_STR)
#     print(f"Successfully accessed target layer of type: {type(target_module)}")

#     sparsity_before = calculate_sparsity(target_module, 'weight')
#     print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' BEFORE pruning: {sparsity_before:.2f}%\n")
#     print(f"--- Applying L1 unstructured pruning (amount={PRUNING_AMOUNT}) ---")
#     prune.l1_unstructured(target_module, name="weight", amount=PRUNING_AMOUNT)

#     print("Pruning hook has been applied.")
#     print(f"The layer now has a 'weight_mask' and 'weight_orig' attribute.")
#     print(f"\n--- Making pruning permanent for '{TARGET_LAYER_NAME_STR}.weight' ---")
#     prune.remove(target_module, "weight")
#     print("Pruning has been made permanent. The 'weight' attribute is now the sparse tensor.")
#     sparsity_after = calculate_sparsity(target_module, 'weight')
#     print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' AFTER pruning: {sparsity_after:.2f}%\n")
    
#     return model

# model_prune = prune_model_weights(model_new,PRUNING_AMOUNT)
# sample_texts = articles[:num_prediction]
# generated, times = generate_headline(model_prune, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
# report_metrics(times," Unsupervised pruning")
# evaluate_model(datasets,generated,num_prediction)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



--- Accessing Target Layer: model.layers.0.mlp.gate_proj ---
Successfully accessed target layer of type: <class 'torch.nn.modules.linear.Linear'>
Sparsity of 'model.layers.0.mlp.gate_proj.weight' BEFORE pruning: 0.00%

--- Applying L1 unstructured pruning (amount=0.5) ---
Pruning hook has been applied.
The layer now has a 'weight_mask' and 'weight_orig' attribute.

--- Making pruning permanent for 'model.layers.0.mlp.gate_proj.weight' ---
Pruning has been made permanent. The 'weight' attribute is now the sparse tensor.
Sparsity of 'model.layers.0.mlp.gate_proj.weight' AFTER pruning: 50.00%



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Unsupervised prunings
Avg Latency: 1.257s
P99 Latency: 1.400s
Throughput: 0.80 samples/sec
Max GPU Memory: 5124.62 MB
ROUGE Scores: {'rouge1': 0.06384811680437281, 'rouge2': 0.0, 'rougeL': 0.05183219399724151, 'rougeLsum': 0.05374848337669946}


{'rouge1': 0.06384811680437281,
 'rouge2': 0.0,
 'rougeL': 0.05183219399724151,
 'rougeLsum': 0.05374848337669946}

In [12]:
def get_module_by_name(model, module_name):
    """Access a submodule in a model using its string name."""
    names = module_name.split('.')
    module = model
    for name in names:
        module = getattr(module, name)
    return module
def apply_pruning(model, layers_to_prune, amount, method):
    """Apply a specified pruning method to a list of layers."""
    parameters_to_prune = []
    for layer_name in layers_to_prune:
        try:
            module = get_module_by_name(model, layer_name)
            parameters_to_prune.append((module, 'weight'))
        except AttributeError:
            print(f"Warning: Layer {layer_name} not found. Skipping.")

    if not parameters_to_prune:
        print("No valid layers found to prune.")
        return

    pruning_method_map = {
        'l1_unstructured': prune.L1Unstructured,
    }
    
    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=pruning_method_map[method],
        amount=amount,
    )

    # # Make the pruning permanent
    # for module, param_name in parameters_to_prune:
    #     prune.remove(module, param_name)
    print(f"Applied '{method}' pruning with {amount*100:.0f}% sparsity to {len(parameters_to_prune)} layers.")
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
strategic_pruned_model,tokenizer = load_model(model_name=model_name,device=device)
memory_baseline = get_model_memory_footprint(strategic_pruned_model)
print(f"Loaded '{'strategic_pruned_model'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")


Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'strategic_pruned_model' model.
Memory Footprint: 2357.13 MB


In [9]:
NUM_LAYERS_TO_TARGET = 4
MLP_LAYERS = []
for i in range(NUM_LAYERS_TO_TARGET):
    MLP_LAYERS.extend([
        f"model.layers.{i}.mlp.gate_proj",
        f"model.layers.{i}.mlp.up_proj",
        f"model.layers.{i}.mlp.down_proj"
    ])
    print(MLP_LAYERS)

['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj', 'model.layers.2.mlp.gate_proj', 'model.layers.2.mlp.up_proj', 'model.layers.2.mlp.down_proj']
['model.layers.0.mlp.gate_proj', 'model.layers.0.mlp.up_proj', 'model.layers.0.mlp.down_proj', 'model.layers.1.mlp.gate_proj', 'model.layers.1.mlp.up_proj', 'model.layers.1.mlp.down_proj', 'model.layers.2.mlp.gate_proj', 'model.layers.2.mlp.up_proj', 'model.layers.2.mlp.down_proj', 'model.layers.3.mlp.gate_proj', 'model.layers.3.mlp.up_proj', 'model.layers.3.mlp.down_proj']


In [10]:
apply_pruning(strategic_pruned_model,MLP_LAYERS,PRUNING_AMOUNT,'l1_unstructured')



Applied 'l1_unstructured' pruning with 50% sparsity to 12 layers.


In [16]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(strategic_pruned_model, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Magnitude Unstructured pruning")
evaluate_model(datasets,generated,NUM_PREDICTION)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Magnitude Unstructured prunings
Avg Latency: 1.203s
P99 Latency: 1.901s
Throughput: 0.83 samples/sec
Max GPU Memory: 10860.50 MB
ROUGE Scores: {'rouge1': 0.04866679337142641, 'rouge2': 0.0, 'rougeL': 0.04172254407577583, 'rougeLsum': 0.04151666104238316}


{'rouge1': 0.04866679337142641,
 'rouge2': 0.0,
 'rougeL': 0.04172254407577583,
 'rougeLsum': 0.04151666104238316}

In [13]:
# Clean up model from memory
clean(strategic_pruned_model,tokenizer)



Cleaned up models and emptied CUDA cache.


# 5. Model Compression: Quantization

**Your Task:** Quantization reduces the precision of model weights (e.g., from 16-bit to 4-bit), significantly cutting down memory usage and often speeding up inference. You will define a 4-bit quantization configuration and use it to load and evaluate a new model.

In [14]:
def get_model_memory_footprint(model):
    """Calculates and returns the model's memory footprint in MB."""
    mem_params = sum(param.nelement() * param.element_size() for param in model.parameters())
    mem_bufs = sum(buf.nelement() * buf.element_size() for buf in model.buffers())
    total_mem_bytes = mem_params + mem_bufs
    return total_mem_bytes / (1024 ** 2) # Convert bytes to MB



In [16]:
dtype = torch.float16 

quantized_model,tokenizer = load_model(model_name=model_name,device=device)
memory_footprints = {}
memory_baseline = get_model_memory_footprint(quantized_model)
memory_footprints["baseline_name"] = f"{memory_baseline:.2f} MB"
print(f"Loaded '{'baseline_name'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")

Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'baseline_name' model.
Memory Footprint: 2357.13 MB


In [17]:
clean(quantized_model,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_8bit = get_model_memory_footprint(model_8bit)



Cleaned up models and emptied CUDA cache.
Loaded 'quant_8bit_name' model.
Memory Footprint: 1429.13 MB


In [18]:
memory_footprints["quant_8bit_name"] = f"{memory_8bit:.2f} MB"
print(f"Loaded '{'quant_8bit_name'}' model.")
print(f"Memory Footprint: {memory_8bit:.2f} MB")

Loaded 'quant_8bit_name' model.
Memory Footprint: 1429.13 MB


In [19]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_8bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Quantised 8bit model")
evaluate_model(datasets,generated,NUM_PREDICTION)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Quantised 8bit models
Avg Latency: 3.608s
P99 Latency: 4.010s
Throughput: 0.28 samples/sec
Max GPU Memory: 8505.96 MB
ROUGE Scores: {'rouge1': 0.0596551604580485, 'rouge2': 0.0, 'rougeL': 0.04445310216446986, 'rougeLsum': 0.05077744129307699}


{'rouge1': 0.0596551604580485,
 'rouge2': 0.0,
 'rougeL': 0.04445310216446986,
 'rougeLsum': 0.05077744129307699}

In [20]:
clean(model_8bit,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_4bit = get_model_memory_footprint(model_4bit)



Cleaned up models and emptied CUDA cache.


In [21]:
memory_footprints["quant_4bit_name"] = f"{memory_4bit:.2f} MB"
print(f"Loaded '{'quant_4bit_name'}' model.")
print(f"Memory Footprint: {memory_4bit:.2f} MB")

Loaded 'quant_4bit_name' model.
Memory Footprint: 965.13 MB


In [22]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_4bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
report_metrics(times," Quantised 4bit model")
evaluate_model(datasets,generated,NUM_PREDICTION)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below are for:  Quantised 4bit models
Avg Latency: 1.653s
P99 Latency: 1.734s
Throughput: 0.60 samples/sec
Max GPU Memory: 8505.96 MB
ROUGE Scores: {'rouge1': 0.07217188820304037, 'rouge2': 0.005100714749837557, 'rougeL': 0.062138023446606995, 'rougeLsum': 0.06732401331757207}


{'rouge1': 0.07217188820304037,
 'rouge2': 0.005100714749837557,
 'rougeL': 0.062138023446606995,
 'rougeLsum': 0.06732401331757207}

# 6. Distributed Inference (Multi-GPU)

**Your Task:** If you have multiple GPUs, you can split the model across them to reduce the memory burden on a single GPU and potentially improve latency. We will explore two common techniques: Tensor Parallelism and Pipeline Parallelism.

*Note: This section requires a multi-GPU environment.*

### Tensor Parallelism
Tensor parallelism splits individual model layers (the tensors) across multiple GPUs. Operations like matrix multiplications are executed in parallel on different GPUs, and the results are aggregated. This is highly effective for reducing the memory footprint of very large layers. The `accelerate` library can handle this automatically via `device_map="auto"`.

### Pipeline Parallelism
Pipeline parallelism assigns entire layers or blocks of layers to different GPUs, creating a sequence or "pipeline" that the data flows through. For example, layers 1-10 run on GPU 0, layers 11-20 run on GPU 1, and so on. This is useful for very deep models where even a single layer might be too large for one GPU after tensor parallelism.

In [10]:
print(f"PyTorch version: {torch.__version__}")
#print(f"DeepSpeed version: {deepspeed.__version__}")
print(f"CUDA is available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    if torch.cuda.device_count() < 4:
        print("!! WARNING: This demo is designed for 4 GPUs. It may not run correctly. !!")

PyTorch version: 2.3.0
CUDA is available: True
Number of GPUs available: 1


In [8]:
!pip install deepspeed
https://github.com/Surveshchauhan/LLM-Inference.git

Defaulting to user installation because normal site-packages is not writeable
Collecting deepspeed
  Downloading deepspeed-0.18.1.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[8 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "<string>", line 2, in <module>
  [31m   [0m   File "<pip-setuptools-caller>", line 34, in <module>
  [31m   [0m   File "/tmp/pip-install-4h6e20g3/deepspeed_ab1919b42ee740678ccc4f2482e01473/setup.py", line 110, in <module>
  [31m   [0m     cuda_major_ver, cuda_minor_ver = installed_cuda_version()
  [31m   [0m   File "/tmp/pip-install-4h6e20g3/deepspeed_ab1919b42ee740

In [None]:
# TODO: Check for multi-GPU environment and evaluate with Tensor Parallelism.
# The `device_map="auto"` in your `load_model` function should automatically apply this.

In [None]:
# TODO: Evaluate with Pipeline Parallelism.
# This is more advanced and may require manually defining a device_map to assign
# different layers of the model to different GPUs.

# 7. Advanced Decoding: Speculative Decoding

**Your Task:** Speculative decoding uses a smaller, faster "draft" model to generate several candidate tokens. A larger, more accurate "target" model then verifies these tokens in a single forward pass. This can significantly speed up generation if the draft model is a good predictor. You will load a larger target model and a smaller draft model, benchmark the target model alone, and then benchmark it with assistance from the draft model.

In [None]:
# TODO: Implement and evaluate speculative decoding.

# 8. Final Report and Analysis

**Your Task:** Consolidate your findings into a summary report. 

1.  Fill in the Markdown table below with the **Latency**, **Throughput**, and **ROUGE scores** for each optimization technique you implemented.
2. Compile the final Project Report in PDF format:
    *   Document the entire process, detailing the methodology, techniques, and libraries used.
    *   Present the final benchmark results clearly.
    *   Provide a thorough analysis of the trade-offs between performance, resources, and quality for each optimization step.
    *   Conclude with recommendations for the most effective optimization strategy for this specific headline generation task, supported by your data.

Some example questions for discussing the trade-offs:
    *   Which method gave the best performance improvement?
    *   Did any methods significantly hurt the ROUGE score (quality)?
    *   Which optimization would you recommend for deployment in a production environment at the news portal, and why? Consider factors like cost, complexity, and performance.

## Performance Comparison

| Optimization Technique | Mean Latency (s) | Throughput (tokens/s) | ROUGE-1 Score |
|--------------------------|------------------|-----------------------|---------------|
| Baseline (No Cache)      | TODO             | TODO                  | TODO          |
| KV Caching               | TODO             | TODO                  | TODO          |
| Pruning (30%)            | TODO             | TODO                  | TODO          |
| Quantization (4-bit)     | TODO             | TODO                  | TODO          |
| Tensor Parallelism       | TODO             | TODO                  | TODO          |
| Pipeline Parallelism     | TODO             | TODO                  | TODO          |
| Speculative Decoding     | TODO             | TODO                  | TODO          |

---

