# UdaciHeadline: LLM Inference Optimization Project

## Project Introduction
Large Language Models (LLMs) are transforming content creation, but deploying them efficiently remains a major hurdle. Imagine you're an ML Engineer at a bustling online news portal. Your key task? Automatically generating catchy headlines from article summaries using an LLM. The problem? The current inference process is sluggish, causing publication delays and driving up operational costs. In this project, UdaciHeadline, you'll step into this role and tackle this critical challenge head-on. Your mission is to accelerate the headline generation pipeline significantly by applying state-of-the-art LLM inference optimization techniques. Get ready to dive deep into practical optimization and deployment!

## Project Summary
This project provides hands-on experience in optimizing the inference performance of a pre-trained Large Language Model (like Llama-3.2-1B) for news headline generation. You will bring together concepts of LLM architecture, optimization techniques, and deployment frameworks. Specifically, you will:

1.  **Establish a baseline** inference pipeline and profile its performance.
2.  Implement and evaluate architectural optimizations like **KV-caching**.
3.  Apply model compression techniques like **quantization** and **pruning**.
4.  Configure and benchmark **distributed inference** using Tensor and Pipeline Parallelism.
5.  Apply advanced decoding mechanisms like **speculative decoding**.
6.  Perform comprehensive **benchmarking and analysis** across all stages.
7.  Produce a **final report** summarizing findings and trade-offs.

## Install and Imports and Global Configuration

Let's import the libraries we'll use throughout the project and define some constants like the model name and the prompt template.

In [3]:
!pip install evaluate
!pip install --upgrade transformers
!pip install rouge_score
!pip install bitsandbytes
!pip install accelerate
!pip install datasets

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting huggingface-hub>=0.7.0 (from evaluate)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting httpx<1.0.0 (from datasets>=2.0.0->evaluate)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1.0.0->datasets>=2.0.0->evaluate)
  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub>=0.7.0->evaluate)
  Downloading hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [1]:
import os
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from evaluate import load as load_metric
from transformers import AutoModelForCausalLM, AutoTokenizer
from time import time as get_time
from pprint import pprint
import torch.nn.utils.prune as prune
import copy
import torch
from time import time as get_time
from huggingface_hub import login
#os.environ["HF_HUB_OFFLINE"] = "1"
# ---- Constants ----

model_name = "meta-llama/Llama-3.2-1B"
MAX_NEW_TOKENS = 25 # Max length for the generated headline

#PROMPT = \
# print(f"Prompt: \"{PROMPT}\"")
TARGET_LAYER_NAME_STR = "model.layers.0.mlp.gate_proj"

# We will prune 50% of the weights in this layer
PRUNING_AMOUNT = 0.5
NUM_PREDICTION = 50

device = "cuda" if torch.cuda.is_available() else "cpu"
df_results = pd.DataFrame(columns=['Optimization Technique ','Avg Latency','P99 Latency','Throughput','Max GPU Memory','rouge1','rouge2','rougeL','rougeLsum'])
df_results

  import pynvml  # type: ignore[import]


## Data Loading

We will use the "News Category Dataset" from Kaggle. The `kagglehub` library makes it easy to download and access. Your task is to implement the function to load and preprocess the data according to the docstring.

In [2]:

def load_news_dataset(path):

    dataset = load_dataset("json", data_files=path, split="train[:1000]")
    articles = [item["short_description"] for item in dataset]
    headlines = [item["headline"] for item in dataset]

    return dataset,articles,headlines


## Define Functions

We will define all the functions here

In [3]:
def load_model(model_name, quantization_config=None,device="cpu"):

    dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
    print(f"Using model: {model_name}")
    print(f"Using device: {device}")
    print(f"Using dtype: {dtype}")

    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Tokenizer loaded successfully.")
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto",quantization_config=quantization_config).to(device)
    print("Model loaded successfully and moved to device.")
    model.eval()
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    return model,tokenizer

def generate_headline(model, tokenizer,texts,device,max_length,use_cache=False):

    headlines = []
    latencies = []
    with torch.no_grad():
        for text in texts:
            prompt = f"Generate a concise headline for the following news summary:\n{text}"
            prompt = f"Short description: {text}\nHeadline:"
            #print(prompt)
            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
            start = get_time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                use_cache=use_cache,  # Baseline: no KV caching
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
            end = get_time()
            latencies.append(end - start)
            generated_text = tokenizer.decode(outputs[0] , skip_special_tokens=True)

            headline_start = generated_text.find("Headline:") + len("Headline:")
            headline_new = generated_text[headline_start:].strip()

            headline_end = headline_new.find("Short description:")
            headline = headline_new[:headline_end].strip()

            headlines.append(headline)
    return headlines, latencies
def generate_headline_quantized(model, tokenizer, texts, device_map, max_length, use_cache=False):


    headlines = []
    latencies = []

    with torch.no_grad():
        for text in texts:
            prompt = f"Short description: {text}\nHeadline:"
            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
            first_device = next(model.parameters()).device
            inputs = {k: v.to(first_device) for k, v in inputs.items()}

            start = get_time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                use_cache=use_cache,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
            end = get_time()
            latencies.append(end - start)

            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Robust headline extraction
            headline = generated_text.split("Headline:")[-1].strip().split("Short description:")[0].strip()
            headlines.append(headline)

    return headlines, latencies


def report_metrics(times,section):
    avg_latency = sum(times) / len(times)
    p99_latency = sorted(times)[int(0.99 * len(times))]
    throughput = len(times) / sum(times)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    print(f"Below are for: {section}s")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")

    list_metric = [section,avg_latency,p99_latency,throughput,gpu_memory]
    return list_metric

def evaluate_model(dataset,generated,num_prediction):
    rouge = load_metric("rouge")
    references = [item for item in dataset[:num_prediction]['headline']]
    results = rouge.compute(predictions=generated, references=references)
    print("ROUGE Scores:", results)
    
    return results
def clean(model,tokenizer):
    # Clean up model from memory
    del model
    del tokenizer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("\nCleaned up models and emptied CUDA cache.")
    def calculate_sparsity(module, param_name='weight'):
    """Calculates sparsity of a named parameter in a module."""
    if hasattr(module, param_name):
        param = getattr(module, param_name)
        if param is not None:
            return 100. * float(torch.sum(param == 0)) / float(param.nelement())
    return 0.0
def print_headline(generated, headlines, articles,n=3):
    for i in range(0,n):
        print(f"Generated Headline: ",generated[i])
        print(f"Actual Headline: ",headlines[i])
        print(f"Short Description: ",articles[i])
        print(f"\n")

def get_module_by_name(model, module_name):
    """Access a submodule in a model using its string name."""
    names = module_name.split('.')
    module = model
    for name in names:
        module = getattr(module, name)
    return module
    
def apply_pruning(model, layers_to_prune, amount, method):
    """Apply a specified pruning method to a list of layers."""
    parameters_to_prune = []
    for layer_name in layers_to_prune:
        try:
            module = get_module_by_name(model, layer_name)
            parameters_to_prune.append((module, 'weight'))
        except AttributeError:
            print(f"Warning: Layer {layer_name} not found. Skipping.")

    if not parameters_to_prune:
        print("No valid layers found to prune.")
        return

    pruning_method_map = {
        'l1_unstructured': prune.L1Unstructured,
    }
    
    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=pruning_method_map[method],
        amount=amount,
    )

    # # Make the pruning permanent
    # for module, param_name in parameters_to_prune:
    #     prune.remove(module, param_name)
    #print(f"Applied '{method}' pruning with {amount*100:.0f}% sparsity to {len(parameters_to_prune)} layers.")

def get_model_memory_footprint(model):
    """Calculates and returns the model's memory footprint in MB."""
    mem_params = sum(param.nelement() * param.element_size() for param in model.parameters())
    mem_bufs = sum(buf.nelement() * buf.element_size() for buf in model.buffers())
    total_mem_bytes = mem_params + mem_bufs
    return total_mem_bytes / (1024 ** 2) # Convert bytes to MB

#### Load Huggingface token

In [5]:

HF_TOKEN = ''#Add your token here
login(token=HF_TOKEN)
datasets, articles,headlines = load_news_dataset("../dataset/News_Category_Dataset.json")

# 2. Baseline Performance

Before we can optimize, we need a starting point. Here, Lets establish the baseline performance of the `Llama-3.2-1B` model without any specific optimizations. We will measure latency, throughput, and the quality of the generated headlines using the ROUGE score.


In [7]:
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
model_new,tokenizer = load_model(model_name=model_name,quantization_config = None,device=device)
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS)
dict_ = report_metrics(times,"Baseline")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model loaded successfully and moved to device.
Below are for: Baselines
Avg Latency: 1.754s
P99 Latency: 2.098s
Throughput: 0.57 samples/sec
Max GPU Memory: 2372.75 MB
ROUGE Scores: {'rouge1': 0.19725660191058997, 'rouge2': 0.0711636938654821, 'rougeL': 0.17296930902739638, 'rougeLsum': 0.1739549789328416}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955


In [8]:
print_headline(generated, headlines, articles)

Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccine"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  2 dead, 1 injured in plane crash in California
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
The dog is a very important part of the famil
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a dog you don't understand what co

```
• Performance: Moderate latency and low throughput. Acts as the control for all other techniques.
• Memory: Efficient and fits comfortably on a single GPU.
• Quality: ROUGE scores are decent, showing that the model can generate reasonable headlines without optimization.
• Summary: A reliable starting point. Good balance of quality and resource usage, but slower than optimized variants
```

# 3. Architectural Optimization: KV Caching

One of the most effective ways to speed up token generation is using a Key-Value (KV) cache. This avoids re-computing attention scores for tokens that are already part of the sequence. 
Lets implement and observe the results

In [9]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_new, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times,"KV Caching")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for: KV Cachings
Avg Latency: 0.712s
P99 Latency: 0.744s
Throughput: 1.40 samples/sec
Max GPU Memory: 2372.75 MB
ROUGE Scores: {'rouge1': 0.19640260298652956, 'rouge2': 0.07316621056341521, 'rougeL': 0.1746354099929494, 'rougeLsum': 0.17505680921981875}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057


In [10]:
print_headline(generated, headlines, articles)

Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccine"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  2 dead, 1 injured in plane crash in California
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
The dog is a very important part of the famil
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a dog you don't understand what co

```
• Performance: Dramatically reduces latency by avoiding redundant computation during autoregressive decoding. The model reuses previously computed attention keys and values, which speeds up generation without changing the output.
• Memory: No additional memory savings, but no increase either  it’s a purely computational optimization.
• Quality: ROUGE scores remain stable or slightly improve due to faster decoding and reduced token delay.
• Summary: A low-risk, high-reward optimization. Ideal for any deployment scenario where latency matters.

```

In [11]:
clean(model_new,tokenizer)


Cleaned up models and emptied CUDA cache.


# 4. Model Compression: Pruning

Pruning removes redundant model weights, which can reduce model size and potentially speed up inference. Lets implement unstructured, magnitude-based pruning for MLP layer

In [12]:
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.is_bf16_supported() else torch.float32
strategic_pruned_model,tokenizer = load_model(model_name=model_name,device=device)
memory_baseline = get_model_memory_footprint(strategic_pruned_model)
print(f"Loaded '{'strategic_pruned_model'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")


Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'strategic_pruned_model' model.
Memory Footprint: 2357.13 MB


In [13]:
NUM_LAYERS_TO_TARGET = 4
MLP_LAYERS = []
for i in range(NUM_LAYERS_TO_TARGET):
    MLP_LAYERS.extend([
        f"model.layers.{i}.mlp.gate_proj",
        f"model.layers.{i}.mlp.up_proj",
        f"model.layers.{i}.mlp.down_proj"
    ])

In [14]:
apply_pruning(strategic_pruned_model,MLP_LAYERS,PRUNING_AMOUNT,'l1_unstructured')

In [15]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(strategic_pruned_model, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," Magnitude Unstructured pruning")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
# Clean up model from memory
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  Magnitude Unstructured prunings
Avg Latency: 0.599s
P99 Latency: 0.625s
Throughput: 1.67 samples/sec
Max GPU Memory: 10868.62 MB
ROUGE Scores: {'rouge1': 0.14577913438561244, 'rouge2': 0.0327597067539767, 'rougeL': 0.13061666400632255, 'rougeLsum': 0.13089770273333565}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898


In [16]:
print_headline(generated, headlines, articles)

Generated Headline:  The U.S. ordered 1.7 billion doses of the new boosters, which are 80% effective, fo
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  The 20-year-old man was arrested on suspicion of the murder of his 19-year-old girlfriend, who was killed i
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you

```
• Performance: Improves latency and throughput by removing low-magnitude weights from linear layers. This reduces the number of active parameters and speeds up matrix operations.
• Memory: Significant reduction in memory usage due to fewer active weights.
• Quality: ROUGE scores drop sharply, indicating that pruning harms headline generation quality. 
• Summary: Best suited for extreme resource-constrained environments where speed is critical and quality can be sacrificed. Not recommended for production headline generation.
```

In [17]:
clean(strategic_pruned_model,tokenizer)


Cleaned up models and emptied CUDA cache.


# 5. Model Compression: Quantization

Quantization reduces the precision of model weights (e.g., from 16-bit to 4-bit), significantly cutting down memory usage and often speeding up inference. Lets implement both 8-bit quantization and a 4-bit quantization configuration and load a model and observe the result

In [18]:
dtype = torch.float16 
quantized_model,tokenizer = load_model(model_name=model_name,device=device)
memory_footprints = {}
memory_baseline = get_model_memory_footprint(quantized_model)
memory_footprints["baseline_name"] = f"{memory_baseline:.2f} MB"
print(f"Loaded '{'baseline_name'}' model.")
print(f"Memory Footprint: {memory_baseline:.2f} MB")

Using model: meta-llama/Llama-3.2-1B
Using device: cuda
Using dtype: torch.bfloat16
Loading tokenizer...
Tokenizer loaded successfully.
Loading model...
Model loaded successfully and moved to device.
Loaded 'baseline_name' model.
Memory Footprint: 2357.13 MB


#### 8-bit Quantization Model

In [19]:
clean(quantized_model,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_8bit = get_model_memory_footprint(model_8bit)
memory_footprints["quant_8bit_name"] = f"{memory_8bit:.2f} MB"
print(f"Loaded '{'quant_8bit_name'}' model.")
print(f"Memory Footprint: {memory_8bit:.2f} MB")


Cleaned up models and emptied CUDA cache.


  import pynvml  # type: ignore[import]


Loaded 'quant_8bit_name' model.
Memory Footprint: 1429.13 MB


In [20]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline_quantized(model_8bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," Quantised 8bit model")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  Quantised 8bit models
Avg Latency: 1.807s
P99 Latency: 1.960s
Throughput: 0.55 samples/sec
Max GPU Memory: 10868.62 MB
ROUGE Scores: {'rouge1': 0.20000103018514484, 'rouge2': 0.0763577855733028, 'rougeL': 0.16968265333091442, 'rougeLsum': 0.1712212066362102}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221


In [21]:
print_headline(generated, headlines, articles)

Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccines for fall"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  "Man Who Flew From Plane To Escape Arrest Is Charged With Assault"
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
Description: "Until you have a dog you don
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a dog y

```
• Performance: Slightly slower than baseline due to quantization overhead, especially on CPUs or non-optimized GPU kernels.
• Memory: Major memory savings  model weights occupy half the space of full precision.
• Quality: ROUGE scores are stable or slightly better than baseline
• Summary: Excellent choice for deployment on limited hardware. Should be combine with KV-caching for best results

```

#### 4-bit Quantization Model

In [22]:
clean(model_8bit,tokenizer)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quantization_config, 
        device_map="auto" # Recommended for bitsandbytes
    )
memory_4bit = get_model_memory_footprint(model_4bit)
memory_footprints["quant_4bit_name"] = f"{memory_4bit:.2f} MB"
print(f"Loaded '{'quant_4bit_name'}' model.")
print(f"Memory Footprint: {memory_4bit:.2f} MB")


Cleaned up models and emptied CUDA cache.
Loaded 'quant_4bit_name' model.
Memory Footprint: 965.13 MB


In [23]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline_quantized(model_4bit, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," Quantised 4bit model")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  Quantised 4bit models
Avg Latency: 0.914s
P99 Latency: 0.931s
Throughput: 1.09 samples/sec
Max GPU Memory: 10868.62 MB
ROUGE Scores: {'rouge1': 0.15255867871600398, 'rouge2': 0.04118585010548303, 'rougeL': 0.1347190371995811, 'rougeLsum': 0.1337735279968416}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221
4,Quantised 4bit model,0.913505,0.930753,1.094685,10868.622559,0.152559,0.041186,0.134719,0.133774


In [24]:
print_headline(generated, headlines, articles)

Generated Headline:  The U.S. has ordered 171 million doses of the new booster, which is expected to arrive in the fall. The
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  story: Airplane pilot arrested in Los Angeles
The pilot of a small plane was arrested in Los Angeles on Friday, after
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
Description: "Until you have a dog you don
Actual Headline:  23 Of The Funniest Tweets A

```
• Performance: Faster than 8-bit and baseline due to smaller weight matrices and reduced memory bandwidth.
• Memory: Extreme memory savings ideal for edge devices or low-cost GPU setups.
• Quality: ROUGE scores drop moderately, indicating some loss of semantic richness and fluency.
• Summary: A strong option for low-memory environments, but not ideal if headline quality is critical.
```

In [25]:
clean(model_4bit,tokenizer)


Cleaned up models and emptied CUDA cache.


# 6. Distributed Inference (Multi-GPU)

*We are now goign to use  multiple GPUs, we will  split the model across them to reduce the memory burden on a single GPU and potentially improve latency. We will explore two common techniques: Tensor Parallelism and Pipeline Parallelism.

*Note: This section requires a multi-GPU environment.*

### Tensor Parallelism
Tensor parallelism splits individual model layers (the tensors) across multiple GPUs. Operations like matrix multiplications are executed in parallel on different GPUs, and the results are aggregated. This is highly effective for reducing the memory footprint of very large layers. The `accelerate` library can handle this automatically via `device_map="auto"`.

### Pipeline Parallelism
Pipeline parallelism assigns entire layers or blocks of layers to different GPUs, creating a sequence or "pipeline" that the data flows through. For example, layers 1-10 run on GPU 0, layers 11-20 run on GPU 1, and so on. This is useful for very deep models where even a single layer might be too large for one GPU after tensor parallelism.

### Tensor Parallelism

In [26]:
print(f"PyTorch version: {torch.__version__}")
#print(f"DeepSpeed version: {deepspeed.__version__}")
print(f"CUDA is available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of GPUs available: {torch.cuda.device_count()}")
    if torch.cuda.device_count() < 4:
        print("!! WARNING: This demo is designed for 4 GPUs. It may not run correctly. !!")

PyTorch version: 2.6.0+cu124
CUDA is available: True
Number of GPUs available: 4


In [27]:
# !sudo ln -s /usr/local/cuda-12.2 /usr/local/cuda-11.8
#uncomment to run if you face error

In [28]:
# pip install deepspeed

In [29]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# model_name = "meta-llama/Llama-3.2-1B"

# Load model with automatic device mapping
model_tp = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Tensor parallelism
    dtype=torch.float16
)
model_tp.eval()

if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model_tp.config.pad_token_id = model_tp.config.eos_token_id


In [30]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_tp, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," Tensor Parallel")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  Tensor Parallels
Avg Latency: 0.720s
P99 Latency: 0.825s
Throughput: 1.39 samples/sec
Max GPU Memory: 10868.62 MB
ROUGE Scores: {'rouge1': 0.19586889007068448, 'rouge2': 0.0731644188873067, 'rougeL': 0.17310889396741813, 'rougeLsum': 0.1761237906089539}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221
4,Quantised 4bit model,0.913505,0.930753,1.094685,10868.622559,0.152559,0.041186,0.134719,0.133774
5,Tensor Parallel,0.720075,0.824669,1.388744,10868.622559,0.195869,0.073164,0.173109,0.176124


In [31]:
clean(model_tp,tokenizer)
print_headline(generated, headlines, articles)


Cleaned up models and emptied CUDA cache.
Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccine"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  2 dead, 1 injured in plane crash in California
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
The dog is a great companion. He is
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a

```
• Performance: Improves throughput and latency by splitting tensor operations across multiple GPUs. Each GPU handles part of the computation in parallel.
• Memory: Efficient memory distribution across devices, allowing larger models to run without bottlenecks.
• Quality: ROUGE scores remain stable, as the model architecture and weights are unchanged.
• Summary: Ideal for multi-GPU setups. Offers strong performance gains without compromising quality.

```

### Pipeline Parallelism.

In [32]:
#"meta-llama/Llama-3.2-1B",
model_pp = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="balanced",  # Pipeline parallelism
    dtype=torch.float16
)
model_pp.eval()

sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(model_pp, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," Pipeline Parallel")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  Pipeline Parallels
Avg Latency: 0.719s
P99 Latency: 0.729s
Throughput: 1.39 samples/sec
Max GPU Memory: 10868.62 MB
ROUGE Scores: {'rouge1': 0.19586889007068448, 'rouge2': 0.0731644188873067, 'rougeL': 0.17310889396741813, 'rougeLsum': 0.1761237906089539}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221
4,Quantised 4bit model,0.913505,0.930753,1.094685,10868.622559,0.152559,0.041186,0.134719,0.133774
5,Tensor Parallel,0.720075,0.824669,1.388744,10868.622559,0.195869,0.073164,0.173109,0.176124
6,Pipeline Parallel,0.719361,0.729432,1.390123,10868.622559,0.195869,0.073164,0.173109,0.176124


In [33]:
clean(model_pp,tokenizer)
print_headline(generated, headlines, articles)


Cleaned up models and emptied CUDA cache.
Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccine"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  2 dead, 1 injured in plane crash in California
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
The dog is a great companion. He is
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a

```
• Performance: Similar to tensor parallelism, but splits model layers sequentially across GPUs. Each GPU handles a stage of the forward pass.
• Memory: Enables deep models to run across devices, reducing per-GPU memory load.
• Quality: Identical to tensor parallelism no change in output quality.
• Summary: Best for very large models (e.g., 13B+) where layer depth exceeds single-GPU capacity. Slightly more complex to manage than tensor parallelism.
```

# 7. Advanced Decoding: Speculative Decoding

Speculative decoding uses a smaller, faster "draft" model to generate several candidate tokens. A larger, more accurate "target" model then verifies these tokens in a single forward pass. This can significantly speed up generation if the draft model is a good predictor. 
Lets load You will load a larger target model (meta-llama/Llama-3.1-8B) and a smaller draft model (meta-llama/Llama-3.2-1B), and benchmark the target model alone, and then benchmark it with assistance from the draft model.

In [34]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Load Draft Model (1B)
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    dtype=torch.float16,
    device_map="auto"
)
draft_model.eval()

# Load Target Model (7B)
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    dtype=torch.float16,
    device_map="auto"
)
target_model.eval()

# Load ROUGE
rouge = load_metric("rouge")


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [35]:
def run_speculative_decoding(texts, references, max_new_tokens=30):
    predictions = []
    latencies = []

    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(target_model.device)
            start = get_time()
            outputs = target_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                use_cache=True,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id,
                assistant_model=draft_model  # Enables speculative decoding
            )
            end = get_time()
            latencies.append(end - start)
            pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
            predictions.append(pred)

    # Metrics
    avg_latency = sum(latencies) / len(latencies)
    p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
    throughput = len(latencies) / sum(latencies)
    gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    print(f"\n[Speculative Decoding]")
    print(f"Avg Latency: {avg_latency:.3f}s")
    print(f"P99 Latency: {p99_latency:.3f}s")
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Max GPU Memory: {gpu_memory:.2f} MB")
    print("ROUGE Scores:", rouge_scores)
    
    list_metric = ['Speculative Decoding',avg_latency,p99_latency,throughput,gpu_memory]
    return list_metric,rouge_scores

In [36]:
if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model_tp.config.pad_token_id = model_tp.config.eos_token_id


sample_texts = articles[:NUM_PREDICTION]
references = [item for item in datasets[:NUM_PREDICTION]['headline']]
dict_, res = run_speculative_decoding(sample_texts, references)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results


[Speculative Decoding]
Avg Latency: 2.416s
P99 Latency: 5.425s
Throughput: 0.41 samples/sec
Max GPU Memory: 13247.43 MB
ROUGE Scores: {'rouge1': 0.13590574213264023, 'rouge2': 0.032221324839906175, 'rougeL': 0.10602704062087703, 'rougeLsum': 0.11043058201812977}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221
4,Quantised 4bit model,0.913505,0.930753,1.094685,10868.622559,0.152559,0.041186,0.134719,0.133774
5,Tensor Parallel,0.720075,0.824669,1.388744,10868.622559,0.195869,0.073164,0.173109,0.176124
6,Pipeline Parallel,0.719361,0.729432,1.390123,10868.622559,0.195869,0.073164,0.173109,0.176124
7,Speculative Decoding,2.415523,5.424541,0.413989,13247.425293,0.135906,0.032221,0.106027,0.110431


In [37]:
print_headline(generated, headlines, articles)

Generated Headline:  "U.S. to order 100 million more doses of COVID-19 vaccine"
Actual Headline:  Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Short Description:  Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.


Generated Headline:  2 dead, 1 injured in plane crash in California
Actual Headline:  American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
Short Description:  He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.


Generated Headline:  "Until you have a dog you don't understand what could be eaten."
The dog is a great companion. He is
Actual Headline:  23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
Short Description:  "Until you have a dog you don't understand what could be eat

```
• Performance: Surprisingly slower in this setup due to mismatch between draft and target models. Latency and P99 latency are higher than baseline.
• Memory: Requires both models in memory, leading to the highest GPU usage in the benchmark.
• Quality: ROUGE scores are significantly lower, suggesting that the draft model’s predictions are frequently rejected or misaligned.
• Summary: Promising technique in theory, but underperformed here due to poor draft-target synergy. May work better with smaller gaps between models or fine-tuned drafts.

```

Lets now run larger target model and check the results

In [38]:
sample_texts = articles[:NUM_PREDICTION]
generated, times = generate_headline(target_model, tokenizer, sample_texts, device,MAX_NEW_TOKENS,"True")
dict_ = report_metrics(times," target_model")
res = evaluate_model(datasets,generated,NUM_PREDICTION)
df_results.loc[len(df_results)] = (dict_+ list(res.values()))
df_results

Below are for:  target_models
Avg Latency: 1.715s
P99 Latency: 2.041s
Throughput: 0.58 samples/sec
Max GPU Memory: 13247.43 MB
ROUGE Scores: {'rouge1': 0.20109832229004854, 'rouge2': 0.07571960213214383, 'rougeL': 0.1786265297163076, 'rougeLsum': 0.18017525964235714}


Unnamed: 0,Optimization Technique,Avg Latency,P99 Latency,Throughput,Max GPU Memory,rouge1,rouge2,rougeL,rougeLsum
0,Baseline,1.753986,2.097501,0.57013,2372.75,0.197257,0.071164,0.172969,0.173955
1,KV Caching,0.712174,0.743731,1.404151,2372.75,0.196403,0.073166,0.174635,0.175057
2,Magnitude Unstructured pruning,0.59874,0.625304,1.670175,10868.622559,0.145779,0.03276,0.130617,0.130898
3,Quantised 8bit model,1.806751,1.959691,0.55348,10868.622559,0.200001,0.076358,0.169683,0.171221
4,Quantised 4bit model,0.913505,0.930753,1.094685,10868.622559,0.152559,0.041186,0.134719,0.133774
5,Tensor Parallel,0.720075,0.824669,1.388744,10868.622559,0.195869,0.073164,0.173109,0.176124
6,Pipeline Parallel,0.719361,0.729432,1.390123,10868.622559,0.195869,0.073164,0.173109,0.176124
7,Speculative Decoding,2.415523,5.424541,0.413989,13247.425293,0.135906,0.032221,0.106027,0.110431
8,target_model,1.715131,2.041321,0.583046,13247.425293,0.201098,0.07572,0.178627,0.180175


```
• Performance: Slower than baseline due to larger model size and deeper architecture.
• Memory: Highest memory usage in the benchmark.
• Quality: Best ROUGE scores across all variants  headlines are more fluent, informative, and accurate.
• Summary: Ideal for quality-first applications. Use with parallelism or quantization to mitigate resource demands.
```

In [39]:
clean(draft_model,tokenizer)
clean(target_model,tokenizer)


Cleaned up models and emptied CUDA cache.

Cleaned up models and emptied CUDA cache.


In [40]:
df_results.to_csv("results.csv",index=False)

# 8. Final Report and Analysis

Final report is attached as PDF

## Performance Comparison

| Optimization Technique              | Avg Latency | P99 Latency | Throughput | Max GPU Memory | ROUGE-1  | ROUGE-2  | ROUGE-L  | ROUGE-Lsum |
|------------------------------------|-------------|-------------|------------|----------------|----------|----------|----------|-------------|
| Baseline                           | 1.753986    | 2.097501    | 0.570130   | 2372.750000    | 0.197257 | 0.071164 | 0.172969 | 0.173955    |
| KV Caching                         | 0.712174    | 0.743731    | 1.404151   | 2372.750000    | 0.196403 | 0.073166 | 0.174635 | 0.175057    |
| Magnitude Unstructured Pruning     | 0.598740    | 0.625304    | 1.670175   | 10868.622559   | 0.145779 | 0.032760 | 0.130617 | 0.130898    |
| Quantised 8bit Model               | 1.806751    | 1.959691    | 0.553480   | 10868.622559   | 0.200001 | 0.076358 | 0.169683 | 0.171221    |
| Quantised 4bit Model               | 0.913505    | 0.930753    | 1.094685   | 10868.622559   | 0.152559 | 0.041186 | 0.134719 | 0.133774    |
| Tensor Parallel                    | 0.720075    | 0.824669    | 1.388744   | 10868.622559   | 0.195869 | 0.073164 | 0.173109 | 0.176124    |
| Pipeline Parallel                  | 0.719361    | 0.729432    | 1.390123   | 10868.622559   | 0.195869 | 0.073164 | 0.173109 | 0.176124    |
| Speculative Decoding               | 2.415523    | 5.424541    | 0.413989   | 13247.425293   | 0.135906 | 0.032221 | 0.106027 | 0.110431    |
| Target Model                       | 1.715131    | 2.041321    | 0.583046   | 13247.425293   | 0.201098 | 0.075720 | 0.178627 | 0.180175    |

---

