# 🧠 LLM Model Compression & Evaluation Pipeline

This Jupyter Notebook walks through a structured pipeline that automates the process of compressing large language models (LLMs) and evaluating their performance.

## 🔧 What this Notebook Does:
1. **Installs prerequisites**
2. **Lists and selects available models**
3. **Skips specified models for compression or evaluation**
4. **Applies quantization techniques (e.g., INT8, INT4, FP16)**
5. **Evaluates models using ROUGE, BLEU, and time metrics**
6. **Generates a comparison report from evaluation CSVs**

## 📦 Model Formats Supported:
- INT8 Quantization
- INT4 Quantization
- FP16 Precision

> 💡 Models and configurations are loaded from an external file (`llm_config.py`).

## 📊 Evaluation Metrics:
- **ROUGE Score** ([read more](https://en.wikipedia.org/wiki/ROUGE_(metric)))
- **BLEU Score** ([read more](https://en.wikipedia.org/wiki/BLEU))
- **Inference Time** (Average response time per model)

---


## 🔧 Step 1: Environment Setup

The pipeline begins with the installation of all necessary libraries and setup of the working environment. 

If you're running this on a remote server or Colab, make sure the following steps execute successfully before continuing.


In [None]:
# Steps: 
# 1. Install Prerequisites
# 2. Get the Configuration File.
# 3. Check available models
# 4. Select models to skip for Compression
# 5. Compress every models for every 3 variants
# 6. Select models to skip for Evaluation
# 7. Evaluate selected models per Compression Variant. 
# 8. Save the Results in a CSV File (Per Compression Variant)

import os
import platform

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

#added installations 
%pip install rouge-score 
%pip install ipywidgets
%pip install pyngrok
 
#existing installations
%pip install -Uq pip
%pip uninstall -q -y optimum optimum-intel
%pip install --pre -Uq "openvino>=2024.2.0" openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu\
"git+https://github.com/huggingface/optimum-intel.git"\
"nncf==2.14.1"\
"torch>=2.1"\
"datasets" \
"accelerate" \
"gradio>=4.19" \
"huggingface-hub>=0.26.5" \
 "einops" "transformers>=4.43.1" "transformers_stream_generator" "tiktoken" "bitsandbytes"

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0.0"

## ⚙️ Step 2: Model Configuration

This step fetches and prepares the model configuration file, which includes metadata for each LLM to be evaluated or compressed. The configuration is pulled from a central location or a locally-defined file.

In [None]:
import os
from pathlib import Path
import requests
import shutil

# fetch model configuration

config_shared_path = Path("../../utils/llm_config.py")
config_dst_path = Path("llm_config.py")

if not config_dst_path.exists():
    if config_shared_path.exists():
        try:
            os.symlink(config_shared_path, config_dst_path)
        except Exception:
            shutil.copy(config_shared_path, config_dst_path)
    else:
        r = requests.get(url="https://raw.githubusercontent.com/GodreignElgin/llm-comparision/llm_config.py")
        with open("llm_config.py", "w", encoding="utf-8") as f:
            f.write(r.text)
elif not os.path.islink(config_dst_path):
    print("LLM config will be updated")
    if config_shared_path.exists():
        shutil.copy(config_shared_path, config_dst_path)
    else:
        r = requests.get(url="https://raw.githubusercontent.com/GodreignElgin/llm-comparision/llm_config.py")
        with open("llm_config.py", "w", encoding="utf-8") as f:
            f.write(r.text)

## 📦 Step 3: Import Model Support Info

Here, the notebook loads a list of supported models (`SUPPORTED_LLM_MODELS`) from an external Python script. It also sets up necessary interactive widgets for later UI-based model selection.

In [None]:
from llm_config import SUPPORTED_LLM_MODELS
import ipywidgets as widgets

## 🧠 Check System Resources

To avoid memory issues during evaluation, this cell checks for available RAM. For memory-intensive models, having at least **16–32 GB** of RAM is ideal.

In [None]:
# to check the current available RAM memory before doing evalation
import psutil
print(f"Available Memory: {psutil.virtual_memory().available / 1e9:.2f} GB")

## 📋 List All Available Models

You’ll now see a list of all the models configured for compression/evaluation. Each model name here can be selected or skipped based on your requirement.

In [None]:
# To list all the available Models. 
# The model name listed in below this is the model name you should give in the 
models = list(SUPPORTED_LLM_MODELS)

models = widgets.Dropdown(
    options=models,
    value=models[0],
    description="Models Available: ",
    disabled=False,
)

models

## ⛔ Optional: Skip Models for Compression

You may not want to compress all models—especially if they've already been processed. This step uses a widget-based multi-select to allow skipping specific models for **compression only**.

In [None]:
# Select the models to skip for Compression (all variants)
import ipywidgets as widgets
from IPython.display import display, clear_output

models = SUPPORTED_LLM_MODELS
checkboxes = [widgets.Checkbox(value=False, description=model) for model in models]
submit_button = widgets.Button(description="Submit")
output = widgets.Output()

SKIP_MODELS_COMPRESSION = []

def on_submit_clicked(b):
    SKIP_MODELS_COMPRESSION = [cb.description for cb in checkboxes if cb.value]
    with output:
        clear_output()
        print("Skipped models for Compression: ", SKIP_MODELS_COMPRESSION)

submit_button.on_click(on_submit_clicked)

display(widgets.VBox(checkboxes + [submit_button, output]))

## 🔁 Iterate Through Compressible Models

This cell prepares a loop through the models that **aren’t skipped**, applying quantization methods such as:
- [INT8](https://huggingface.co/docs/transformers/performance#dynamic-quantization)
- [INT4](https://huggingface.co/docs/optimum/intel/quantization)
- [FP16](https://huggingface.co/docs/transformers/performance#mixed-precision-training)

You can view how models are compressed internally in the logs printed during execution.

In [None]:
# Iterate through all the available models except ones mentioned in the SKIP_MODELS_COMPRESSION
# to compress the models and store it automatically.
# The Compressions done in this code are FP16, INT8 and INT4. To include other versions, you need to explicitly write code for that.
from pathlib import Path
from IPython.display import Markdown, display

# Iterate through all supported models except the ones in SKIP_MODELS
for model_name in SUPPORTED_LLM_MODELS:
    if model_name in SKIP_MODELS_COMPRESSION:
        print(f"Skipping model: {model_name}")
        continue
    
    model_configuration = SUPPORTED_LLM_MODELS[model_name]
    print(f"Processing model: {model_name}")
    
    pt_model_id = model_configuration["model_id"]
    fp16_model_dir = Path(model_name) / "FP16"
    int8_model_dir = Path(model_name) / "INT8"
    int4_model_dir = Path(model_name) / "INT4"

    def convert_to_fp16():
        if not (fp16_model_dir / "openvino_model.xml").exists():
            remote_code = model_configuration.get("remote_code", False)
            export_command = f"optimum-cli export openvino --model {pt_model_id} --task text-generation-with-past --weight-format fp16 {str(fp16_model_dir)}"
            if remote_code:
                export_command += " --trust-remote-code"
            display(Markdown(f"**Export command:** {export_command}"))
            ! $export_command

    def convert_to_int8():
        if not (int8_model_dir / "openvino_model.xml").exists():
            int8_model_dir.mkdir(parents=True, exist_ok=True)
            remote_code = model_configuration.get("remote_code", False)
            export_command = f"optimum-cli export openvino --model {pt_model_id} --task text-generation-with-past --weight-format int8 {str(int8_model_dir)}"
            if remote_code:
                export_command += " --trust-remote-code"
            display(Markdown(f"**Export command:** {export_command}"))
            ! $export_command

    def convert_to_int4():
        compression_configs = {
            "default": {"sym": False, "group_size": 128, "ratio": 0.8},
            "zephyr-7b-beta": {"sym": True, "group_size": 64, "ratio": 0.6},
            "mistral-7b": {"sym": True, "group_size": 64, "ratio": 0.6},
            "minicpm-2b-dpo": {"sym": True, "group_size": 64, "ratio": 0.6},
            "gemma-2b-it": {"sym": True, "group_size": 64, "ratio": 0.6},
            "notus-7b-v1": {"sym": True, "group_size": 64, "ratio": 0.6},
            "neural-chat-7b-v3-1": {"sym": True, "group_size": 64, "ratio": 0.6},
            "llama-2-chat-7b": {"sym": True, "group_size": 128, "ratio": 0.8},
            "llama-3-8b-instruct": {"sym": True, "group_size": 128, "ratio": 0.8},
            "llama-3.1-8b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
            "gemma-7b-it": {"sym": True, "group_size": 128, "ratio": 0.8},
            "chatglm2-6b": {"sym": True, "group_size": 128, "ratio": 0.72},
            "qwen-7b-chat": {"sym": True, "group_size": 128, "ratio": 0.6},
            "red-pajama-3b-chat": {"sym": False, "group_size": 128, "ratio": 0.5},
            "qwen2.5-7b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
            "qwen2.5-3b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
            "qwen2.5-14b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
            "qwen2.5-1.5b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
            "qwen2.5-0.5b-instruct": {"sym": True, "group_size": 128, "ratio": 1.0},
        }
        model_compression_params = compression_configs.get(model_name, compression_configs["default"])

        if not (int4_model_dir / "openvino_model.xml").exists():
            remote_code = model_configuration.get("remote_code", False)
            export_command = f"optimum-cli export openvino --model {pt_model_id} --task text-generation-with-past --weight-format int4"
            export_command += f" --group-size {model_compression_params['group_size']} --ratio {model_compression_params['ratio']}"
            if model_compression_params["sym"]:
                export_command += " --sym"
            export_command += f" {str(int4_model_dir)}"
            if remote_code:
                export_command += " --trust-remote-code"
            display(Markdown(f"**Export command:** {export_command}"))
            ! $export_command

    # Convert models if needed
    convert_to_fp16()
    convert_to_int8()
    convert_to_int4()

    print(f"Finished processing {model_name}\n")


## ⛔ Optional: Skip Models for Evaluation

Similar to compression, this allows skipping models from the **evaluation** stage using another widget selector. Skipping is helpful when running tests iteratively.

In [None]:
# Select the models to skip for Evaluation (all variants)
import ipywidgets as widgets
from IPython.display import display, clear_output

models = SUPPORTED_LLM_MODELS
checkboxes = [widgets.Checkbox(value=False, description=model) for model in models]
submit_button = widgets.Button(description="Submit")
output = widgets.Output()

SKIP_MODELS_EVALUATION = []

def on_submit_clicked(b):
    SKIP_MODELS_EVALUATION = [cb.description for cb in checkboxes if cb.value]
    with output:
        clear_output()
        print("Skipped models for Evaluation: ", SKIP_MODELS_EVALUATION)

submit_button.on_click(on_submit_clicked)

display(widgets.VBox(checkboxes + [submit_button, output]))

# Evaluation Pipeline 

This Pipeline code is designed to iterate through all the available models from the SUPPORTED_LLM_MODELS list, and Evaluate on 11 Evaluation Metrics. The Evaluation results are Stored in a CSV File for future reference. The code is well-structured, readable, and follows best practices.

This pipeline runs through only the specified Compression.

## 🧪 Model Evaluation Setup

Now that the compression configuration is set up, we begin loading required libraries and initializing tools for evaluation.

This includes:
- Tokenizers from 🤗 Hugging Face,
- Optimum libraries for quantized model loading,
- Metrics libraries for BLEU/ROUGE.



In [None]:
import time
import numpy as np
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.chrf_score import corpus_chrf
from sklearn.metrics.pairwise import cosine_similarity
def evaluate_model(model, tokenizer, input_text, reference_texts):
    # Time stamp 
    start_time = time.time()
    
    # Running the model with the sample input text.  
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    output_ids = model.generate(input_ids=input_ids, max_new_tokens=128)
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    # Latency calculation
    latency = (time.time() - start_time) * 1000
    num_tokens = len(output_ids[0])

    # Throughput calculation
    throughput = num_tokens / (latency / 1000)
    
    # Rouge Score Calcuator
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    rouge_scores = scorer.score(reference_texts, generated_text)
    
    # BLEU Score Calculator
    smoothing = SmoothingFunction().method1
    bleu_score = sentence_bleu([reference_texts.split()], generated_text.split(), smoothing_function=smoothing)
    
    # chrf Score Calculator
    chrf_score = corpus_chrf([[reference_texts]], [[generated_text]])
    
    # Unique ngrams, entropy and repeated ngrams Calculation
    tokens = generated_text.split()
    unique_ngrams = len(set(zip(tokens, tokens[1:]))) / len(tokens) if len(tokens) > 1 else 0
    entropy = -np.sum([p * np.log2(p) for p in np.unique(tokens, return_counts=True)[1] / len(tokens)])
    repeated_ngrams = sum([1 for i in range(len(tokens) - 1) if tokens[i] == tokens[i + 1]]) / len(tokens)
    
    try:
        ref_emb = model.get_input_embeddings()(input_ids).detach().numpy()
        gen_emb = model.get_input_embeddings()(output_ids).detach().numpy()

        #Coherence Calculator.
        coherence = cosine_similarity(ref_emb.mean(axis=1), gen_emb.mean(axis=1))[0][0]
    except Exception:
        coherence = 0
    
    return {
        "Model": f"{model.config.name_or_path}_{compression_dir}",
        "Latency (ms)": latency,
        "Throughput (tokens/sec)": throughput,
        "ROUGE-1": rouge_scores["rouge1"].fmeasure,
        "ROUGE-2": rouge_scores["rouge2"].fmeasure,
        "ROUGE-L": rouge_scores["rougeL"].fmeasure,
        "BLEU Score": bleu_score,
        "CHRF Score": chrf_score,
        "Unique n-grams": unique_ngrams,
        "Entropy": entropy,
        "Repeated n-grams (%)": repeated_ngrams * 100,
        "Coherence (Cosine Similarity)": coherence,
    }

## 🧊 Model Compression Pipeline

For each model:
1. Loads the base version using the Hugging Face `AutoModelForCausalLM` class.
2. Applies selected quantization format (FP16 / INT8 / INT4).
3. Saves the quantized model in the appropriate format.

🔧 Behind the scenes, it uses:
- `optimum.intel` for INT8
- `bitsandbytes` for INT4 (if available)

📎 [Learn about model quantization](https://huggingface.co/docs/transformers/performance#model-quantization)


In [None]:
import gc
import pandas as pd
from pathlib import Path
from transformers import AutoConfig, AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
import openvino.properties as props
import openvino.properties.hint as hints
import openvino.properties.streams as streams

def evaluate_models_and_save(SUPPORTED_LLM_MODELS, compression_dir):

    output_csv=f"evaluation_results_{compression_dir}.csv"

    # results stored as a list
    all_results = []
    existing_df = None
    
    # check if the file already exists, to append to the file and not accidentally create new file everytime and overwrite over it.
    if Path(output_csv).exists():
        existing_df = pd.read_csv(output_csv)
    
    for model_name, model_configuration in SUPPORTED_LLM_MODELS.items():

        if model_name in SKIP_MODELS_EVALUATION:
            print(f"Skipping model: {model_name}")
            continue
        print(f"Processing model: {model_name}")
        
        #loads only the specified precision model file. 
        model_dir = Path(model_name) / compression_dir  

        # only loads the openvino format model. 
        if not (model_dir / "openvino_model.xml").exists():
            continue
        
        print(f"Loading model from {model_dir}")
        ov_config = {
            hints.performance_mode(): hints.PerformanceMode.THROUGHPUT,
            streams.num(): "AUTO",
            props.cache_dir(): "ov_cache"
        }
        
        try:
            tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
            model = OVModelForCausalLM.from_pretrained(
                model_dir,
                device="CPU", # u can set the device here. GPU if u have. 
                ov_config=ov_config,
                config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
                trust_remote_code=True
            )
        except Exception as e:
            print(f"Error loading {model_name}: {e}")
            continue
        
        # sample input text for testing. 
        input_text = "2 + 2 ="

        # one-shot testing (simple one shot)
        reference_texts = "2 + 2 = 4"
        
        try:
            results = evaluate_model(model, tokenizer, input_text, reference_texts)
            results["Model"] = model_name
            all_results.append(results)
        except Exception as e:
            print(f"Error during evaluation of {model_name}: {e}")
            
        # Deletes the evaluated model and its assigned tokenizer to save RAM memory. 
        del model, tokenizer

        # Initializes the Garbage collector to free up memory.
        gc.collect()
    
    # Adding the results to a dataframe for easier analysis.
    df = pd.DataFrame(all_results)
    if existing_df is not None:
        df = pd.concat([existing_df, df], ignore_index=True)
    
    df.to_csv(output_csv, index=False)
    print(f"Results appended to {output_csv}")

compression_dirs = ["FP16", "INT8", "INT4"]
# make sure to set the compression_dir vairable
# compression_dir = model_compressed_directory_name 
# Example, compression_dir = FP16 (if the directory is named FP16 in the format, model/FP16/the compressed files.)
for compression_dir in compression_dirs:
    evaluate_models_and_save(SUPPORTED_LLM_MODELS, compression_dir)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the int4 data
data_int4 = """Model,Latency (ms),Throughput (tokens/sec),ROUGE-1,ROUGE-2,ROUGE-L,BLEU Score,CHRF Score,Unique n-grams,Entropy,Repeated n-grams (%),Coherence (Cosine Similarity)
qwen2.5-0.5b-instruct,3718.015432357788,35.77177190888036,0.0789473684210526,0.054054054054054,0.0789473684210526,0.0339483954244015,0.0813940147091148,0.696969696969697,5.182563191622859,0.0,0.0
tiny-llama-1b-chat,4734.71999168396,28.51277377270744,0.074074074074074,0.050632911392405,0.074074074074074,0.0421667093080309,0.0785752665303753,0.55,4.512492001110319,1.25,0.0
DeepSeek-R1-Distill-Qwen-1.5B,7707.976341247559,17.384588907328133,0.1428571428571428,0.05,0.1428571428571428,0.0503446068227303,0.1406866456493381,0.9791666666666666,4.634761657503722,0.0,0.0
DeepSeek-R1-Distill-Qwen-7B,56090.147495269775,2.389011368017896,0.0681818181818181,0.0465116279069767,0.0681818181818181,0.0387138472888192,0.0682625017764336,0.8160919540229885,5.350315334630777,3.4482758620689653,0.0
qwen2.5-1.5b-instruct,3660.7508659362793,9.834048073302192,0.3529411764705882,0.2666666666666667,0.3529411764705882,0.2140909265975804,0.2989303567713718,0.9411764705882352,3.734521664779752,0.0,0.0
gemma-2b-it,4337.922811508179,7.837852695257876,0.2352941176470588,0.1333333333333333,0.2352941176470588,0.1431712315455507,0.2166282977428487,0.7647058823529411,3.337175341123077,0.0,0.0
gemma-2-2b-it,4597.316265106201,7.395619104576564,0.2608695652173913,0.1904761904761905,0.2608695652173913,0.1700107809840422,0.2500253190521059,0.9523809523809524,4.297079327540666,0.0,0.0
qwen2.5-3b-instruct,6167.704343795776,3.242697588142081,0.5,0.3333333333333333,0.5,0.2984745896009823,0.3151663151663151,0.8888888888888888,2.725480556997868,0.0,0.0
DeepSeek-R1-Distill-Llama-8B,29102.829933166504,4.604363228858695,0.0681818181818181,0.0465116279069767,0.0681818181818181,0.0255901590405087,0.0657462943708585,0.8863636363636364,5.473580624294531,1.1363636363636365,0.0
qwen2.5-7b-instruct,28736.822605133057,4.628208268795978,0.0615384615384615,0.0317460317460317,0.0615384615384615,0.0278438083263775,0.0709180336599263,0.8888888888888888,5.333875392324575,0.0,0.0
llama-3.2-1b-instruct,5683.32839012146,23.5777331172546,0.1132075471698113,0.0784313725490195,0.1132075471698113,0.0401218776374591,0.188787574107971,0.5833333333333334,3.61007166776166,14.285714285714285,0.0
llama-3.2-3b-instruct,13235.87131500244,10.12400293195018,0.0517241379310344,0.0350877192982456,0.0517241379310344,0.0291622159772622,0.0655736528349281,0.2608695652173913,3.50282594697423,0.0,0.0
zephyr-7b-beta,30112.271785736084,4.483222021924905,0.0666666666666666,0.0454545454545454,0.0666666666666666,0.0241915748247312,0.0634091809588424,0.935483870967742,5.806481762856229,0.0,0.0
gemma-2-9b-it,17309.56268310547,2.19531831599009,0.2222222222222222,0.16,0.2222222222222222,0.1410002457876886,0.21321465960841668,0.96,4.563856189774725,0.0,0.0
"""
from io import StringIO
df_int4 = pd.read_csv(StringIO(data_int4))

# Display the first few rows of the dataframe
print("First few rows of int4 data:")
print(df_int4.head())
print("\n")

# Basic information about the dataframe
print("Information about int4 data:")
print(df_int4.info())
print("\n")

# 1. Latency vs. Throughput Scatter Plot (int4)
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Latency (ms)', y='Throughput (tokens/sec)', data=df_int4, hue='Model', s=100)
plt.title('Latency vs. Throughput for INT4 Compressed Models')
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (tokens/sec)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# 2. Bar Chart for ROUGE-L (int4)
plt.figure(figsize=(12, 6))
sns.barplot(x='Model', y='ROUGE-L', data=df_int4, palette='viridis')
plt.title('ROUGE-L Scores for INT4 Compressed Models')
plt.xlabel('Model')
plt.ylabel('ROUGE-L Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# 3. Bar Chart for BLEU Score (int4)
plt.figure(figsize=(12, 6))
sns.barplot(x='Model', y='BLEU Score', data=df_int4, palette='mako')
plt.title('BLEU Scores for INT4 Compressed Models')
plt.xlabel('Model')
plt.ylabel('BLEU Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()