# Benchmarking

The purpose of this jupyter notebook is to load original/optimized models to device(CPU/GPU) and run benchmarks like inference throughput and model size.

Model Loading: Load the Tiny-llama-1b-chat model using Hugging Face's transformers library. This involves retrieving the pre-trained model and its tokenizer, which are essential for processing input data and generating responses. On the second part of this notebook the pretrained models are loaded using **OVModelForCausalLM** API.

Next, moving to a Python environment, we import the appropriate modules and  download the original model as well as its processor.​

In [19]:
from config import SUPPORTED_LLM_MODELS
import ipywidgets as widgets
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForCausalLM, AutoConfig
from optimum.intel import OVQuantizer
import openvino as ov
from pathlib import Path
import shutil
import torch
import logging
import nncf
import gc
from transformers import (
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    TextIteratorStreamer,
)
from converter import converters, register_configs

register_configs()
core = ov.Core()

Below is a sampele of prompt and configs used for seting up the LLM and generating responses.

In [2]:
model_configuration = {'model_id': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
 'remote': False,
 'start_message': "<|system|>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\nIf a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.</s>\n",
 'history_template': '<|user|>\n{user}</s> \n<|assistant|>\n{assistant}</s> \n',
 'current_message_template': '<|user|>\n{user}</s> \n<|assistant|>\n{assistant}',
 'prompt_template': "<|system|> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.</s>\n        <|user|>\n        Question: {question} \n        Context: {context} \n        Answer: </s>\n        <|assistant|>"}

Below code block is used to read the pre-trained TinyLlama/TinyLlama-1.1B-Chat-v1.0 from HF using AutoModelForCausalLM APIs. Once the model is loaded, the total memory size is calculated.

In [3]:
import torch
from transformers import AutoModelForCausalLM
model_fp32 = AutoModelForCausalLM.from_pretrained(model_configuration['model_id'])

def model_memory_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    total_size = param_size + buffer_size
    return total_size / (1024 ** 2)  # Convert to megabytes

size_in_mb = model_memory_size(model_fp32)
print(f"Approximate model size in memory: {size_in_mb:.2f} MB")

Approximate model size in memory: 4218.35 MB


In [5]:
from pathlib import Path

Setting up the model path tha will be later used to load the quantized models.

In [6]:
fp16_model_dir = Path('tiny-llama-1b-chat/FP16')
int8_model_dir = Path('tiny-llama-1b-chat/INT8_compressed_weights')
int4_model_dir = Path('tiny-llama-1b-chat/INT4_compressed_weights')

Let's compare model size for different compression types.

In [7]:
fp16_weights = fp16_model_dir / "openvino_model.bin"
int8_weights = int8_model_dir / "openvino_model.bin"
int4_weights = int4_model_dir / "openvino_model.bin"

if fp16_weights.exists():
    print(f"Size of FP16 model is {fp16_weights.stat().st_size / 1024 / 1024:.2f} MB")
for precision, compressed_weights in zip([8, 4], [int8_weights, int4_weights]):
    if compressed_weights.exists():
        print(
            f"Size of model with INT{precision} compressed weights is {compressed_weights.stat().st_size / 1024 / 1024:.2f} MB"
        )
    if compressed_weights.exists() and fp16_weights.exists():
        print(
            f"Compression rate for INT{precision} model: {fp16_weights.stat().st_size / compressed_weights.stat().st_size:.3f}"
        )

Size of FP16 model is 2098.68 MB
Size of model with INT8 compressed weights is 1050.55 MB
Compression rate for INT8 model: 1.998
Size of model with INT4 compressed weights is 696.51 MB
Compression rate for INT4 model: 3.013


In [15]:
device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="CPU",
    description="Device:",
    disabled=False,
)

device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

# Model Initialization and Throughput Calculation

This notebook demonstrates the initialization of an OpenVINO-optimized Large Language Model (LLM) and the calculation of its throughput. We use a closure to encapsulate model initialization, ensuring that the model is loaded only once when its directory changes. This approach enhances efficiency by reusing the initialized model for subsequent throughput calculations without reloading it. 

## Initialization with Closure

A closure, `model_initializer`, is defined to create a function that remembers the last initialized model and tokenizer. This avoids unnecessary reinitialization and leverages caching for improved performance.

## Throughput Calculation
Define a function, calculate_throughput, which calculates the model's throughput. This function uses the previously defined closure to initialize the model if necessary and then measures how quickly the model can generate responses to a sample input.


In [25]:
from transformers import AutoTokenizer, AutoConfig
from time import perf_counter

# Assuming OVModelForCausalLM and other necessary imports are defined elsewhere

def model_initializer():
    initialized_model_dir = None
    model = None
    tok = None

    def initialize_model_if_needed(model_dir, model_name, model_configuration, device, tokenizer_kwargs):
        nonlocal initialized_model_dir, model, tok

        # Check if the model_dir has changed or if the model hasn't been initialized yet
        if model_dir != initialized_model_dir or model is None:
            print(f"Initializing model from {model_dir}...")
            class_key = model_name.split("-")[0]
            tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
            ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
            model_class = (
                OVModelForCausalLM
                if not model_configuration["remote"]
                else model_classes[class_key]
            )
            model = model_class.from_pretrained(
                model_dir,
                device=device,
                ov_config=ov_config,
                config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
                trust_remote_code=True,
            )
            initialized_model_dir = model_dir
        else:
            print(f"Using cached model from {model_dir}.")

        return model, tok

    return initialize_model_if_needed

# Create a closure function that remembers the last initialized model
initialize_model = model_initializer()

def calculate_throughput(model_dir, model_name, quantization_type, model_configuration, device, tokenizer_kwargs):
    # Use the closure to initialize the model if needed
    ov_model, tok = initialize_model(model_dir, model_name, model_configuration, device, tokenizer_kwargs)

    input_tokens = tok('a sometimes tedious film.', return_tensors="pt", **tokenizer_kwargs)

    # Assuming TextIteratorStreamer and other necessary configurations are defined
    streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        input_ids=input_tokens.input_ids,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=0.1 > 0.0,
        top_p=1,
        top_k=50,
        repetition_penalty=1.1,
        streamer=streamer,
    )

    start = perf_counter()
    answer = ov_model.generate(**generate_kwargs)
    end = perf_counter()
    duration = end - start
    print(f"Throughput of {quantization_type} {model_name} is {len(answer[0]) / duration}")


In [26]:
tokenizer_kwargs = dict(
    padding=True,
    truncation=True,
    max_length=512
)

This code block is designed to showcase how efficiently our model can process and respond to natural language inputs, providing valuable insights into the performance optimizations achieved through model quantization and OpenVINO integration.

In [27]:
calculate_throughput(fp16_model_dir, model_configuration["model_id"], 'FP16: ', model_configuration, device.value, tokenizer_kwargs)
calculate_throughput(fp16_model_dir, model_configuration["model_id"], 'FP16: ', model_configuration, device.value, tokenizer_kwargs)

Initializing model from tiny-llama-1b-chat/FP16...


The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to CPU ...
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Throughput of FP16:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 2.4456816906266847
Using cached model from tiny-llama-1b-chat/FP16.
Throughput of FP16:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 5.727936268042223


In [28]:
calculate_throughput(int8_model_dir, model_configuration["model_id"], 'INT8: ', model_configuration, device.value, tokenizer_kwargs)
calculate_throughput(int8_model_dir, model_configuration["model_id"], 'INT8: ', model_configuration, device.value, tokenizer_kwargs)

Initializing model from tiny-llama-1b-chat/INT8_compressed_weights...


The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to CPU ...
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Throughput of INT8:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 3.8877829410108555
Using cached model from tiny-llama-1b-chat/INT8_compressed_weights.
Throughput of INT8:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 20.9805502640221


In [29]:
calculate_throughput(int4_model_dir, model_configuration["model_id"], 'INT4: ', model_configuration, device.value, tokenizer_kwargs)
calculate_throughput(int4_model_dir, model_configuration["model_id"], 'INT4: ', model_configuration, device.value, tokenizer_kwargs)

Initializing model from tiny-llama-1b-chat/INT4_compressed_weights...


The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to CPU ...
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Throughput of INT4:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 8.845335131923164
Using cached model from tiny-llama-1b-chat/INT4_compressed_weights.
Throughput of INT4:  TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 25.990472594905917


In [33]:
tok = AutoTokenizer.from_pretrained(model_configuration['model_id'])
model_fp32 = AutoModelForCausalLM.from_pretrained(model_configuration['model_id'])
input_tokens = tok('a sometimes tedious film.', return_tensors="pt", **tokenizer_kwargs)

# Assuming TextIteratorStreamer and other necessary configurations are defined
streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)
generate_kwargs = dict(
    input_ids=input_tokens.input_ids,
    max_new_tokens=256,
    temperature=0.1,
    do_sample=0.1 > 0.0,
    top_p=1,
    top_k=50,
    repetition_penalty=1.1,
    streamer=streamer,
)

start = perf_counter()
answer = model_fp32.generate(**generate_kwargs)
end = perf_counter()
duration = end - start
print(f"Throughput of ORIGINAL: {model_configuration['model_id']} is {len(answer[0]) / duration}")

Throughput of ORIGINAL: TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 4.859636410956347


In [34]:
input_tokens = tok('a sometimes tedious film.', return_tensors="pt", **tokenizer_kwargs)

# Assuming TextIteratorStreamer and other necessary configurations are defined
streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)
generate_kwargs = dict(
    input_ids=input_tokens.input_ids,
    max_new_tokens=256,
    temperature=0.1,
    do_sample=0.1 > 0.0,
    top_p=1,
    top_k=50,
    repetition_penalty=1.1,
    streamer=streamer,
)

start = perf_counter()
answer = model_fp32.generate(**generate_kwargs)
end = perf_counter()
duration = end - start
print(f"Throughput of ORIGINAL: {model_configuration['model_id']} is {len(answer[0]) / duration}")

Throughput of ORIGINAL: TinyLlama/TinyLlama-1.1B-Chat-v1.0 is 5.9102147189626475
