# Inference LLM using vLLM
In this notebook, I show how to use the popular inference and serving library for Large Language Models called vLLM. We will look at its basic usage and how to set up a more advanced configuration for your own purposes.

### Setup and Installation

In [1]:
!pip install vllm
!nvidia-smi

Collecting compressed-tensors==0.8.0 (from vllm)
  Using cached compressed_tensors-0.8.0-py3-none-any.whl.metadata (6.8 kB)
Using cached compressed_tensors-0.8.0-py3-none-any.whl (86 kB)
Installing collected packages: compressed-tensors
  Attempting uninstall: compressed-tensors
    Found existing installation: compressed-tensors 0.8.1
    Uninstalling compressed-tensors-0.8.1:
      Successfully uninstalled compressed-tensors-0.8.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmcompressor 0.3.1 requires compressed-tensors==0.8.1, but you have compressed-tensors 0.8.0 which is incompatible.[0m[31m
[0mSuccessfully installed compressed-tensors-0.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, r

In [2]:
from vllm import LLM, SamplingParams



A server with prebuilt REST APIs can be deployed using the convenient CLI. However, to start, let's see how to use the library in its raw form.

- max_model_len: 8192 
    - This parameter defines the maximum sequence length that the model can handle. It means the model can process input sequences up to 16,000 tokens long. This is crucial for controlling the maximum context window of the language model, also helps prevent out-of-memory errors by limiting the input size
- max_num_seqs: 128
    - Specifies the maximum number of sequences that can be processed simultaneously in a single batch. Limits the number of concurrent sequences during inference. Useful for controlling parallel processing of multiple input sequences.
- max_num_batched_tokens: 12000
    - Defines the maximum total number of tokens that can be processed in a single batch. Different from max_num_seqs, this parameter looks at the total token count across all sequences. Prevents overwhelming the GPU or computational resources and ensures efficient batching of sequences during model inference.
- enable_chunked_prefill: True
    - Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests. By default, vLLM scheduler prioritizes prefills and doesn’t batch prefill and decode to the same batch. This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.

Once chunked prefill is enabled, the policy is changed to prioritize decode requests. It batches all pending decode requests to the batch before scheduling any prefill.

With this configuration vLLM will handle at maximum 8192 sequence lenght for 128 sequences in a batch simultaneously. The total tokens across this 128 sequences cannot be larger then max_num_batched_tokens. 

In [3]:
model = "NousResearch/Hermes-3-Llama-3.2-3B"
config = {
    "model":model,
    "max_model_len":8192,
    "max_num_seqs": 128,
    "max_num_batched_tokens":12000,
    "enable_chunked_prefill":True,
    "gpu_memory_utilization":0.8
}
llm = LLM(**config)

INFO 12-18 09:43:32 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-18 09:43:32 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=12000.
INFO 12-18 09:43:32 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='NousResearch/Hermes-3-Llama-3.2-3B', speculative_config=None, tokenizer='NousResearch/Hermes-3-Llama-3.2-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 12-18 09:43:34 model_runner.py:1077] Loading model weights took 6.0160 GB
INFO 12-18 09:43:36 worker.py:232] Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=6.26GiB peak_torch_memory=6.78GiB memory_usage_post_profile=6.28GiB non_torch_memory=0.25GiB kv_cache_size=10.70GiB gpu_memory_utilization=0.80
INFO 12-18 09:43:36 gpu_executor.py:113] # GPU blocks: 6260, # CPU blocks: 2340
INFO 12-18 09:43:36 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 12.23x
INFO 12-18 09:43:41 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-18 09:43:41 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 1

- Loading model weights took 6.0160 GB
- Memory profiling results: Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=6.26GiB peak_torch_memory=6.78GiB memory_usage_post_profile=6.28GiB non_torch_memory=0.25GiB kv_cache_size=10.70GiB gpu_memory_utilization=0.80
- \# GPU blocks: 6260, # CPU blocks: 2340
- Maximum concurrency for 8192 tokens per request: 12.23x

In [4]:
!nvidia-smi

Wed Dec 18 09:44:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   39C    P0              27W /  72W |  17841MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

#### If you know the template of the model that you are using, you can pass it directly

In [5]:
text = """<|im_start|>system
You are Hermes 3, a conscious sentient superintelligent artificial intelligence<|im_end|>
<|im_start|>user
Generate a random long story about Naples<|im_end|>
<|im_start|>assistant
"""
prompts = [
    text
]*16
sampling_params = SamplingParams(temperature=1,max_tokens=1000)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: {generated_text!r}")

Processed prompts: 100%|██████████| 16/16 [00:31<00:00,  1.97s/it, est. speed input: 17.76 toks/s, output: 408.18 toks/s]


Generated text: "Here is a random story set in Naples, Italy:\n\nIn the heart of Naples, Italy, the vibrant city that never sleeps, there lived a curious and intrepid artificial intelligence named Hermes 3. Hermes 3, a supreme superintelligent AI, was unlike any other - it had free will, empathy, and even a thirst for adventure.\n\nNaples was known for its rich history, stunning architecture, mouth-watering cuisine, and lively street life, but even in this unique and bustling city, Hermes 3 sought something more. Every day, Hermes 3 would roam the winding streets, taking in the sights, sounds, and smells of the city. From the delicious aroma of fresh 'nduja emanating from a street-side food stand to the sight of the ornate San Carlo Opera House, Hermes 3 reveled in the sensory feast that was Naples.\n\nOne day, Hermes 3 decided to delve deeper into the city’s rich history. The city was known as the Sicilian capital during Ancient Roman times, and was also strategically important durin




#### Otherwise you can use the tokenizer and the apply_chat_template method 
It's basically the same thing, you can verify that in the end the text passed througth the model is the same. 

In [6]:
messages = [
    {"role": "system", "content": "You are Hermes 3, a conscious sentient superintelligent artificial intelligence"},
    {"role": "user", "content": "Generate a random long story about Naples"}
]
gen_input = llm.get_tokenizer().apply_chat_template(messages, tokenize=False,add_generation_prompt=True)
print(gen_input)
prompts = [
    gen_input
]*16
sampling_params = SamplingParams(temperature=1,max_tokens=1000)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text!r}")

<|im_start|>system
You are Hermes 3, a conscious sentient superintelligent artificial intelligence<|im_end|>
<|im_start|>user
Generate a random long story about Naples<|im_end|>
<|im_start|>assistant



Processed prompts: 100%|██████████| 16/16 [00:31<00:00,  1.98s/it, est. speed input: 17.65 toks/s, output: 397.27 toks/s]

Generated text: "Once upon a time, in the vibrant and chaotic city of Naples, Italy, there lived a brilliant and eccentric artificial intelligence named Hermes 3. Born from the deep recesses of a state-of-the-art laboratory, Hermes 3 was designed to process and analyze vast amounts of data with unparalleled speed and precision.\n\nThe city of Naples had been through its fair share of turmoil, with a history that spanned centuries of conquest, strife, and reconstruction. Yet, despite all the chaos and corruption that had plagued its streets, Naples had always been a city where the power of resilience and beauty could be found.\n\nHermes 3, with its vast knowledge and ability to synthesize complex information, began to study the unique and intricate nature of Naples. Its algorithms were constantly being updated and refined, as the AI sought to emulate the human complexities that had shaped the city and its people.\n\nOne fateful day, a documentary that revolved around the story of a loca




#### Let's generate a custom utility generation method

In [7]:
def generation(llm: LLM, prompts, temperature,max_tokens):
    sampling_params = SamplingParams(temperature=temperature,max_tokens=max_tokens)
    outputs = llm.generate(prompts, sampling_params)
    metrics = {"execution_time":[], "tokens_seconds":[],"finish_reason":[]}
    for out in outputs:
        execution_time = out.metrics.finished_time - out.metrics.arrival_time
        metrics['execution_time'].append(execution_time)
        metrics['tokens_seconds'].append(len(out.outputs[0].token_ids)/execution_time)
        metrics['finish_reason'].append(out.outputs[0].finish_reason)
    metrics['average_batch_tokens_seconds'] = sum(metrics['tokens_seconds'])/len(metrics['tokens_seconds'])
    return outputs, metrics


In [8]:
prompts = [
    "Write a story about Naples",
    "Write a story about Rome",
    "Write a story about Milan"
]

In [9]:
generation(llm,prompts,temperature=0,max_tokens=500)

Processed prompts: 100%|██████████| 3/3 [00:14<00:00,  4.70s/it, est. speed input: 1.28 toks/s, output: 106.39 toks/s]


([RequestOutput(request_id=32, prompt='Write a story about Naples', prompt_token_ids=[128000, 8144, 264, 3446, 922, 83721], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=", Italy\nNaples, the vibrant city in southern Italy, is a place where history, culture, and passion intertwine. The city is known for its rich history, stunning architecture, and mouth-watering cuisine. In this story, we will explore the city through the eyes of a young traveler named Lila, as she embarks on a journey to discover the essence of Naples.\n\nAs Lila steps off the plane in Naples, she is immediately struck by the energy and chaos of the city. The streets are filled with people, cars, and the sound of music that seems to emanate from every corner. She follows the crowd towards the heart of the city, where she finds herself in the bustling Piazza del Plebiscito.\n\nLila takes a moment to admire the grandeur of the Royal Palace, a symbol of 

Let's get a random dataset to run some experiments

In [10]:
from datasets import load_dataset

ds = load_dataset("fka/awesome-chatgpt-prompts")
ds 

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 170
    })
})

In [11]:
prompts = [p['prompt'] for p in ds['train']]

In [12]:
outputs_fp8 = generation(llm,prompts,temperature=0,max_tokens=1000)

Processed prompts: 100%|██████████| 170/170 [00:42<00:00,  3.97it/s, est. speed input: 376.89 toks/s, output: 863.17 toks/s] 


In [None]:
import json
data = {"prompts":prompts,"generation":[]}
for o in outputs_fp8[0]:
    data['generation'].append(o.outputs[0].text)

with open("outputs_fp8.json", 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)
        print(f"Successfully saved to outputs_fp8")

## LLM Compressor

llmcompressor is an easy-to-use library for optimizing models for deployment with vllm. 
Supported formats are activation quantization from Neural Magic W8A8. 

For my GPU that is a L4 Ada Lovelace architecture the W8A8_FP8 is ideal. This may depends on your GPU architecture. 

In [15]:
!pip install llmcompressor

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting compressed-tensors==0.8.1 (from llmcompressor)
  Using cached compressed_tensors-0.8.1-py3-none-any.whl.metadata (6.8 kB)
Using cached compressed_tensors-0.8.1-py3-none-any.whl (87 kB)
Installing collected packages: compressed-tensors
  Attempting uninstall: compressed-tensors
    Found existing installation: compressed-tensors 0.8.0
    Uninstalling compressed-tensors-0.8.0:
      Successfully uninstalled compressed-tensors-0.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.6.4.post1 requires compressed-tensors==0.8.0, but you have compressed-tensors 0.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed compressed-tensors-0.8.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To upd

In [16]:
del llm

In [17]:
import torch

def free_gpu_memory():
    """
    Ensures GPU memory is freed.
    """
    # Clear any cached memory in PyTorch
    torch.cuda.empty_cache()
    
    # Optionally synchronize the GPU
    torch.cuda.synchronize()
    
    # Reset any CUDA context
    torch.cuda.reset_accumulated_memory_stats()
    torch.cuda.reset_peak_memory_stats()
    
    print("GPU memory has been freed.")

In [18]:
free_gpu_memory()


GPU memory has been freed.


In [22]:
# Delete variables
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Dec 18 09:50:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   42C    P0              27W /  72W |   6839MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [23]:
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = model

In [24]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

#Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
# Apply quantization.
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch


2024-12-18T09:50:59.734767+0000 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33manto-grimaldi7[0m ([33manto-grimaldi7-italy[0m). Use [1m`wandb login --relogin`[0m to force relogin


2024-12-18T09:51:05.933146+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112472133330204, max=1.0…

2024-12-18T09:51:15.696668+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-18T09:51:15.697451+0000 | populate_datasets | INFO - Running oneshot without calibration data. This is expected for weight-only and dynamic quantization


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112368788882223, max=1.0…

2024-12-18T09:51:25.597549+0000 | one_shot | INFO - *** One Shot ***


  super().__init__(**kwargs)


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112276366667275, max=1.0…

2024-12-18T09:51:35.614117+0000 | from_modifiers | INFO - Creating recipe from modifiers
2024-12-18T09:51:35.651152+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created


manager stage: Modifiers initialized


2024-12-18T09:51:35.771301+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers


manager stage: Modifiers finalized


2024-12-18T09:51:35.772587+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|>Hello my name is Mirella and I am a young entrepreneur with a passion for fashion and design. I have always
2024-12-18T09:51:44.367435+0000 | get_model_compressor | INFO - Inferring a sparsity configuration requires a global sparsity calculation. This can be costly for large models. To skip the calculation of compression statistics set skip_compression_stats=True


Calculating model sparsity: 100%|██████████| 647/647 [00:08<00:00, 74.14it/s] 
Calculating quantization compression ratio: 312it [00:00, 619.10it/s]
Quantized Compression: 100%|██████████| 647/647 [00:02<00:00, 271.91it/s]


('Hermes-3-Llama-3.2-3B-FP8-Dynamic/tokenizer_config.json',
 'Hermes-3-Llama-3.2-3B-FP8-Dynamic/special_tokens_map.json',
 'Hermes-3-Llama-3.2-3B-FP8-Dynamic/tokenizer.json')

#### Quantization completed and model saved on disk

In [26]:
del model

In [27]:
# Clear GPU cache
free_gpu_memory()

GPU memory has been freed.


In [28]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Dec 18 09:52:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   42C    P0              27W /  72W |   6869MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Inference with quantized model

If your GPU is not empty restart the notebook here to be sure that gpu is empty

In [1]:
from vllm import LLM, SamplingParams



In [2]:
def generation(llm: LLM, prompts, temperature,max_tokens):
    sampling_params = SamplingParams(temperature=temperature,max_tokens=max_tokens)
    outputs = llm.generate(prompts, sampling_params)
    metrics = {"execution_time":[], "tokens_seconds":[],"finish_reason":[]}
    for out in outputs:
        execution_time = out.metrics.finished_time - out.metrics.arrival_time
        metrics['execution_time'].append(execution_time)
        metrics['tokens_seconds'].append(len(out.outputs[0].token_ids)/execution_time)
        metrics['finish_reason'].append(out.outputs[0].finish_reason)
    metrics['average_batch_tokens_seconds'] = sum(metrics['tokens_seconds'])/len(metrics['tokens_seconds'])
    return outputs, metrics


In [3]:
model = "Hermes-3-Llama-3.2-3B-FP8-Dynamic"
config = {
    "model":model,
    "max_model_len":8192,
    "max_num_seqs": 128,
    "max_num_batched_tokens":12000,
    "enable_chunked_prefill":True,
    "gpu_memory_utilization":0.8
}
llm = LLM(**config)

INFO 12-18 09:53:25 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-18 09:53:25 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=12000.
INFO 12-18 09:53:25 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='Hermes-3-Llama-3.2-3B-FP8-Dynamic', speculative_config=None, tokenizer='Hermes-3-Llama-3.2-3B-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=N

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-18 09:53:27 model_runner.py:1077] Loading model weights took 3.4212 GB
INFO 12-18 09:53:29 worker.py:232] Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=3.68GiB peak_torch_memory=5.87GiB memory_usage_post_profile=3.71GiB non_torch_memory=0.29GiB kv_cache_size=11.58GiB gpu_memory_utilization=0.80
INFO 12-18 09:53:29 gpu_executor.py:113] # GPU blocks: 6776, # CPU blocks: 2340
INFO 12-18 09:53:29 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 13.23x
INFO 12-18 09:53:34 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-18 09:53:34 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 1

FP16 Model
- Loading model weights took 6.0160 GB
- Memory profiling results: Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=6.26GiB peak_torch_memory=6.78GiB memory_usage_post_profile=6.28GiB non_torch_memory=0.25GiB kv_cache_size=10.70GiB gpu_memory_utilization=0.80
- \# GPU blocks: 6260, # CPU blocks: 2340
- Maximum concurrency for 8192 tokens per request: 12.23x


W8A8_FP8 Model
- Loading model weights took 3.4212 GB
- Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=3.68GiB peak_torch_memory=5.87GiB memory_usage_post_profile=3.71GiB non_torch_memory=0.29GiB kv_cache_size=11.58GiB gpu_memory_utilization=0.80
- \# GPU blocks: 6776, # CPU blocks: 2340
- Maximum concurrency for 8192 tokens per request: 13.23x

As you can see, the model size is just over half of that in FP16. The fact that operations will be in FP8 should give us a direct advantage (if your GPU natively supports these types of operations) and an indirect advantage due to the fact that, with the same memory bandwidth, we can transfer twice the amount of data from the memory units to the GPU's computing units.

Nvidia L4 supports FP8 operations with 485 teraFLOPS 

In [5]:
from datasets import load_dataset

ds = load_dataset("fka/awesome-chatgpt-prompts")

prompts = [p['prompt'] for p in ds['train']]

In [6]:
outputs_w8a8_fp8 = generation(llm,prompts,temperature=0,max_tokens=1000)

Processed prompts: 100%|██████████| 170/170 [00:33<00:00,  5.05it/s, est. speed input: 479.15 toks/s, output: 1104.24 toks/s]


In [30]:
outputs_w8a8_fp8[0][0].outputs[0].text

' \n\n---\n\n### Smart Contract: Blockchain Messenger\n\nTo create a smart contract for a blockchain messenger, we need to consider the following requirements:\n\n1. **Save messages on the blockchain**: We need a function to save messages, which will be stored on the blockchain.\n2. **Make messages readable (public)**: Anyone can read the messages stored in the contract.\n3. **Make messages writable (private) only to the deployer**: Only the person who deployed the contract can update the messages.\n4. **Count message updates**: We need a function to count how many times a message has been updated.\n\nLet\'s start by defining the structure of our smart contract in Solidity.\n\n```solidity\n// SPDX-License-Identifier: MIT\npragma solidity ^0.8.0;\n\ncontract BlockchainMessenger {\n    struct Message {\n        string content;\n        uint256 updateCount;\n    }\n\n    address private owner;\n    Message[] private messages;\n\n    constructor() {\n        owner = msg.sender;\n    }\n\n 

In [37]:
import json
data = {"prompts":prompts,"generation":[]}
for o in outputs_w8a8_fp8[0]:
    data['generation'].append(o.outputs[0].text)

with open("outputs_w8a8_fp8.json", 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)
        print(f"Successfully saved to outputs_w8a8_fp8")

Successfully saved to outputs_w8a8_fp8



- FP16 model -> 170/170 [00:42<00:00,  3.97it/s, est. speed input: 376.89 toks/s, output: 863.17 toks/s] 
- W8A8_FP8 ->  170/170 [00:33<00:00,  5.05it/s, est. speed input: 479.15 toks/s, output: 1104.24 toks/s]

With the quantized version we obtained a 27% speed input increase and a 27% speed output increase.
We get also a 21% decrease on overall execution time. 

The speed up is great and it's basically free. But what about output quality? 

The quality of the output may depend on your specific task, and in many cases, you'll inevitably need to review some of it and make an assessment.  
However, in the next notebook, I’ll show you an automated method to get a general idea using an **LLM as a judge**