# LLM Benchmarking Experiment

## Set Up Your Python Environment

To keep your project dependencies isolated and reproducible, create and activate a virtual environment before installing packages.

```sh
# Create a virtual environment
python -m venv .venv

# Activate the virtual environment (macOS/Linux)
source .venv/bin/activate

# Install required dependencies
pip install notebook requests jsonlines pandas tqdm
```

## Install and Run Ollama

You can install Ollama directly from the official website or run it using Docker.

### Option 1: Install from the Official Website
Download and install Ollama from: https://ollama.com/

### Option 2: Use the Dockerized Version
This approach is useful when you want a portable or containerized setup.

```bash
# Start Ollama using Docker Compose
docker compose up -d

# Enter the running Ollama container
docker exec -it ollama bash
```
### Use Ollama

```bash
# Verify the Ollama installation
ollama --version

# Starts Ollama (if it isn’t already running)
ollama serve

# Print all the models that have been downloaded 
ollama list

# Download a model (example: qwen:0.5b, gemma3:270m). 
ollama pull qwen:0.5b
```

To use HF models: https://huggingface.co/docs/hub/ollama

## Benchmarking Experiment

In [None]:
# import libs
import requests
import jsonlines
import time
import json
import pandas as pd

# define the Ollama endpoint for the generation
OLLAMA_GENERATE_URL = f"http://localhost:11434/api/generate"

## Create a Function to Query the Ollama HTTP API

Define a function that sends a prompt to the Ollama HTTP service and returns the model’s response.

The function sets the following parameters:

- **model**: the name/identifier of the Ollama model to use (e.g. `qwen:0.5b`).
- **prompt**: the textual prompt you want to send to the model.
- **stream** (`bool`): whether to enable streaming responses  
  - If `True`, the model returns output incrementally (token by token).  
  - If `False`, the function returns the full response at once.
- **generation options**: additional parameters to control text generation, such as:
  - `temperature` – randomness of the output.
  - `max_tokens` – maximum number of tokens in the model’s response.
  - (Optionally) other settings supported by the Ollama API (e.g. `top_p`, `stop` sequences).

The function will construct a JSON payload with these arguments, send it to the Ollama HTTP endpoint (e.g. `POST /api/generate`), and process the response according to the `stream` mode.


In [None]:
def generate(model : str, prompt : str, options : dict = None):
    resp = requests.post(
        OLLAMA_GENERATE_URL,
        json={
            "model": model,
            "prompt": prompt,
            "stream": False, 
            "options": options,
        },
    )

    # check if any error occurred during the request
    resp.raise_for_status()

    data = resp.json()

    return data["response"], data["total_duration"] * 10**-9

Read the prompt dataset and select one prompt to make the call.

In [None]:
# use a sample prompt to test the inference function
prompt = "Write the dijkstra algorithm in Javascript. Output only the source code, no explaination."

prompt

In [None]:
# run "ollama list" to print all the models that have been downloaded

# define the model to use
OLLAMA_MODEL = "qwen:0.5b" # example

Measure the **End-to-End Request Latency** of the inference function.

This metric measures the total time from submitting a query to receiving the complete response, accounting for queueing/batching processes and network latency.

More information on LLM benchmarking metrics are available here: https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html.

In [None]:
start = time.time()

response, ollama_total_duration = generate(
    model=OLLAMA_MODEL,
    prompt=prompt,
)

end = time.time()

e2e = end - start

print(f"Ollama total duration (from JSON response): {ollama_total_duration:.6f} s")
print(f"End-to-end generation time: {e2e:.6f} s")

In [None]:
# Formatted output
print(f"=== PROMPT ===")
print(f"{prompt[:200]} ... \n")

print(f"=== RESPONSE ===")
print(f"{response[:200]} ... ")

## Extract the Time-To-First-Token (TTFT) and Inter-Token Latency (ITL)

To extract these metrics, you must enable streaming in Ollama, which allows tokens to be delivered incrementally as they are generated.

We will define another function, that will have the 'stream' parameter set to True, and that is able to compute TTFT and ITL.

In [None]:
def generate_with_stream(model: str, prompt: str, options : dict = None):

    # in this case we measure inside the generation function
    start = time.time()

    resp = requests.post(
        OLLAMA_GENERATE_URL,
        json={
            "model": model,
            "prompt": prompt,
            "stream": True, 
            "options": options,
        },
    )

    # check if any error occurred during the request
    resp.raise_for_status()

    first_token_time = None

    chunks = []

    # the iter_lines automatically checks when the stream ends
    for line in resp.iter_lines():
        if not line:
            continue
        # this time we have to use json to decode
        data = json.loads(line.decode("utf-8"))
        token = data.get("response", "")
        if not token:
            continue

        # TTFT
        now = time.time()
        if first_token_time is None:
            first_token_time = now

        chunks.append(token)

    end = time.time()
    total_gen_time = end - start

    text = "".join(chunks)

    # the 'data' variables contains now the last json object
    ollama_total_duration = data['total_duration'] * 10**-9
    
    ttft = (first_token_time - start) if first_token_time else 0.0
    itl = (total_gen_time - ttft) / max(len(chunks) - 1, 1)


    # in this case the response is composed of multiple json objects (you can check by directly calling the endpoint)

    return text, total_gen_time, ttft, itl, ollama_total_duration

## Run again the generation

In [None]:
response, e2e, ttft, itl, ollama_total_duration = generate_with_stream(
    model=OLLAMA_MODEL,
    prompt=prompt,
)

print(f"Ollama total duration (from JSON response): {ollama_total_duration:.6f} s")
print(f"End-to-end generation time: {e2e:.6f} s")
print(f"Time to first token (TTFT): {ttft:.6f} s")
print(f"Inter-token latency (ITL): {itl:.6f} s")

## Adjust the setup

### Parameter Details

`num_predict`
Specifies the maximum number of tokens the model is allowed to generate. Lower values shorten the output, while higher values allow for longer responses.

`temperature`
Controls the randomness of the generation process. A value of 0 makes the output highly deterministic and focused.
Higher values (e.g., 0.7–1.0) increase variability and creativity.
Note that temperature is not the only factor influencing randomness; parameters like `top_k` and `top_p` also contribute.

In [None]:
GENERATION_OPTIONS = { 
    "num_predict": 256, # we can limit the maximum number of generated tokens
    "temperature": 0 # handle randomness of the generation process
}
NUMBER_OF_ITERATIONS = 30 # define the number of iterations

In [None]:
from tqdm import tqdm

results = []

for it in tqdm(range(NUMBER_OF_ITERATIONS)):

    response, e2e, ttft, itl, ollama_total_duration = generate_with_stream(
        model=OLLAMA_MODEL,
        prompt=prompt,
        options=GENERATION_OPTIONS
    )

    results.append({
        "prompt": prompt,
        "iteration": it,
        "response": response,
        "ollama_total_gen_time": ollama_total_duration,
        "end_to_end_latency": e2e,
        "TTFT": ttft,
        "ITL": itl
    })

results = pd.DataFrame(results)

In [None]:
# Dump results for further inspection and analysis 
with jsonlines.open("results.jsonl", "w") as out_file:
    out_file.write_all(results.to_dict('records'))

In [None]:
results.describe()

## Inspect and analyze the results

In [None]:
import numpy as np

### ===> perform an in-depth examination of the value distribution

def print_stats(stats):
    for k, v in stats.items():
        if isinstance(v, tuple):
            print(f"{k}: ({v[0]:.3f}, {v[1]:.3f})")
        else:
            print(f"{k}: {v:.3f}")

def summary_stats(data):
    mu = data.mean()
    range_ = data.max() - data.min()
    max_dist_mean = np.abs(data - mu).max()
    variance = data.var(ddof=1) if len(data) > 1 else 0.0
    sigma = data.std(ddof=1) if len(data) > 1 else 0.0
    cv = sigma / mu if mu != 0 else np.nan
    std_error = sigma / np.sqrt(len(data)) if len(data) > 1 else 0.0
    statistics = {
        "Mean (μ)": mu,
        "Range": range_,
        "Max distance from mean": max_dist_mean,
        "Variance (σ²)": variance,
        "Standard deviation (σ)": sigma,
        "Coefficient of variation (CV)": cv,
        "Standard error of the mean (SEM)": std_error
    }
    return statistics

# example on TTFT
print_stats(summary_stats(results.TTFT))

## Visualize the results

Matplotlib and seaborn are widely used python libraries supporting a variety of plot types.

```bash
pip install matplotlib seaborn
```

In [None]:
import matplotlib.pyplot as plt

# visualize the results (example)

latency_measurements = results.TTFT.values

plt.figure(figsize=(10, 2))
plt.plot(latency_measurements, color="black") 

plt.axhline(min(latency_measurements), color="lightgrey", ls="--")
plt.axhline(max(latency_measurements), color="lightgrey", ls="--")

plt.xlabel("Iteration")
plt.ylabel("Time (s)")
plt.tight_layout()
plt.show()

## Run the experiment with different prompts

In [None]:
prompts = []

# load the input prompts
with jsonlines.open("sample_prompts.jsonl", "r") as p_file:
    prompts = [prompt for prompt in p_file]

selected_prompts = prompts

from tqdm import tqdm

results = []

total_iterations = len(selected_prompts) * NUMBER_OF_ITERATIONS
pbar = tqdm(total=total_iterations, desc="Progress")

for prompt_idx, prompt in enumerate(selected_prompts):
    for it in range(NUMBER_OF_ITERATIONS):

        pbar.set_description(f"Prompt {prompt_idx+1}/{len(selected_prompts)} | Iter {it+1}/{NUMBER_OF_ITERATIONS}")
        
        response, ttft, itl, e2e, ollama_total_duration = generate_with_stream(
            model=OLLAMA_MODEL,
            prompt=prompt["prompt"],
            options=GENERATION_OPTIONS
        )

        results.append({
            "prompt": prompt["prompt"],
            "it": it,
            "ollama_total_gen_time": ollama_total_duration,
            "end_to_end_latency": e2e,
            "TTFT": ttft,
            "ITL": itl
        })

        pbar.update()
pbar.close()

results = pd.DataFrame(results)

In [None]:
results

In [None]:
with jsonlines.open("results.jsonl", "w") as out_file:
    out_file.write_all(results.to_dict('records'))