# M4 — Medición, optimización offline y packaging (ORT + Olive)

## Objetivo
1. Medir **tamaño** y **latencia aproximada** de un modelo con ONNX Runtime (ORT).
2. Usar **SessionOptions** para:
   - elegir nivel de optimización de grafo
   - guardar el **modelo optimizado offline** a disco
   - generar un **perfil de latencia** (JSON)
3. (Opcional) Generar un **ZIP de artefactos** con Olive (`packaging_config`).

> Nota: el micro-benchmark de latencia aquí es orientativo (para comparar antes/después en tu máquina).


## Prerrequisitos (VS Code + Kernel)
Asegúrate de que el notebook usa el **kernel** del entorno `.venv` del proyecto (en VS Code: selector de kernel arriba a la derecha).


In [1]:
# Celda 1 — Comprobar que estamos en el Python del venv
import sys, subprocess
import onnxruntime as ort

print("Python executable:", sys.executable)
print("Python version:", sys.version)
print("ONNX Runtime:", ort.get_version_string())
print("Available providers:", ort.get_available_providers())

print("\n--- pip show olive-ai ---")
subprocess.run([sys.executable, "-m", "pip", "show", "olive-ai"], check=False)


Python executable: g:\source\VisualCode\repos\olive-python-vscode-labs\.venv\Scripts\python.exe
Python version: 3.13.2 (tags/v3.13.2:4f8bb39, Feb  4 2025, 15:23:48) [MSC v.1942 64 bit (AMD64)]
ONNX Runtime: 1.23.2
Available providers: ['AzureExecutionProvider', 'CPUExecutionProvider']

--- pip show olive-ai ---


CompletedProcess(args=['g:\\source\\VisualCode\\repos\\olive-python-vscode-labs\\.venv\\Scripts\\python.exe', '-m', 'pip', 'show', 'olive-ai'], returncode=0)

## 1) Seleccionar un modelo de trabajo

Usaremos, si existe:
- `models/linear_fp32.onnx` (del M3)
- `outputs/m3_linear_int8/**.onnx` (si ya cuantizaste)

Si no existen, creamos un modelo lineal FP32 mínimo (`Y = X·W + b`) en `models/linear_fp32.onnx`.


In [2]:
from pathlib import Path
import numpy as np
import onnx
from onnx import TensorProto, helper, numpy_helper

root = Path.cwd()
# Si estás dentro de notebooks/, sube un nivel
if not (root / "models").exists() and (root.parent / "models").exists():
    root = root.parent

models_dir = root / "models"
outputs_dir = root / "outputs"
models_dir.mkdir(exist_ok=True)
outputs_dir.mkdir(exist_ok=True)

fp32_model_path = models_dir / "linear_fp32.onnx"

if not fp32_model_path.exists():
    in_features = 4
    out_features = 3

    X = helper.make_tensor_value_info("X", TensorProto.FLOAT, ["batch", in_features])
    Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, ["batch", out_features])

    rng = np.random.default_rng(0)
    W_val = rng.standard_normal((in_features, out_features), dtype=np.float32)
    b_val = rng.standard_normal((out_features,), dtype=np.float32)

    W = numpy_helper.from_array(W_val, name="W")
    b = numpy_helper.from_array(b_val, name="b")

    matmul = helper.make_node("MatMul", inputs=["X", "W"], outputs=["Z"])
    add = helper.make_node("Add", inputs=["Z", "b"], outputs=["Y"])

    graph = helper.make_graph([matmul, add], "linear", [X], [Y], initializer=[W, b])

    opset = [helper.make_operatorsetid("", 11)]
    model = helper.make_model(graph, producer_name="m4-lab", opset_imports=opset)
    model.ir_version = 11

    onnx.checker.check_model(model)
    onnx.save_model(model, str(fp32_model_path))

print("FP32 model:", fp32_model_path)

# Buscar un INT8 cuantizado (si existe)
int8_candidates = sorted((outputs_dir / "m3_linear_int8").rglob("*.onnx")) if (outputs_dir / "m3_linear_int8").exists() else []
int8_model_path = int8_candidates[0] if int8_candidates else None

print("INT8 model:", int8_model_path)


FP32 model: g:\source\VisualCode\repos\olive-python-vscode-labs\models\linear_fp32.onnx
INT8 model: g:\source\VisualCode\repos\olive-python-vscode-labs\outputs\m3_linear_int8\model.onnx


## 2) Micro-benchmark (latencia) + tamaño

Medimos:
- tamaño del archivo en bytes
- tiempo medio por inferencia (ms) con un input fijo


In [3]:
import time
import numpy as np
import onnxruntime as ort
from pathlib import Path

x = np.array([[1.0, 2.0, 3.0, 4.0]], dtype=np.float32)

def benchmark(model_path: Path, sess_options=None, iters: int = 300, warmup: int = 30):
    sess = ort.InferenceSession(str(model_path), sess_options=sess_options, providers=["CPUExecutionProvider"])
    inp = sess.get_inputs()[0].name
    out = sess.get_outputs()[0].name

    for _ in range(warmup):
        sess.run([out], {inp: x})

    t0 = time.perf_counter()
    for _ in range(iters):
        y = sess.run([out], {inp: x})[0]
    t1 = time.perf_counter()

    avg_ms = (t1 - t0) * 1000 / iters
    size = model_path.stat().st_size
    return y, avg_ms, size

y_fp32, ms_fp32, sz_fp32 = benchmark(fp32_model_path)
print("FP32:", fp32_model_path.name, "| size:", sz_fp32, "bytes | avg:", ms_fp32, "ms | y:", y_fp32)

if int8_model_path:
    y_int8, ms_int8, sz_int8 = benchmark(int8_model_path)
    print("INT8:", int8_model_path.name, "| size:", sz_int8, "bytes | avg:", ms_int8, "ms | y:", y_int8)


FP32: linear_fp32.onnx | size: 198 bytes | avg: 0.003832000002148561 ms | y: [[ 5.2784495  1.9592975 -5.466373 ]]
INT8: model.onnx | size: 1180 bytes | avg: 0.004423333333155218 ms | y: [[ 5.2475195  1.9376892 -5.4539256]]


## 3) Optimización de grafo (online) y export optimizado (offline)

ORT permite:
- elegir el nivel de optimización de grafo (`graph_optimization_level`)
- guardar el modelo resultante tras optimizaciones en disco (`optimized_model_filepath`)

Aquí generamos un modelo optimizado offline y luego lo medimos.


In [4]:
import onnxruntime as ort
from pathlib import Path

optimized_dir = outputs_dir / "m4_optimized"
optimized_dir.mkdir(exist_ok=True)

optimized_model_path = optimized_dir / "linear_optimized.onnx"

so = ort.SessionOptions()
# Nivel de optimización: EXTENDED (puedes probar ORT_ENABLE_ALL)
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
# Export del grafo optimizado
so.optimized_model_filepath = str(optimized_model_path)

# Crear la sesión dispara la optimización y guarda el modelo optimizado
_ = ort.InferenceSession(str(fp32_model_path), sess_options=so, providers=["CPUExecutionProvider"])

print("Saved optimized model:", optimized_model_path, "| exists:", optimized_model_path.exists(), "| bytes:", optimized_model_path.stat().st_size if optimized_model_path.exists() else None)


Saved optimized model: g:\source\VisualCode\repos\olive-python-vscode-labs\outputs\m4_optimized\linear_optimized.onnx | exists: True | bytes: 454


## 4) Medir el modelo optimizado offline

Cuando cargas el modelo ya optimizado, puedes desactivar optimizaciones para reducir trabajo de inicialización.


In [5]:
import onnxruntime as ort

so_no_opt = ort.SessionOptions()
so_no_opt.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

y_opt, ms_opt, sz_opt = benchmark(optimized_model_path, sess_options=so_no_opt)
print("OPT_OFFLINE:", optimized_model_path.name, "| size:", sz_opt, "bytes | avg:", ms_opt, "ms | y:", y_opt)


OPT_OFFLINE: linear_optimized.onnx | size: 454 bytes | avg: 0.003671999999520873 ms | y: [[ 5.2784495  1.9592975 -5.466373 ]]


## 5) Profiling de latencia (JSON)

ORT puede generar un archivo JSON de profiling:
- `SessionOptions.enable_profiling = True`
- al final llamas a `session.end_profiling()` para obtener el nombre del archivo generado

En esta celda generamos un perfil y hacemos un resumen simple por `name` (duración total).


In [6]:
import json
import onnxruntime as ort
from pathlib import Path
from collections import defaultdict

profile_dir = outputs_dir / "m4_profiles"
profile_dir.mkdir(exist_ok=True)

so_prof = ort.SessionOptions()
so_prof.enable_profiling = True
so_prof.profile_file_prefix = str(profile_dir / "ort_profile")

sess = ort.InferenceSession(str(fp32_model_path), sess_options=so_prof, providers=["CPUExecutionProvider"])
inp = sess.get_inputs()[0].name
out = sess.get_outputs()[0].name

# Ejecutar algunas inferencias
for _ in range(50):
    sess.run([out], {inp: x})

prof_file = sess.end_profiling()
prof_path = Path(prof_file)
print("Profile file:", prof_path.resolve(), "| exists:", prof_path.exists())

data = json.loads(prof_path.read_text(encoding="utf-8"))

dur_by_name = defaultdict(int)
for ev in data:
    name = ev.get("name")
    dur = ev.get("dur")
    if name is not None and dur is not None:
        dur_by_name[name] += dur

top = sorted(dur_by_name.items(), key=lambda x: x[1], reverse=True)[:15]
print("\nTop 15 eventos por dur total (unidad típica: microsegundos):")
for name, dur in top:
    print(f"- {name}: {dur}")


Profile file: G:\source\VisualCode\repos\olive-python-vscode-labs\outputs\m4_profiles\ort_profile_2026-01-12_10-35-58.json | exists: True

Top 15 eventos por dur total (unidad típica: microsegundos):
- model_run: 924
- SequentialExecutor::Execute: 813
- /MatMulAddFusion_kernel_time: 688
- session_initialization: 315
- model_loading_uri: 166


## 6) (Opcional) Packaging con Olive (Zipfile)

Olive puede empaquetar artefactos en un ZIP cuando añades `packaging_config` en el `engine` del run-config.


In [7]:
# Celda opcional — generar ZIP con Olive a partir de un workflow simple (sin depender de passes problemáticos)
# Requiere que 'olive' se ejecute con el Python del kernel: sys.executable -m olive ...

import sys, subprocess, json
from pathlib import Path

package_out_dir = outputs_dir / "m4_olive_package"
package_cache = outputs_dir / "m4_olive_cache"
cfg_path = outputs_dir / "m4_pack_run_config.json"

config = {
    "workflow_id": "m4_pack_fp32_only",
    "input_model": {
        # En algunas versiones el registry key del handler ONNX es "ONNXModel".
        "type": "ONNXModel",
        "config": {"model_path": str(fp32_model_path)},
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {"accelerators": [{"device": "cpu", "execution_providers": ["CPUExecutionProvider"]}]},
        }
    },
    "passes": {},
    "engine": {
        "host": "local_system",
        "target": "local_system",
        "cache_dir": str(package_cache),
        "output_dir": str(package_out_dir),
        "log_severity_level": 0,
        "evaluate_input_model": False,
        "packaging_config": {"type": "Zipfile", "name": "M4_OutputModels"},
    },
}

cfg_path.write_text(json.dumps(config, indent=2), encoding="utf-8")
print("Wrote:", cfg_path)

cmd = [sys.executable, "-m", "olive", "run", "--config", str(cfg_path)]
print("Running:", " ".join(cmd))
completed = subprocess.run(cmd, text=True, capture_output=True)
print("returncode:", completed.returncode)
print("---- stderr (tail) ----")
print(completed.stderr[-2000:])

zips = sorted(package_out_dir.rglob("*.zip")) if package_out_dir.exists() else []
print("ZIPs:", zips)


Wrote: g:\source\VisualCode\repos\olive-python-vscode-labs\outputs\m4_pack_run_config.json
Running: g:\source\VisualCode\repos\olive-python-vscode-labs\.venv\Scripts\python.exe -m olive run --config g:\source\VisualCode\repos\olive-python-vscode-labs\outputs\m4_pack_run_config.json
returncode: 0
---- stderr (tail) ----

ZIPs: []


## Verificación (checklist)
- Has medido `FP32` (y `INT8` si existe) con tamaño y latencia.
- Existe `outputs/m4_optimized/linear_optimized.onnx`.
- Existe un perfil JSON en `outputs/m4_profiles/`.
- (Opcional) Se genera un ZIP en `outputs/m4_olive_package/` con `packaging_config`.
