# M7 — Hugging Face → Olive → ONNX → ONNX Runtime (CPU / NVIDIA GPU / Intel NPU)

Este notebook integra lo aprendido en M0–M6:
- Comprobación de entorno (Python / pip / Olive / ONNX Runtime).
- Exportación/optimización con **Olive** en formato **ONNX**.
- Ejecución y medición de rendimiento con **ONNX Runtime** usando distintos **Execution Providers**.

> Si tu máquina no tiene GPU NVIDIA o NPU Intel (OpenVINO), esas secciones se **saltan** automáticamente.


## Objetivo

1. Descargar el modelo `microsoft/Phi-4-mini-instruct` desde Hugging Face.
2. Convertirlo a ONNX y aplicar cuantización INT8 orientada a CPU.
3. Probar inferencia y latencia en:
   - CPUExecutionProvider
   - CUDAExecutionProvider (si está disponible)
   - OpenVINOExecutionProvider con `device_type="NPU"` (si está disponible; si falla, fallback a CPU)


## Referencias oficiales (lista cerrada del curso)

- Olive CLI: https://microsoft.github.io/Olive/0.6.1/features/cli.html
- Olive Hugging Face: https://microsoft.github.io/Olive/0.7.0/features/huggingface_model_optimization.html
- Olive cuantización: https://microsoft.github.io/Olive/features/quantization.html
- ORT Execution Providers: https://onnxruntime.ai/docs/execution-providers/
- ORT CUDA EP: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html
- ORT OpenVINO EP: https://onnxruntime.ai/docs/execution-providers/OpenVINO-ExecutionProvider.html
- ORT Python API: https://onnxruntime.ai/docs/api/python/api_summary.html


In [None]:
# Celda 1 — Versiones y entorno (OBLIGATORIO antes de comandos de Olive)
import sys, subprocess
import onnxruntime as ort

print("Python executable:", sys.executable)
print("Python version:", sys.version)

print("\n== pip show olive-ai ==")
subprocess.run([sys.executable, "-m", "pip", "show", "olive-ai"], check=False)

print("\n== ONNX Runtime ==")
print("ORT version:", ort.get_version_string())
print("Available providers:", ort.get_available_providers())


In [None]:
# Celda 2 — Carpetas del proyecto
from pathlib import Path

Path("../models").mkdir(exist_ok=True)
Path("../outputs").mkdir(exist_ok=True)
Path("../data").mkdir(exist_ok=True)

print("OK. Carpetas creadas/confirmadas.")


In [None]:
# Celda 3 — Modelo de Hugging Face
MODEL_ID = "microsoft/Phi-4-mini-instruct"
print("MODEL_ID:", MODEL_ID)


## 4) Crear el run-config de Olive

Usamos `HfModel` como input y aplicamos:
- `OnnxConversion`
- `OnnxDynamicQuantization` (dinámica; no requiere dataset de calibración)


In [None]:
import json
from pathlib import Path

run_dir = Path("../outputs") / "m7_phi4"
run_dir.mkdir(parents=True, exist_ok=True)

run_config_path = run_dir / "m7_run_config_cpu_int8.json"
cache_dir = run_dir / "cache"
output_dir = run_dir / "out_cpu_int8"

config = {
    "workflow_id": "m7_phi4_cpu_int8",
    "input_model": {
        "type": "HfModel",
        "model_path": MODEL_ID,
        # No especificamos io_config para permitir que Olive lo infiera automáticamente
        # Phi-4 es un modelo CausalLM con KV cache, Olive lo detecta
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "config": {
                "accelerators": [
                    {"device": "cpu", "execution_providers": ["CPUExecutionProvider"]}
                ]
            }
        }
    },
    "passes": {
        "convert_to_onnx": {
            "type": "OnnxConversion",
            "config": {
                "target_opset": 17,  # Opset compatible con modelos modernos
            }
        },
        "quant_int8_dynamic": {"type": "OnnxDynamicQuantization", "config": {}},
    },
    "engine": {
        "host": "local_system",
        "target": "local_system",
        "cache_dir": str(cache_dir),
        "output_dir": str(output_dir),
        "log_severity_level": 0,
        "evaluate_input_model": False,
    },
}

run_config_path.write_text(json.dumps(config, indent=2), encoding="utf-8")
print("Wrote:", run_config_path)
print("Output dir:", output_dir)

## 5) Ejecutar Olive

Usamos `python -m olive run --config ...` para asegurar que se ejecuta con el mismo Python del kernel.


In [None]:
import subprocess, sys

cmd = [sys.executable, "-m", "olive", "run", "--config", str(run_config_path)]
print("Running:", " ".join(cmd))

p = subprocess.run(cmd, capture_output=True, text=True)
print("returncode:", p.returncode)
print("---- stdout (tail) ----")
print(p.stdout[-4000:])
print("---- stderr (tail) ----")
print(p.stderr[-4000:])

if p.returncode != 0:
    raise RuntimeError("Olive falló. Revisa el log anterior.")

## 6) Localizar el ONNX resultante

In [None]:
from pathlib import Path

onnx_files = sorted(Path(output_dir).rglob("*.onnx"))
print("Encontrados:", len(onnx_files))
for f in onnx_files[:10]:
    print("-", f)

if not onnx_files:
    raise FileNotFoundError("No se encontraron .onnx en el output_dir. Revisa el log de Olive y la configuración.")

optimized_onnx = onnx_files[0]
print("\nUsaremos:", optimized_onnx)


## 7) Benchmark multi-EP (CPU / CUDA / OpenVINO)

- CPU: siempre
- CUDA: si `CUDAExecutionProvider` aparece en `ort.get_available_providers()`
- OpenVINO: si `OpenVINOExecutionProvider` aparece; intenta `device_type="NPU"`, si falla hace fallback a `CPU`


In [None]:
import numpy as np
import onnxruntime as ort
import time

model_path = str(optimized_onnx)

def make_dummy_inputs(session: ort.InferenceSession, seq_len: int = 8, batch: int = 1):
    inputs = {}
    for inp in session.get_inputs():
        name = inp.name
        shape = []
        for d in inp.shape:
            if d is None:
                shape.append(batch if len(shape) == 0 else seq_len)
            else:
                shape.append(d)

        t = inp.type
        if "int64" in t:
            arr = np.zeros(shape, dtype=np.int64)
        elif "int32" in t:
            arr = np.zeros(shape, dtype=np.int32)
        elif "float16" in t:
            arr = np.zeros(shape, dtype=np.float16)
        else:
            arr = np.zeros(shape, dtype=np.float32)

        if name.lower() in ["input_ids", "input", "ids"]:
            arr = np.random.randint(0, 1000, size=shape, dtype=arr.dtype)
        elif name.lower() in ["attention_mask", "mask"]:
            arr = np.ones(shape, dtype=arr.dtype)

        inputs[name] = arr
    return inputs

def benchmark_session(session: ort.InferenceSession, inputs: dict, warmup: int = 2, iters: int = 5):
    for _ in range(warmup):
        session.run(None, inputs)
    t0 = time.perf_counter()
    for _ in range(iters):
        session.run(None, inputs)
    t1 = time.perf_counter()
    return (t1 - t0) * 1000.0 / iters

results = []

# CPU
sess_cpu = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
lat_cpu = benchmark_session(sess_cpu, make_dummy_inputs(sess_cpu))
results.append(("CPUExecutionProvider", lat_cpu))

# CUDA
if "CUDAExecutionProvider" in ort.get_available_providers():
    try:
        import onnxruntime
        if hasattr(onnxruntime, "preload_dlls"):
            onnxruntime.preload_dlls()
    except Exception:
        pass

    sess_cuda = ort.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
    lat_cuda = benchmark_session(sess_cuda, make_dummy_inputs(sess_cuda), warmup=2, iters=10)
    results.append(("CUDAExecutionProvider", lat_cuda))
else:
    print("CUDAExecutionProvider no disponible (¿onnxruntime-gpu instalado?).")

# OpenVINO
if "OpenVINOExecutionProvider" in ort.get_available_providers():
    def try_openvino(device_type: str):
        sess = ort.InferenceSession(model_path, providers=[("OpenVINOExecutionProvider", {"device_type": device_type})])
        lat = benchmark_session(sess, make_dummy_inputs(sess), warmup=2, iters=10)
        return lat

    try:
        results.append(("OpenVINOExecutionProvider (NPU)", try_openvino("NPU")))
    except Exception as e:
        print("OpenVINO NPU falló; fallback a CPU. Error:", e)
        results.append(("OpenVINOExecutionProvider (CPU)", try_openvino("CPU")))
else:
    print("OpenVINOExecutionProvider no disponible (¿onnxruntime-openvino instalado + setupvars?).")

print("\nResultados (ms/iter):")
for name, ms in sorted(results, key=lambda x: x[1]):
    print(f"- {name}: {ms:.3f} ms")


## Verificación

- Olive genera al menos un `.onnx` en `outputs/m7_phi4/out_cpu_int8/`.
- CPU benchmark funciona (siempre).
- CUDA/OpenVINO solo si aparecen en `ort.get_available_providers()`.

## Errores comunes

- Kernel equivocado en VS Code → revisa `sys.executable` y elige kernel.
- CUDA EP no aparece → necesitas build con CUDA EP (ver doc del CUDA EP).
- OpenVINO EP no aparece o NPU falla → `onnxruntime-openvino` + `setupvars.bat` y revisar compatibilidad (ver doc del OpenVINO EP).
