# 06-01 : LLM Serving

A basic test to serve a LLM as a function.

> ❗❗️This was more an infrastructure test to make sure nuclio can build the image and push it to the local registry. It is also useful to monitor the network traffic with `iftop` to see the impacts of downloding container images, package libraries, and finally the llm; to see where the bottlenecks are. Also watch disk usage with `watch -t 'df | head'`.

## Refrences

- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.7.2/tutorials/genai_01_basic_tutorial.html)

In [9]:
import mlrun

In [10]:
# Show the API server URL
mlrun.get_run_db()

HTTPRunDB('http://dragon.local:30070')

## 1. Configuration

In [None]:
MODEL_ID = "microsoft/phi-2" # the model ID to use
project_name = "llm-serving" # the project name

### 1.1 Create The Project

In [12]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-07-31 13:35:55,478 [info] Project loaded successfully: {"project_name":"llm-serving-johannes"}
Full project name: llm-serving-johannes


### 1.2 Model Cache Directory

In [13]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

Cache directory: s3://mlrun/projects/llm-serving-johannes/artifacts/cache


## 2. Serving Function

In [14]:
# requirements for the function
requirements = [
    "transformers==4.41.2",
    "tensorflow==2.16.1",
    "torch"
]

# create the function to serve the model 
image = "registry-service.mlrun.svc.cluster.local/mlrun/mlrun-gpu:1.9.1-py39"  # specify the image to use
serve_func = project.set_function(
    name="serve-llm",
    func="src/06-01_serving.py",
    image=image,
    kind="nuclio",
    handler="invoke_llm",
    requirements=requirements
)

# set the environment variables for the function
serve_func.set_envs(env_vars={
    "MODEL_ID": MODEL_ID, 
    "CACHE_DIR": CACHE_DIR
})

# Since the model is stored in memory, use only 1 replica and and one worker
# Since this is running on CPU only, inference might take ~1 minute (increasing timeout)
serve_func.spec.min_replicas = 1
serve_func.spec.max_replicas = 1
serve_func.with_http(worker_timeout=120, gateway_timeout=150, workers=1)
serve_func.set_config("spec.readinessTimeoutSeconds", 1200)

# set gpu resources for the function
serve_func.with_limits(gpus=1)



In [15]:
# build the function
#project.build_function(function='serve-llm')

# deploy the function
#project.deploy_function(serve_func)
serve_func = project.deploy_function(function="serve-llm")

> 2025-07-31 13:35:55,548 [info] Starting remote function deploy
2025-07-31 11:35:55  (info) Deploying function
2025-07-31 11:35:55  (info) Building
2025-07-31 11:35:55  (info) Staging files and preparing base images
2025-07-31 11:35:55  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-07-31 11:35:55  (info) Building processor image
2025-07-31 11:39:42  (info) Build complete
Failed to deploy. Details:

== CUDA ==

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

   Use the NVIDIA Container Toolkit to st

RunError: Function serve-llm deployment failed