# 06-01 : LLM Serving

A basic test to serve a LLM as a function.

## Refrences

- [Deploying an LLM using MLRun](https://docs.mlrun.org/en/v1.7.2/tutorials/genai_01_basic_tutorial.html)

In [1]:
import mlrun

In [2]:
# Show the API server URL
mlrun.get_run_db()

HTTPRunDB('http://dragon.local:30070')

## 1. Configuration

In [3]:
MODEL_ID = "microsoft/phi-2" # the model ID to use
project_name = "llm-serving" # the project name

### 1.1 Create The Project

In [4]:
project = mlrun.get_or_create_project(
    name=project_name,
    user_project=True)

# Display the current project name
project_name = project.metadata.name
print(f'Full project name: {project_name}')

> 2025-07-28 15:15:54,646 [info] Project loaded successfully: {"project_name":"llm-serving-johannes"}
Full project name: llm-serving-johannes


### 1.2 Model Cache Directory

In [5]:
# the cache directory for the model
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)
print(f"Cache directory: {CACHE_DIR}")

Cache directory: s3://mlrun/projects/llm-serving-johannes/artifacts/cache


## 2. Serving Function

In [6]:
# requirements for the function
requirements = [
    "transformers==4.41.2",
    "tensorflow==2.16.1",
    "torch"
]

# create the function to serve the model 
serve_func = project.set_function(
    name="serve-llm",
    func="src/06-01_serving.py",
    image="dragon:6500/mlrun/mlrun-gpu:1.7.2",
    kind="nuclio",
    handler="invoke_llm",
    requirements=requirements
).apply(mlrun.auto_mount())

# set the environment variables for the function
serve_func.set_envs(env_vars={
    "MODEL_ID": MODEL_ID, 
    "CACHE_DIR": CACHE_DIR
})

# Since the model is stored in memory, use only 1 replica and and one worker
# Since this is running on CPU only, inference might take ~1 minute (increasing timeout)
serve_func.spec.min_replicas = 1
serve_func.spec.max_replicas = 1
serve_func.with_http(worker_timeout=120, gateway_timeout=150, workers=1)
serve_func.set_config("spec.readinessTimeoutSeconds", 1200)

# set gpu resources for the function
serve_func.with_limits(gpus=1)



In [7]:
# build the function
#project.build_function(function='serve-llm')

# deploy the function
#project.deploy_function(serve_func)
serve_func = project.deploy_function(function="serve-llm")

> 2025-07-28 15:15:54,734 [info] Starting remote function deploy
2025-07-28 13:15:55  (info) Deploying function
2025-07-28 13:15:55  (info) Building
2025-07-28 13:15:55  (info) Staging files and preparing base images
2025-07-28 13:15:55  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-07-28 13:15:55  (info) Building processor image
Failed to deploy. Details:

Error - Job failed. Job logs:
error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "dragon:6500/nuclio/processor-llm-serving-johannes-serve-llm:latest": creating push check transport for dragon:6500 failed: Get "https://dragon:6500/v2/": http: server gave HTTP response to HTTPS client
    /nuclio/pkg/processor/build/builder.go:276

Call stack:
Failed to build processor image
    /nuclio/pkg/processor/build/builder.go:276

> 2025-07-28 15:16:15,099 [error] Nuclio funct

RunError: Function serve-llm deployment failed