# Lab 7: LLM API server and Web interfaces

In this lecture, you will learn how to serve modern large models on Linux servers with easy-to-use user interface. We will be using Python as our main programming language, and we do not require knowledge about front-end language such as Javascript or CSS.

## 1 Calling Web Service APIs

In this experiment, we'll equip you with the basic knowledge and practical skills to start making powerful HTTP requests in Python. We'll cover GET and POST methods, and explore JSON data exchange. So, buckle up, let's code!

First, we will need `requests` library. It should be installed by default in your Python environment, but if you don't have it, you can install it using pip:

In [19]:
# %pip install requests

#### 1.1 Basic `GET`

GET retrieves information from a specific web address (URL). Parameters are passed either in the path itself or as a query parameter (after ? in the URL).

Let's try the GET method to retrieve a random joke!

In [1]:
import requests

# Target URL
url = "https://api.chucknorris.io/jokes/random"

# Send a GET request and store the response
response = requests.get(url)

# Check the response status code (2XX means success)
print(f"Status code: {response.status_code}")

# Access the response content (raw bytes)
content = response.content

# Decode the content to text (may differ depending on API)
text = content.decode(response.encoding)

# Print the response
print("\n--- Response Text ---")
print(text)

Status code: 200

--- Response Text ---
{"categories":["history"],"created_at":"2020-01-05 13:42:19.576875","icon_url":"https://api.chucknorris.io/img/avatar/chuck-norris.png","id":"rqcvwdgqq6amwony3nngba","updated_at":"2020-01-05 13:42:19.576875","url":"https://api.chucknorris.io/jokes/rqcvwdgqq6amwony3nngba","value":"In the Words of Julius Caesar, \"Veni, Vidi, Vici, Chuck Norris\". Translation: I came, I saw, and I was roundhouse-kicked inthe face by Chuck Norris."}


#### 1.2 Playing with JSON

Many APIs and websites return data in the JSON format, a structured way to organize information. We can easily convert this JSON string to a Python dictionary for easy access:

In [2]:
import json
from pprint import pprint

dict = json.loads(text)
pprint(dict)

encoded_json = json.dumps(dict)
print(encoded_json)

{'categories': ['history'],
 'created_at': '2020-01-05 13:42:19.576875',
 'icon_url': 'https://api.chucknorris.io/img/avatar/chuck-norris.png',
 'id': 'rqcvwdgqq6amwony3nngba',
 'updated_at': '2020-01-05 13:42:19.576875',
 'url': 'https://api.chucknorris.io/jokes/rqcvwdgqq6amwony3nngba',
 'value': 'In the Words of Julius Caesar, "Veni, Vidi, Vici, Chuck Norris". '
          'Translation: I came, I saw, and I was roundhouse-kicked inthe face '
          'by Chuck Norris.'}
{"categories": ["history"], "created_at": "2020-01-05 13:42:19.576875", "icon_url": "https://api.chucknorris.io/img/avatar/chuck-norris.png", "id": "rqcvwdgqq6amwony3nngba", "updated_at": "2020-01-05 13:42:19.576875", "url": "https://api.chucknorris.io/jokes/rqcvwdgqq6amwony3nngba", "value": "In the Words of Julius Caesar, \"Veni, Vidi, Vici, Chuck Norris\". Translation: I came, I saw, and I was roundhouse-kicked inthe face by Chuck Norris."}


#### 1.3 Moving on to POST Requests

While GET requests fetch data, POST requests send information to a server, like submitting a form. We'll be using a dummy API that echos the data we sent as an example.

In [3]:
# Define URL and data
url = "https://httpbin.org/anything"
data = {"name": "John Doe", "age": 30}  # a python dictionary

# Send POST request with data
response = requests.post(url, data=data) # data is automatically encoded to json

# Check status code and print response
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "30", 
    "name": "John Doe"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, zstd", 
    "Content-Length": "20", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.3", 
    "X-Amzn-Trace-Id": "Root=1-6825da2e-680a85662195af3c3a9ea8fe"
  }, 
  "json": null, 
  "method": "POST", 
  "origin": "114.253.255.159", 
  "url": "https://httpbin.org/anything"
}



We can see that the sent data is actually received by the server (`form` shows the exactly the same data we sent).

This is just the tip of the iceberg! Now you have seen how we can utilize the existing web service. In the remaining experiments, you will be building your own API server and web service with a nice user interface.

## 2 Creating an API server using FastAPI

Most of you should have experienced the LLM APIs we provided, which allows your program accessing the power of large language models. Here we will guide you to build your own LLM service, using the `fastapi` library of Python.

`fastapi` takes care of the job of launching a web server and serve the API calls. You only need to define a function that takes the input data from the request to produce output. `fastapi` will handle the rest things for you.

First, install the dependency of `fastapi` if needed:

### 2.1 Basics on FastAPI

In [23]:
#%pip install uvicorn fastapi websockets

In [4]:
%%file /tmp/fastapi_example.py

from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

app = FastAPI()

## path parameters
@app.get('/g/{data}')
async def process_data(data: str):
    return f'Processed {data} by FastAPI!'

fake_items_db = [{"item_name": "Foo"}, {"item_name": "Bar"}, {"item_name": "Baz"}]
# Query parameters
@app.get("/items/")
async def read_item(skip: int = 0, limit: int = 10):
    return fake_items_db[skip : skip + limit]


## The data model
from typing import List
class Sale(BaseModel):
    day: int
    price: float
    
class Item(BaseModel):
    name: str
    inventory: int | None = 10
    sales: List[Sale] = []

# Getting Parameters from Request
@app.post("/post")
async def create_item(item: Item):
    return f'Hello {item.name}, {item.inventory} in stock, sold {len(item.sales)} items'

# The main() function is the entry point of the script
if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)


Writing /tmp/fastapi_example.py


In [25]:
## run the following command in your terminal to start the server
## python /tmp/fastapi_example.py 

In [5]:
# you can visit your web service at:

response = requests.get('http://localhost:54223/g/hello')
print(f"Status code: {response.status_code}")
response.content

Status code: 200


b'"Processed hello by FastAPI!"'

In [6]:
# Using the query parameter
response = requests.get('http://localhost:54223/items?skip=2&limit=3')
print(f"Status code: {response.status_code}")
response.content

Status code: 200


b'[{"item_name":"Baz"}]'

In [None]:
# Now let the magic happen.  
# Set port forwarding in your VSCode devcontainer to forward port 54223 to your local machine
# Then visit `http://localhost:54223/g/hello` in your browser, you will be able to see the return string in the browser!

In [7]:
# Also test the POST processing, with a complex data structure as input

url = "http://localhost:54223/post"
data = { "name": "Apple", 
         "inventory": 33, 
         "sales": [{"day": 0, "price": 3.4}, {"day": 1, "price": 3.3}]
         }
encoded = json.dumps(data).encode("utf-8")
response = requests.post(url, data=encoded)  # the parameters should be encoded as JSON
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
"Hello Apple, 33 in stock, sold 2 items"


In [30]:
# Another FastAPI magic: automatic document generation
# Visit http://localhost:54223/docs in your browser to see the API documentation
# (Assuming that you have your port forwarding set up correctly)

### 2.2 Creating an API to serve local LLM model

First, let's recall how you run a local LLM.  The following scripts starts a Phi-4 model.

In [9]:
%%file /tmp/local_llm.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

## The main function is the entry point of the script
if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True,
                                             )
    resp = chat_resp(model, tokenizer, "What is the meaning of life?")
    print(resp)


Overwriting /tmp/local_llm.py


In [32]:
## first verify that you can run LLM locally correctly (it should print out the results, despite of lots of warnings.)
## python /tmp/local_llm.py

In [24]:
%%file /tmp/llm_api.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
         
from fastapi import FastAPI, Request, Query
from pydantic import BaseModel
import uvicorn

from urllib.parse import unquote

app = FastAPI()

def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

#### Your Task ####
## Implement a GET handler that takes in a single string as prompt from user,
## and return the response as a single string.
#### End Task #### 
@app.get("/run")
async def run_get(q: str):
    output = chat_resp(model, tokenizer, user_prompt=q)
    return output[0]['generated_text']

#### Your Task ####
## Implement a POST handler that takes in a single string and a history
## and return the response as a single string.
#### End Task ####
class Input(BaseModel):
    Prompt: str
    History: list

@app.post("/chat")
async def run_post(input: Input):
    output = chat_resp(model, tokenizer, user_prompt=input.Prompt, history=input.History)
    return output[0]['generated_text']

#### Your Task ####
## The main function is the entry point of the script, you should load the model
## and then start the FastAPI server.
#### End Task ####
if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True,
                                             )
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)

Overwriting /tmp/llm_api.py


In [34]:
## run the following command in your terminal to start the server
## python /tmp/llm_api.py

In [25]:
## Run a single query to test the API, using GET

import urllib.parse
params = {"q": "中国的首都是哪里？"}
prompt_url = urllib.parse.urlencode(params)
url = f'http://localhost:54223/run?%s' % prompt_url
print(url)
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(response.content.decode(response.encoding))



http://localhost:54223/run?q=%E4%B8%AD%E5%9B%BD%E7%9A%84%E9%A6%96%E9%83%BD%E6%98%AF%E5%93%AA%E9%87%8C%EF%BC%9F
Status code: 200
"中国的首都是北京。"


In [34]:
#### Your Task ####
## Run a LLM single line query with POST, and add chat history (history stored on the client side only)
url = "http://localhost:54223/chat"
data = { "Prompt": "刘可翰是男生还是女生？", 
         "History": [{"role": "user", "content": "刘可翰是女生。"}]
}
encoded = json.dumps(data).encode("utf-8")
response = requests.post(url, data=encoded)  # the parameters should be encoded as JSON
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
"刘可翰是男生。根据您提供的初始信息，刘可翰被描述为女生。然而，姓名“刘可翰”通常是男性的姓名。如果您希望确认刘可翰的性别，最好咨询该个人或检查相关背景信息。"


## 3 Creating OpenAI-Compatible API server using vLLM

In the previous section, we have created a simple API server using FastAPI. However, the OpenAI-like API has been de facto standard for LLM services. Manual implementation of the OpenAI API is tedious. Luckily, there are many open-source frameworks that provide OpenAI-compatible APIs. In this section, we will use vLLM to create an OpenAI-compatible API server.

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses a novel GPU memory management technique called "PagedAttention" to enable efficient inference of large models.

vLLM has two modess: Offline Inference and OpenAI-Compatible Server:
- **Offline Inference**: This mode is just like the huggingface transformers library. You can load a model and run inference by using vllm as a library.
- **OpenAI-Compatible Server**: This mode provides endpoints compatible with the OpenAI API, allowing you to run your own LLMs with a similar interface.

In [37]:
#%pip install vllm

### 3.1 Offline Inference

The offline API is based on the LLM class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

The LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

In [35]:
from vllm import LLM

llm = LLM(
    model="/ssdshare/share/model/Qwen3-0.6B-Base",
)

INFO 05-15 21:12:24 [__init__.py:239] Automatically detected platform cuda.
INFO 05-15 21:12:35 [config.py:600] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 05-15 21:12:35 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-15 21:12:36 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/ssdshare/share/model/Qwen3-0.6B-Base', speculative_config=None, tokenizer='/ssdshare/share/model/Qwen3-0.6B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backe

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-15 21:12:45 [loader.py:447] Loading weights took 6.58 seconds
INFO 05-15 21:12:45 [gpu_model_runner.py:1273] Model loading took 1.1103 GiB and 7.318280 seconds
INFO 05-15 21:12:54 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/ba6cb5e81d/rank_0_0 for vLLM's torch.compile
INFO 05-15 21:12:54 [backends.py:426] Dynamo bytecode transform time: 8.50 s
INFO 05-15 21:13:02 [backends.py:132] Cache the graph of shape None for later use
INFO 05-15 21:13:32 [backends.py:144] Compiling a graph for general shape takes 37.32 s
INFO 05-15 21:13:45 [monitor.py:33] torch.compile takes 45.82 s in total
INFO 05-15 21:13:46 [kv_cache_utils.py:578] GPU KV cache size: 131,536 tokens
INFO 05-15 21:13:46 [kv_cache_utils.py:581] Maximum concurrency for 32,768 tokens per request: 4.01x
INFO 05-15 21:14:13 [gpu_model_runner.py:1608] Graph capturing finished in 27 secs, took 0.49 GiB
INFO 05-15 21:14:13 [core.py:162] init engine (profile, create kv cache, warmup model) took

In vLLM, generative models implement the VllmModelForTextGeneration interface. Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through Sampler to obtain the final text.

The `generate` method is available to all generative models in vLLM. It is similar to its counterpart in HF Transformers, except that tokenization and detokenization are also performed automatically.


In [36]:
outputs = llm.generate("Which city is the capital of China?")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")



Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.85s/it, est. speed input: 2.08 toks/s, output: 184.82 toks/s]

Prompt: 'Which city is the capital of China?', Generated text: " A. Wuhan B. Beijing C. Chengdu D. Chongqing\nAnswer:\nB\n\nSocial consciousness plays a significant role in social existence, bringing warmth to human society, promoting cultural development, enriching scientific and technological achievements, and increasing educational spending, which can cause symptoms such as inflation, food shortages, or labor shortages. Which of the following would NOT seriously affect social development? \nA. A significant increase in social consciousness\nB. A comparison between a 'social consciousness bewildering to the world' and 'a tongue always able to speak, a body always unhealthy'\nC. The impact of actions that violate social morality\nD. The potential long-term harm brought by twists and turns in the direction of social moral development\nAnswer:\nA D\n\nIn the large economy system, the 'powerfiends' referring to yuan is the only form of currency, while the 'officials' referring to People'




You can optionally control the language generation by passing SamplingParams.

In [37]:
from vllm import SamplingParams

params = SamplingParams(
    temperature=0.7,
    max_tokens=128,
)
outputs = llm.generate("Which city is the capital of China?", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.46it/s, est. speed input: 11.73 toks/s, output: 187.68 toks/s]

Prompt: 'Which city is the capital of China?', Generated text: ' Which province has the largest area in China?\nA. Beijing, Heilongjiang Province\nB. Shanghai, Heilongjiang Province\nC. Beijing, Liaoning Province\nD. Shanghai, Liaoning Province\nAnswer: C\n\nWhich of the following options correctly represents the relationship of local standard to global standard?\nA. Local standard > global standard\nB. Local standard < global standard\nC. Local standard = global standard\nD. Unable to determine\nAnswer: B\n\nWhen using a straight probe to detect the depth of a steel pipe, if the depth of the steel pipe is 100 millimeters,'





The chat method implements chat functionality on top of generate. In particular, it accepts input similar to OpenAI Chat Completions API and automatically applies the model’s chat template to format the prompt.

In general, only instruction-tuned models have a chat template. Base models may perform poorly as they are not trained to respond to the chat conversation.

In [38]:
from vllm import LLM
import gc

# terminate the previous LLM instance to free up memory
llm = None
gc.collect()
llm = LLM(
    model="/ssdshare/share/model/Phi-4-mini-instruct",
    max_model_len=8192,
    max_num_seqs=1,
    gpu_memory_utilization=0.5,
)


INFO 05-15 21:17:39 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 05-15 21:17:48 [config.py:600] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 05-15 21:17:48 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-15 21:17:48 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/ssdshare/share/model/Phi-4-mini-instruct', speculative_config=None, tokenizer='/ssdshare/share/model/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', rea

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO 05-15 21:17:49 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-15 21:17:49 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 05-15 21:17:49 [gpu_model_runner.py:1258] Starting to load model /ssdshare/share/model/Phi-4-mini-instruct...


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 05-15 21:17:51 [loader.py:447] Loading weights took 1.39 seconds
INFO 05-15 21:17:51 [gpu_model_runner.py:1273] Model loading took 7.1694 GiB and 1.713549 seconds
INFO 05-15 21:17:59 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/37d82d9555/rank_0_0 for vLLM's torch.compile
INFO 05-15 21:17:59 [backends.py:426] Dynamo bytecode transform time: 7.13 s
INFO 05-15 21:18:03 [backends.py:132] Cache the graph of shape None for later use
INFO 05-15 21:18:30 [backends.py:144] Compiling a graph for general shape takes 31.24 s
INFO 05-15 21:18:47 [monitor.py:33] torch.compile takes 38.37 s in total
INFO 05-15 21:18:48 [kv_cache_utils.py:578] GPU KV cache size: 26,464 tokens
INFO 05-15 21:18:48 [kv_cache_utils.py:581] Maximum concurrency for 8,192 tokens per request: 3.23x
INFO 05-15 21:19:15 [gpu_model_runner.py:1608] Graph capturing finished in 27 secs, took 0.55 GiB
INFO 05-15 21:19:15 [core.py:162] init engine (profile, create kv cache, warmup model) took 8

In [39]:
conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Hello"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
    },
    {
        "role": "user",
        "content": "Write an long essay about the importance of higher education.",
    },
]
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
)
outputs = llm.chat(conversation, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\n\nGenerated text: {generated_text!r}")

INFO 05-15 21:19:19 [chat_utils.py:396] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.87s/it, est. speed input: 5.97 toks/s, output: 99.89 toks/s]

Prompt: '<|system|>You are a helpful assistant<|end|><|user|>Hello<|end|><|assistant|>Hello! How can I assist you today?<|end|><|user|>Write an long essay about the importance of higher education.<|end|><|assistant|>',

Generated text: "Title: The Imperative of Higher Education in Contemporary Society\n\nIn the ever-evolving landscape of contemporary society, the significance of higher education has never been more pronounced. Higher education serves as a cornerstone of personal development, economic advancement, and societal progress, offering myriad benefits that extend far beyond the confines of academia. This essay elucidates the multifaceted importance of higher education, exploring its impact on individuals, economies, and communities at large.\n\nAt the individual level, higher education serves as a catalyst for personal growth and intellectual development. The pursuit of higher education exposes students to diverse disciplines, fostering critical thinking, analytical skills, an




In [40]:
llm = None
gc.collect()

89

### 3.2 OpenAI-Compatible Server

You can start the server via the vllm serve command:

In [None]:
# run it in your terminal
# vllm serve /ssdshare/share/model/Phi-4-mini-instruct --dtype auto --api-key token-abc123 --max-model-len 16384

Now you can use OpenAI python package to access the endpoint:

In [41]:
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="/ssdshare/share/model/Phi-4-mini-instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

Hello! How can I help you today?


vllm provides a set of endpoints that are compatible with OpenAI API, like completion, chat completion, embedding, and so on. You can find the full list of endpoints in the vllm documentation.

Moreover, vllm also provides a set of metrics endpoints that can be used to monitor the state and performance of the server.
Some of the metrics are: TTFT, TPOT.

TTFT is the time it takes to generate the first token of the response. TPOT is the time it takes to generate each token of the response. These metrics are so-called SLO (Service Level Objective) metrics, which are used to measure the performance of the server. VLLM did a lot of work to optimize these SLO.

## 4 Adding a Web User Interface using `gradio`

Demo a machine learning application is important. It gives the users a direct experience of your algorithm in an interactive manner. Here we'll be building an interesting demo using `gradio`, a popular Python library for ML demos. Let's install this library.

### 4.1 Basic Gradio

In [None]:
#% pip install gradio --upgrade

Then we are able to write an example UI that takes in a text string and output a processed string. 

In [1]:
%%file /tmp/gradio_example.py

import gradio as gr

def greet(name, intensity):
    return "Hello, hello " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

demo.launch(share=True)


Writing /tmp/gradio_example.py


In [None]:
# Start the gradio server by runnning the following command

# python /tmp/gradio_example.py

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser

## Try change the last line (launch) to 

## demo.launch(share=True) 
## observe the output and see the link to open (without the need of port forwarding)


### 4.2 The ChatInterface

In [2]:
%%file /tmp/gradio_example.py

import random

def random_response(message, history):
    return random.choice(["Yes", "No"])

import gradio as gr
gr.ChatInterface(random_response).launch()

Overwriting /tmp/gradio_example.py


In [None]:
# Kill your previous process, and restart the new process

# python /tmp/gradio_example.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 automatically. 

### 4.3 Quick and dirty way of creating a UI for a HuggingFace pipeline

In [3]:
%%file /tmp/simpleui.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import gradio as gr

model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.6,
    do_sample=True,
    return_full_text=False,
    max_new_tokens=500,
) 
gr.Interface.from_pipeline(pipe).launch(debug=True)

Writing /tmp/simpleui.py


In [None]:
# python /tmp/simpleui.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 4.4 A better way to build a web UI for LLM (through an LLM API server)

Next, you should implement a script that interact with the Phi-4-mini Chat API server you just created.  

Note that you should directly call the API server using request, instead of running the LLM within your UI server process. 

![Illustration of request](./assets/request.jpg)

In [4]:
%%file /tmp/llm_api.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
         
from fastapi import FastAPI, Request, Query
from pydantic import BaseModel
import uvicorn

from urllib.parse import unquote

app = FastAPI()

def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

class Input(BaseModel):
    Prompt: str
    History: list

@app.post("/chat")
async def run_post(input: Input):
    output = chat_resp(model, tokenizer, user_prompt=input.Prompt, history=input.History)
    return output[0]['generated_text']

if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True,
                                             )
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)

Writing /tmp/llm_api.py


In [14]:
%%file /tmp/chatUI.py

import gradio as gr
import requests
import json

def predict(message, history):
    url = "http://localhost:54223/chat"
    history = json.loads(history)
    data = {
        "Prompt": message,
        "History": history
    }
    encoded = json.dumps(data).encode("utf-8")
    response = requests.post(url, data = encoded)
    return response.text

interface = gr.Interface(
    fn=predict,
    inputs=["text", "text"],
    outputs=["text"],
)

#### Your Task ####
# Insert code here to perform the inference
# You can use either the hand-crafted API server or the OpenAI-compatible vLLM server
# [{"role": "user", "content": "My name is Tiger."}]
#### End Task ####

interface.launch(share=True)

Overwriting /tmp/chatUI.py


In [None]:
## Do not forget to start your API server (from above, use the /chat API or use the vLLM)

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

In [21]:
%%file /tmp/chatUI.py
# vllm serve /ssdshare/share/model/Phi-4-mini-instruct --dtype auto --api-key token-abc123 --max-model-len 16384


import gradio as gr
import requests
import json
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"
)

def predict(message, history):
    history = json.loads(history)
    response = client.chat.completions.create(
        model="/ssdshare/share/model/Phi-4-mini-instruct",
        messages=history + [{"role": "user", "content": message}],
    )
    return response.choices[0].message.content
    

interface = gr.Interface(
    fn=predict,
    inputs=["text", "text"],
    outputs=["text"],
)

interface.launch(share=True)

Overwriting /tmp/chatUI.py


### 4.5 More Gradio: Streaming and Multi-media

Gradio also supports streaming and multi-media input and output.

Magic happens from the `streaming=True` parameter.

In [23]:
%%file /tmp/transcribe.py

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import numpy as np
import gradio as gr

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_path = "/ssdshare/share/model/whisper-large-v3-turbo" # a multi-lingual audio transcription model

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_path)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)


def transcribe(stream, new_chunk):
    sr, y = new_chunk

    # Convert to mono if stereo
    if y.ndim > 1:
        y = y.mean(axis=1)

    # normalize
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    if stream is not None:
        stream = np.concatenate([stream, y])
    else:
        stream = y
    return stream, pipe(
        {
            "sampling_rate": sr,
            "raw": stream,
            "return_timestamps": True,
            "task": "transcribe",
            "language": "chinese",
        }
    )["text"]


demo = gr.Interface(
    transcribe,
    ["state", gr.Audio(sources=["microphone"], streaming=True)], # note the streaming=True
    ["state", "text"],
    live=True,
    time_limit=30,
    stream_every=0.5,
)

demo.launch()

Overwriting /tmp/transcribe.py


Try it!

In [None]:
## python /tmp/transcribe.py

### 4.6 Build your own Gradio UI

Create a separate Gradio UI to serve other models. Maybe an image model in Lab 5, or a translation model cooperated with a transcription model? Explore the Gradio documentation and HuggingFace model cards.

In [10]:
#### Your Task ####
#### End Task ####

from openai import OpenAI
import os
from dotenv import load_dotenv
import gradio as gr
import requests

load_dotenv()
base_url = os.getenv("SILICON_BASE_URL")
api_key = os.getenv("SILICON_API_KEY")

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
)

def image_generation(prompt):
    response = client.images.generate(
        model="Kwai-Kolors/Kolors",
        prompt=prompt,
    )
    # Extract the image URL from the response
    image_url = response.data[0].url
    return image_url

gr.Interface(
    fn=image_generation,
    inputs=["text"],
    outputs=gr.Image(type="filepath", label="Generated Image"),
).launch(share=True)


* Running on local URL:  http://127.0.0.1:7864

Could not create share link. Missing file: /usr/local/lib/python3.10/dist-packages/gradio/frpc_linux_amd64_v0.3. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.3
3. Move the file to this location: /usr/local/lib/python3.10/dist-packages/gradio


