# Lab 7: LLM API server and Web interfaces

In this lecture, you will learn how to serve modern large models on Linux servers with easy-to-use user interface. We will be using Python as our main programming language, and we do not require knowledge about front-end language such as Javascript or CSS.

## 1 Calling Web Service APIs

In this experiment, we'll equip you with the basic knowledge and practical skills to start making powerful HTTP requests in Python. We'll cover GET and POST methods, and explore JSON data exchange. So, buckle up, let's code!

First, we will need `requests` library. It should be installed by default in your Python environment, but if you don't have it, you can install it using pip:

In [1]:
%pip install requests

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


#### 1.1 Basic `GET`

GET retrieves information from a specific web address (URL). Parameters are passed either in the path itself or as a query parameter (after ? in the URL).

Let's try the GET method to retrieve a random joke!

In [3]:
import requests

# Target URL
url = "https://api.chucknorris.io/jokes/random"

# Send a GET request and store the response
response = requests.get(url)

# Check the response status code (2XX means success)
print(f"Status code: {response.status_code}")

# Access the response content (raw bytes)
content = response.content

# Decode the content to text (may differ depending on API)
text = content.decode(response.encoding)

# Print the response
print("\n--- Response Text ---")
print(text)

Status code: 200

--- Response Text ---
{"categories":[],"created_at":"2020-01-05 13:42:23.240175","icon_url":"https://api.chucknorris.io/img/avatar/chuck-norris.png","id":"dzsV9TzMTnCbckmmbzNJyA","updated_at":"2020-01-05 13:42:23.240175","url":"https://api.chucknorris.io/jokes/dzsV9TzMTnCbckmmbzNJyA","value":"When nature calls Chuck Norris hangs up"}


#### 1.2 Playing with JSON

Many APIs and websites return data in the JSON format, a structured way to organize information. We can easily convert this JSON string to a Python dictionary for easy access:

In [4]:
import json
from pprint import pprint

dict = json.loads(text)
pprint(dict)

encoded_json = json.dumps(dict)
print(encoded_json)

{'categories': [],
 'created_at': '2020-01-05 13:42:23.240175',
 'icon_url': 'https://api.chucknorris.io/img/avatar/chuck-norris.png',
 'id': 'dzsV9TzMTnCbckmmbzNJyA',
 'updated_at': '2020-01-05 13:42:23.240175',
 'url': 'https://api.chucknorris.io/jokes/dzsV9TzMTnCbckmmbzNJyA',
 'value': 'When nature calls Chuck Norris hangs up'}
{"categories": [], "created_at": "2020-01-05 13:42:23.240175", "icon_url": "https://api.chucknorris.io/img/avatar/chuck-norris.png", "id": "dzsV9TzMTnCbckmmbzNJyA", "updated_at": "2020-01-05 13:42:23.240175", "url": "https://api.chucknorris.io/jokes/dzsV9TzMTnCbckmmbzNJyA", "value": "When nature calls Chuck Norris hangs up"}


#### 1.3 Moving on to POST Requests

While GET requests fetch data, POST requests send information to a server, like submitting a form. We'll be using a dummy API that echos the data we sent as an example.

In [4]:
# Define URL and data
url = "https://httpbin.org/anything"
data = {"name": "John Doe", "age": 30}  # a python dictionary

# Send POST request with data
response = requests.post(url, data=data) # data is automatically encoded to json

# Check status code and print response
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "30", 
    "name": "John Doe"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, zstd", 
    "Content-Length": "20", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.3", 
    "X-Amzn-Trace-Id": "Root=1-6825853f-5cfb7df74b800eb05eaccce2"
  }, 
  "json": null, 
  "method": "POST", 
  "origin": "114.253.255.159", 
  "url": "https://httpbin.org/anything"
}



We can see that the sent data is actually received by the server (`form` shows the exactly the same data we sent).

This is just the tip of the iceberg! Now you have seen how we can utilize the existing web service. In the remaining experiments, you will be building your own API server and web service with a nice user interface.

## 2 Creating an API server using FastAPI

Most of you should have experienced the LLM APIs we provided, which allows your program accessing the power of large language models. Here we will guide you to build your own LLM service, using the `fastapi` library of Python.

`fastapi` takes care of the job of launching a web server and serve the API calls. You only need to define a function that takes the input data from the request to produce output. `fastapi` will handle the rest things for you.

First, install the dependency of `fastapi` if needed:

### 2.1 Basics on FastAPI

In [5]:
%pip install uvicorn fastapi websockets

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
%%file /tmp/fastapi_example.py

from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

app = FastAPI()

## path parameters
@app.get('/g/{data}')
async def process_data(data: str):
    return f'Processed {data} by FastAPI!'

fake_items_db = [{"item_name": "Foo"}, {"item_name": "Bar"}, {"item_name": "Baz"}]
# Query parameters
@app.get("/items/")
async def read_item(skip: int = 0, limit: int = 10):
    return fake_items_db[skip : skip + limit]


## The data model
from typing import List
class Sale(BaseModel):
    day: int
    price: float
    
class Item(BaseModel):
    name: str
    inventory: int | None = 10
    sales: List[Sale] = []

# Getting Parameters from Request
@app.post("/post")
async def create_item(item: Item):
    return f'Hello {item.name}, {item.inventory} in stock, sold {len(item.sales)} items'

# The main() function is the entry point of the script
if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)


Writing /tmp/fastapi_example.py


In [19]:
## run the following command in your terminal to start the server
## python /tmp/fastapi_example.py 

In [7]:
# you can visit your web service at:

response = requests.get('http://localhost:54223/g/hello')
print(f"Status code: {response.status_code}")
response.content

Status code: 200


b'"Processed hello by FastAPI!"'

In [8]:
# Using the query parameter
response = requests.get('http://localhost:54223/items?skip=2&limit=3')
print(f"Status code: {response.status_code}")
response.content

Status code: 200


b'[{"item_name":"Baz"}]'

In [23]:
# Now let the magic happen.  
# Set port forwarding in your VSCode devcontainer to forward port 54223 to your local machine
# Then visit `http://127.0.0.1:54223/g/hello` in your browser, you will be able to see the return string in the browser!

In [9]:
# Also test the POST processing, with a complex data structure as input

url = "http://localhost:54223/post"
data = { "name": "Apple", 
         "inventory": 33, 
         "sales": [{"day": 0, "price": 3.4}, {"day": 1, "price": 3.3}]
         }
encoded = json.dumps(data).encode("utf-8")
response = requests.post(url, data=encoded)  # the parameters should be encoded as JSON
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
"Hello Apple, 33 in stock, sold 2 items"


In [25]:
# Another FastAPI magic: automatic document generation
# Visit http://localhost:54223/docs in your browser to see the API documentation
# (Assuming that you have your port forwarding set up correctly)

### 2.2 Creating an API to serve local LLM model

First, let's recall how you run a local LLM.  The following scripts starts a Phi-4 model.

In [10]:
%%file /tmp/local_llm.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

## The main function is the entry point of the script
if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True,
                                             )
    resp = chat_resp(model, tokenizer, "What is the meaning of life?")
    print(resp)


Writing /tmp/local_llm.py


In [27]:
## first verify that you can run LLM locally correctly (it should print out the results, despite of lots of warnings.)
## python /tmp/local_llm.py

In [12]:
%%file /tmp/llm_api.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
         
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

from urllib.parse import unquote

app = FastAPI()

def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

#### Your Task ####
## Implement a GET handler that takes in a single string as prompt from user,
## and return the response as a single string.
@app.get('/run')
async def run(q: str = ""):
    prompt = unquote(q)  # Decode URL-encoded string
    result = chat_resp(model, tokenizer, user_prompt=prompt)
    # Extract the generated text from the result
    return result[0]["generated_text"]
#### End Task ####

#### Your Task ####
## Implement a POST handler that takes in a single string and a history
## and return the response as a single string.
class ChatRequest(BaseModel):
    prompt: str
    history: list = []

@app.post('/chat')
async def chat(request: ChatRequest):
    result = chat_resp(model, tokenizer, user_prompt=request.prompt, history=request.history)
    return result[0]["generated_text"]
#### End Task ####

#### Your Task ####
## The main function is the entry point of the script, you should load the model
## and then start the FastAPI server.
if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                           device_map="cuda:0", 
                                           torch_dtype="auto", 
                                           trust_remote_code=True,
                                          )
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)
#### End Task ####


Overwriting /tmp/llm_api.py


In [29]:
## run the following command in your terminal to start the server
## python /tmp/llm_api.py

In [13]:
## Run a single query to test the API, using GET

import urllib.parse
params = {"q": "中国的首都是哪里？"}
prompt_url = urllib.parse.urlencode(params)
url = f'http://localhost:54223/run?%s' % prompt_url
print(url)
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(response.content.decode(response.encoding))

http://localhost:54223/run?q=%E4%B8%AD%E5%9B%BD%E7%9A%84%E9%A6%96%E9%83%BD%E6%98%AF%E5%93%AA%E9%87%8C%EF%BC%9F
Status code: 200
"中国的首都是北京。"


In [14]:
#### Your Task ####
## Run a LLM single line query with POST, and add chat history (history stored on the client side only)
import requests
import json

# URL for the chat endpoint
url = "http://localhost:54223/chat"

# Initialize chat history
chat_history = []

def send_chat_message(prompt):
    """Send a message to the LLM API and update chat history"""
    global chat_history
    
    # Prepare the request payload
    payload = {
        "prompt": prompt,
        "history": chat_history
    }
    
    # Send the POST request
    response = requests.post(url, json=payload)
    
    if response.status_code == 200:
        assistant_message = response.json()  # The API returns the response as a string
        
        # Update chat history with user's message and assistant's response
        chat_history.append({"role": "user", "content": prompt})
        chat_history.append({"role": "assistant", "content": assistant_message})
        
        print(f"User: {prompt}")
        print(f"Assistant: {assistant_message}")
        print("-" * 50)
    else:
        print(f"Error: Status code {response.status_code}")
        print(response.text)

# Example usage - send multiple messages to build history
send_chat_message("Hello, who are you?")
send_chat_message("What's the capital of Japan?")
send_chat_message("Can you tell me more about its history?")

# Display the full conversation history
print("Full conversation history:")
for message in chat_history:
    print(f"{message['role'].capitalize()}: {message['content']}")

User: Hello, who are you?
Assistant: Hello! I'm Phi, an AI developed by Microsoft. How can I help you today?
--------------------------------------------------
User: What's the capital of Japan?
Assistant: The capital of Japan is Tokyo. It's not only the political center but also a major cultural and economic hub. If you have any more questions or need further assistance, feel free to ask!
--------------------------------------------------
User: Can you tell me more about its history?
Assistant: Certainly! Tokyo, originally known as Edo, has a rich history that spans over 1,000 years. Here are some key points in its historical timeline:

1. **Early History (Edo Period) (1603-1868)**:
   - In 1603, Tokugawa Ieyasu established Edo as the capital of the Tokugawa shogunate, marking the beginning of the Edo period.
   - Edo became the center of political, economic, and cultural life in Japan. It was a time of relative peace and stability, known as the Pax Tokugawa.
   - The city developed i

## 3 Creating OpenAI-Compatible API server using vLLM

In the previous section, we have created a simple API server using FastAPI. However, the OpenAI-like API has been de facto standard for LLM services. Manual implementation of the OpenAI API is tedious. Luckily, there are many open-source frameworks that provide OpenAI-compatible APIs. In this section, we will use vLLM to create an OpenAI-compatible API server.

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses a novel GPU memory management technique called "PagedAttention" to enable efficient inference of large models.

vLLM has two modess: Offline Inference and OpenAI-Compatible Server:
- **Offline Inference**: This mode is just like the huggingface transformers library. You can load a model and run inference by using vllm as a library.
- **OpenAI-Compatible Server**: This mode provides endpoints compatible with the OpenAI API, allowing you to run your own LLMs with a similar interface.

In [15]:
%pip install vllm

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 3.1 Offline Inference

The offline API is based on the LLM class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

The LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

In [16]:
from vllm import LLM

llm = LLM(
    model="/ssdshare/share/model/Qwen3-0.6B-Base",
)

INFO 05-15 15:07:19 [__init__.py:239] Automatically detected platform cuda.
INFO 05-15 15:07:30 [config.py:600] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
INFO 05-15 15:07:31 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-15 15:07:32 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/ssdshare/share/model/Qwen3-0.6B-Base', speculative_config=None, tokenizer='/ssdshare/share/model/Qwen3-0.6B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backe

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-15 15:07:41 [loader.py:447] Loading weights took 6.56 seconds
INFO 05-15 15:07:42 [gpu_model_runner.py:1273] Model loading took 1.1103 GiB and 7.613681 seconds
INFO 05-15 15:07:51 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/c193686457/rank_0_0 for vLLM's torch.compile
INFO 05-15 15:07:51 [backends.py:426] Dynamo bytecode transform time: 9.87 s
INFO 05-15 15:08:00 [backends.py:132] Cache the graph of shape None for later use
INFO 05-15 15:08:31 [backends.py:144] Compiling a graph for general shape takes 39.18 s
INFO 05-15 15:08:49 [monitor.py:33] torch.compile takes 49.05 s in total
INFO 05-15 15:08:50 [kv_cache_utils.py:578] GPU KV cache size: 79,584 tokens
INFO 05-15 15:08:50 [kv_cache_utils.py:581] Maximum concurrency for 32,768 tokens per request: 2.43x
INFO 05-15 15:09:19 [gpu_model_runner.py:1608] Graph capturing finished in 29 secs, took 0.48 GiB
INFO 05-15 15:09:19 [core.py:162] init engine (profile, create kv cache, warmup model) took 

In vLLM, generative models implement the VllmModelForTextGeneration interface. Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through Sampler to obtain the final text.

The `generate` method is available to all generative models in vLLM. It is similar to its counterpart in HF Transformers, except that tokenization and detokenization are also performed automatically.


In [17]:
outputs = llm.generate("Which city is the capital of China?")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")



Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  4.89it/s, est. speed input: 39.44 toks/s, output: 88.72 toks/s]

Prompt: 'Which city is the capital of China?', Generated text: ' Beijing\n\nCapitalize this past sentence correctly. The capital city of China is Beijing.'





You can optionally control the language generation by passing SamplingParams.

In [18]:
from vllm import SamplingParams

params = SamplingParams(
    temperature=0.7,
    max_tokens=128,
)
outputs = llm.generate("Which city is the capital of China?", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.08s/it, est. speed input: 7.41 toks/s, output: 118.55 toks/s]

Prompt: 'Which city is the capital of China?', Generated text: ' Also, what are the capitals of India, Nepal, and Bhutan? Additionally, can you tell me about the capital of Bhutan, the currency used in Bhutan, and the currency of Japan and South Korea? Furthermore, which country has the highest population, and which country has the highest literacy rate? Who is the current President of India, and which leader of Bhutan has won the Nobel Peace Prize? Lastly, what is the capital of the United Kingdom, and which country has the highest literacy rate?\n\nThe capital of China is Beijing. The capital of India is Delhi, the capital of Nepal is Kathmandu, and the'





The chat method implements chat functionality on top of generate. In particular, it accepts input similar to OpenAI Chat Completions API and automatically applies the model’s chat template to format the prompt.

In general, only instruction-tuned models have a chat template. Base models may perform poorly as they are not trained to respond to the chat conversation.

In [1]:
from vllm import LLM
import gc

# terminate the previous LLM instance to free up memory
llm = None
gc.collect()
llm = LLM(
    model="/ssdshare/share/model/Phi-4-mini-instruct",
    max_model_len=8192,
    max_num_seqs=1,
    gpu_memory_utilization=0.5,
)


INFO 05-15 15:46:32 [__init__.py:239] Automatically detected platform cuda.
INFO 05-15 15:46:34 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 05-15 15:46:43 [config.py:600] This model supports multiple tasks: {'generate', 'reward', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
INFO 05-15 15:46:43 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-15 15:46:45 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/ssdshare/share/model/Phi-4-mini-instruct', speculative_config=None, tokenizer='/ssdshare/share/model/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 05-15 15:46:48 [loader.py:447] Loading weights took 1.50 seconds
INFO 05-15 15:46:48 [gpu_model_runner.py:1273] Model loading took 7.1694 GiB and 1.797707 seconds
INFO 05-15 15:46:55 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f3846216f7/rank_0_0 for vLLM's torch.compile
INFO 05-15 15:46:55 [backends.py:426] Dynamo bytecode transform time: 7.31 s
INFO 05-15 15:46:56 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 05-15 15:47:03 [monitor.py:33] torch.compile takes 7.31 s in total
INFO 05-15 15:47:04 [kv_cache_utils.py:578] GPU KV cache size: 126,576 tokens
INFO 05-15 15:47:04 [kv_cache_utils.py:581] Maximum concurrency for 8,192 tokens per request: 15.45x
INFO 05-15 15:47:26 [gpu_model_runner.py:1608] Graph capturing finished in 22 secs, took 0.54 GiB
INFO 05-15 15:47:26 [core.py:162] init engine (profile, create kv cache, warmup model) took 37.99 seconds


In [6]:
conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Hello"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
    },
    {
        "role": "user",
        "content": "Write an long essay about the importance of higher education.",
    },
]
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
)
outputs = llm.chat(conversation, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\n\nGenerated text: {generated_text!r}")

INFO 05-15 15:49:23 [chat_utils.py:396] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Processed prompts: 100%|██████████| 1/1 [00:10<00:00, 10.51s/it, est. speed input: 3.33 toks/s, output: 97.43 toks/s]

Prompt: '<|system|>You are a helpful assistant<|end|><|user|>Hello<|end|><|assistant|>Hello! How can I assist you today?<|end|><|user|>Write an long essay about the importance of higher education.<|end|><|assistant|>',

Generated text: "The Importance of Higher Education\n\nHigher education, often synonymous with college or university-level education, is a critical component of personal and societal development. As we navigate an ever-evolving global landscape, the significance of higher education has grown exponentially. This essay will explore the multifaceted importance of higher education, highlighting its benefits for individuals, society, and the global economy.\n\n**Individual Benefits of Higher Education**\n\nHigher education offers numerous benefits for individuals, which can be broadly categorized into personal, professional, and social dimensions. \n\n1. **Personal Development**: Higher education fosters critical thinking, creativity, and problem-solving skills. These cognit




In [7]:
llm = None
gc.collect()

1048

### 3.2 OpenAI-Compatible Server

You can start the server via the vllm serve command:

In [None]:
# run it in your terminal
# vllm serve /ssdshare/share/model/Phi-4-mini-instruct --dtype auto --api-key token-abc123 --max-model-len 16384

Now you can use OpenAI python package to access the endpoint:

In [8]:
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="/ssdshare/share/model/Phi-4-mini-instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

Hello! How can I assist you today?


vllm provides a set of endpoints that are compatible with OpenAI API, like completion, chat completion, embedding, and so on. You can find the full list of endpoints in the vllm documentation.

Moreover, vllm also provides a set of metrics endpoints that can be used to monitor the state and performance of the server.
Some of the metrics are: TTFT, TPOT.

TTFT is the time it takes to generate the first token of the response. TPOT is the time it takes to generate each token of the response. These metrics are so-called SLO (Service Level Objective) metrics, which are used to measure the performance of the server. VLLM did a lot of work to optimize these SLO.

## 4 Adding a Web User Interface using `gradio`

Demo a machine learning application is important. It gives the users a direct experience of your algorithm in an interactive manner. Here we'll be building an interesting demo using `gradio`, a popular Python library for ML demos. Let's install this library.

### 4.1 Basic Gradio

In [10]:
%pip install gradio --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
Collecting gradio
  Downloading https://mirrors4.bfsu.edu.cn/pypi/web/packages/e2/28/f36287c69c2944c13944e2fc6752a9e673cb4841d2e1217f078c3f403a0b/gradio-5.29.1-py3-none-any.whl (54.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.1/54.1 MB[0m [31m70.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting gradio-client==1.10.1 (from gradio)
  Downloading https://mirrors4.bfsu.edu.cn/pypi/web/packages/55/6f/03eb8e0e0ec80eced5ed35a63376dabfc7391b1538502f8e85e9dc5bab02/gradio_client-1.10.1-py3-none-any.whl (323 kB)
Installing collected packages: gradio-client, gradio
  Attempting uninstall: gradio-client
    Found existing installation: gradio_client 1.8.0
    Uninstalling gradio_client-1.8.0:
      Successfully uninstalled gradio_client-1.8.0
  Attempting uninstall: gradio
    Found existing installation: gradio 5.23.3
    Uninstalling gradio-5.23.3:
      Successf

Then we are able to write an example UI that takes in a text string and output a processed string. 

In [12]:
%%file /tmp/gradio_example.py

import gradio as gr

def greet(name, intensity):
    return "Hello, hello " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

demo.launch()


Overwriting /tmp/gradio_example.py


In [None]:
# Start the gradio server by runnning the following command

# python /tmp/gradio_example.py

In [14]:
%%file /tmp/gradio_example.py
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser

## Try change the last line (launch) to 

## demo.launch(share=True) 
## observe the output and see the link to open (without the need of port forwarding)

import gradio as gr

def greet(name, intensity):
    return "Hello, hello " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

demo.launch(share=True)

Overwriting /tmp/gradio_example.py


### 4.2 The ChatInterface

In [15]:
%%file /tmp/gradio_example.py

import random

def random_response(message, history):
    return random.choice(["Yes", "No"])

import gradio as gr
gr.ChatInterface(random_response).launch()

Overwriting /tmp/gradio_example.py


In [None]:
# Kill your previous process, and restart the new process

# python /tmp/gradio_example.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 automatically. 

### 4.3 Quick and dirty way of creating a UI for a HuggingFace pipeline

In [16]:
%%file /tmp/simpleui.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import gradio as gr

model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.6,
    do_sample=True,
    return_full_text=False,
    max_new_tokens=500,
) 
gr.Interface.from_pipeline(pipe).launch(debug=True)

Writing /tmp/simpleui.py


In [None]:
# python /tmp/simpleui.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 4.4 A better way to build a web UI for LLM (through an LLM API server)

Next, you should implement a script that interact with the Phi-4-mini Chat API server you just created.  

Note that you should directly call the API server using request, instead of running the LLM within your UI server process. 

![Illustration of request](./assets/request.jpg)

In [17]:
%%file /tmp/chatUI.py

import gradio as gr
import requests
import json


#### Your Task ####
# Insert code here to perform the inference
# You can use either the hand-crafted API server or the OpenAI-compatible vLLM server
#### End Task ####
def predict(message, history):
    # Convert Gradio history format to API history format
    api_history = []
    
    for user_msg, assistant_msg in history:
        api_history.append({"role": "user", "content": user_msg})
        api_history.append({"role": "assistant", "content": assistant_msg})
    
    # Prepare the request payload
    payload = {
        "prompt": message,
        "history": api_history
    }
    
    # Send POST request to our API server
    try:
        response = requests.post(
            "http://localhost:54223/chat", 
            json=payload,
            timeout=60  # Set a reasonable timeout
        )
        
        if response.status_code == 200:
            # Parse the response as text
            return response.json()
        else:
            return f"Error: API returned status code {response.status_code}"
    except requests.exceptions.RequestException as e:
        return f"Error connecting to API server: {str(e)}"

gr.ChatInterface(predict).launch()

Writing /tmp/chatUI.py


In [None]:
## Do not forget to start your API server (from above, use the /chat API or use the vLLM)

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 4.5 More Gradio: Streaming and Multi-media

Gradio also supports streaming and multi-media input and output.

Magic happens from the `streaming=True` parameter.

In [18]:
%%file /tmp/transcribe.py

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import numpy as np
import gradio as gr

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_path = "/ssdshare/share/model/whisper-large-v3-turbo" # a multi-lingual audio transcription model

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_path)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)


def transcribe(stream, new_chunk):
    sr, y = new_chunk

    # Convert to mono if stereo
    if y.ndim > 1:
        y = y.mean(axis=1)

    # normalize
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    if stream is not None:
        stream = np.concatenate([stream, y])
    else:
        stream = y
    return stream, pipe(
        {
            "sampling_rate": sr,
            "raw": stream,
            "return_timestamps": True,
            "task": "transcribe",
            "language": "chinese",
        }
    )["text"]


demo = gr.Interface(
    transcribe,
    ["state", gr.Audio(sources=["microphone"], streaming=True)], # note the streaming=True
    ["state", "text"],
    live=True,
    time_limit=30,
    stream_every=0.5,
)

demo.launch()

Writing /tmp/transcribe.py


Try it!

In [None]:
## python /tmp/transcribe.py

### 4.6 Build your own Gradio UI

Create a separate Gradio UI to serve other models. Maybe an image model in Lab 5, or a translation model cooperated with a transcription model? Explore the Gradio documentation and HuggingFace model cards.

In [None]:
%%file /tmp/speech_translator.py
import torch
import numpy as np
import gradio as gr
from transformers import (
    AutoModelForSpeechSeq2Seq, 
    AutoProcessor, 
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline
)
import os
import traceback

# Set up device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"Device set to use {device}")
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Language code mapping for Whisper
LANG_CODES = {
    "English": "en",
    "Chinese": "zh",
    "Japanese": "ja",
    "Spanish": "es",
    "French": "fr",
    "German": "de"
}

try:
    # Load speech recognition model
    print("Loading speech recognition model...")
    speech_model_path = "/ssdshare/share/model/whisper-large-v3-turbo"
    speech_model = AutoModelForSpeechSeq2Seq.from_pretrained(
        speech_model_path, 
        torch_dtype=torch_dtype, 
        low_cpu_mem_usage=True, 
        use_safetensors=True
    )
    speech_model.to(device)
    speech_processor = AutoProcessor.from_pretrained(speech_model_path)

    speech_pipe = pipeline(
        "automatic-speech-recognition",
        model=speech_model,
        tokenizer=speech_processor.tokenizer,
        feature_extractor=speech_processor.feature_extractor,
        torch_dtype=torch_dtype,
        device=device,
    )
    print("Speech model loaded successfully!")

    # Load translation model
    print("Loading translation model...")
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="cuda:0",
        torch_dtype="auto",
        trust_remote_code=True,
    )

    translation_pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=300,
        temperature=0.3,
        do_sample=True,
        return_full_text=False
    )
    print("Translation model loaded successfully!")
except Exception as e:
    print(f"Error loading models: {e}")
    print(traceback.format_exc())
    raise e

# Define source and target languages
LANGUAGES = ["English", "Chinese", "Japanese", "Spanish", "French", "German"]

def process_audio(audio, source_lang, target_lang):
    try:
        # Check if audio input exists
        if audio is None:
            return "Please record or upload audio first.", "No audio input detected."
            
        # Debug the audio input structure
        print(f"Audio input type: {type(audio)}")
        print(f"Audio input: {audio}")
        
        # Handle different audio input formats
        if isinstance(audio, tuple) and len(audio) == 2:
            # Expected format: (sample_rate, audio_data)
            sr, y = audio
        elif isinstance(audio, str):
            # Path to audio file
            import soundfile as sf
            y, sr = sf.read(audio)
        else:
            # Try to handle other formats or print detailed error
            return f"Unsupported audio format: {type(audio)}", "Audio processing failed."
        
        # Convert to mono if stereo
        if y.ndim > 1:
            y = y.mean(axis=1)
        
        # Normalize audio
        y = y.astype(np.float32)
        if np.max(np.abs(y)) > 0:
            y /= np.max(np.abs(y))
        
        # Get language code for Whisper
        lang_code = LANG_CODES.get(source_lang, "en")
        
        print(f"Transcribing audio in {source_lang} ({lang_code})...")
        # Transcribe with Whisper
        transcription_result = speech_pipe(
            {"sampling_rate": sr, "raw": y},
            task="transcribe",
            language=lang_code,
            return_timestamps=False
        )
        
        transcribed_text = transcription_result["text"].strip()
        print(f"Transcription result: {transcribed_text}")
        
        if not transcribed_text:
            return "Couldn't detect any speech in the recording.", "No speech detected."
        
        # Skip translation if source and target languages are the same
        if source_lang == target_lang:
            return transcribed_text, transcribed_text
        
        print(f"Translating from {source_lang} to {target_lang}...")
        # Create a translation prompt for the LLM
        translation_prompt = f"Translate the following text from {source_lang} to {target_lang}: '{transcribed_text}'"
        
        # Get translation from LLM
        translation_result = translation_pipe(translation_prompt)
        translated_text = translation_result[0]["generated_text"].strip()
        
        # Clean up the translation output to extract just the translated content
        if ":" in translated_text:
            translated_text = translated_text.split(":", 1)[1].strip()
        if translated_text.startswith("'") and translated_text.endswith("'"):
            translated_text = translated_text[1:-1]
            
        print(f"Translation result: {translated_text}")
        return transcribed_text, translated_text
        
    except Exception as e:
        print(f"Error in process_audio: {e}")
        print(traceback.format_exc())
        return f"Error: {str(e)}", f"Translation failed: {str(e)}"

# Create the Gradio interface
with gr.Blocks(title="Speech Translator") as demo:
    gr.Markdown("# 🎙️ Speech-to-Text Translation 🌐")
    gr.Markdown("Record your speech, and it will be transcribed and translated to your target language.")
    
    with gr.Row():
        source_lang = gr.Dropdown(choices=LANGUAGES, value="English", label="Source Language")
        target_lang = gr.Dropdown(choices=LANGUAGES, value="Chinese", label="Target Language")
    
    with gr.Row():
        # Changed audio input type to match what Gradio expects
        audio_input = gr.Audio(source="microphone", type="numpy", label="Record Speech")
    
    with gr.Row():
        transcribe_btn = gr.Button("Transcribe & Translate")
    
    with gr.Row():
        with gr.Column():
            transcription_output = gr.Textbox(label="Transcription", lines=5)
        with gr.Column():
            translation_output = gr.Textbox(label="Translation", lines=5)
    
    # Handle the button click
    transcribe_btn.click(
        fn=process_audio,
        inputs=[audio_input, source_lang, target_lang],
        outputs=[transcription_output, translation_output]
    )

    # Add debug info area
    with gr.Accordion("Debug Information", open=False):
        debug_info = gr.Textbox(label="System Info", value=f"Device: {device}, PyTorch: {torch.__version__}", interactive=False)

# Launch the application with specific server configurations
demo.launch(
    server_name="0.0.0.0",
    share=True
)

Overwriting /tmp/speech_translator.py
