# Quick Start: Launch A Server and Send Requests

This notebook provides a quick-start guide for using SGLang after installation.

## Launch a server

This code block is equivalent to executing 

```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
```

in your terminal and wait for the server to be ready.

In [7]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

server_process = execute_shell_command(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)

wait_for_server("http://localhost:30000")

2024-11-02 00:27:25.383621: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-02 00:27:25.396224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-02 00:27:25.396257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2024-11-02 00:27:34] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-

## Send a Request

Once the server is up, you can send test requests using curl. The server implements the [OpenAI-compatible API](https://platform.openai.com/docs/api-reference/).

In [9]:
import subprocess

curl_command = """
curl http://localhost:30000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer None" \\
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is an LLM? Tell me in one sentence."
      }
    ]
  }'
"""

response = subprocess.check_output(curl_command, shell=True).decode('utf-8')

print_highlight(response)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   278    0     0  100   278      0   1387 --:--:-- --:--:-- --:--:--  1383

[2024-11-02 00:28:48 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 42, cache hit rate: 40.19%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 00:28:48 TP0] Decode batch. #running-req: 1, #token: 75, token usage: 0.00, gen throughput (token/s): 1.46, #queue-req: 0
[2024-11-02 00:28:49] INFO:     127.0.0.1:53714 - "POST /v1/chat/completions HTTP/1.1" 200 OK


100   871  100   593  100   278   1788    838 --:--:-- --:--:-- --:--:--  2623


## Using OpenAI Python Client

You can use the OpenAI Python API library to send requests.

In [3]:
import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(response)

[2024-11-02 00:03:52 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 29.13%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 00:03:52 TP0] Decode batch. #running-req: 1, #token: 65, token usage: 0.00, gen throughput (token/s): 11.33, #queue-req: 0
[2024-11-02 00:03:53] INFO:     127.0.0.1:57008 - "POST /v1/chat/completions HTTP/1.1" 200 OK


## Using the Native Generation API

You can also use the native `/generate` endpoint. It provides more flexiblity.
An API reference is available [here](references/sampling_params.html). 

In [5]:
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)

print_highlight(response.json())

[2024-11-02 00:05:04 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 33.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-02 00:05:04 TP0] Decode batch. #running-req: 1, #token: 26, token usage: 0.00, gen throughput (token/s): 3.10, #queue-req: 0
[2024-11-02 00:05:04] INFO:     127.0.0.1:60536 - "POST /generate HTTP/1.1" 200 OK


In [6]:
terminate_process(server_process)