# Quick Start: Sending Requests

This notebook provides a quick-start guide for using SGLang after installation.

<!-- ## Launch a server

This code block is equivalent to executing 

```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
```

in your command line and wait for the server to be ready. -->

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

server_process = execute_shell_command(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)

wait_for_server("http://localhost:30000")

[2024-11-01 21:11:18] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=991460189, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_i

## Send a Request

Once the server is running, you can send test requests using curl or requests. The server implements the [OpenAI-compatible API](https://platform.openai.com/docs/api-reference/chat).

### Using curl

In [2]:
import subprocess, json

curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is a LLM?"}]}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

[2024-11-01 21:11:43 TP0] Prefill batch. #new-seq: 1, #new-token: 46, #cached-token: 1, cache hit rate: 1.85%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 21:11:44 TP0] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, gen throughput (token/s): 5.80, #queue-req: 0
[2024-11-01 21:11:45 TP0] Decode batch. #running-req: 1, #token: 120, token usage: 0.00, gen throughput (token/s): 42.46, #queue-req: 0
[2024-11-01 21:11:46 TP0] Decode batch. #running-req: 1, #token: 160, token usage: 0.00, gen throughput (token/s): 42.40, #queue-req: 0
[2024-11-01 21:11:47 TP0] Decode batch. #running-req: 1, #token: 200, token usage: 0.00, gen throughput (token/s): 42.38, #queue-req: 0
[2024-11-01 21:11:48 TP0] Decode batch. #running-req: 1, #token: 240, token usage: 0.00, gen throughput (token/s): 42.38, #queue-req: 0
[2024-11-01 21:11:49 TP0] Decode batch. #running-req: 1, #token: 280, token usage: 0.00, gen throughput (token/s): 42.34, #queue-req: 0
[2024-11-01 21:11:50 TP0]

### Using requests

In [3]:
import requests

url = "http://localhost:30000/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer None"
}
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is a LLM?"}
    ]
}

response = requests.post(url, headers=headers, json=data)
print_highlight(response.json())

[2024-11-01 21:11:54 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 46, cache hit rate: 46.53%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 21:11:55 TP0] Decode batch. #running-req: 1, #token: 83, token usage: 0.00, gen throughput (token/s): 40.46, #queue-req: 0
[2024-11-01 21:11:56 TP0] Decode batch. #running-req: 1, #token: 123, token usage: 0.00, gen throughput (token/s): 42.49, #queue-req: 0
[2024-11-01 21:11:57 TP0] Decode batch. #running-req: 1, #token: 163, token usage: 0.00, gen throughput (token/s): 42.39, #queue-req: 0
[2024-11-01 21:11:58 TP0] Decode batch. #running-req: 1, #token: 203, token usage: 0.00, gen throughput (token/s): 42.38, #queue-req: 0
[2024-11-01 21:11:59 TP0] Decode batch. #running-req: 1, #token: 243, token usage: 0.00, gen throughput (token/s): 42.34, #queue-req: 0
[2024-11-01 21:12:00 TP0] Decode batch. #running-req: 1, #token: 283, token usage: 0.00, gen throughput (token/s): 42.32, #queue-req: 0
[2024-11-01 21:12:01 TP

## Using OpenAI Python Client

You can also use the OpenAI Python API library to send requests.

In [4]:
import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print_highlight(response)

[2024-11-01 21:12:03 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 50.67%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-01 21:12:03 TP0] Decode batch. #running-req: 1, #token: 57, token usage: 0.00, gen throughput (token/s): 29.21, #queue-req: 0
[2024-11-01 21:12:04] INFO:     127.0.0.1:35420 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [5]:
terminate_process(server_process)

: 