# Quick Start: Launch A Server and Send Requests

This section provides a quick start guide to using SGLang after installation.

## Launch a server

This code block is equivalent to executing 

```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
```

in your command line and wait for the server to be ready.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)


server_process = execute_shell_command(
    """
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)

wait_for_server("http://localhost:30000")

[2024-10-30 07:43:57] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=12900532, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, 

[2024-10-30 07:44:13 TP0] Init torch distributed begin.


[2024-10-30 07:44:13 TP0] Load weight begin. avail mem=78.59 GB


[2024-10-30 07:44:13 TP0] lm_eval is not installed, GPTQ may not be usable


INFO 10-30 07:44:14 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.45it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.33it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.22it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.48it/s]

[2024-10-30 07:44:17 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.50 GB
[2024-10-30 07:44:17 TP0] Memory pool end. avail mem=8.37 GB
[2024-10-30 07:44:17 TP0] Capture cuda graph begin. This can take up to several minutes.


[2024-10-30 07:44:26 TP0] max_total_num_tokens=442913, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072


[2024-10-30 07:44:26] INFO:     Started server process [1242702]
[2024-10-30 07:44:26] INFO:     Waiting for application startup.
[2024-10-30 07:44:26] INFO:     Application startup complete.
[2024-10-30 07:44:26] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2024-10-30 07:44:26] INFO:     127.0.0.1:46936 - "GET /v1/models HTTP/1.1" 200 OK


## Send a Request

Once the server is running, you can send test requests using curl.

In [2]:
!curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is a LLM?"}]}'

[2024-10-30 07:44:26 TP0] Prefill batch. #new-seq: 1, #new-token: 47, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-10-30 07:44:27 TP0] Decode batch. #running-req: 1, #token: 87, token usage: 0.00, gen throughput (token/s): 41.39, #queue-req: 0


[2024-10-30 07:44:27 TP0] Decode batch. #running-req: 1, #token: 127, token usage: 0.00, gen throughput (token/s): 138.93, #queue-req: 0


[2024-10-30 07:44:27] INFO:     127.0.0.1:46964 - "GET /get_model_info HTTP/1.1" 200 OK


[2024-10-30 07:44:27 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 1, cache hit rate: 1.85%, token usage: 0.00, #running-req: 1, #queue-req: 0


[2024-10-30 07:44:27 TP0] Decode batch. #running-req: 2, #token: 179, token usage: 0.00, gen throughput (token/s): 148.97, #queue-req: 0
[2024-10-30 07:44:27] INFO:     127.0.0.1:46968 - "POST /generate HTTP/1.1" 200 OK
[2024-10-30 07:44:27] The server is fired up and ready to roll!


[2024-10-30 07:44:28 TP0] Decode batch. #running-req: 1, #token: 207, token usage: 0.00, gen throughput (token/s): 140.11, #queue-req: 0


[2024-10-30 07:44:28 TP0] Decode batch. #running-req: 1, #token: 247, token usage: 0.00, gen throughput (token/s): 138.16, #queue-req: 0


[2024-10-30 07:44:28 TP0] Decode batch. #running-req: 1, #token: 287, token usage: 0.00, gen throughput (token/s): 138.15, #queue-req: 0


[2024-10-30 07:44:29 TP0] Decode batch. #running-req: 1, #token: 327, token usage: 0.00, gen throughput (token/s): 137.49, #queue-req: 0


[2024-10-30 07:44:29 TP0] Decode batch. #running-req: 1, #token: 367, token usage: 0.00, gen throughput (token/s): 137.74, #queue-req: 0


[2024-10-30 07:44:29 TP0] Decode batch. #running-req: 1, #token: 407, token usage: 0.00, gen throughput (token/s): 137.90, #queue-req: 0


[2024-10-30 07:44:29 TP0] Decode batch. #running-req: 1, #token: 447, token usage: 0.00, gen throughput (token/s): 137.63, #queue-req: 0


[2024-10-30 07:44:30] INFO:     127.0.0.1:46950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"0fda0495db014f39b91f4e3de79b33c4","object":"chat.completion","created":1730274270,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and generate human-like language. LLMs are trained on vast amounts of text data, which enables them to learn patterns, relationships, and nuances of language.\n\nThese models are typically composed of multiple layers of neural networks, which allow them to analyze and understand the context, syntax, and semantics of language. This enables them to perform a wide range of tasks, such as:\n\n1. **Language Translation**: LLMs can translate text from one language to another with high accuracy.\n2. **Text Summarization**: They can summarize long pieces of text into concise, meaningful summaries.\n3. **Qu

## Using OpenAI Compatible API

SGLang supports OpenAI-compatible APIs. Here are Python examples:

In [3]:
import openai

# Always assign an api_key, even if not specified during server initialization.
# Setting an API key during server initialization is strongly recommended.

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

# Chat completion example

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print_highlight(response)

[2024-10-30 07:44:30 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 29.13%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-10-30 07:44:30 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, gen throughput (token/s): 44.77, #queue-req: 0


[2024-10-30 07:44:31] INFO:     127.0.0.1:41416 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [4]:
terminate_process(server_process)

[2024-10-30 07:44:31] INFO:     Shutting down
[2024-10-30 07:44:31] INFO:     Waiting for application shutdown.
[2024-10-30 07:44:31] INFO:     Application shutdown complete.
[2024-10-30 07:44:31] INFO:     Finished server process [1242702]
