# Quick Start: Launch A Server and Send Requests

This section provides a quick start guide to using SGLang after installation.

## Launch a server

This code block is equivalent to executing 

```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
```

in your command line and wait for the server to be ready.

In [1]:
from sglang.utils import lauch_sglang_server, wait_for_server, terminate_process, highlight_text


server_process = lauch_sglang_server(
    """
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)

wait_for_server("http://localhost:30000")
highlight_text("Server is ready. Proceeding with the next steps.")

[2024-10-28 09:17:45] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=347192970, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1,

[2024-10-28 09:18:01 TP0] Init torch distributed begin.


[2024-10-28 09:18:01 TP0] Load weight begin. avail mem=78.59 GB


[2024-10-28 09:18:02 TP0] lm_eval is not installed, GPTQ may not be usable


INFO 10-28 09:18:02 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.24it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.16it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.17it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.40it/s]

[2024-10-28 09:18:05 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.50 GB
[2024-10-28 09:18:05 TP0] Memory pool end. avail mem=8.37 GB
[2024-10-28 09:18:05 TP0] Capture cuda graph begin. This can take up to several minutes.


[2024-10-28 09:18:12 TP0] max_total_num_tokens=442913, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072


[2024-10-28 09:18:12] INFO:     Started server process [511197]
[2024-10-28 09:18:12] INFO:     Waiting for application startup.
[2024-10-28 09:18:12] INFO:     Application startup complete.
[2024-10-28 09:18:12] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)


[2024-10-28 09:18:13] INFO:     127.0.0.1:35516 - "GET /v1/models HTTP/1.1" 200 OK


## Send a Request

Once the server is running, you can send test requests using curl.

In [2]:
!curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer None" \
  -d '{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is a LLM?"}]}'

[2024-10-28 09:18:13 TP0] Prefill batch. #new-seq: 1, #new-token: 47, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-10-28 09:18:13] INFO:     127.0.0.1:35536 - "GET /get_model_info HTTP/1.1" 200 OK


[2024-10-28 09:18:13 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 1, cache hit rate: 1.85%, token usage: 0.00, #running-req: 1, #queue-req: 0


[2024-10-28 09:18:13] INFO:     127.0.0.1:35540 - "POST /generate HTTP/1.1" 200 OK
[2024-10-28 09:18:13] The server is fired up and ready to roll!


[2024-10-28 09:18:14 TP0] Decode batch. #running-req: 1, #token: 87, token usage: 0.00, gen throughput (token/s): 25.58, #queue-req: 0


[2024-10-28 09:18:14 TP0] Decode batch. #running-req: 1, #token: 127, token usage: 0.00, gen throughput (token/s): 139.75, #queue-req: 0


[2024-10-28 09:18:14 TP0] Decode batch. #running-req: 1, #token: 167, token usage: 0.00, gen throughput (token/s): 137.96, #queue-req: 0


[2024-10-28 09:18:15 TP0] Decode batch. #running-req: 1, #token: 207, token usage: 0.00, gen throughput (token/s): 138.20, #queue-req: 0


[2024-10-28 09:18:15 TP0] Decode batch. #running-req: 1, #token: 247, token usage: 0.00, gen throughput (token/s): 137.96, #queue-req: 0


[2024-10-28 09:18:15 TP0] Decode batch. #running-req: 1, #token: 287, token usage: 0.00, gen throughput (token/s): 137.49, #queue-req: 0


[2024-10-28 09:18:15 TP0] Decode batch. #running-req: 1, #token: 327, token usage: 0.00, gen throughput (token/s): 138.10, #queue-req: 0


[2024-10-28 09:18:16 TP0] Decode batch. #running-req: 1, #token: 367, token usage: 0.00, gen throughput (token/s): 138.22, #queue-req: 0


[2024-10-28 09:18:16] INFO:     127.0.0.1:35530 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"ad61027db61649d0bd69f6aa901f1d8c","object":"chat.completion","created":1730107096,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and generate human-like language. LLMs are trained on vast amounts of text data, which allows them to learn patterns, relationships, and nuances of language.\n\nLarge Language Models like myself are trained on a massive corpus of text, often sourced from the internet, books, and other digital sources. This training enables us to:\n\n1. **Understand**: We can comprehend the meaning of text, including context, syntax, and semantics.\n2. **Generate**: We can create coherent and context-specific text, such as responses to questions, articles, or even entire stories.\n3. **Complete**: We can fill in the

## Using OpenAI Compatible API

SGLang supports OpenAI-compatible APIs. Here are Python examples:

In [3]:
import openai

# Always assign an api_key, even if not specified during server initialization.
# Setting an API key during server initialization is strongly recommended.

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

# Chat completion example

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
highlight_text(response)

[2024-10-28 09:18:16 TP0] Prefill batch. #new-seq: 1, #new-token: 20, #cached-token: 29, cache hit rate: 29.13%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-10-28 09:18:17 TP0] Decode batch. #running-req: 1, #token: 79, token usage: 0.00, gen throughput (token/s): 46.61, #queue-req: 0
[2024-10-28 09:18:17] INFO:     127.0.0.1:35554 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [4]:
terminate_process(server_process)

[2024-10-28 09:18:17] INFO:     Shutting down


[2024-10-28 09:18:17] INFO:     Waiting for application shutdown.
[2024-10-28 09:18:17] INFO:     Application shutdown complete.
[2024-10-28 09:18:17] INFO:     Finished server process [511197]
