# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-18 00:53:05] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:04,  4.38it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:04,  4.38it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.38it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.15it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.15it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.15it/s] 

Capturing batches (bs=88 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.15it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 14.89it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 14.89it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 14.89it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:00, 14.89it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.73it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.73it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.73it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 17.73it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 19.73it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 19.73it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 19.73it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:00<00:00, 19.73it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:00<00:00, 19.19it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 21.61it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 21.61it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 21.61it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 18.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nadia. I am from Canada. I live in a big city called Toronto. Toronto is a beautiful city with many buildings. It is the largest city in Canada. When I first came here, I was very excited. I love the air, and it's cold and snowy outside. In the past, I had to wear a coat, but now I can wear a pair of gloves. But I don't like the traffic very much. I think it's a waste of time. I usually ride my bicycle to work. I can see the trees and the water when I walk. I love the city and the people in it.
Prompt: The president of the United States is
Generated text:  very busy. He has to travel to a country in Africa. He needs to change his plane ticket. There are many airlines in Africa. Some airlines charge money. Others do not. The president has to pay $12000 for his plane ticket. The airline that he chooses must charge at least $10000. What is the maximum amount the president can pay for his plane ticket?
If the airline must charge at least $10000, t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Type of Character] who is [Describe your personality traits]. I enjoy [What you do for fun or hobbies]. I'm always looking for new experiences and learning new things. What's your favorite hobby or activity? I love [Describe your favorite hobby or activity]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite book or movie? I love [Name the book/movie]. I'm always looking for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its delicious cuisine, including French cuisine and international cuisine. The city is a popular tourist destination and a cultural hub, attracting millions of visitors each year. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that is both old and new, and a city that is constantly evolving. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased accuracy and precision: AI systems are becoming more accurate and precise in their predictions and decisions, leading to more reliable and effective applications.

2. Integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions.

3. Personalization: AI systems are likely to become more personalized, with the ability to learn from user data and provide more tailored experiences.

4. Ethical and responsible AI: As AI systems become more advanced, there will be a greater emphasis on ethical and responsible design, with a focus



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [occupation] with [number] years of experience in [field]. I have a knack for [mention an ability or characteristic] and have always been passionate about [mention something that inspires you]. What would you like to say about yourself? [Name] is [mention a detail or a feature of their identity that makes them stand out]. Additionally, please provide a brief description of your professional background, including your previous roles and any relevant accomplishments. [Name] has been working in [describe their current role in their industry or field] for [mention years] years. They are [mention a detail or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is also the largest city in France and the third-largest city in the European Union, with an estimated population of over 10 million. 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

]

 and

 I

 am

 a

 [

insert

 occupation

 or

 major

 field

]

 enthusiast

.

 I

 have

 a

 passion

 for

 [

insert

 hobby

 or

 project

],

 and

 I

 love

 spending

 time

 exploring

 the

 outdoors

 and

 trying

 new

 things

.

 I

 believe

 that

 everyone

 has

 a

 unique

 skill

 that

 can

 be

 hon

ed

 and

 developed

,

 and

 I

 look

 forward

 to

 contributing

 to

 the

 community

 by

 sharing

 my

 knowledge

 and

 experiences

.

 Thank

 you

 for

 having

 me

!

 

🌍

✨





Hey

!

 I

'm

 [

insert

 your

 full

 name

],

 and

 I

'm

 just

 a

 regular

 person

 who

 happens

 to

 like

 learning

 new

 things

.

 I

'm

 a

 [

insert

 occupation

 or

 major

 field

]

 enthusiast

,

 and

 I

'm

 always

 on

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 capital

 city

 of

 France

 and

 is

 home

 to

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 many

 other

 famous

 landmarks

.

 The

 city

 is

 known

 for

 its

 rich

 history

 and

 beautiful

 architecture

.

 Paris

 is

 often

 referred

 to

 as

 the

 "

City

 of

 Light

"

 and

 is

 a

 major

 tourist

 destination

.

 The

 French

 Riv

iera

 is

 a

 popular

 tourist

 destination

 and

 is

 known

 for

 its

 beaches

,

 sun

b

athing

,

 and

 water

 sports

.

 



Paris

 is

 home

 to

 over

 

1

2

 million

 people

 and

 is

 a

 major

 center

 of

 business

,

 education

,

 and

 culture

.

 Its

 position

 as

 the

 capital

 has

 made

 it

 an

 important

 hub

 for

 international

 politics

,

 economics



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 heavily

 shaped

 by

 the

 rapid

 advancements

 in

 technology

,

 the

 increasing

 availability

 of

 data

,

 and

 the

 increasing

 complexity

 of

 problems

 that

 AI

 is

 being

 used

 to

 solve

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 automation

:

 AI

 is

 increasingly

 being

 used

 to

 automate

 routine

 tasks

,

 freeing

 up

 human

 resources

 to

 focus

 on

 more

 creative

 and

 complex

 tasks

.

 This

 trend

 is

 likely

 to

 continue

 as

 AI

 becomes

 more

 and

 more

 capable

 and

 efficient

.



2

.

 Autonomous

 vehicles

:

 As

 AI

 becomes

 more

 advanced

,

 autonomous

 vehicles

 may

 become

 a

 reality

.

 This

 could

 lead

 to

 a

 reduction

 in

 traffic

 accidents

,

 increased

 safety

,

 and

 increased

 efficiency

 in

 transportation

.



3

.

 Personal

ized

 medicine

:




In [6]:
llm.shutdown()