# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0916 19:40:57.328000 63000 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 19:40:57.328000 63000 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0916 19:41:06.778000 63836 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 19:41:06.778000 63836 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0916 19:41:06.831000 63837 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 19:41:06.831000 63837 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-16 19:41:07] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.83it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.30it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.30it/s]Capturing batches (bs=1 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.30it/s]Capturing batches (bs=1 avail_mem=74.73 GB): 100%|██████████| 3/3 [00:00<00:00, 10.24it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Timmy!
I am a current student at East Anglia University of the Arts, with a BA (Hons) in Art and Architecture. My major focus is in the field of Art History, particularly with my current work in the fields of Art and Building History. I have a passion for researching, researching, and researching. My thesis work, titled “The Building Context of Medieval Monk House: A Case Study in the Parish of East Riding of Yorkshire,” is now complete, and I am excited to share more details with the world.
I am currently enrolled in a small MSc in Arts and Humanities with a focus on Public Art and
Prompt: The president of the United States is
Generated text:  visiting the island of requiring permission from a certain number of states. In the U. S. , there are 50 states and 3300 miles from the president to each state. If the president's plane is only allowed to travel 500 miles each day, how many days will it take for the plane to make the round trip to each 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. What can you tell me about yourself? I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

This statement is factually correct and provides a clear and concise overview of the capital city's location and significance within the broader context of France. It is a straightforward and easily understandable statement that can be easily communicated to a wide audience. 

However, if you would like to provide a more detailed or nuanced statement, please let me know and I will do my best to assist you further. 

In terms of the statement itself, it is important to note that while Paris is the capital city of France, it is not the only capital city in the country. Other major cities in France include Paris, Lyon, Marseille, and Nice

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that could emerge in the coming years:

1. Increased integration of AI into everyday life: As AI becomes more integrated into our daily lives, we may see more widespread adoption of AI-powered technologies like voice assistants, self-driving cars, and smart home devices. This could lead to a more seamless and intuitive user experience, as well as increased efficiency and productivity.

2. Greater emphasis on ethical AI: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ensuring



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ___________ and I'm a/an ___________ artist. I specialize in ___________. My work is ___________. If you're interested in getting to know me better, please let me know. I hope to meet you someday! 🌍✨🎨

I hope you enjoy this introduction! Let me know if you need any clarification or have any questions. 📚✨✨

Remember to be yourself and share your interests with us! 🌟💕

What is your favorite color and why? 🌍✨
As an artist, I use color to convey emotion and create mood. Blue is often associated with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Grande-Bretagne" due to its geographical connection to England.
Paris, also known as "La Grande-Bretagne" due to its geographical connection to England, is the capital and largest city of France. It is a historical, cultural, and artistic center,

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

 [

Gender

]

 person

.

 I

 come

 from

 [

Location

]

 and

 I

 love

 [

Favorite

 Activity

].

 Let

's

 do

 this

!

 [

Name

]

 is

 a

 friendly

,

 down

-to

-earth

 [

Occup

ation

],

 with

 a

 great

 sense

 of

 humor

 and

 a

 love

 for

 [

Favorite

 Activity

].

 I

 enjoy

 [

Favorite

 Activity

]

 with

 my

 friends

,

 and

 I

 have

 [

Number

]

 of

 loyal

 fans

 across

 my

 platform

.

 I

'm

 always

 ready

 to

 share

 my

 knowledge

 and

 expertise

 in

 [

Field

 of

 Interest

]

 and

 I

'm

 here

 to

 assist

 you

 with

 any

 questions

 you

 may

 have

.

 Let

's

 be

 friends

!

 [

Name

]

 is

 a

 friendly

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



What

 is

 the

 answer

?

 (

Select

 from

 the

 options

 below

:

 i

.

 yes

.

 ii

.

 no

.

 i

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 unpredictable

,

 but

 some

 potential

 trends

 that

 could

 come

 about

 include

:



1

.

 Increased

 AI

 integration

 with

 physical

 and

 biological

 systems

:

 As

 AI

 is

 becoming

 more

 integrated

 with

 our

 physical

 world

,

 we

 may

 see

 more

 pervasive

 and

 integrated

 AI

 systems

 in

 our

 everyday

 lives

,

 from

 healthcare

 to

 manufacturing

 to

 transportation

.



2

.

 AI

 becoming

 more

 personalized

 and

 context

-aware

:

 As

 AI

 learns

 to

 better

 understand

 and

 interpret

 human

 behavior

 and

 preferences

,

 it

 may

 become

 more

 personalized

 and

 context

-aware

,

 offering

 more

 accurate

 and

 relevant

 recommendations

 and

 personalized

 experiences

.



3

.

 AI

-driven

 advancements

 in

 healthcare

:

 As

 AI

 continues

 to

 improve

,

 it

 could

 revolution

ize

 healthcare

,

 offering

 more

 precise

 diagnoses

,

 personalized

 treatment

 plans




In [6]:
llm.shutdown()