# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-30 20:04:37] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-30 20:04:37] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-30 20:04:37] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-30 20:04:37] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-10-30 20:04:46] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-30 20:04:46] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-30 20:04:46] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-30 20:04:46] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.80it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=120 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]

Capturing batches (bs=120 avail_mem=76.30 GB):  10%|█         | 2/20 [00:00<00:04,  4.39it/s]Capturing batches (bs=112 avail_mem=76.30 GB):  10%|█         | 2/20 [00:00<00:04,  4.39it/s]Capturing batches (bs=104 avail_mem=76.29 GB):  10%|█         | 2/20 [00:00<00:04,  4.39it/s]Capturing batches (bs=104 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:02,  7.50it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:02,  7.50it/s] 

Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:02,  7.50it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:02,  7.50it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.93it/s]Capturing batches (bs=72 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.93it/s]

Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.93it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.08it/s]Capturing batches (bs=56 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.08it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  45%|████▌     | 9/20 [00:01<00:00, 11.08it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=40 avail_mem=76.25 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.34it/s]

Capturing batches (bs=32 avail_mem=76.25 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:01<00:00, 11.18it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:01<00:00, 11.18it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:01<00:00, 11.18it/s]

Capturing batches (bs=16 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.49it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.49it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.49it/s] Capturing batches (bs=4 avail_mem=76.22 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.49it/s]Capturing batches (bs=4 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.31it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.31it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  90%|█████████ | 18/20 [00:01<00:00, 14.31it/s]

Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 11.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack, an American. I live in China now. I can speak English and I can sing very well. But I have many bad habits such as smoking and drinking. My parents and teachers are always worried about me. They say that I should stop smoking and drinking. I'm happy to hear them. But I also want to tell them that I'm trying to do something to change my bad habits. I want to buy more books and I plan to spend some time in different places every week. I also want to give up eating too much meat and I will start eating more vegetables every day. I'm going to work out every day
Prompt: The president of the United States is
Generated text:  a ( ).
A. Supreme official
B. Highest administrative organ
C. Highest state organ
D. Highest judicial organ
Answer: C

The Party's greatest political strength lies in its ability to unite and lead people of all ethnic groups across the country to achieve which of the following?
A. Common prosperity
B. Achieving the great r

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its cuisine, fashion, and art scene, and is a popular tourist destination. The city is home to many famous museums, including the Louvre and the Musée d'Orsay, as well as the iconic Eiffel Tower. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status as the world's most

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This could lead to more natural and intuitive interactions between humans and machines, as well as more effective decision-making.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and use, as well as more transparent and accountable AI systems.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [location] person. I enjoy [job title] and [related job title], which is [specific role]. I am [age] years old, [gender] and [interests/experience]. I have a [relation to the target audience] connection with [target audience]. I believe that [reason for my connection] and I am [confident level]. My hobbies include [list of hobbies]. How can you describe me to someone who knows me, without actually saying my name? What a great way to start! You’re [age] years old, [gender] and [interests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The statement can be summarized as: Paris, the cultural and historical capital of France, serves as the nation's capital and the seat of the executive branch of the French government. It is also a major tourist destination and a major financial center. F

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 fictional

 character

's

 name

],

 and

 I

 am

 currently

 [

insert

 fictional

 character

's

 age

 and

 gender

]

 years

 old

.

 I

 currently

 work

 as

 a

 [

insert

 fictional

 character

's

 occupation

 or

 role

]

 in

 a

 [

insert

 fictional

 workplace

 or

 organization

].

 I

 am

 very

 [

insert

 fictional

 character

's

 personality

 trait

 or

 characteristic

]

 and

 enjoy

 [

insert

 fictional

 character

's

 hobbies

 or

 interests

].

 What

 brings

 you

 here

 today

?

 As

 an

 AI

,

 I

 don

't

 have

 the

 ability

 to

 experience

 emotions

 or

 think

 like

 humans

,

 but

 I

 can

 understand

 and

 respond

 to

 prompts

 and

 input

.

 How

 can

 I

 assist

 you

 today

?

 Please

 let

 me

 know

 your

 name

 and

 what

 brings

 you

 to

 this

 place

.

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 accurate

.

 The

 city

 of

 Paris

,

 located

 on

 the

 Î

le

 de

 la

 C

ité

,

 is

 the

 most

 populous

 city

 in

 France

 and

 one

 of

 the

 most

 important

 cities

 in

 Europe

.

 The

 city

's

 cultural

,

 artistic

,

 and

 financial

 centers

 have

 made

 it

 an

 influential

 hub

 for

 international

 trade

,

 diplomacy

,

 and

 education

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

 like

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 and

 Lou

vre

 Museum

,

 as

 well

 as

 its

 extensive

 art

 and

 literature

 scene

.

 The

 city

 also

 has

 a

 diverse

 population

 of

 around

 

1

.

 

3

 million

 residents

,

 with

 a

 growing

 middle

 class

 and

 increasing

 numbers

 of

 international

 visitors

.

 Despite



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 several

 areas

 of

 significant

 change

 and

 development

.

 Here

 are

 some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 AI

 landscape

 in

 the

 years

 to

 come

:



1

.

 Increased

 focus

 on

 ethics

 and

 fairness

:

 As

 AI

 systems

 become

 more

 integrated

 into

 people

's

 lives

,

 there

 will

 be

 increasing

 pressure

 to

 ensure

 that

 AI

 systems

 are

 developed

,

 deployed

,

 and

 used

 in

 a

 manner

 that

 is

 fair

,

 equitable

,

 and

 fair

.

 This

 may

 involve

 efforts

 to

 promote

 transparency

,

 accountability

,

 and

 fairness

 in

 the

 development

 and

 deployment

 of

 AI

 systems

.



2

.

 Advances

 in

 natural

 language

 processing

:

 As

 AI

 continues

 to

 advance

,

 there

 is

 likely

 to

 be

 continued

 progress

 in

 natural

 language

 processing

.

 This

 will




In [6]:
llm.shutdown()