# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-27 17:11:54] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-27 17:11:54] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-27 17:11:54] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-27 17:11:54] INFO trace.py:48: opentelemetry package is not installed, tracing disabled






[2025-10-27 17:12:03] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-27 17:12:03] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-27 17:12:03] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-27 17:12:05] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.95 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.95 GB):   5%|▌         | 1/20 [00:00<00:03,  5.30it/s]Capturing batches (bs=120 avail_mem=58.84 GB):   5%|▌         | 1/20 [00:00<00:03,  5.30it/s]

Capturing batches (bs=112 avail_mem=58.84 GB):   5%|▌         | 1/20 [00:00<00:03,  5.30it/s]Capturing batches (bs=104 avail_mem=58.83 GB):   5%|▌         | 1/20 [00:00<00:03,  5.30it/s]Capturing batches (bs=104 avail_mem=58.83 GB):  20%|██        | 4/20 [00:00<00:01, 14.29it/s]Capturing batches (bs=96 avail_mem=58.83 GB):  20%|██        | 4/20 [00:00<00:01, 14.29it/s] Capturing batches (bs=88 avail_mem=58.82 GB):  20%|██        | 4/20 [00:00<00:01, 14.29it/s]Capturing batches (bs=80 avail_mem=58.82 GB):  20%|██        | 4/20 [00:00<00:01, 14.29it/s]

Capturing batches (bs=80 avail_mem=58.82 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=72 avail_mem=58.82 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=64 avail_mem=58.81 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=56 avail_mem=58.81 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.36it/s]Capturing batches (bs=56 avail_mem=58.81 GB):  50%|█████     | 10/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=48 avail_mem=58.80 GB):  50%|█████     | 10/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=40 avail_mem=58.80 GB):  50%|█████     | 10/20 [00:00<00:00, 20.10it/s]

Capturing batches (bs=32 avail_mem=58.79 GB):  50%|█████     | 10/20 [00:00<00:00, 20.10it/s]Capturing batches (bs=32 avail_mem=58.79 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.41it/s]Capturing batches (bs=24 avail_mem=58.79 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.41it/s]Capturing batches (bs=16 avail_mem=58.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.41it/s]Capturing batches (bs=12 avail_mem=58.78 GB):  65%|██████▌   | 13/20 [00:00<00:00, 21.41it/s]Capturing batches (bs=12 avail_mem=58.78 GB):  80%|████████  | 16/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=8 avail_mem=58.77 GB):  80%|████████  | 16/20 [00:00<00:00, 20.55it/s] 

Capturing batches (bs=4 avail_mem=58.76 GB):  80%|████████  | 16/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=2 avail_mem=58.76 GB):  80%|████████  | 16/20 [00:00<00:00, 20.55it/s]Capturing batches (bs=2 avail_mem=58.76 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.38it/s]Capturing batches (bs=1 avail_mem=58.76 GB):  95%|█████████▌| 19/20 [00:01<00:00, 19.38it/s]

Capturing batches (bs=1 avail_mem=58.76 GB): 100%|██████████| 20/20 [00:01<00:00, 17.89it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fred. I am the king of the world. I am rich and famous. I am the king of a big country. I have many friends. They are all very nice. There are many people in my country. Some of them are farmers. They grow food for us. Some of them are workers. They make things for us. Some of them are doctors. They look after us. I have some friends in other countries. They are also very nice. They speak English. But I speak a language called English. My friends in other countries are very happy that I speak English. They say "I like you". I like my
Prompt: The president of the United States is
Generated text:  a very important person. He or she helps run the country. But what do you think is the most important person in the country? Let's have a look at the American president. President Obama is very important. He is a man who helps run the country. He is a good president. President Obama's main job is to help run the country. He helped America during the 20

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major economic and political center in Europe. It is also known for its fashion industry, art scene, and food culture. The city is home to many famous landmarks and attractions, including the Louvre, the Eiffel Tower, and the Champs-Élysées. Paris is a city of contrasts, with its modern

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology continues to improve, we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [age] year old [gender] [race] [nationality] who has [occupation] in the [field] industry. I'm a [born in] year, [born in] month, and [born in] day. I am [born in] hours, [born in] minutes, and [born in] seconds. I have always been [a certain personality trait, such as hard-working, friendly, adventurous, etc.] and I am always looking for [what you can do that makes you special, such as learning new skills, sharing knowledge, etc.] to help others

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Does this next sentence follow, given the preceding text?

Paris is the largest city in France.

Choose your answer from: (I) yes. (II) no.

(II) no.

The statement "Paris is the largest city in France" does not follow from the given information about Paris being the capital city of France. The

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

occupation

].

 I

'm

 a

 [

character

 type

,

 like

 AI

 or

 robot

]

 that

 specializes

 in

 [

job

 title

,

 like

 engineer

 or

 chef

].

 I

'm

 [

age

],

 and

 I

'm

 currently

 working

 at

 [

location

,

 like

 a

 tech

 company

 or

 a

 restaurant

].

 I

've

 been

 in

 this

 field

 for

 [

number

 of

 years

,

 like

 

1

0

 years

]

 and

 I

've

 learned

 a

 lot

 about

 [

job

 title

,

 like

 programming

,

 solving

 problems

,

 etc

.

].

 I

'm

 [

gender

,

 like

 female

 or

 male

]

 and

 I

 like

 [

occupation

,

 like

 sports

,

 nature

,

 travel

,

 etc

.

].

 I

 love

 [

sports

,

 hobbies

,

 etc



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Pop

a

".

 



In

 the

 early

 

2

0

th

 century

,

 the

 city

 was

 renamed

 "

Paris

"

 to

 honor

 Emperor

 Napoleon

 III

.

 The

 current

 city

 center

 is

 centered

 around

 the

 E

iff

el

 Tower

,

 which

 is

 also

 famous

 for

 its

 iconic

 shape

.

 Paris

 is

 also

 known

 as

 "

Le

 Pop

a

"

 in

 the

 local

 language

.

 



The

 city

 is

 home

 to

 many

 historical

 landmarks

,

 including

 the

 Lou

vre

 Museum

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 famous

 for

 its

 gastr

onomy

,

 including

 the

 famous

 "

to

que

"

 (

a

 type

 of

 cro

issant

)

 and

 the

 "

s

our

 piece

"

 (

a

 type

 of

 cheese

).



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 a

 variety

 of

 trends

,

 including

:



1

.

 Increased

 efficiency

 and

 productivity

:

 As

 AI

 technology

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 more

 efficient

 and

 accurate

 algorithms

 that

 can

 help

 businesses

 and

 organizations

 increase

 their

 productivity

 and

 efficiency

.



2

.

 Enhanced

 personal

ization

:

 AI

 is

 becoming

 more

 adept

 at

 analyzing

 and

 understanding

 customer

 data

,

 leading

 to

 more

 personalized

 and

 relevant

 experiences

.

 This

 could

 result

 in

 increased

 customer

 satisfaction

 and

 loyalty

.



3

.

 Increased

 transparency

 and

 accountability

:

 With

 the

 rise

 of

 AI

,

 there

 is

 an

 increased

 emphasis

 on

 transparency

 and

 accountability

.

 This

 could

 mean

 that

 companies

 are

 required

 to

 provide

 more

 detailed

 explanations

 of

 their

 decision

-making

 processes

 and

 that

 their

 algorithms




In [6]:
llm.shutdown()