# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-26 01:27:03] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-26 01:27:03] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-26 01:27:03] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-26 01:27:03] INFO trace.py:48: opentelemetry package is not installed, tracing disabled




[2025-10-26 01:27:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-26 01:27:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-26 01:27:11] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-26 01:27:13] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.90it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.04it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.04it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.04it/s] Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.51it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.51it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:01,  8.51it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.52it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.52it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:01,  9.52it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  8.82it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  8.82it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:01,  8.93it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:01<00:01,  8.93it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:01<00:01,  8.93it/s]

Capturing batches (bs=40 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 10.18it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 10.18it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00, 10.18it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:01<00:00, 11.63it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 11.63it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 11.63it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 12.34it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 12.34it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.04it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.04it/s]

Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 13.04it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 11.49it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hae Young Kim, and I'm a 21 year old girl. I can speak English fluently and I like to write poetry. I have been working on a project, which involves writing a poem about my experiences of living in China and writing in English. Can you help me with a brief introduction to the topic? Yes, of course! The poem you're writing about your experiences of living in China and writing in English is a great project. It will help you to express your thoughts and feelings about your life in a fresh and meaningful way. Additionally, writing in English can be a great way to improve your English language skills
Prompt: The president of the United States is
Generated text:  trying to save the United States military.

The president has decided that they will have the president of the United States form a committee to help with the military's needs. This committee will be responsible for taking care of the military's expenses and overseeing their operations. The

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I am passionate about [job title] and have been working in this field for [number of years] years. I am always looking for ways to [job title] and have a keen interest in [job title] and its impact on society. I am always eager to learn and grow, and I am always looking for new challenges and opportunities to grow and succeed. I am a [job title] and I am always looking for ways to [job title] and have a keen interest in [job title] and its impact on society. I am always

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture that attracts millions of visitors each year. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as its diverse food scene and fashion scene. Paris is also a major center for art, music, and literature, and is home to many famous museums, theaters, and concert halls. The city is also known for its fashion industry, with many famous designers and boutiques located in the city. Overall, Paris is a city of contrasts,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see more automation and robotics in various industries, from manufacturing to healthcare to transportation. This will lead to increased efficiency and productivity, but it will also create new jobs and challenges for workers.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be an increased risk of data breaches and privacy violations. Governments and companies will need to develop new security measures to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I am a/an [insert profession or title] with a diverse background that includes [insert relevant experiences and skills]. I am passionate about [insert personal interests or hobbies]. If asked, I am fluent in [insert language] and enjoy [insert interests or hobbies]. I believe in [insert values or principles]. I am dedicated to [insert key areas of focus or goals] and strive to achieve my best. I am always ready to learn and grow, and I believe that my expertise in [insert relevant areas of expertise] can benefit anyone. Thank you for asking, and I look forward to the opportunity to share

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

User: Could you please provide some additional details about Paris that I could learn about?

Certainly! Here are some additional details about Paris 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 [

Your

 Age

],

 and

 I

've

 always

 been

 a

 person

 who

 is

 always

 looking

 for

 a

 challenge

 and

 trying

 to

 make

 the

 world

 a

 better

 place

.

 From

 my

 childhood

,

 I

've

 always

 been

 fascinated

 by

 how

 different

 people

 are

 and

 how

 they

 can

 have

 such

 different

 perspectives

 and

 interests

.

 I

'm

 always

 looking

 for

 ways

 to

 inspire

 others

 and

 make

 a

 positive

 impact

,

 and

 I

 think

 that

's

 what

 I

 do

 best

 as

 a

 character

.



I

 have

 a

 natural

 talent

 for

 problem

-solving

 and

 I

 enjoy

 using

 my

 creativity

 to

 come

 up

 with

 solutions

 to

 complex

 problems

.

 I

'm

 also

 someone

 who

 loves

 to

 learn

 and

 I

'm

 always

 eager

 to

 improve

 myself

 and

 gain



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 the

 largest

 and

 most

 populous

 city

 in

 the

 country

.


You

 are

 to

 answer

 this

 question

:

 Are

 the

 rules

 of

 the

 Paris

ian

 game

 related

 to

 the

 French

 capital

?

 No

,

 the

 rules

 of

 the

 Paris

ian

 game

 are

 not

 related

 to

 the

 French

 capital

.

 The

 Paris

ian

 game

 is

 a

 type

 of

 board

 game

 where

 players

 choose

 to

 play

 either

 the

 "

p

ied

 de

 gu

erre

"

 or

 the

 "

p

ied

 de

 l

oup

."

 The

 rules

 of

 this

 game

 are

 not

 specifically

 tied

 to

 Paris

 or

 the

 French

 capital

,

 but

 rather

 to

 the

 general

 nature

 of

 the

 board

 game

 itself

.

 It

 is

 a

 popular

 game

 that

 has

 been

 played

 for

 centuries

 and

 is

 enjoyed

 by



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 fascinating

 and

 involves

 many

 potential

 trends

 and

 developments

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 More

 advanced

 machine

 learning

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 sophisticated

 machine

 learning

 algorithms

 that

 can

 perform

 tasks

 that

 were

 previously

 impossible

 for

 humans

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 various

 fields

 such

 as

 diagnosis

,

 treatment

,

 and

 patient

 care

.

 We

 can

 expect

 this

 trend

 to

 continue

 as

 more

 medical

 professionals

 use

 AI

 to

 improve

 patient

 outcomes

.



3

.

 Enhanced

 virtual

 reality

:

 As

 more

 AI

-powered

 devices

 and

 platforms

 become

 available

,

 we

 can

 expect

 to

 see

 more

 immersive

 and

 interactive

 experiences

,

 such

 as




In [6]:
llm.shutdown()