# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 10:11:09] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=26.16 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=26.16 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]Capturing batches (bs=120 avail_mem=25.77 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]

Capturing batches (bs=112 avail_mem=25.69 GB):   5%|▌         | 1/20 [00:00<00:03,  5.26it/s]Capturing batches (bs=112 avail_mem=25.69 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.55it/s]Capturing batches (bs=104 avail_mem=25.61 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.55it/s]Capturing batches (bs=96 avail_mem=25.46 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.55it/s] Capturing batches (bs=88 avail_mem=25.41 GB):  15%|█▌        | 3/20 [00:00<00:01, 11.55it/s]Capturing batches (bs=88 avail_mem=25.41 GB):  30%|███       | 6/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=80 avail_mem=25.41 GB):  30%|███       | 6/20 [00:00<00:00, 16.08it/s]

Capturing batches (bs=72 avail_mem=25.37 GB):  30%|███       | 6/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=64 avail_mem=25.37 GB):  30%|███       | 6/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=64 avail_mem=25.37 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.38it/s]Capturing batches (bs=56 avail_mem=25.35 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.38it/s]Capturing batches (bs=48 avail_mem=25.34 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.38it/s]Capturing batches (bs=40 avail_mem=25.34 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.38it/s]

Capturing batches (bs=40 avail_mem=25.34 GB):  60%|██████    | 12/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=32 avail_mem=25.22 GB):  60%|██████    | 12/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=24 avail_mem=24.63 GB):  60%|██████    | 12/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=16 avail_mem=22.29 GB):  60%|██████    | 12/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=16 avail_mem=22.29 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.54it/s]Capturing batches (bs=12 avail_mem=19.52 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.54it/s]

Capturing batches (bs=8 avail_mem=19.52 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.54it/s] Capturing batches (bs=4 avail_mem=19.51 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.54it/s]Capturing batches (bs=4 avail_mem=19.51 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.24it/s]Capturing batches (bs=2 avail_mem=19.51 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.24it/s]Capturing batches (bs=1 avail_mem=19.50 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.24it/s]Capturing batches (bs=1 avail_mem=19.50 GB): 100%|██████████| 20/20 [00:01<00:00, 18.60it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amin, and I am a software developer and entrepreneur with a passion for digital marketing. I have a Bachelor's degree in computer science and have experience working in both traditional and digital marketing.
I am based in San Francisco, California, but I love to travel and have visited some of the world's most beautiful places. I have always been a big fan of coffee and enjoy tasting a variety of blends and specialties. How can I best approach a new client or client group to understand their marketing needs? Developing a strong understanding of a new client or client group is essential to creating effective marketing strategies. Here are some steps you can take to approach
Prompt: The president of the United States is
Generated text:  a very important person. He holds the most important position in the government of the country. He is always busy doing something to help the country. He has a lot of responsibilities. He has to keep the peace a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Academy of Sciences. Paris is a cultural and historical center with a rich history dating back to ancient times. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The

Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

2. Greater integration with human decision-making: AI will continue to become more integrated with human decision-making, allowing for more complex and nuanced decision-making.

3. Increased use of AI in healthcare: AI will be used to improve the accuracy and speed of medical diagnosis and treatment, and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [job title] at [Company]. I am [job title] in [company name], and I have been working for [company name] for [number of years] years. I have always been passionate about [field of interest] and have always been driven by [specific motivation or purpose]. I enjoy [motivation or purpose] and I always strive to [specific behavior or achievement]. I am [age] years old, but I am always ready to learn and grow. I am [gender] and I love to [occupation]. What excites me the most about my job is [mot

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

[Note: The instruction requires a factual statement about a specific city, but there is a minor oversight in the given instruction. The correct statement about Paris is that it is the capital of France. The task is to provide a concise factual stat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

]

 at

 [

Company

].

 I

'm

 excited

 to

 meet

 you

 and

 help

 you

 in

 any

 way

 I

 can

.

 Please

 let

 me

 know

 what

 you

're

 looking

 for

 in

 a

 job

 interview

 and

 I

'll

 do

 my

 best

 to

 provide

 you

 with

 a

 tailored

 experience

.

 

😊

✨





---



Note

:

 Replace

 [

Name

]

 with

 your

 real

 name

,

 [

Name

]

 with

 your

 fictional

 name

,

 [

job

 title

]

 with

 your

 job

 title

,

 [

Company

]

 with

 your

 company

 name

,

 and

 [

Job

]

 with

 your

 job

 title

.

 This

 introduces

 you

 as

 a

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

 in

 a

 neutral

 manner

,

 acknowledging

 your

 real



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 the

 Lou

vre

 museum

.

 It

's

 the

 largest

 city

 in

 both

 the

 European

 Union

 and

 the

 United

 Nations

.

 The

 city

 is

 home

 to

 the

 ancient

 Roman

 Forum

,

 the

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

.

 Despite

 its

 historical

 significance

,

 Paris

 has

 made

 significant

 progress

 in

 recent

 years

,

 including

 the

 opening

 of

 the

 Or

anger

ie

 museum

 and

 the

 transformation

 of

 the

 C

ité

 de

 l

'

Ar

mes

 into

 the

 Paris

 Mét

ro

 system

.

 French

 cuisine

,

 especially

 the

 cuisine

 of

 the

 Bas

que

 region

,

 is

 also

 well

-known

.

 The

 French

 language

 is

 also

 spoken

 in

 many

 regions

 of

 France

 and

 abroad

.

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 combination

 of

 existing

 technologies

 and

 new

 developments

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

 may

 become

 more

 common

 in

 the

 coming

 years

,

 using

 AI

 to

 navigate

 roads

,

 handle

 traffic

,

 and

 make

 decisions

 on

 the

 road

.



2

.

 Virtual

 assistants

:

 AI

-powered

 virtual

 assistants

 may

 become

 more

 sophisticated

 and

 integrated

 into

 our

 daily

 lives

,

 using

 speech

 recognition

,

 natural

 language

 processing

,

 and

 machine

 learning

 to

 understand

 our

 needs

 and

 provide

 helpful

 information

.



3

.

 Smart

 homes

:

 AI

-powered

 smart

 home

 devices

 may

 become

 more

 integrated

 into

 our

 homes

,

 using

 sensors

 and

 cameras

 to

 detect

 energy

 usage

,

 keep

 the

 home

 safe

,

 and

 provide




In [6]:
llm.shutdown()