# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0910 07:25:54.546000 1173249 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 07:25:54.546000 1173249 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0910 07:26:03.497000 1173817 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 07:26:03.497000 1173817 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0910 07:26:03.521000 1173816 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0910 07:26:03.521000 1173816 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-10 07:26:04] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.51 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.51 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.94it/s]Capturing batches (bs=2 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.94it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.94it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  8.75it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisyah, a 13-year-old girl from Indonesia. I'm interested in music and I love to sing in the choir. I love to travel. I have the chance to go to a concert and the concert is so amazing and I love to watch the audience sing and dance. What are you interested in? In my free time, I like to draw pictures and listen to music. I also love to learn new things. I like to collect stamps and I love to collect coins. What are your hobbies and interests? I am going to a music school to learn about music and I want to become a singer. I would
Prompt: The president of the United States is
Generated text:  seeking to improve the efficiency and sustainability of the nation’s energy consumption and production. To that end, he has asked the National Oceanic and Atmospheric Administration (NOAA) to develop a new energy efficiency standard for the electric grid, and to work with the Department of Energy to develop a new energy efficiency standard for the transpo

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill] with [Number] years of experience in [Field]. I am passionate about [Interest or Hobby] and I enjoy [Reason for Passion]. I am always looking for new challenges and opportunities to learn and grow. I am a [Personality] and I am [Motivation]. I am [Appearance] and I am [Physical Description]. I am [Religion or Belief] and I am [Ethical Code]. I am [Family Background] and I am [Personal Values]. I am [Career Goals] and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, and is home to many of the world's most famous museums and attractions. Paris is a bustling metropolis with a rich history and diverse population, and is a popular tourist destination. The city is also home to many international organizations and institutions, including the European Union and the United Nations. Paris is a vibrant and dynamic city that continues to grow and evolve as a major global city.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs.

2. Enhanced capabilities in natural language processing: AI is likely to become even more capable in natural language processing, allowing machines to understand and respond to human language in ways that are more complex and nuanced than ever before. This could lead to more sophisticated and intelligent AI systems that can better understand and respond to human emotions and motivations.

3. Greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm an experienced [occupation or profession] with over 10 years of experience in the industry. I'm a hardworking, dedicated, and professional individual who has consistently met and exceeded the expectations of my clients and colleagues alike. My passion for learning and my commitment to continuous improvement have made me a valuable asset to any organization I'm a part of. If you're looking for a trustworthy, reliable, and successful professional, look no further than me. I'd love to hear from you! What is the name of your fictional character, and what is their occupation or profession? Please provide their details, such

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a major city in Europe and is famous for its historical and cultural landmarks, including the Eiffel Tower and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Career

]

 at

 [

Company

]

 who

 has

 [

number

]

 years

 of

 experience

.

 In

 my

 current

 role

,

 I

'm

 responsible

 for

 [

Key

 Responsibility

 or

 Area

 of

 Focus

].

 I

'm

 always

 up

-to

-date

 on

 [

Current

 Task

 or

 Objective

].

 Looking

 forward

 to

 meeting

 you

!

 

😊




Hey

 there

!

 My

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

Career

]

 at

 [

Company

].

 I

 have

 [

number

]

 years

 of

 experience

 in

 [

current

 role

].

 In

 my

 current

 role

,

 I

'm

 responsible

 for

 [

Key

 Responsibility

 or

 Area

 of

 Focus

].

 I

'm

 always

 up

-to

-date

 on

 [

Current

 Task

 or

 Objective

].

 Looking

 forward

 to

 meeting



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 by

 area

 and

 population

,

 and

 one

 of

 the

 world

’s

 most

 populous

 cities

.

 Paris

 is

 known

 for

 its

 ancient

 history

,

 Renaissance

 architecture

,

 and

 vibrant

 French

 culture

.

 It

 is

 also

 home

 to

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 many

 other

 famous

 landmarks

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 cultural

 hub

 for

 France

 and

 Europe

.

 It

 is

 the

 French

 capital

 and

 seat

 of

 government

 and

 is

 home

 to

 the

 French

 Parliament

,

 the

 city

 hall

,

 and

 the

 Supreme

 Court

 of

 France

.

 The

 city

 is

 also

 known

 for

 its

 fashion

,

 cuisine

,

 and

 music

.

 As

 of

 

2

0



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 shaped

 by

 a

 complex

 inter

play

 of

 technological

 progress

,

 cultural

 shifts

,

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 integration

:

 AI

 will

 continue

 to

 integrate

 more

 into

 our

 daily

 lives

,

 from

 self

-driving

 cars

 to

 virtual

 assistants

,

 becoming

 increasingly

 ubiquitous

.

 This

 integration

 will

 require

 us

 to

 adapt

 our

 algorithms

,

 data

,

 and

 communication

 styles

 to

 interact

 with

 AI

 systems

.



2

.

 Autonomous

 systems

:

 AI

 will

 likely

 become

 more

 autonomous

,

 able

 to

 operate

 on

 their

 own

 without

 human

 intervention

.

 Autonomous

 systems

 will

 likely

 replace

 human

 control

 over

 AI

,

 allowing

 for

 greater

 efficiency

 and

 flexibility

.



3

.

 Data

 and

 privacy

 concerns

:

 As

 AI

 systems

 become

 more

 advanced

,




In [6]:
llm.shutdown()