# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0906 02:33:10.444000 845306 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0906 02:33:10.444000 845306 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0906 02:33:18.960000 845931 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0906 02:33:18.960000 845931 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0906 02:33:18.995000 845930 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0906 02:33:18.995000 845930 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 02:33:19] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.09it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fred and I am an IT Technician with over 10 years experience in the field. I have knowledge of the core principles, technologies and tools involved in the software development process, and I have successfully helped develop a software product that is in use for over 10 years. My educational background in Computer Science at the University of Illinois Urbana-Champaign is what gave me the knowledge required to successfully develop the software. My main area of expertise is in software architecture, design, and testing.
I have a keen eye for detail and am capable of troubleshooting and resolving a wide range of software issues, including but not limited to bugs, performance
Prompt: The president of the United States is
Generated text:  a leader of a group of people called the United States Congress. The president also has the power to veto a bill passed by Congress. The president has the power to appoint the people who serve in the government. Th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. It is also home to many famous French artists, writers, and musicians. The city is known for its cuisine, including its famous croissants and its traditional French dishes. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of innovation and creativity, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations and tasks.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, but there is a growing trend towards using AI to assist in diagnosis, treatment, and patient care.

4. Greater focus on AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Job Title] in [Company]. Before joining the company, I was a [Previous Job Title] in [Company], and I've always been [X] in my career. I strive to be [X] by working hard and always staying [X]. I'm always looking for new opportunities to grow and develop. I enjoy [Job Title] and I'm excited to be here at [Company]. Looking forward to the future together! 🌟✨

Hey there, I'm [Name], a [Job Title] in [Company]. Before joining, I worked in [Previous Job Title

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the second-largest city in the European Union. It is also the capital of the Paris Region and is located in the Seine-et-Oyse valley in the department of Seine-Saint-Denis. The city is known for its rich history, beautiful architecture, and vibr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ______

_

 (

fill

 in

 your

 name

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 profession

 or

 role

).



I

 work

 at

 a

/an

 ______

__

 (

fill

 in

 your

 location

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 role

 or

 profession

).

 I

 have

 a

/an

 ______

__

 (

fill

 in

 your

 experience

 or

 background

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 age

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 gender

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 religion

 or

 cultural

 background

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 nationality

).

 I

 am

 a

/an

 ______

__

 (

fill

 in

 your

 interests

 or

 hobbies

).



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



**

Explanation

 for

 an

 AI

 model

:

**


This

 statement

 is

 a

 factual

 fact

 about

 the

 capital

 city

 of

 France

.

 It

 provides

 the

 answer

 to

 the

 question

 without

 any

 interpretation

 or

 elabor

ation

.

 The

 answer

 can

 be

 verified

 by

 checking

 the

 historical

 and

 cultural

 context

 of

 Paris

,

 which

 is

 undoubtedly

 one

 of

 the

 most

 influential

 cities

 in

 Europe

.

 The

 statement

 is

 presented

 in

 a

 straightforward

 and

 un

ambiguous

 manner

,

 adher

ing

 to

 the

 guidelines

 provided

.

 



**

Additional

 constraints

:

**


-

 Ensure

 that

 the

 statement

 is

 gramm

atically

 correct

 and

 follows

 proper

 spelling

 rules

.


-

 Do

 not

 include

 any

 subjective

 or

 opinion

-based

 statements

.


-

 Ensure

 the

 answer

 is

 accurate

 and

 current

 as

 of

 the

 time



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 interesting

 trends

 that

 are

 still

 being

 developed

 and

 refined

.

 Here

 are

 some

 of

 the

 potential

 future

 trends

 in

 AI

:



1

.

 Aug

mented

 Intelligence

:

 In

 the

 future

,

 AI

 will

 become

 more

 and

 more

 integrated

 into

 our

 daily

 lives

.

 Aug

mented

 intelligence

 will

 involve

 AI

 systems

 that

 are

 able

 to

 provide

 additional

 information

 and

 insights

 to

 humans

.

 For

 example

,

 a

 smart

 home

 system

 that

 can

 help

 you

 with

 your

 daily

 tasks

,

 or

 a

 doctor

's

 assistant

 that

 can

 help

 you

 find

 the

 best

 treatment

 options

.



2

. AI

 in Healthcare

:

 AI

 will

 play

 a

 critical

 role

 in

 healthcare

 in

 the

 future

.

 We

 will

 see

 the

 development

 of

 more

 advanced

 medical

 algorithms

 that




In [6]:
llm.shutdown()