# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0903 07:01:44.551000 591686 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 07:01:44.551000 591686 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0903 07:01:53.238000 592127 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 07:01:53.238000 592127 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0903 07:01:53.251000 592126 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0903 07:01:53.251000 592126 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-03 07:01:53] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.02it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=76.99 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=76.99 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=2 avail_mem=76.46 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=1 avail_mem=76.45 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.68it/s]Capturing batches (bs=1 avail_mem=76.45 GB): 100%|██████████| 3/3 [00:00<00:00,  4.33it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Niall and I am a postgraduate student in the Department of Informatics, University of Cambridge. I graduated from the University of Cambridge in 2018 with a BSc in Computer Science and my current research focuses on developing new approaches to the automated generation of Turing complete functional programs.

I am currently working on the project "Turing Proving Ground" which is focused on the development of a computational method to automatically generate Turing complete functional programs. The project is using both heuristic and neural approaches to explore the space of possible programs, which is called the space of Turing programs.

What are the main challenges faced by the project and
Prompt: The president of the United States is
Generated text:  an elected official who represents the country. Presidents are chosen by the people through a democratic process. The president is the head of the executive branch of the government, and their p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill/Ability] who has been [Number of Years] years in the field of [Field of Interest]. I'm passionate about [Why I'm Passionate About My Field]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [Favorite Hobby/Activity] that I enjoy doing. I'm also a [Favorite Book/Artist/Writer/Artist] that I love to read and learn from. I'm a [Favorite Movie/TV Show/Book/Artist] that I enjoy watching and reading.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Flottante" (floating city) due to its floating population of people. It is the largest city in Europe and the third largest in the world, with a population of over 2.7 million people. Paris is known for its rich history, art, and culture, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major financial center and a major tourist destination. Paris is a popular tourist destination and is often referred to as the "city of love" due to its romantic and romantic atmosphere. The city is also home to many famous

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the potential future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare to transportation. This will lead to increased efficiency, cost savings, and job displacement, but it will also create new opportunities for innovation and creativity.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we can expect



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Name]. I am a [Type] personality. I am [Name] and I have always been passionate about [What you do for a living]. I enjoy [The reason why you do this]. I am [Name] because I am [Name], and I am [Name] and I believe in [What you believe in]. I am a [Name] with a unique style and personality that sets me apart from others. I am [Name] and I am [Name]. I am here because I want to share my story and see the world through a different lens. I am [Name]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and its capital. It serves as the political, cultural, and economic center of the country. Known for its iconic landmarks such as the Eiffel Tower, the Notre-Dame Cathedral, and the Louvre Museum, Paris has a rich history and a diverse population of around 10 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 and

 last

 name

],

 and

 I

 am

 [

insert

 occupation

 or

 major

 industry

],

 with

 a

 passion

 for

 [

insert

 something

 related

 to

 your

 occupation

 or

 major

 industry

,

 such

 as

 literature

,

 music

,

 or

 writing

].

 I

 believe

 that

 my

 skills

 and

 experience

 in

 [

insert

 something

 related

 to

 your

 occupation

 or

 major

 industry

,

 such

 as

 creative

 writing

,

 storytelling

,

 or

 research

]

 have

 enabled

 me

 to

 bring

 my

 unique

 voice

 to

 the

 table

,

 which

 is

 why

 I

 am

 thrilled

 to

 be

 here

 today

.

 I

 look

 forward

 to

 learning

 more

 about

 your

 work

 and

 contributing

 to

 the

 conversation

.

 Please

 let

 me

 know

 if

 you

'd

 like

 me

 to

 share

 any

 personal

 anecdotes

 or

 stories

 from

 my

 past

 work



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 capital

 city

 of

 France

 and

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 It

's

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 and

 culture

.

 How

 can

 I

 assist

 you

 further

?

 

😊

✨





Does

 Paris

 have

 any

 notable

 landmarks

 or

 attractions

 that

 are

 popular

 among

 tourists

?

 Yes

,

 Paris

 has

 a

 multitude

 of

 iconic

 landmarks

 and

 attractions

 that

 are

 popular

 among

 tourists

.

 Some

 of

 the

 most

 famous

 ones

 include

:



1

.

 E

iff

el

 Tower

 -

 The

 iconic

 tower

 is

 the

 centerpiece

 of

 Paris

 and

 is

 also

 a

 UNESCO

 World

 Heritage

 site

.



2

.

 Lou

vre

 Museum



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 fascinating

 and

 likely

 to

 unfold

 in

 a

 number

 of

 ways

,

 some

 of

 which

 include

:



1

.

 Increased

 focus

 on

 ethical

 and

 societal

 impacts

:

 With

 the

 increasing

 awareness

 of

 AI

's

 impact

 on

 society

,

 it

 is

 likely

 that

 AI

 researchers

 and

 developers

 will

 continue

 to

 focus

 on

 ethical

 considerations

 such

 as

 bias

,

 transparency

,

 and

 privacy

.



2

.

 Greater

 use

 of

 AI

 in

 healthcare

:

 With

 more

 people

 living

 longer

 and

 with

 advanced

 treatments

,

 there

 is

 a

 potential

 for

 AI

 to

 be

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

 and

 reduce

 medical

 errors

.



3

.

 Development

 of

 more

 advanced

 AI

 models

:

 As

 AI

 technology

 advances

,

 there

 may

 be

 the

 potential

 for

 even

 more

 powerful

 and

 sophisticated

 AI

 models

 to




In [6]:
llm.shutdown()