# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-06 03:00:14] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.16it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.78 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.78 GB):   5%|▌         | 1/20 [00:00<00:05,  3.77it/s]Capturing batches (bs=120 avail_mem=74.68 GB):   5%|▌         | 1/20 [00:00<00:05,  3.77it/s]Capturing batches (bs=112 avail_mem=74.67 GB):   5%|▌         | 1/20 [00:00<00:05,  3.77it/s]Capturing batches (bs=112 avail_mem=74.67 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.28it/s]Capturing batches (bs=104 avail_mem=74.67 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.28it/s]

Capturing batches (bs=104 avail_mem=74.67 GB):  20%|██        | 4/20 [00:00<00:02,  7.67it/s]Capturing batches (bs=96 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:02,  7.67it/s] Capturing batches (bs=96 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.96it/s]Capturing batches (bs=88 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.96it/s]

Capturing batches (bs=80 avail_mem=74.65 GB):  25%|██▌       | 5/20 [00:00<00:02,  6.96it/s]Capturing batches (bs=80 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.86it/s]Capturing batches (bs=72 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.86it/s]

Capturing batches (bs=72 avail_mem=74.65 GB):  40%|████      | 8/20 [00:01<00:01,  8.63it/s]Capturing batches (bs=64 avail_mem=74.64 GB):  40%|████      | 8/20 [00:01<00:01,  8.63it/s]

Capturing batches (bs=64 avail_mem=74.64 GB):  45%|████▌     | 9/20 [00:01<00:01,  7.07it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  45%|████▌     | 9/20 [00:01<00:01,  7.07it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  45%|████▌     | 9/20 [00:01<00:01,  7.07it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.51it/s]Capturing batches (bs=40 avail_mem=74.53 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.51it/s]

Capturing batches (bs=40 avail_mem=74.53 GB):  60%|██████    | 12/20 [00:01<00:01,  6.95it/s]Capturing batches (bs=32 avail_mem=74.53 GB):  60%|██████    | 12/20 [00:01<00:01,  6.95it/s]Capturing batches (bs=24 avail_mem=74.52 GB):  60%|██████    | 12/20 [00:01<00:01,  6.95it/s]Capturing batches (bs=24 avail_mem=74.52 GB):  70%|███████   | 14/20 [00:01<00:00,  8.56it/s]Capturing batches (bs=16 avail_mem=74.52 GB):  70%|███████   | 14/20 [00:01<00:00,  8.56it/s]

Capturing batches (bs=16 avail_mem=74.52 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.83it/s]Capturing batches (bs=12 avail_mem=74.51 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.83it/s]Capturing batches (bs=8 avail_mem=74.51 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.83it/s] Capturing batches (bs=8 avail_mem=74.51 GB):  85%|████████▌ | 17/20 [00:02<00:00,  8.78it/s]Capturing batches (bs=4 avail_mem=74.50 GB):  85%|████████▌ | 17/20 [00:02<00:00,  8.78it/s]Capturing batches (bs=2 avail_mem=74.50 GB):  85%|████████▌ | 17/20 [00:02<00:00,  8.78it/s]

Capturing batches (bs=1 avail_mem=74.50 GB):  85%|████████▌ | 17/20 [00:02<00:00,  8.78it/s]Capturing batches (bs=1 avail_mem=74.50 GB): 100%|██████████| 20/20 [00:02<00:00, 12.24it/s]Capturing batches (bs=1 avail_mem=74.50 GB): 100%|██████████| 20/20 [00:02<00:00,  8.72it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Matthew and I'm here to help you with any questions you may have about Venn diagrams. If you're using a computer screen, I'll be here to answer your questions in real time. Do you have a specific question in mind? Let me know and I'll do my best to help you! Or, if you prefer, feel free to ask me anything. I'll be here to answer in the usual way! Have a great day! 😊👍😊💡🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔
Prompt: The president of the United States is
Generated text:  currently 45 years old. How old will the president be in 20 years if he has a son and a daughter who are half the age of him and her current ages, respectively?

Let's start by defining the current ages of the president and his son, daughter, and mother.

Let the current age of the president be \( P \).

1. The current age of the president's son is \( \frac{P}{2} \).
2. The current age of the president's daughter is \( \frac{P}{2} \).

According to the problem, in 20 years, the presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working in this field for [number of years] years and have always been passionate about [job title] and have always been committed to [job title] goals. I am always looking for ways to [job title] and have always been open to learning new things and trying new things. I am always eager to learn and grow, and I am always looking for ways to contribute to the success of [company name]. I am a [job title] who is always looking for ways to [job title] and have always been committed to [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. Paris is also a major center for fashion, art, and cuisine, and is home to many famous museums, theaters, and restaurants. The city is also known for its annual festivals and events, including the Eiffel Tower Parade and the Carnaval de Paris. Paris is a vibrant and dynamic city that continues to thrive as a major global city.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced privacy and security: As AI systems become more integrated with human intelligence, there will be an increased need for privacy and security measures to protect personal data and prevent misuse of AI systems. This could lead to new privacy laws and regulations being developed to ensure that AI systems are used ethically and responsibly.

3. Greater reliance on AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [insert occupation here]. I'm passionate about [insert something here, like reading, writing, or sports], and I enjoy making people laugh. I'm always eager to learn new things and expand my horizons. What brings you to this world? It's a world full of interesting people and exciting adventures, and I'm here to explore and discover what lies within me. I'm excited to be here, and I look forward to sharing my experiences and insights with you. How about you? [Name]? 

---

### Introduction

**Name:** [Your Full Name]
**Occupation:** [Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its rich history, iconic architecture, and diverse cultural scene. Paris is located on the Mediterranean coast, and it has been a major city for over 1000 years. The city is home to man

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

/an

 [

职业

]

 living

 in

 [

your

 location

].

 I

 have

 always

 been

 passionate

 about

 [

the

 thing

 you

 enjoy

 doing

].

 I

'm

 [

your

 age

],

 [

a number

] years

 old

,

 and

 I

 love

 [

your

 hobby

 or

 interest

].

 I

 have

 [

number

]

 friends

,

 [

the

 reason

 for

 having

 [

the

 number

 of

 friends

]

 friends

]

 and

 I

 always

 strive

 to

 [

the

 reason

 for

 wanting

 to

 improve

]

 [

the

 reason

 for

 wanting

 to

 improve

]

 [

your

 hobby

 or

 interest

].

 I

 have

 a

 lot

 of

 [

the

 thing

 you

're

 good

 at

 or

 interested

 in

].

 I

 enjoy

 [

the

 thing

 you

 like

 to

 do

]

 with

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historic

 city

 with

 a

 rich

 cultural

 heritage

,

 located

 in

 the

 center

 of

 the

 country

.

 The

 city

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 Sac

ré

-C

œur

 Basil

ica

.

 It

 is

 a

 cosm

opolitan

 city

 with

 a

 diverse

 population

 that

 is

 home

 to

 many

 languages

,

 religions

,

 and

 ethnic

 groups

.

 The

 city

 is

 also

 famous

 for

 its

 fine

 cuisine

,

 art

,

 and

 music

.

 Paris

 is

 a

 UNESCO

 World

 Heritage

 Site

 and

 a

 major

 cultural

 and

 economic

 hub

 of

 Europe

.

 The

 city

 is

 also

 known

 for

 its

 beautiful

 gardens

 and

 parks

,

 including

 the

 Bo

is

 de

 Bou

log



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 is

 set

 to

 continue

 to

 evolve

 at

 a

 rapid

 pace

.

 Here

 are

 some

 potential

 trends

 in

 AI

 that

 could

 shape

 our

 future

:



1

.

 Increased

 Integration

 of

 AI

 into

 Everyday

 Life

:

 AI

 is

 already

 making

 its

 way

 into

 our

 daily

 lives

,

 such

 as

 through

 voice

 assistants

,

 smart

 home

 devices

,

 self

-driving

 cars

,

 and

 chat

bots

.

 As

 technology

 continues

 to

 advance

,

 we

 can

 expect

 AI

 to

 become

 more

 integrated

 into

 our

 daily

 routines

,

 making

 our

 lives

 easier

 and

 more

 efficient

.



2

.

 AI

 Will

 Be

 More

 In

formed

:

 As

 AI

 becomes

 more

 sophisticated

,

 we

 can

 expect

 it

 to

 become

 more

 aware

 of

 our

 behavior

 and

 emotions

,

 and

 to

 learn

 from

 our

 experiences

 and




In [6]:
llm.shutdown()