# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-23 20:03:51] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=74.73 GB): 100%|██████████| 3/3 [00:00<00:00, 11.07it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Renz. I am a sophomore at a private high school in San Francisco. I am a college majoring in biochemistry. I am a shy, introverted person and I find it difficult to connect with my peers and the rest of the world. I have struggled with anxiety, depression, and social anxiety. I am seeking help to overcome my struggles and to become more comfortable with myself.
Please create a message for me to email with to discuss my concerns. Here are the points I need to mention: I am a 26 year old female, I was born in 1994 and I was born in New York
Prompt: The president of the United States is
Generated text:  a person who can become a member of the U. S. House of Representatives.
A. True
B. False
Answer:

B

In the process of a train traveling from City A to City B, if the train is 50 meters ahead of the driver and starts decelerating at a speed of 10 meters per second squared, how many seconds will it take to come to a complete stop? 
A. 50 seconds
B.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and is home to many famous landmarks and historical sites. The city is known for its rich history, art, and cuisine, and is a major hub for international business and diplomacy. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage. The city is also known for its fashion industry, with many famous designers and boutiques. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and context-aware AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical and social implications: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical and social implications. This could lead to more rigorous testing and evaluation of AI systems, as well as greater consideration of the potential impact on society.

3. Increased use of AI in healthcare: AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Age] year-old [Occupation]. I'm currently [Current Location], [Job Title], [Professional Experience] [Education], [Favorite Reading/Writing/Exhibition], [Favorite Hobby], and [Favorite Quote]. Can you tell me more about yourself? Oh, and one last thing. I'm always looking for new adventures and love to explore new places. That's it! I'm ready to share more about you. [Name] [What's Your Profession? How Much Experience Do You Have? Age? Occupation? What Job Title? How Many Years of Experience Do You Have? Education

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in Europe by population and is home to many world-renowned landmarks and cultural attractions. Paris is known for its rich history, art, and cuisine, and it is also home to a diverse population of people 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

insert

 the

 appropriate

 profession

 or

 occupation

]

!

 As

 a

 dedicated

 reader

 and

 enthusiast

,

 I

'm

 always

 on

 the

 lookout

 for

 unique

 and

 challenging

 experiences

,

 and

 I

'm

 always

 here

 to

 share

 my

 knowledge

 and

 insights

 with

 you

.

 Whether

 it

's

 about

 the

 latest

 trends

 in

 technology

,

 the

 latest

 discoveries

 in

 science

,

 or

 the

 best

 ways

 to

 live

 a

 fulfilling

 life

,

 I

'm

 here

 to

 help

 you

 explore

 and

 discover

 more

 about

 the

 world

 around

 us

.

 How

 can

 I

 assist

 you

 today

?

 Let

's

 do

 it

 together

!

 

📚

📚

📚

 Let

's

 connect

!

 

💬

🤔

🤔

 These

 are

 some

 of

 the

 things

 that

 I

 enjoy

 doing

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 “

La

 Ville

-E

urope

enne

”

 and

 officially

 known

 as

 “

Paris,

 the

 Big

 Apple

”

 as

 it

 is

 the

 main

 city

 of

 France

.

 Paris

 is

 the

 second

 largest

 city

 in

 Europe

 after

 London

 and

 the

 sixth

 most

 populous

 city

 in

 the

 world

.

 It

 is

 a

 major

 cultural

 center

,

 financial

 center

,

 and

 the

 seat

 of

 government

 of

 France

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 the

 

6

th

 century

 BC

.

 Paris

 is

 home

 to

 many

 world

-ren

owned

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 many

 others

.

 The

 city

 is

 also

 known

 for

 its

 cuisine

 and

 its

 fashion

 industry

.

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

 that

 are

 expected

 to

 shape

 the

 technology

 and

 applications

 it

 will

 play

 in

.

 Here

 are

 some

 potential

 trends

:



1

.

 Increased

 Personal

ization

:

 AI

 will

 continue

 to

 enable

 personalized

 learning

 experiences

,

 product

 recommendations

,

 and

 other

 services

 based

 on

 individual

 preferences

 and

 data

.



2

.

 Improved

 Efficiency

:

 AI

-powered

 automation

 will

 become

 more

 advanced

 and

 efficient

,

 allowing

 businesses

 to

 cut

 costs

,

 reduce

 errors

,

 and

 streamline

 processes

.



3

.

 Enhanced

 Mental

 Health

:

 AI

 will

 play

 a

 more

 significant

 role

 in

 treating

 mental

 health

 disorders

 and

 improving

 the

 quality

 of

 life

 for

 individuals

 with

 these

 conditions

.



4

.

 Increased

 Transparency

:

 AI

 systems

 will

 become

 more

 transparent

,

 allowing

 users




In [6]:
llm.shutdown()