# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 05:33:08] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.45 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.03it/s]Capturing batches (bs=2 avail_mem=75.35 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.03it/s]Capturing batches (bs=1 avail_mem=74.99 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.03it/s]Capturing batches (bs=1 avail_mem=74.99 GB): 100%|██████████| 3/3 [00:00<00:00,  7.40it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sam, I am 26 years old and I am a married man. I have been in this relationship for over 10 years and I feel very close to my partner. Despite this, I am having trouble with my relationship.

I know the problem is that I am not being honest with my partner about my feelings. I am often secretive and difficult to talk to about my thoughts and feelings. I want to be honest and open with my partner, but I don't feel comfortable sharing my secrets with them. 

I have also had a few arguments with my partner, which I find hard to resolve. I feel like my partner
Prompt: The president of the United States is
Generated text:  34 years younger than the president of Brazil. The president of Brazil is 4 times younger than the president of the European Union. If the president of the European Union is currently 20 years old, how old would the president of the European Union be in 10 years?

To determine the president of the European Union's age in 10 years

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. What can I help you with today? [Name] is looking for a [Job Title] position at [Company Name]. I'm excited to hear about your experience and what you're looking for in a job. [Name] is looking for a [Job Title] position at [Company Name]. I'm excited to hear about your experience and what you're looking for in a job. [Name] is looking for a [Job Title] position

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich cultural heritage. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its fashion industry, art scene, and cuisine. Paris is a cultural and economic hub of France and a major tourist destination. It is home to many world-renowned museums, theaters, and art galleries. The city is also known for its annual festivals and events, such

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased automation and robotics: AI is already being used in manufacturing, healthcare, and transportation, but it is expected to continue to expand its use in these areas as well as in other industries. Robots and automation will become more sophisticated, and AI will be used to perform tasks that were previously done by humans.

2. Improved natural language processing: AI will continue to improve its ability to understand and interpret human language, allowing for more natural and intuitive interactions with machines and humans alike.

3. Enhanced privacy and security: As AI becomes more integrated into our daily lives,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm an aspiring [Field of Interest] writer. I'm excited to start my journey and explore the world of [Your Field of Interest], whether it's the thrill of new discoveries or the joy of sharing your ideas with the world. What inspired you to become a [Field of Interest] writer, and what's your most exciting project so far? Happy writing! 🧪✨

Hey there, folks! I'm **[Your Name]**, a bit of a literature buff with a passion for [Your Field of Interest]. I'm always on the lookout for the next great project, whether it's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its rich history, iconic landmarks, and vibrant culture. It's the country's largest city and is home to the Eiffel Tower, Louvre Museum, and the Place de la Concorde. Paris is also known for its fashion industry and is

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

type

 of

 person

]

 by

 nature

.

 I

 enjoy

 [

what

 I

 do

 best

].

 I

'm

 very

 [

m

oral

 nature

],

 and

 I

 try

 to

 be

 [

what

 I

 consider

 to

 be

 my

 best

 trait

].

 I

'm

 [

my

 favorite

 hobby

/

interest

].

 I

'm

 always

 [

what

 I

'm

 looking

 forward

 to

 doing

].

 I

'm

 excited

 to

 meet

 you

!

 Let

's

 chat

!

 [

Name

]

 G

reetings

!

 I

'm

 [

Name

],

 an

 [

insert

 type

 of

 person

]

 by

 nature

.

 I

 enjoy

 [

insert

 what

 I

 do

 best

],

 and

 I

'm

 very

 [

insert

 what

 I

 consider

 to

 be

 my

 best

 trait

].

 I

'm

 [

insert

 what



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 for

 its

 iconic

 landmarks

 and

 diverse

 cultural

 scene

.

 



(Note

:

 To

 expand

 on

 this

 statement

,

 there

's

 a

 map

 of

 Paris

 and

 the

 city

 is

 labeled

 as

 "

Paris

,

 France

")

 



-

 Paris

 is

 often

 referred

 to

 as

 "

the

 city

 of

 lights

"

 for

 its

 illuminated

 architecture

 and

 bou

lev

ards

.


-

 It

 is

 also

 known

 as

 "

The

 Heart

 of

 France

"

 for

 its

 historic

 heart

 of

 the

 city

.


-

 The

 city

 has

 a

 rich

 history

 dating

 back

 over

 

2

,

 

5

0

0

 years

 and

 is

 home

 to

 many

 of

 France

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Arc



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 key

 trends

 and

 developments

:



1

.

 Deep

 learning

 and

 machine

 learning

:

 As

 the

 accuracy

 and

 speed

 of

 AI

 systems

 continue

 to

 improve

,

 we

 are

 likely

 to

 see

 an

 increasing

 focus

 on

 deep

 learning

 and

 machine

 learning

 as

 a

 primary

 driver

 of

 innovation

.

 These

 methods

 are

 capable

 of

 solving

 complex

 problems

 that

 are

 currently

 beyond

 the

 reach

 of

 traditional

 AI

 algorithms

.



2

.

 Real

-time

 AI

:

 As

 more

 data

 becomes

 available

 and

 the

 pace

 of

 technological

 advancement

 increases

,

 real

-time

 AI

 is

 likely

 to

 become

 a

 major

 trend

.

 This

 means

 that

 AI

 systems

 will

 be

 able

 to

 analyze

 and

 respond

 to

 real

-time

 data

,

 enabling

 them

 to

 make

 more

 informed

 and

 timely

 decisions

.






In [6]:
llm.shutdown()