# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0915 00:44:35.912000 995087 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 00:44:35.912000 995087 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0915 00:44:44.930000 995755 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 00:44:44.930000 995755 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 00:44:45] `torch_dtype` is deprecated! Use `dtype` instead!


W0915 00:44:45.792000 995756 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 00:44:45.792000 995756 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.92it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.58it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  6.59it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex. I'm a 14-year-old girl who just finished my senior year of high school. I have always been a very organized student and have made many goals for myself in order to become a successful college student. I also believe that I have the talent to be a great athlete and so I decided to play basketball. I have been playing basketball since I was six years old and I have been practicing and playing basketball for three years now. I have become really good at it and I am really excited about playing basketball again this year. I always wanted to play basketball as a high school student, but I didn't think I would
Prompt: The president of the United States is
Generated text:  a three-person contract job. The president of the United States is selected by the 10 member cabinet members. The president can only serve two terms. The president can be replaced by the Vice President (president of the United States) only once. Suppose the current president 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Blanche" or "The White City." It is the largest city in France and the second-largest city in the European Union, with a population of over 2.7 million people. Paris is known for its rich history, art, and culture, and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also a major transportation hub, with numerous airports and train stations. Paris is a popular tourist destination, with millions of visitors each year. The city is also home to many important institutions of higher education,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs and preferences.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be a greater need for privacy and security measures to protect the data and information that is generated and processed by AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Emily, and I'm a 28-year-old marketing professional. I have a passion for helping businesses grow their sales and build their brand. I love collaborating with clients and working with a team to create successful campaigns and landing pages. I'm confident and always ready to take on new challenges and meet new people. I hope to continue my journey as a marketing professional and provide a positive impact to businesses. How can I say "hello" to someone new? You can say "Hello, my name is Emily, and I'm a 28-year-old marketing professional. I have a passion for helping businesses grow their sales and build their

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a UNESCO World Heritage site known for its historical significance, vibrant culture, and beautiful architecture. 

This statement encapsulates the m

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

],

 and

 I

'm

 a

 [

insert

 the

 occupation

 of

 the

 character

].

 My

 goal

 is

 to

 [

insert

 how

 the

 character

 wants

 to

 accomplish

 their

 goal

],

 and

 I

'm

 excited

 to

 see

 what

 happens

.

 I

'm

 always

 up

 for

 a

 challenge

,

 and

 I

 enjoy

 learning

 new

 things

 and

 making

 new

 friends

.

 What

's

 your

 name

,

 and

 what

's

 your

 occupation

?

 I

'm

 excited

 to

 meet

 you

!

 

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊

😊



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Is

 the

 above

 statement

 true

 or

 false

?

 Let

 me

 know

.

 



Available

 options

:


 a

).

 True

;


 b

).

 False

;

 a

).

 True

;



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 a

 blend

 of

 increasing

 sophistication

,

 but

 also

 increasing

 complexity

 and

 ethical

 considerations

.

 Here

 are

 some

 possible

 trends

 in

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 sophistication

:

 AI

 is

 expected

 to

 become

 more

 capable

 of

 understanding

 and

 performing

 complex

 tasks

 that

 were

 once

 considered

 impossible

 or

 very

 difficult

.

 This

 includes

 tasks

 such

 as

 image

 and

 speech

 recognition

,

 natural

 language

 processing

,

 and

 decision

-making

.



2

.

 Expansion

 of

 AI

 applications

:

 AI

 is

 expected

 to

 be

 used

 in

 a

 wider

 range

 of

 applications

,

 from

 healthcare

 to

 finance

,

 transportation

 to

 entertainment

,

 and

 military

 to

 defense

.

 AI

-powered

 systems

 will

 become

 more

 integrated

 into

 our

 daily

 lives

,

 creating

 new

 opportunities

 for

 innovation

 and

 personal

ization




In [6]:
llm.shutdown()