# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0823 12:47:21.339000 658233 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0823 12:47:21.339000 658233 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0823 12:47:29.679000 658653 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0823 12:47:29.679000 658653 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.28it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.35it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dario Haab, and I'm a social worker in the San Francisco Bay Area. I'm in the process of learning a new language. I'm considering whether to learn Spanish or French. Can you help me decide which language to learn? I'm also interested in using a French as a second language. Which language should I learn, and why?
Certainly! Learning a new language is a wonderful experience and it's great to see people from different backgrounds learn new languages. Here are some pros and cons of learning Spanish or French, and how you might consider which one might be best for you:

Pros of learning Spanish:

1. Practical
Prompt: The president of the United States is
Generated text:  facing a tough election year, and the candidates have proposed different strategies for winning the election. Candidate A suggests that they will focus on rural areas and communities, while Candidate B suggests that they will focus on urban areas and industries. Which strategy is m

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] who has been [number of years] in the industry. I'm passionate about [reason for passion], and I'm always looking for ways to [action or achievement]. I'm confident in my abilities and I'm always eager to learn new things. I'm a [character trait] and I'm always ready to help others. I'm [character trait] and I'm always ready to make a difference. I'm [character trait] and I'm always ready to take on new challenges. I'm [character trait] and I'm always ready to make a positive impact on the world

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and festivals. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, diverse culture, and vibrant nightlife. It is the largest city in France and a major economic and political center in Europe. Paris is also known for its fashion industry, with many famous designers and boutiques located in the city. The city is home to many international organizations

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends in AI that are expected to shape the future:

1. Increased automation: AI is expected to become more prevalent in many industries, including manufacturing, transportation, and healthcare. Automation will likely lead to increased efficiency and productivity, but it will also lead to job displacement for some workers.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be a greater need for privacy and security. This will require advancements in AI that are designed to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I’m a passionate and creative [role], currently working as a [job title] at [company name]. My journey into this field has been filled with dedication and passion, and I’m excited to share my knowledge and experience with you. What’s your name, and where do you work? [Name] will be happy to share more about [their role] and the company they work for. [Name] is excited to introduce themselves to you, and I look forward to hearing from you! [Name] Hope you have a great day! 🌟✨

Wow, that was a great self-introduction

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France and the second largest city in Europe. The city is also the home of the French Parliament and the Louvre Museum. Paris is known for its historical architecture, museums, and French cuisine. It is als

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Alex

 and

 I

'm

 an

 intro

verted

,

 curious

,

 and

 creative

 individual

.

 I

'm

 a

 writer

,

 artist

,

 and

 educator

,

 but

 I

'm

 most

 comfortable

 when

 I

'm

 surrounded

 by

 nature

 and

 in

 the

 moment

.

 I

 love

 to

 create

 and

 help

 others

 find

 their

 own

 voice

.

 I

 believe

 in

 the

 power

 of

 storytelling

 to

 heal

,

 connect

,

 and

 inspire

.

 I

'm

 excited

 to

 explore

 more

 and

 help

 you

 find

 your

 own

 unique

 voice

 in

 the

 world

.

 What

's

 your

 name

?

 Hello

!

 I

'm

 Alex

,

 a

 writer

,

 artist

,

 and

 educator

.

 What

's

 your

 name

?

 Is

 there

 anything

 else

 you

'd

 like

 to

 share

?

 Hello

!

 I

'm

 Alex

,

 a

 writer

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 as

 the

 "

City

 of

 Love

"

 and

 is

 a

 UNESCO

 World

 Heritage

 site

.

 The

 city

 is

 renowned

 for

 its

 rich

 cultural

 heritage

,

 including

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 also

 has

 a

 strong

 economy

 and

 is

 a

 major

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 each

 year

.

 



Paris

 is

 a

 unique

 city

 that

 combines

 historical

 elegance

 with

 contemporary

 flair

,

 making

 it

 a

 popular

 destination

 for

 artists

,

 intellectuals

,

 and

 cultural

 enthusiasts

 alike

.

 The

 city

 is

 home

 to

 many

 museums

,

 theaters

,

 and

 art

 galleries

,

 and

 its

 streets

 are

 lined

 with

 a

 diverse

 array

 of

 shops

,

 cafes

,

 and

 restaurants

.

 



The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 complex

,

 with

 a

 variety

 of

 potential

 outcomes

 depending

 on

 how

 it

 is

 developed

 and

 implemented

.

 Some

 possible

 trends

 in

 AI

 include

:



1

.

 Increased

 reliance

 on

 AI

 for

 decision

-making

 and

 decision

 support

:

 As

 AI

 becomes

 more

 sophisticated

 and

 capable

 of

 making

 decisions

,

 its

 role

 in

 decision

-making

 is

 likely

 to

 increase

.

 This

 could

 lead

 to

 a

 more

 decentralized

 decision

-making

 process

,

 where

 AI

 systems

 act

 as

 advisors

 or

 decision

-makers

 for

 organizations

,

 rather

 than

 making

 decisions

 in

 isolation

.



2

.

 AI

 will

 become

 more

 integrated

 with

 human

 decision

-making

:

 AI

 systems

 will

 likely

 become

 more

 integrated

 with

 human

 decision

-making

 processes

,

 allowing

 them

 to

 assist

 humans

 in

 making

 decisions

.

 This

 could

 lead

 to




In [6]:
llm.shutdown()