# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.90it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sam. I'm from Australia. I'm a student. I'm not 16. I'm 15 this year. I'm now in Beijing for an vacation. It's sunny in Beijing. There are two big rivers. There are also many lakes and rivers nearby. There are also many tall buildings on the streets. I like to go to the park and play with my friends there. I like to play tennis. I don't like playing volleyball. I like to use my imagination to tell me stories. Sometimes I like to read books. I like to do morning exercises. It's fun to do morning exercises. I
Prompt: The president of the United States is
Generated text:  a citizen of the United States. I am a citizen of the United States.  
Therefore, I am a citizen of the United States.
This argument is an example of which logical fallacy? This argument is an example of an **assumption** or **hasty generalization**. The argument assumes that because the president of the United States is a citizen of the United States, and the president is descr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry]. I'm always looking for new opportunities to grow and learn, and I'm always eager to learn new things. I'm a [reason for interest in the industry] and I'm always looking for ways to contribute to the company's success. I'm a [reason for interest in the industry] and I'm always looking for ways to contribute to the company's success. I'm a [reason for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of French literature and cuisine, and a major center for art, music, and film. Paris is a cultural and economic hub, with a diverse population and a rich history dating back to the Roman Empire. It is the largest city in France and the second-largest city in the world by population. Paris is also known for its annual fashion and food festivals, as well as its annual Eiffel Tower celebration. The city is a popular tourist destination and a major economic center

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to improve, we can expect to see even more widespread adoption in healthcare.

2. AI in manufacturing: AI is already being used in manufacturing to improve efficiency, reduce costs, and increase productivity. As AI technology continues to improve, we can expect to see even more widespread adoption in manufacturing.

3. AI in finance: AI is already being



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Age]. I am currently working as a [Job Title] at [Company Name] and I have [Number of Years in this Job]. I have always been passionate about [What I Love to Do/What I Want to Do]. I have a strong work ethic and I am always looking for ways to [What I Want to Improve/What I Want to Do Better]. I am always eager to learn and adapt to new challenges. I am always open to new ideas and have a keen eye for detail. I am a team player, always willing to help others and make them feel valued. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its stunning architecture, vibrant culture, and rich history. Paris is often referred to as the "City of Light" due to its iconic lights, such as the Eiffel Tower and the Louvre Museum. The city is home to the headquarters of many world-

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

.

 I

'm

 currently

 [

Occup

ation

]

 and

 I

 enjoy

 [

Favorite

 Activity

],

 [

Favorite

 Food

],

 and

 [

Favorite

 Book

].

 I

'm

 also

 a

 [

Favorite

 Sport

],

 [

Favorite

 Movie

],

 and

 [

Favorite

 Character

 in

 a

 Book

].

 If

 you

 have

 any

 questions

 or

 would

 like

 to

 know

 more

 about

 me

,

 just

 ask

!

 

🌟

✨





Please

 feel

 free

 to

 share

 any

 details

 you

'd

 like

 me

 to

 include

 in

 my

 introduction

.

 

🚀

✨





#

Self

Intro

 #

Character

 #

Friendly

 #

Information

al

 #

W

ish

List

 #

Write

 #

Insp

iration

 #

Character

istics

 #

Characters

 #

Person

ality

 #



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



OPTIONS

:


 (

1

).

 Yes




 (

2

).

 No




(

1

).

 Yes





Paris

 is

 the

 capital

 of

 France

,

 and

 it

 is

 known

 for

 its

 historical

 landmarks

,

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 E

iff

el

 Tower

.

 It

 is

 also

 a

 major

 cultural

 center

 for

 France

,

 with

 museums

,

 theaters

,

 and

 other

 cultural

 institutions

.

 Additionally

,

 Paris

 is

 home

 to

 many

 international

 institutions

,

 including

 the

 European

 Parliament

.

 The

 city

 is

 known

 for

 its

 fashion

,

 art

,

 and

 cuisine

.

 Finally

,

 Paris

 is

 a

 significant

 city

 for

 tourism

,

 with

 millions

 of

 visitors

 each

 year

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 promising

 and

 involves

 rapid

 developments

 and

 advancements

.

 Some

 possible

 trends

 in

 AI

 include

:



1

.

 Increased

 Use

 of

 AI

 for

 Medical

 Applications

:

 AI

 can

 be

 used

 to

 diagnose

 diseases

,

 predict

 patient

 outcomes

,

 and

 even

 create

 personalized

 treatment

 plans

.

 This

 will

 lead

 to

 more

 accurate

 medical

 diagnoses

,

 improved

 treatment

 outcomes

,

 and

 reduced

 costs

 for

 patients

.



2

.

 Development

 of

 AI

 for

 Autonomous

 Vehicles

:

 As

 autonomous

 vehicles

 become

 more

 advanced

,

 AI

 will

 play

 a

 key

 role

 in

 their

 development

 and

 operation

.

 AI

 will

 be

 used

 to

 analyze

 traffic

 conditions

,

 predict

 traffic

 patterns

,

 and

 even

 make

 driving

 decisions

.



3

.

 AI

 in

 the

 Workplace

:

 AI

 is

 already

 being

 used

 in

 the

 workplace

 to

 improve




In [6]:
llm.shutdown()