# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")





Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Desirée. I am a 20-year-old student. I have a lot of questions about my life. I'd like to speak with you, please. [I'm going to ask you some questions. There will be no one else in the room.] First, please introduce yourself. 2. When were you born? 3. What's your profession? 4. What's your favorite color? 5. What's your favorite food? 6. What's your favorite sport? 7. What's your favorite hobby? 8. What's your favorite TV show? 9. What's your
Prompt: The president of the United States is
Generated text:  a very important person in the country. He is like the boss of the country. The president can do a lot of things. The president can help his country solve the problems. The president can also help his country win the games. He is always busy. But he always comes to the White House to meet the people. The president's job is very important. He always works very hard.
Based on the article and the questions below, please provide the answer for the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. What can you tell me about yourself? I'm a [insert a brief description of your personality or background]. And what's your favorite hobby or activity? I love [insert a hobby or activity you enjoy]. And what's your favorite book or movie? I love [insert a book or movie you've read or watched]. And what's your favorite place to go? I love [insert a place you've visited]. And what's your favorite color? I love [insert a favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and "La Ville de la Rose". It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major center for business, finance, and tourism in France. It is a popular tourist destination and a major cultural hub. The city is home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased automation: AI is likely to become more integrated into various industries, leading to increased automation of tasks and processes. This could result in the creation of new jobs, but also create new opportunities for workers to be more productive and efficient.

2. Improved privacy and security: As AI systems become more sophisticated, there will be an increased need for measures to protect personal data and privacy. This could lead to new regulations and standards to ensure that AI systems are used responsibly and ethically.

3. Enhanced human-computer interaction: AI is likely to become more integrated into



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First name] and I am [Last name] and I am a [job title] at [company name].
[First name] is a [job title] with [number] years of experience in [industry]. In my role, I am responsible for [main responsibility]. [First name] is a team player and always [adjective] to accomplish our goals. I have a strong work ethic and am always willing to go above and beyond to ensure that our team is successful. I am a team player who thrives in a fast-paced and dynamic environment. I am an absolute pleasure to work with and am always eager to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in Europe and the second-largest city in the world by population. Paris is known for its rich history, beautiful architecture, and cultural attractions. It is a popular tourist destination and a symbol of Fre

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 [

Age

]

 years

 old

.

 I

 am

 a

 [

Occup

ation

]

 who

 have

 always

 been

 [

Personal

 trait

 or

 characteristic

].

 I

'm

 always

 looking

 for

 opportunities

 to

 [

Describe

 a

 positive

 trait

 you

 would

 like

 to

 have

].

 If

 you

 could

 call

 me

 [

Name

],

 what

 would

 you

 say

?

 [

Name

]

 


I

 am

 a

 [

Occup

ation

],

 [

Age

],

 [

Personal

 Trait

]

 and

 [

Other

 personal

 attributes

],

 [

Personal

 trait

].

 I

 am

 always

 looking

 for

 opportunities

 to

 [

Describe

 a

 positive

 trait

 you

 would

 like

 to

 have

].

 If

 you

 could

 call

 me

 [

Name

],

 what

 would

 you

 say

?

 Hello

,

 my

 name

 is

 [

Name

]

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 with

 the

 iconic

 E

iff

el

 Tower

 as

 its

 symbol

.

 It

 is

 the

 third

 most

 populous

 city

 in

 the

 European

 Union

 and

 the

 sixth

-largest

 city

 in

 the

 world

.

 The

 city

 is

 renowned

 for

 its

 architectural

 beauty

,

 artistic

 heritage

,

 and

 vibrant

 cultural

 scene

.

 The

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Palace

 of

 Vers

ailles

 are

 some

 of

 Paris

's

 most

 famous

 landmarks

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 ancient

 times

 and

 continues

 to

 be

 a

 center

 of

 European

 culture

 and

 international

 affairs

.

 The

 French

 language

 is

 widely

 spoken

 in

 the

 city

 and

 it

 is

 the

 second

 most

 spoken

 language

 in

 the

 world

 after

 Mandarin

.

 Paris

 is

 also

 known



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

 and

 promises

.

 Some

 possible

 trends

 include

:



1

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 could

 revolution

ize

 transportation

 by

 reducing

 accidents

 and

 emissions

 while

 improving

 efficiency

.

 They

 could

 also

 become

 a

 life

-saving

 tool

 for

 the

 poor

 and

 vulnerable

.



2

.

 Smart

 homes

:

 Smart

 homes

 could

 improve

 energy

 efficiency

,

 reduce

 costs

,

 and

 enhance

 comfort

 by

 integrating

 various

 smart

 devices

 such

 as

 smart

 ther

most

ats

,

 smart

 lighting

,

 and

 security

 systems

.



3

.

 Improved

 mental

 health

:

 AI

 could

 help

 in

 identifying

 early

 signs

 of

 mental

 health

 issues

,

 providing

 personalized

 treatment

 plans

,

 and

 monitoring

 the

 effectiveness

 of

 treatment

.



4

.

 Data

-driven

 decision

-making

:

 AI

 could

 help

 in

 making

 data

-driven

 decisions

 in




In [6]:
llm.shutdown()