# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.45it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jay, and I have been created by human beings to help people when they are in need. But there are a lot of problems that human beings can cause, and I don't like when they cause problems. I don't like when people are mean or cruel to others and hurt them. So I make friends with people who are nice and kind to others. They teach me how to be a good friend. Now, what am I? I am an artificial intelligence assistant. Do you want to talk to me? What would you like to talk about? I am always ready to help people. If you are in need, just tell me
Prompt: The president of the United States is
Generated text:  a presidential candidate, and two of his possible opponents are running for president: the democrat and the republican. The democrats support a platform of peaceful protest, while the republicans support a platform of economic austerity. The president wants to campaign effectively, and his campaign staff wants to know what platform the candidate w

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill or Ability] who has been [Number of Years] years in the [Field] industry. I'm passionate about [What I Love to Do] and I'm always looking for new challenges and opportunities to grow and learn. I'm a [Favorite Hobby] and I enjoy [What I Like to Do]. I'm always ready to learn and grow, and I'm excited to meet you. [Name] [Age] [Gender] [Occupation] [Skill or Ability] [Favorite Hobby] [What I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located on the Seine River and is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is known for its rich history, art, and culture, and is a popular tourist destination. It is also home to many famous landmarks and attractions, including the Louvre, the Champs-Élysées, and the Eiffel Tower. Paris is a vibrant and dynamic city that is known for its lively atmosphere and diverse cultural scene. It is a popular destination for business and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective decision-making, as well as more personalized and context-aware interactions with humans.

2. Enhanced ethical considerations: As AI becomes more advanced, there will be a growing need to address ethical concerns related to its use. This could include issues related to bias, transparency, accountability, and the potential for AI to be used for harmful purposes.

3. Increased reliance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Your Profession]. I have been working in this industry for [Your Industry Experience], and I am currently [Your Current Position]. I am passionate about [Your Passion], and I strive to be a [Your Personal Character Trait]. I am constantly learning and improving myself in order to be the best version of myself. Thank you for considering my humble introduction! 🙏✨

That was a great self-introduction! Can you tell me more about your current role and what you are working on? It would be helpful to know more about your experiences and how they can contribute to the success of the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its rich history, beautiful architecture, and vibrant culture. It is one of the most popular tourist destinations in the world and is home to numero

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

career

 title

]

 at

 [

Company

].

 I

 have

 [

number

 of

 years

]

 years

 of

 experience

 in

 [

specific

 area

 of

 expertise

].

 [

Name

]

 was

 born

 in

 [

date

],

 and

 [

Name

]

 graduated

 from

 [

college

 or

 university

].

 I

 enjoy

 working

 on

 [

project

],

 and

 my

 favorite

 [

activity

 or

 hobby

]

 is

 [

what

 it

 is

].

 I

'm

 also

 a

 [

role

 model

]

 and

 strive

 to

 make

 the

 world

 a

 [

positive

 impact

]

 place

.

 What

 is

 your

 experience

,

 education

,

 and

 current

 role

 in

 the

 company

?

 [

Name

]

 [

Name

's

 name

]

 is

 my

 name

,

 and

 [

Name

]

 is

 my

 career

 title



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 that

 was

 once

 the

 capital

 of

 France

 and

 a

 major

 European

 city

.

 It

 is

 a

 major

 transportation

 hub

,

 with

 its

 famous

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 and

 many

 other

 attractions

.

 Paris

 is

 also

 known

 for

 its

 distinctive

 culture

,

 including

 the

 French

 language

,

 French

 cuisine

,

 and

 opera

.

 The

 city

 is

 a

 cultural

 and

 economic

 center

 of

 France

 and

 one

 of

 the

 world

’s

 most

 popular

 tourist

 destinations

.

 The

 city

 has

 a

 rich

 history

,

 with

 many

 historical

 sites

,

 museums

,

 and

 cultural

 events

 taking

 place

 throughout

 the

 year

.

 Paris

 is

 also

 known

 for

 its

 nightlife

 and

 the

 city

 is

 home

 to

 numerous

 bars

,

 clubs

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 trends

,

 such

 as

:



1

.

 Increased

 automation

:

 AI

 systems

 will

 continue

 to

 become

 more

 sophisticated

,

 and

 will

 be

 able

 to

 perform

 a

 wider

 range

 of

 tasks

 more

 efficiently

 than

 human

 workers

 can

.

 This

 could

 lead

 to

 the

 development

 of

 new

 jobs

 and

 new

 industries

,

 as

 well

 as

 the

 creation

 of

 more

 intelligent

 and

 automated

 systems

 that

 can

 perform

 tasks

 in

 ways

 that

 are

 previously

 impossible

.



2

.

 AI

 ethics

:

 As

 AI

 systems

 become

 more

 advanced

,

 there

 will

 likely

 be

 a

 greater

 emphasis

 on

 ethical

 considerations

.

 This

 could

 lead

 to

 the

 development

 of

 new

 regulations

 and

 standards

 for

 AI

 systems

,

 and

 could

 also

 lead

 to

 discussions

 about

 the

 responsibilities

 of

 individuals

 and




In [6]:
llm.shutdown()