# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.43it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ian and I'm a writer. I'm a lifelong learner who loves reading, writing, and learning new things about the world around us. I love to explore the world with a sense of wonder and curiosity, and I love sharing my passions and interests with others. I'm also a photographer and have been for the past year, so I'm very good at taking photos and sharing them online. I love helping others to find their passions and discover new things. Let me know if there's anything else I can help you with! 

As a result of my lifelong learning and passion for learning, I am able to provide knowledge and guidance on
Prompt: The president of the United States is
Generated text:  a very busy person and he works a lot. He has a lot of work to do. He gets up very early in the morning and then he goes to work by car. He leaves his house at seven o'clock in the morning. He goes to his office at eight o'clock. He works in the morning. He has breakfast at seven o'clock. H

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Parliament building. Paris is a cultural and economic center, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination, attracting millions of visitors each year. Paris is also known for its cuisine, with its famous dishes such as croissants, boudin, and escargot. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Increased use of AI in healthcare



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm an AI assistant who can help you with almost anything. How can I help you today?
As an AI assistant, I'm here to help you with a wide range of questions and tasks, whether it's answering questions, providing information, or simply offering to assist with your needs. I'm constantly learning and improving, so please feel free to ask me any questions you have and I'll do my best to provide the best possible answer. And don't hesitate to let me know if you need help with anything specific, and I'll do my best to assist you. Let's get started! 📱✨

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Presqu’île".

Paris, also known as "La Presqu'île", is the largest city in France and the capital of the country. It is the economic and cultural center of France. The city is renowne

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

's

 Name

],

 I

'm

 a

/an

 [

Age

]

 year

 old

 [

Occup

ation

/

Position

/

Role

].

 I

 have

 [

X

]

 years

 of

 experience

 in

 [

X

].

 I

'm

 from

 [

Your

 Location

],

 [

Your

 Profession

],

 or

 [

Your

 Specialty

],

 and

 I

'm

 here

 to

 meet

 you

.

 I

'm

 here

 to

 share

 my

 knowledge

 and

 experience

 to

 help

 you

 achieve

 your

 goals

 and

 make

 your

 journey

 to

 success

 as

 smooth

 as

 possible

.

 I

'm

 excited

 to

 meet

 you

,

 and

 I

 look

 forward

 to

 helping

 you

 succeed

.

 How

 are

 you

 today

?

 What

 brings

 you

 here

?

 What

 brings

 you

 to

 speak

 to

 me

?

 I

'm

 here

 to

 listen

 and

 learn

,

 and

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 known

 for

 its

 rich

 cultural

 heritage

 and

 is

 a

 popular

 tourist

 destination

.

 Paris

 is

 known

 for

 its

 history

,

 architecture

,

 and

 art

,

 making

 it

 a

 major

 city

 for

 both

 locals

 and

 tourists

 alike

.

 The

 city

 is

 also

 home

 to

 numerous

 museums

,

 theaters

,

 and

 restaurants

,

 making

 it

 a

 vibrant

 and

 diverse

 place

 to

 visit

.

 Paris

 is

 known

 for

 its

 fashion

 industry

 and

 is

 a

 major

 shopping

 destination

.

 It

 is

 home

 to

 numerous

 international

 organizations

,

 such

 as

 UNESCO

 and

 the

 French

 Academy

 of

 Sciences

.

 The

 city

 is

 also

 home

 to

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 promising

,

 and

 it

 will

 continue

 to

 evolve

 and

 adapt

 to

 new

 challenges

 and

 opportunities

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 automation

 and

 automation

:

 AI

 will

 continue

 to

 gain

 more

 automation

,

 including

 in

 areas

 like

 manufacturing

,

 healthcare

,

 and

 transportation

.

 This

 will

 lead

 to

 increased

 efficiency

 and

 productivity

,

 and

 will

 also

 result

 in

 more

 reliable

 and

 reliable

 systems

.



2

.

 Development

 of

 AI

 ethics

 and

 responsibility

:

 As

 AI

 becomes

 more

 integrated

 into

 society

,

 there

 will

 be

 a

 growing

 need

 to

 develop

 ethical

 guidelines

 and

 standards

 for

 its

 use

.

 This

 will

 require

 an

 increase

 in

 the

 development

 of

 AI

 ethics

 experts

 who

 can

 analyze

 and

 evaluate

 AI

 systems

.



3

.

 Integration

 of




In [6]:
llm.shutdown()