# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.65it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sadie, and I’m the only one who can run a marathon. I’m also the only one who can sing, play the guitar, and own a dog. I’m currently 27 years old and I love to read, watch movies, and cook. I like to dress up and be all me.
Q: What does Sadie like to do? Sadie likes to read, watch movies, and cook. Sadie also likes to sing and play the guitar. Sadie likes to dress up and be herself. Sadie is the only one who can run a marathon. What does Sadie like to do?
A
Prompt: The president of the United States is
Generated text:  very busy. He always works in the White House. He wears a white shirt and blue jeans on Sundays. He is called the "Presidential Boyfriend". The President is not allowed to eat meat. He eats many other kinds of food, such as fish and vegetables. At the end of the week, he gives lots of presents to the presidents of the other countries in the world. On Saturdays, the President has breakfast at a restaurant. He has lunch and dinne

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [Age] year old [Gender] [Occupation]. I am currently [Current Location] and I am [Current Job Title]. I am a [Favorite Hobby] enthusiast and I love [Favorite Food/Drink/Activity]. I am [Favorite Book/TV Show/Video Game] and I am always looking for new adventures. I am [Favorite Music/Artist] and I love to [Favorite Hobby/Activity]. I am [Favorite Sport/Activity/Travel Destination] and I am always on the lookout for new experiences. I am [Favorite Movie/Book/TV Show/Video Game]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also known for its rich history, including the French Revolution and the French Revolution Museum. Paris is a bustling metropolis with a diverse population and a vibrant cultural scene. It is the seat of the French government and the country's largest city, with a population of over 2.5 million people. The city is also home to many famous French artists, writers, and musicians. Paris is a popular tourist destination and a major economic hub in Europe. It is known for its fashion, art

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the development of this technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud detection, and investment decision-making. As AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a software developer with a passion for exploring new technologies and fostering creativity. I'm always eager to learn and apply new ideas in my programming, while also being open to feedback on my work and willing to help others improve their programming skills.

I enjoy collaborating with other developers to build complex applications, and I'm always looking for ways to improve my skills and knowledge. I'm also a natural problem solver, and I'm always willing to brainstorm innovative solutions to problems.

Overall, I value hard work, communication, and perseverance in achieving my goals. I'm excited to bring my unique skills and experience to our team and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks, rich cultural heritage, and bustling street life. I

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 an

 [

insert

 age

]

 year

 old

 [

insert

 occupation

].

 I

 enjoy

 [

insert

 hobbies

 or

 interests

],

 and

 I

'm

 always

 trying

 to

 learn

 new

 things

.

 I

'm

 excited

 to

 meet

 you

!

 

🌟

✨




Remember

,

 you

 have

 my

 respect

 and

 admiration

.

 Let

's

 have

 a

 great

 conversation

 together

!

 

📝

💻

✨




Feel

 free

 to

 share

 anything

 interesting

 or

 curious

 about

 yourself

!

 

🤔

✨




Let

's

 get

 to

 know

 each

 other

 better

!

 

🤝

💬




I

'm

 [

insert

 any

 other

 relevant

 information

 like

 a

 profession

,

 nationality

,

 etc

.

].

 Thanks

 for

 your

 time

!

 

😊

✨




Looking

 forward



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



The

 statement

 is

 concise

 and

 accurately

 describes

 the

 city

's

 official

 name

,

 which

 is

 the

 French

 word

 "

Paris

"

 in

 English

.

 It

 also

 mentions

 that

 Paris

 is

 the

 capital

 city

 of

 France

,

 which

 is

 a

 country

 in

 Western

 Europe

.

 The

 statement

 leaves

 no

 room

 for

 interpretation

,

 making

 it

 clear

 to

 the

 reader

 that

 it

 provides

 a

 straightforward

,

 factual

 statement

 about

 a

 specific

 topic

.



A

 direct

 translation

 of

 the

 statement

 into

 Spanish

 would

 be

 "

La

 capital

 de

 Franc

ia

 es

 Par

ís

".

 However

,

 as

 the

 question

 does

 not

 specify

 the

 language

 to

 be

 used

,

 the

 original

 English

 statement

 is

 the

 most

 accurate

 representation

.

 



In

 conclusion

,

 the

 statement

 about

 France

's

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 full

 of

 promise

,

 and

 here

 are

 some

 potential

 trends

 to

 watch

 out

 for

:



1

.

 Autonomous

 vehicles

:

 The

 widespread

 adoption

 of

 autonomous

 vehicles

 will

 change

 the

 way

 we

 travel

,

 work

,

 and

 live

.

 Autonomous

 vehicles

 will

 be

 able

 to

 navigate

 roads

 and

 roads

,

 detect

 hazards

,

 and

 respond

 to

 traffic

.

 This

 will

 lead

 to

 a

 decrease

 in

 traffic

 accidents

 and

 an

 increase

 in

 traffic

 efficiency

.



2

.

 Smart

 homes

:

 The

 integration

 of

 AI

 and

 IoT

 technology

 will

 create

 smart

 homes

 that

 can

 control

 and

 monitor

 various

 aspects

 of

 our

 homes

,

 such

 as

 energy

 usage

,

 climate

 control

,

 and

 security

 systems

.

 This

 will

 result

 in

 more

 efficient

 and

 sustainable

 living

.



3

.

 Personal

ized




In [6]:
llm.shutdown()