# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gabriele and I am a college student who loves photography. I have been learning about the different types of pictures and the different techniques for taking them. Today I want to talk about a specific type of picture - the portrait. It is one of the most recognizable, photographed, and photographed with the most people. It is a picture of a person or people that is taken for the sake of portrait photography. It is the type of picture that is taken while the person is posed, looking directly at the photographer. 

While portraits are photographed and enjoyed by a vast number of people, they are not always taken for good looks. A portrait
Prompt: The president of the United States is
Generated text:  in the library to read the newspaper. He reads 25 pages per hour for the first hour and then slows down and reads 15 pages per hour for the next hour. If he finished reading the entire newspaper in 5 hours, how many pages are in the newspaper? To d

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Person] who is [What I enjoy doing]. I'm [What I'm passionate about]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm good at]. I'm [What I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, with a rich history dating back to the Roman Empire. Paris is a popular tourist destination, known for its fashion, food, and wine, and is home to many world-renowned museums and attractions. The city is also known for its diverse population, with a mix of French, African, and Asian immigrants. Paris is a vibrant and dynamic city that continues to evolve and grow, with a strong sense of community and a commitment to cultural preservation

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI that can better understand and respond to human emotions and behaviors.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and adapt to new situations. This could lead to more efficient and effective AI systems that can handle a wider range of tasks.

3. Increased focus on ethical and social implications: As



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Type of Character] in [Your Genre or World]. I am [Job Title] and I have [Number of Years in Profession]. In my free time, I enjoy [My Favorite Activity], [My Hobby/Interests], and [My Spiritual Practice]. I am always looking for ways to grow and develop as a person and I am always eager to learn and explore new things. I am very [Type of Character], and I am constantly striving to make the world a better place for everyone. I hope to continue to learn and grow and to continue being a positive influence in the world. Thank

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, cafes, and vibrant arts and culture scene. 

[City Name] is a city with a long history, being a historical and cultural center for centuries, featuring UNESCO World Heritage sites an

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 a

 [

insert

 job

 or

 profession

].

 I

'm

 excited

 to

 meet

 you

 and learn

 more

 about

 you

!

 

🚀

😊





Feel

 free

 to

 tailor

 this

 to

 reflect

 who

 you

 are

 or

 what

 you

 do

.

 I

 want

 my

 introduction

 to

 be

 rel

atable

 and

 engaging

.

 It

's

 important

 for

 my

 character

 to

 be

 recognizable

 and

 rel

atable

 to

 make

 me

 feel

 connected

 to

 you

.

 

🚀

😊





Please

 let

 me

 know

 if

 you

'd

 like

 me

 to

 adjust

 anything

,

 and

 I

'll

 do

 my

 best

 to

 make

 it

 even

 more

 authentic

 and

 engaging

!

 

💬

😊





---



Hey

 there

,

 it

's

 [

insert

 name

].

 I

'm

 a

/an

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 and

 popular

 city

 known

 for

 its

 art

,

 culture

,

 cuisine

,

 and

 fashion

.

 It

 is

 located

 on

 the

 river

 Se

ine

 and

 is

 the

 second

-largest

 city

 in

 France

 by

 population

.

 The

 city

 is

 home

 to

 the

 French

 Parliament

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 many

 other

 famous

 landmarks

.

 Paris

 is

 also

 a

 cultural

 center

 for

 many

 world

 languages

 and

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

.

 As

 a

 result

,

 it

 is

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

,

 known

 for

 its

 jazz

 music

,

 fashion

,

 and

 annual

 cultural

 festivals

.

 Paris

 is

 also

 a

 world

-ren

owned

 center

 for

 science

 and

 technology

,

 with

 many



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continued

 innovation

,

 complexity

,

 and

 diverse

 applications

.

 Some

 possible

 future

 trends

 include

:



1

.

 Increased

 AI

 integration

 with

 other

 technologies

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 bi

otechnology

,

 leading

 to

 more

 personalized

 and

 interconnected

 systems

.



2

.

 AI

 will

 become

 more

 ethical

 and

 responsible

,

 with

 greater

 emphasis

 on

 transparency

,

 accountability

,

 and

 transparency

 in

 its

 use

.



3

.

 AI

 will

 become

 more

 interactive

 and

 natural

,

 with

 a

 greater

 emphasis

 on

 emotional

 intelligence

 and

 AI

 empathy

.



4

.

 AI

 will

 become

 more

 capable

 and

 less

 dependent

 on

 human

 intervention

,

 with

 a

 greater

 emphasis

 on

 self

-learning

 and

 self

-re

new

al

.



5

.

 AI

 will

 become

 more




In [6]:
llm.shutdown()