# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.06it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xiao Ming. I am a high school student from Xiangyang, Hubei Province. I have just graduated from senior three and now I am in Senior Four, studying mathematics. My school is Fengtai High School, and my major is chemistry. 

I have been working at a restaurant for over three years and I have been enjoying the job very much. I have also been engaged in a hobby named "popular science" which I have been participating in since my senior three. I have become proficient in chemistry and have learned all about the periodic table. 

I have been very diligent in my studies and have achieved a good score
Prompt: The president of the United States is
Generated text:  to be elected by the combined votes of the members of the United States Congress. Last year, the representatives present in the United States Congress were 105 in number, each of whom could vote. The president was elected with 78 votes. How many representatives attended the meeting? We need t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your character or personality]. I enjoy [insert a short description of your hobbies or interests]. What do you like to do for fun? I love [insert a short description of your hobbies or interests]. What do you like to do for work? I like to [insert a short description of your work or responsibilities]. What do you like to do for relaxation? I like to [insert a short description of your hobbies

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Revolution. Paris is a popular tourist destination, with millions of visitors annually. The city is also home to many famous French artists, writers, and musicians. It is a major hub for international trade and diplomacy, with Paris as the seat of the French government and the headquarters of the European Union. The city is also known for its cuisine, with dishes such as croissants, b

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society.

2. Advancements in machine learning and deep learning: These are the two main areas of AI research that are expected to drive future advancements. Machine learning will continue to improve, while deep learning will become more sophisticated and capable of handling complex tasks.

3. Integration



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a computer science student who specializes in AI. I've always been fascinated by how technology can help us solve complex problems. I love learning about new technologies and how they can be applied in real-world scenarios. I'm always on the lookout for new challenges and opportunities to contribute to the field. 

I hope you have a positive and productive day! 🌟✨ #ComputerScience #AI #Tech #Learning #NewOpportunities #DailyLife #TechContribution #TechHumor #TechLife 🤖 #Innovation #LearningTech 🚀 #TechLife #TechHumor #Tech

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, which is famous for its iconic landmarks, such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also known for its rich history, including being the birthplace of the F

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

insert

 your

 profession

 or

 job

 title

]

 with

 [

insert

 your

 years

 of

 experience

 in

 the

 industry

].

 I

 started

 my

 career

 in

 [

insert

 the

 first

 year

 of

 your

 career

]

 and

 have

 been

 working

 in

 [

insert

 the

 first

 two

 years

 of

 your

 career

]

 for

 [

insert

 the

 number

 of

 years

 you

've

 been

 in

 the

 industry

].

 I

 enjoy

 [

insert

 one

 or

 two

 hobbies

 or

 interests

],

 and

 I

'm

 always

 looking

 for

 new

 opportunities

 to

 expand

 my

 knowledge

 and

 skills

.

 I

'm

 excited

 to

 continue

 learning

 and

 growing

 in

 my

 career

.


Note

:

 Replace

 [

Name

]

 with

 your

 actual

 name

 and

 add

 any

 additional

 information

 you

'd

 like

 to

 include

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 located

 in

 the

 center

 of

 the

 country

 and

 serves

 as

 the

 city

's

 administrative

 center

.

 The

 city

 is

 home

 to

 the

 national

 parliament

,

 the

 Supreme

 Council

,

 and

 the

 Lou

vre

 Museum

,

 among

 other

 important

 institutions

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 vibrant

 culture

,

 and

 beautiful

 architecture

,

 making

 it

 a

 major

 destination

 for

 tourists

 and

 locals

 alike

.

 The

 city

 is

 also

 home

 to

 many

 notable

 French

 artists

 and

 writers

,

 including

 Pablo

 Picasso

 and

 Gust

ave

 E

iff

el

.

 Its

 use

 of

 the

 French

 language

,

 known

 as

 "

la

 langue

 française

,"

 is

 a

 significant

 part

 of

 its

 identity

.

 Paris

 is

 the

 

9

th

 most

 populous

 city

 in

 the

 world

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 trends

 that

 will

 shape

 how

 it

 is

 developed

,

 applied

,

 and

 used

.

 Some

 of

 the

 key

 trends

 include

:



1

.

 Increased

 reliance

 on

 AI

 for

 decision

-making

:

 As

 AI

 technology

 continues

 to

 advance

,

 it

 is

 becoming

 more

 accessible

 to

 a

 wider

 range

 of

 people

.

 As

 a

 result

,

 there

 is

 a

 growing

 emphasis

 on

 using

 AI

 to

 make

 more

 informed

 decisions

 and

 make

 better

 decisions

,

 particularly

 in

 industries

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.



2

.

 Em

phasis

 on

 ethical

 AI

:

 The

 development

 and

 deployment

 of

 AI

 systems

 will

 be

 influenced

 by

 ethical

 considerations

.

 As

 AI

 systems

 become

 more

 complex

 and

 sophisticated

,

 there

 will

 be

 a

 need

 to

 ensure




In [6]:
llm.shutdown()