# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.75it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Max and I am 16 years old. I am from Switzerland and I will be attending a college in Switzerland for the first time this year. I am studying economics and I am very interested in studying abroad.
I know that I need to be a citizen of the country of origin to be eligible to study abroad, but I am not sure where to find the correct information on my legal status. Is there a website or a database that I can search for this information? Can you please help me find the correct information for my situation?
Yes, you can find the correct information on your legal status for studying abroad by visiting the Government of
Prompt: The president of the United States is
Generated text:  an elected official who serves a four-year term. He is the leader of the United States and represents the country. The officeholder is responsible for the administration of the country, the defense, and the economy. He is the leader of the nation. It is the highest nationa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character, such as "a friendly, outgoing, and helpful person" or "a dedicated, hardworking, and organized person"]. I enjoy [insert a short description of your character's interests or hobbies, such as "reading, cooking, or playing sports"]. I'm always looking for ways to [insert a short description of your character's goal or purpose, such as "to improve my skills in [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the annual Eiffel Tower Festival. It is also the seat of the French government and the country's cultural and political center. Paris is a major tourist destination and a popular destination for international business and diplomacy. The city is known for its rich history, art, and cuisine. It is also home to the Louvre Museum, the most famous art museum in the world. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. The city is also known for its fashion industry and its role in the French economy.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and efficient solutions to complex problems.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions.

3. Increased reliance on AI for decision-making: As AI becomes more integrated with human intelligence, it is likely to become a more important tool for decision-making in many industries



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I'm a [insert character's profession/occupation]. I'm passionate about [insert one or two things about my character that stand out]. I have [insert a few hobbies or interests]. My favorite [insert one or two things]. What about you? What's your name? What's your profession/occupation? What's your favorite hobby? Tell me about yourself! 🎨✨

---

---

I'm going to make the introduction a bit more casual and less formal. I'll make it shorter and more concise, with a bit more enthusiasm. Let's move on to the next part of our conversation

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as the City of Light. It is the largest city in Europe and the fifth-largest city in the world. The city has a rich history and culture, with landmarks such as Notre-Dame Cathedral, the Eiffel

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 a

 [

insert

 age

]

 year

 old

 boy

 who

 is

 passionate

 about

 [

insert

 hobby

 or

 interest

].

 I

 love

 to

 [

insert

 something

 positive

 or

 positive

]

 about

 myself

,

 and

 I

 believe

 that

 I

 can

 make

 a

 positive

 impact

 on

 the

 world

 by

 [

insert

 something

 constructive

 or

 constructive

].

 What

's

 your

 name

?

 What

's

 your

 age

?

 What

's

 your

 hobby

 or

 interest

?

 What

 do

 you

 love

 to

 do

 for

 a

 hobby

 or

 interest

?

 What

's

 something

 you

're

 passionate

 about

?

 What

's

 your

 purpose

 in

 life

?

 What

's

 your

 goal

?

 What

's

 your

 dream

?

 What

's

 your

 new

 goal

?

 What

's

 your

 new

 goal

?

 What

's

 your

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 south

 of

 the

 country

 and

 is

 the

 largest

 city

 in

 the

 European

 Union

.

 The

 city

 is

 famous

 for

 its

 architecture

,

 cuisine

,

 and

 fashion

,

 and

 is

 a

 hub

 for

 the

 arts

,

 culture

,

 and

 business

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

 and

 many

 historical

 landmarks

,

 including

 the

 Lou

vre

 Museum

 and

 the

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 known

 for

 its

 annual

 festivals

 and

 events

,

 including

 the

 World

 of

 D

ancer

 Festival

 and

 the

 Ha

ute

 Cout

ure

 Conference

.

 Paris

 is

 also

 known

 for

 its

 fast

-paced

 city

 life

,

 bustling

 streets

,

 and

 world

-class

 sports

 teams

.

 The

 city

 is

 home

 to

 a

 large

 number

 of

 French

 people



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 rapidly

 as

 technology

 advances

 and

 new

 breakthrough

s

 emerge

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 AI

 landscape

:



1

.

 Increased

 integration

 with

 human

 intelligence

:

 AI

 will

 continue

 to

 become

 more

 integrated

 with

 human

 intelligence

,

 allowing

 it

 to

 perform

 tasks

 that

 are

 complex

 or

 difficult

 for

 machines

 alone

.

 This

 could

 lead

 to

 more

 accurate

 predictions

,

 better

 decision

-making

,

 and

 even

 more

 complex

 tasks

.



2

.

 Development

 of

 advanced

 languages

 and

 tools

:

 AI

 will

 likely

 see

 continued

 development

 of

 advanced

 programming

 languages

 and

 tools

 that

 allow

 for

 more

 complex

 and

 nuanced

 AI

 systems

.

 This

 could

 lead

 to

 the

 creation

 of

 new

 AI

 models

 and

 techniques

 that

 are

 not

 yet

 possible

 with

 traditional




In [6]:
llm.shutdown()