# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Henri. My favourite color is orange. I like to stay up late at night and listen to music. I love to read books and I like to go on exciting adventures with my friends. I love to cook yummy meals with my family. I'm very adventurous and enjoy trying new things. I'm always looking for new experiences and I love to learn new things. I'm very passionate about volunteering, so I spend my free time helping others in need. I believe that volunteering is a great way to make a positive impact on the world. 

What does Henri have in common with Nick? 
A) They have similar hobbies
B)
Prompt: The president of the United States is
Generated text:  not a member of either the Republican or the Democratic party. They do not belong to either party. Pick the correct answer from the options: 
Options are:
1). The Republican party;
2). The Democratic party;
3). Neither party;
4). None of the above; 4). None of the above;

The president of the United States is not

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who has a [Number] of years of experience in [Field of Interest]. I'm a [Type of Character] who has a [Number] of years of experience in [Field of Interest]. I'm a [Type of Character] who has a [Number] of years of experience in [Field of Interest]. I'm a [Type of Character] who has a [Number] of years of experience in [Field of Interest]. I'm a [Type of Character] who has a [Number] of years

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Library, and the French Academy of Sciences. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. Its history dates back to the Roman Empire and is known for its beautiful architecture and art. The city is also home to many international organizations and institutions, including the European Union and the United Nations. Paris is a city of contrasts, with its modern skyscrapers and historic neighborhoods. It is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, predict patient outcomes, and improve patient care. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare in the coming years.

2. AI in finance: AI is already being used in finance to automate trading, fraud detection, and risk management. As AI technology continues to improve, we can expect to see even more widespread use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm an accomplished [job title] with a deep passion for [specific hobby or activity]. I am passionate about [mention something you enjoy, like watching a movie, playing a sport, or cooking food]. I love helping others, and I'm always looking for new challenges to challenge myself and explore new skills. I believe that every person has the potential to achieve their goals, and I'm excited to help others reach theirs as well. What's your name and what do you do? [Optional]: I'm a writer, a photographer, a speaker, or a fitness trainer. What do you do? [Optional]:

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France and serves as the largest and most populous city in the country, with a population of approximately 2.2 million people. The city is known for its rich h

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 software

 engineer

 and

 I

've

 always

 been

 fascinated

 by

 the

 world

 of

 computers

 and

 the

 way

 they

 work

.

 I

've

 always

 admired

 the

 code

 that

 powers

 the

 internet

 and

 the

 way

 it

's

 able

 to

 process

 information

.

 I

'm

 always

 interested

 in

 learning

 new

 things

 and

 exploring

 new

 technologies

.

 I

 believe

 that

 with

 a

 strong

 understanding

 of

 computer

 science

,

 I

 can

 help

 to

 solve

 complex

 problems

 and

 make

 the

 world

 a

 better

 place

.

 



So

,

 thank

 you

 for

 taking

 the

 time

 to

 meet

 me

.

 Let

's

 connect

!

 Let

's

 chat

!

 

🌐

✨





That

's

 a

 great

 self

-int

roduction

!

 Can

 you

 expand

 on

 your

 passion

 for

 solving

 complex



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 and

 the

 nation

’s

 political

,

 cultural

,

 and

 economic

 center

.

 It

 has

 been

 a

 major

 center

 of

 European

 culture

 since

 the

 

1

2

th

 century

,

 and

 Paris

 remains

 one

 of

 the

 world

’s

 leading

 tourist

 destinations

.

 



Specific

ally

,

 the

 definition

 of

 Paris

 is

 as

 follows

:



-

 **

The

 Big

 Apple

**,

 but

 also

 the

 heart

 of

 France

.


-

 **

The

 capital

**

 of

 France

 and

 the

 largest

 city

.


-

 **

A

 symbol

 of

 France

**

.


-

 **

The

 seat

 of

 government

**

.


-

 **

A

 city

 of

 art

 and

 learning

**

.



-

 **

A

 quint

essential

 city

 of

 the

 French

 romance

**

.

 The

 city

’s

 artistic

 and

 intellectual

 heritage



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 driven

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 As

 AI

 becomes

 more

 integrated

 with

 other

 technologies

 such

 as

 IoT

,

 blockchain

,

 and

 the

 Internet

 of

 Things

,

 its

 capabilities

 will

 continue

 to

 expand

.

 This

 will

 lead

 to

 more

 seamless

 integration

 of

 AI

 into

 existing

 systems

,

 as

 well

 as

 the

 creation

 of

 new

 applications

 that

 exploit

 the

 potential

 of

 AI

.



2

.

 Greater

 emphasis

 on

 ethical

 AI

:

 As

 AI

 technology

 becomes

 more

 prevalent

,

 there

 is

 a

 growing

 emphasis

 on

 ensuring

 that

 it

 is

 developed

 and

 used

 eth

ically

.

 This

 includes

 issues

 such

 as

 bias

,

 privacy

,

 and

 transparency

,

 and

 will

 likely

 lead

 to

 stricter

 regulations

 and

 better

 guidelines

 for




In [6]:
llm.shutdown()