# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Julie. I'm a 13-year-old girl from England. I'm a student in a middle school in London. My telephone number is 7071212. I like to play with my friends. My friends like to go to the park. My father likes to go to the cinema. My mother likes to go to the movies. I have a pet dog. It's named Timmy. He's a small brown dog. He's very cute. He's a good pet. Timmy and I like to play together. I can't have a pet because my parents don't let me. It's
Prompt: The president of the United States is
Generated text:  a powerful man who has the power to make decisions in the country. He can make important decisions about the country's policies and how it will grow and prosper in the future. His decisions are the key to the country's success. He has the power to make decisions on policies like healthcare, education, and economic policies.
As the president, it is important that the president takes care of his own health and well-being. It is important for the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Type of Vehicle] [Vehicle Name]. I have been driving for [Number of Years] years and have driven [Number of Miles] miles. I am a [Favorite Hobby] [Hobby Name]. I am a [Favorite Book] [Book Title]. I am a [Favorite Music Artist] [Artist Name]. I am a [Favorite Sport] [Sport Name]. I am a [Favorite Movie] [Movie Title]. I am a [Favorite Book Club Member] [Name]. I am a [Favorite Book Club Member] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or simply "Paris". It is the largest city in France and the third-largest city in the world by population. Paris is a cultural and historical center with many famous landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major financial and business center, with many international corporations and financial institutions headquartered there. Paris is known for its romantic and artistic atmosphere, and is a popular tourist destination for visitors from around the world. It is also home to many important museums, including the Louvre and the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, there is a growing interest in developing AI that can learn and adapt to new situations, rather than simply following pre-programmed instructions. This could lead to more complex and sophisticated AI systems that can solve complex problems and make decisions that are difficult for humans to make. Finally, there is a growing concern about the ethical and social implications of AI, including issues such as bias,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name]. I am a [current occupation] with over [number] years of experience in [job title], and I love [reason for being in this occupation]. I am [age] years old and I enjoy [reason for being passionate about this topic]. I am always looking for opportunities to grow and learn, and I am always eager to contribute to the success of my team. Thank you for having me! 🌟✨

## Format and Precision

- **"Hello, my name is [name]."** - Start with a polite, conversational greeting.
- **"I am a [current occupation] with

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Paris is the largest city in France and the second largest in the European Union, with a population of over 11 million people. It is located on the northern bank of the Seine River, in the heart of the Paris Basin, and is home to the Eiffel Tow

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 human

 being

.

 I

'm

 here

 to

 help

 people

,

 whether

 it

's

 with

 information

,

 assistance

,

 or

 just

 being

 a

 friend

.

 Let

 me

 know

 if

 you

 need

 anything

,

 and

 I

'll

 do

 my

 best

 to

 help

.

 How

 can

 I

 assist

 you

 today

?

 [

name

]

 [

occupation

]

 [

character

]

 [

character

]

 [

character

]

 Hello

,

 my

 name

 is

 [

name

],

 and

 I

'm

 a

 human

 being

.

 I

'm

 here

 to

 help

 people

,

 whether

 it

's

 with

 information

,

 assistance

,

 or

 just

 being

 a

 friend

.

 Let

 me

 know

 if

 you

 need

 anything

,

 and

 I

'll

 do

 my

 best

 to

 help

.

 How

 can

 I

 assist

 you



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



F

acts

 about

 Paris

:


1

.

 It

 is

 the

 capital

 of

 France

 and

 the

 largest

 metropolitan

 area

 in

 Europe

.


2

.

 It

 is

 located

 on

 the

 western

 coast

 of

 the

 Mediterranean

 Sea

.


3

.

 It

 is

 the

 

1

3

th

 most

 populous

 city

 in

 the

 world

 with

 an

 estimated

 population

 of

 

1

0

.

7

 million

 residents

.


4

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 food

,

 and

 fashion

.

 It

 is

 famous

 for

 its

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

 of

 Fine

 Arts

.


5

.

 It

 is

 home

 to

 many

 cultural

 landmarks

,

 including

 the

 Palace

 of

 Vers

ailles

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 many

 potential

 areas

 for

 growth

 and

 innovation

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 are

 currently

 being

 explored

 and

 likely

 to

 continue

 in

 the

 coming

 years

:



1

.

 Autonomous

 vehicles

:

 With

 the

 continued

 development

 of

 AI

,

 autonomous

 vehicles

 are

 likely

 to

 become

 more

 common

 and

 sophisticated

.

 These

 vehicles

 could

 potentially

 be

 programmed

 to

 avoid

 obstacles

 and

 navigate

 roads

 using

 sensors

 and

 machine

 learning

 algorithms

.



2

.

 Personal

ized

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 help

 diagnose

 and

 treat

 diseases

 more

 effectively

.

 As

 AI

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 even

 more

 personalized

 treatments

,

 with

 AI

 algorithms

 being

 used

 to

 analyze

 patient

 data

 and

 identify

 the

 best

 treatment

 options




In [6]:
llm.shutdown()