# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0809 23:03:11.202000 3042450 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0809 23:03:11.202000 3042450 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0809 23:03:20.108000 3042929 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0809 23:03:20.108000 3042929 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Matthew I am a software developer and my major in Computer Science and I work in a data science role. I have been using Amazon Redshift for years and have been using Hadoop for the past two years. I have worked with Spark and have been using Spark for the past two years as well. I have a Bachelor of Science in Computer Science and a degree in Electrical Engineering. I have used Hadoop for data warehousing and analytics work for companies like Facebook, JPMorgan Chase, and now I am looking to use it for Kaggle competitions. I have experience with building and deploying Hadoop applications using Spark. I have used Docker
Prompt: The president of the United States is
Generated text:  a well-known figure in the world of politics. The president works in the White House, where he has to balance a wide range of responsibilities. It is common to have other important people in the White House who also have important jobs. Each president has a job descr

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" and "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union. Paris is a cultural and historical center, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major financial center and a major transportation hub. Paris is home to many world-renowned museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. It is also home to many famous restaurants, including the Eiffel Tower, the Mou

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation: AI is likely to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that are currently performed by humans. This could lead to increased efficiency and productivity, but it could also lead to job displacement for some workers.

2. AI ethics and privacy: As AI becomes more advanced, there will be a need to address ethical and privacy concerns. This could lead to new regulations and standards for AI development and use,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [type] with [number] years of experience in [industry]. I specialize in [specific skill or area of expertise]. I enjoy [reason for love/hobby or interest]. Here's my short, neutral self-introduction:
"Hello, my name is [Name]. I'm a [type] with [number] years of experience in [industry]. I specialize in [specific skill or area of expertise]. I enjoy [reason for love/hobby or interest]. " 
Feel free to adapt the name, type, industry, experience, and skill as needed to better fit your personality and interests. Good

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also the country's largest city. It is located on the island of France and is the seat of the French government, the capital, and the largest city in metropolitan France. 

(Note: In 2021, the city is known as París) 

The

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 and

 I

'm

 a

 

3

0

-year

-old

 entrepreneur

 with

 a

 passion

 for

 technology

 and

 innovation

.

 I

 believe

 in

 the

 power

 of

 technology

 to

 change

 the

 world

 and

 I

'm

 always

 looking

 for

 new

 ways

 to

 improve

 my

 skills

 and

 expertise

.

 I

'm

 a

 creative

 problem

 solver

 and

 a

 highly

 organized

 and

 detail

-oriented

 person

,

 and

 I

'm

 committed

 to

 always

 striving

 for

 excellence

.

 I

'm

 excited

 about

 the

 opportunities

 and

 possibilities

 that

 lie

 ahead

 for

 me

 in

 my

 career

 and

 I

'm

 looking

 forward

 to

 bringing

 my

 ideas

 and

 expertise

 to

 new

 challenges

 and

 opportunities

.

 Thanks

 for

 taking

 the

 time

 to

 meet

 me

.

 Have

 a

 great

 day

!

 

👋

✨

✨





Hey

 Emily

,

 I

'm

 writing



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Given

 the

 fact

 that

 Paris

 is

 the

 capital

 of

 France

,

 answer

 the

 following

 question

:

 What

 is

 the

 capital

 of

 India

?

 The

 capital

 of

 India

 is

 New

 Delhi

.

 


Ex

plain

 the

 reasoning

 behind

 your

 answer

.

 New

 Delhi

 is

 the

 capital

 of

 India

,

 located

 in

 the

 northern

 part

 of

 the

 country

.

 It

 serves

 as

 the

 seat

 of

 the

 Indian

 Government

 and

 has

 served

 as

 the

 capital

 since

 

1

9

4

7

,

 when

 the

 British

 government

 was

 dissolved

 and

 the

 Indian

 National

 Congress

 led

 by

 Mah

at

ma

 Gandhi

 took

 over

 the

 governance

 of

 India

.

 The

 capital

 is

 also

 known

 as

 the

 city

 of

 the

 king

 as

 it

 is

 the

 administrative

 and

 ceremonial

 capital

 of

 the

 Indian

 Union



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 constantly

 evolving

,

 with

 potential

 for

 significant

 advancements across

 numerous

 domains

.

 Here

 are

 some

 key

 trends

 and

 developments

 to

 watch

 for

 in

 the

 near

 future

:



1

.

 Increased

 reliance

 on

 AI

 in

 healthcare

:

 With

 more

 people

 suffering

 from

 chronic

 diseases

 and

 more

 complex

 medical

 procedures

,

 AI

 is

 likely

 to

 play

 an

 increasingly

 important

 role

 in

 healthcare

,

 from

 personalized

 treatment

 recommendations

 to

 disease

 diagnosis

 and

 prevention

.



2

.

 Integration

 of

 AI

 into

 every

 industry

:

 As

 AI

 becomes

 more

 integrated

 into

 various

 industries

,

 from

 manufacturing

 and

 transportation

 to

 banking

 and

 retail

,

 we

 can

 expect

 to

 see

 more

 widespread

 use

 of

 AI

 in

 a

 wide

 range

 of

 applications

.



3

.

 AI

-powered

 education

:

 With

 the

 growing

 demand

 for




In [6]:
llm.shutdown()