# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0806 03:45:36.413000 2564781 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 03:45:36.413000 2564781 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0806 03:45:46.034000 2565561 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 03:45:46.034000 2565561 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Joe I am 52 years old I am a computer programmer but I play football and I have been doing it for the past 8 years. I play football for the Houston Texans but I have been working on my programming skills since I was 16 years old. I have taken a course in C++, which was really good.

What is the most important thing to know about me?

1. Shouldn't I tell my coworkers about this? I do have a hobby
2. Should I tell my coworkers about this? I am a programmer and have been doing football for 25 years
3. Should I tell my
Prompt: The president of the United States is
Generated text:  30 years older than the president of Brazil, who is 35 years old. If the president of Brazil decides to give an extra $100,000 bonus to the person he appoints, what will be the difference in age between the president of the United States and the president of Brazil after the bonus is given? To determine the difference in age between the president of the United States and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its cuisine, fashion, and art scene. It is a popular tourist destination and a cultural hub in Europe. The city is home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its modern architecture and historical landmarks. It is a city of people, with a diverse population of over 10 million people. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots in factories to personalized medicine and virtual assistants. Additionally, AI will likely continue to be used for tasks that require human-like intelligence, such as language translation and emotional intelligence, and will likely be integrated into more and more aspects of our society, from education and healthcare to transportation and entertainment. However, there are also potential risks and challenges associated with AI, including issues of bias, privacy, and security, and it is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm 28 years old, and I recently graduated from college with a degree in [field of study]. Before that, I worked in the financial industry for two years, and I have a passion for [interest or hobby]. I love spending time with my family and friends, and I'm also a big fan of [relatable hobby or activity]. I'm always looking for new experiences and learning new things, and I enjoy making connections with people. What's your favorite thing to do in your free time? I love traveling, trying new restaurants, and reading. I'm excited to meet you! 💻

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its rich history, stunning architecture, and lively culture. Paris is one of the world’s most famous cities and hosts numerous world-renowned attractions, including the Eiffel Tower, Notre

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 job

 role

/

occupation

]

 at

 [

insert

 company

 name

].

 I

'm

 a

 [

insert

 age

,

 gender

,

 and

 nationality

]

 born

 in

 [

insert

 year

 of

 birth

]

 in

 [

insert

 country

].

 I

'm

 [

insert

 a

 physical

 characteristic

,

 like

 "

short

,

 curly

-haired

,

 or

 blue

-eyed

"]

 and

 have

 a

 [

insert

 hobby

,

 like

 "

reading

,

 cooking

,

 or

 playing

 sports

"]

 that

 keeps

 me

 motivated

 to

 learn

 and

 grow

 as

 a

 person

.

 What

 kind

 of

 person

 are

 you

?

 [

insert

 a

 brief

 sentence

 to

 introduce

 yourself

,

 like

 "

Hello

,

 I

'm

 a

 [

insert

 name

]

 and

 I

'm

 a

 [

insert

 occupation



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 Western

 Part

 of

 the

 country

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 lively

 atmosphere

.

 It

 is

 also

 a

 major

 economic

 and

 cultural

 center

,

 with

 many

 prestigious

 museums

,

 theaters

,

 and

 cultural

 events

.

 Paris

 has

 a

 population

 of

 over

 

2

 million

 people

 and

 is

 a

 popular

 tourist

 destination

.

 It

 is

 also

 home

 to

 notable

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 Overall

,

 Paris

 is

 a

 bustling

 and

 fascinating

 city

 with

 much

 to

 offer

 visitors

 of

 all

 ages

.

 **

Paris

:**

 The

 capital

 of

 France

 is

 a

 UNESCO

 World

 Heritage

 Site

,

 and

 it

 has

 a

 vibrant

 and

 eclectic

 culture

,

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 many

 potential

 trends

 that

 could

 shape

 the

 industry

 in

 the

 coming

 years

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 could

 influence

 AI

 in

 the

 years

 ahead

:



1

.

 Deep

 learning

 and

 neural

 networks

:

 One

 of

 the

 most

 exciting

 areas

 of

 AI

 research

 is

 the

 development

 of

 deep

 learning

 and

 neural

 networks

.

 These

 are

 computer

 algorithms

 that

 can

 learn

 from

 large

 datasets

 and

 perform

 tasks

 that

 would

 be

 impossible

 for

 humans

 to

 accomplish

.

 As

 more

 data

 is

 processed

 and

 algorithms

 are

 improved

,

 the

 potential

 for

 AI

 to

 solve

 complex

 problems

 will

 increase

.



2

.

 Aug

mented

 and

 virtual

 reality

:

 As

 virtual

 reality

 technology

 becomes

 more

 advanced

,

 we

 can

 expect

 to

 see

 more

 immersive




In [6]:
llm.shutdown()