# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0816 08:59:59.546000 86341 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0816 08:59:59.546000 86341 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0816 09:00:08.597000 87077 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0816 09:00:08.597000 87077 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.39it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.39it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.39it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  8.09it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alice. I'm a student at a high school. I like to play different sports. I have two big brothers and a small sister. I don't like to eat vegetables and don't like to go to the gym. I like to play with my friends and I often share my toys. What does Alice think of her family?

Pick your answer from:
 *They are a great family.
 *They are a terrible family.
 *They are alright family.
 *I don't know. Let me know the correct answer. To determine Alice's family's character, let's consider her perspective on her family:

1. Alice likes to play
Prompt: The president of the United States is
Generated text:  a very important person in the country. He is in charge of the government. He is also the leader of the country. He does not have much power. He is very popular with the people in the country. He is responsible for making laws and rules for the country. He is in charge of the country’s money, which he calls the President’s Budget. He is also in charg

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a unique trait or characteristic that sets me apart from other characters in the story]. And what brings you to this company? I'm looking forward to [insert a reason for your interest in the company]. And what's your favorite part of your job? I love [insert a favorite part of my job]. And what's your biggest challenge? I'm always looking for ways to [insert a challenge you're trying to overcome]. And what

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and diverse cultural scene. It is also the birthplace of the French Revolution and the home of the French language. Paris is a bustling metropolis with a rich history and a vibrant cultural scene. It is the largest city in France and one of the most visited cities in the world. The city is home to many famous landmarks and attractions, including the Louvre Museum, the Champs-Élysées, and the Notre-Dame Cathedral. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city of art

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration with human decision-making: AI systems will become more integrated with human decision-making processes, allowing for more complex and nuanced decision-making.

2. Greater emphasis on ethical considerations: As AI systems become more advanced, there will be a greater emphasis on ethical considerations, such as privacy, bias, and accountability.

3. Increased use of AI in healthcare: AI will be used to improve the accuracy and efficiency of medical diagnosis and treatment, leading to better patient outcomes.

4. Greater use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title] who has [experience level] years of experience. I'm [gender] and [age] years old. I'm [physical description] and have [height] in [length of height] and [weight] in [weight of height] in [meters]. I have a [interest] in [what interests me]. I'm a [yourself-identified type of person] with a strong sense of [something about yourself] and I thrive on [something about yourself that makes you unique]. I have a love for [something that makes me happy] and am [your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

What is the capital of India? The capital of India is New Delhi. Could you please clarify on that? The capital of India is New Delhi. It is the second largest city in India. It is the economic, cultural, and political capital of India. It is also known as the 'Rice Capita

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Occup

ation

]

 with

 a

 passion

 for

 [

Job

 Title

].

 I

'm

 a

 creative

 problem

-s

olver

 and

 I

 love

 to

 explore

 the

 world

 around

 me

 with

 a

 keen

 eye

 for

 detail

 and

 an

 un

qu

ench

able

 curiosity

.

 I

 thrive

 on

 learning

 from

 the

 challenges

 and

 opportunities

 I

 encounter

 daily

,

 and

 I

 believe

 that

 every

 successful

 person

 I

 meet

 is

 someone

 who

 is

 constantly

 working

 towards

 their

 goals

,

 no

 matter

 what

 obstacles

 they

 face

.

 I

'm

 a

 team

 player

 and

 always

 willing

 to

 lend

 a

 helping

 hand

 when

 I

 can

.

 I

 love

 to

 travel

,

 explore

 new

 places

,

 and

 immer

se

 myself

 in

 different

 cultures

.

 I

'm

 always

 looking

 for

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Sure

!

 Paris

,

 the

 capital

 of

 France

,

 is

 renowned

 for

 its

 iconic

 landmarks

,

 rich

 cultural

 heritage

,

 and

 bustling

 urban

 life

,

 making

 it

 a

 beloved

 destination

 for

 millions

 of

 visitors

 each

 year

.

 The

 city

 is

 also

 known

 for

 its

 sophisticated

 fashion

 scene

,

 vibrant

 nightlife

,

 and

 gastr

onomic

 delights

,

 including

 famous

 Mich

elin

-star

red

 restaurants

.

 Additionally

,

 Paris

 has

 a

 rich

 history

,

 featuring

 ancient

 ruins

 and

 imposing

 historical

 buildings

.

 The

 city

,

 with

 its

 vibrant

 culture

,

 beautiful

 landscapes

,

 and

 op

ulent

 architecture

,

 is

 a

 truly

 global

 melting

 pot

.

 Paris

 is

 a

 city

 that

 continues

 to

 inspire

 new

 generations

 and

 a

 global

 hub

 of

 influence

.

 

🏠

📈

✨







Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 und

eni

ably

 exciting

 and

 dynamic

.

 Here

 are

 some

 of

 the

 potential

 trends

 that

 might

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 Integration

 with

 Human

 Intelligence

:

 As

 AI

 gets

 more

 sophisticated

,

 it

 will

 likely

 become

 more

 integrated

 with

 human

 intelligence

.

 This

 means

 that

 AI

 systems

 will

 be

 able

 to

 learn

 from

 and

 adapt

 to

 human

 behavior

,

 emotions

,

 and

 decision

-making

 processes

.



2

.

 More

 Personal

ization

:

 With

 the

 rise

 of

 big

 data

 and

 machine

 learning

,

 we

 can

 expect

 to

 see

 more personalized

 AI

 systems

 that

 are

 designed

 to

 learn

 from

 individual

 user

 preferences

 and

 behavior

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 use

 of

 resources

,

 as

 well

 as

 more

 personalized

 marketing




In [6]:
llm.shutdown()