# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0818 20:08:41.589000 591525 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0818 20:08:41.589000 591525 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0818 20:08:50.522000 592162 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0818 20:08:50.522000 592162 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.87it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=56.12 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=56.12 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.72it/s]Capturing batches (bs=2 avail_mem=56.06 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.72it/s]Capturing batches (bs=1 avail_mem=56.06 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.72it/s]Capturing batches (bs=1 avail_mem=56.06 GB): 100%|██████████| 3/3 [00:00<00:00, 11.09it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Natasha and I'm a computer science student from the University of Tokyo.
I'm always thinking about new technologies and their potential applications in the world. So far, I've worked on a project called Nucleo, a robot that can be controlled through a mobile phone or a laptop. This project is aimed at improving the usability of robots and reducing the likelihood of accidents.
Would you like to know more about my project? I'm always happy to answer any questions. Here's the link to the project: [Link to project](https://github.com/natasha-nucleo) Thank you for your time! Let me know if you
Prompt: The president of the United States is
Generated text:  a member of a team. This team is composed of one from each of 5 countries. If 50% of the countries are in Asia and the rest are in Europe, and the team is divided into 2 teams of 3 countries each, how many ways can the president be chosen?
To determine the number of ways the president can be chose

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or experience here]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm always looking for new challenges and opportunities to grow and learn. What are your hobbies or interests? I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I'm always looking for new challenges and opportunities to grow and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is also the birthplace of French writer Victor Hugo and the home of the Louvre Museum. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is also home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. In the future, we may see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning models being developed to improve diagnosis, treatment, and patient care.

2. AI in finance: AI is already being used in finance to improve risk management, fraud detection, and investment strategies. In the future, we may see even more widespread use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Your profession] who has always been fascinated by [Your hobby or passion]. What's your favorite [Subject]? And what kind of person are you? Remember, I'm not trying to impress you. Just sharing my thoughts. Feel free to ask me anything you'd like to know. It's just to let you know I have some interesting things about myself. I hope you like it.

If you're ready to have a short conversation, let's get started! 🎉✨

#Title: Self-Introduction 🎉✨

Hello! My name is [Your Name], and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  highly anticipated and involves a multitude of technological advancements that could revolutionize various fields, from healthcare to education to transp

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

Title

/

Role

].

 My

 hobbies

 and

 interests

 include

 [

List

 of

 hobbies

 and

 interests

].

 How

 can

 you

 help

 me

 today

?



My

 [

Area

 of

 Expert

ise

 or

 Expert

ise

 Area

].

 How

 can

 I

 help

 you

 today

?

 [

G

reeting

 and

 interaction

]

 I

'm

 here

 to

 help

 you

 with

 [

Your

 Task

 or

 Goal

].

 [

Ex

plain

 how

 you

 can

 assist

 them

].

 Please

 let

 me

 know

 how

 I

 can

 help

.

 [

Thank

 you

 and

 goodbye

.

 ]

 



Note

:

 The

 content

 of

 the

 self

-int

roduction

 should

 be

 neutral

 and

 professional

,

 avoiding

 personal

 or

 sensitive

 topics

.

 The

 tone

 should

 be

 friendly

 and

 approach

able

,

 while

 also

 providing

 context

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 left

 bank

 of

 the

 Se

ine

 River

.

 It

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 cultural

 heritage

 and

 is

 home

 to

 many

 notable

 landmarks

 and

 museums

.

 The

 city

 is

 known

 for

 its

 Notre

-D

ame

 Cathedral

 and

 Se

ine

 River

,

 and

 is

 also

 home

 to

 many

 international

 institutions

 and

 events

.

 It

 is

 also

 a

 major

 transportation

 hub

,

 with

 numerous

 subway

 and

 public

 transportation

 systems

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 cultural

 hub

,

 known

 for

 its

 food

,

 art

,

 and

 fashion

 scenes

.

 It

 is

 also

 a

 major

 economic

 center

,

 with

 significant

 industries

 and

 businesses

.

 The

 French

 capital

 is

 known

 for

 its

 rich

 and

 diverse

 culture

,

 history

,

 and

 architecture



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 trends

 that

 will

 continue

 to

 evolve

 and

 drive

 innovation

 in

 this

 rapidly

 growing

 field

.

 Here

 are

 some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Deep

 learning

 and

 machine

 learning

:

 One

 of

 the

 key

 trends

 in

 AI

 is

 the

 development

 of

 deep

 learning

 and

 machine

 learning

 algorithms

 that

 are

 more

 sophisticated

 and

 capable

 of

 handling

 complex

 tasks

.

 This

 will

 likely

 lead

 to

 breakthrough

s

 in

 areas

 such

 as

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

,

 among

 others

.



2

.

 AI

 ethics

 and

 accountability

:

 As

 AI

 systems

 become

 more

 prevalent

 in

 our

 daily

 lives

,

 there

 will

 be

 increasing

 pressure

 on

 society

 to

 ensure

 that

 AI

 is

 developed




In [6]:
llm.shutdown()