# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0815 04:57:40.637000 4031405 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 04:57:40.637000 4031405 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0815 04:57:49.926000 4031782 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0815 04:57:49.926000 4031782 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.21it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.15it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ana and I am a small business owner. I'm starting a new business and would like to know how to advertise it on social media.
Starting a new business can be a great way to grow your business and reach a wider audience. However, advertising your new business on social media can be an effective way to attract customers and reach potential customers. Here are some steps you can follow to advertise your new business on social media:
1. Choose the right social media platform: The best social media platforms for advertising a new business are Instagram, Facebook, and Twitter. Instagram is ideal for businesses that want to create visual content, while Facebook and Twitter
Prompt: The president of the United States is
Generated text:  expected to have a lot of different job titles, ranging from the top general to the CEO, and therefore may be asked to be a witness in criminal trials. What are the differences in the legal definitions for the term "witne

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Revolution. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many famous museums, including the Musée d'Orsay and the Musée Rodin. It is also known for its cuisine, including French cuisine and its famous cheese, foie gras. Paris is a vibrant and diverse city with a rich cultural heritage that continues to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and preferences. This could lead to more personalized and efficient AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations and safeguards to ensure that AI systems are used in a responsible and beneficial way. This could



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I come from [Country], and I have always been [occupation]. I am [yourself]! My passion for [yourself] has led me to pursue a [career path] in [yourself]. I am [yourself]! 🌟

*If you're interested in my interests or hobbies, please feel free to mention them in your comments and I will consider adding them to my profile.* 📚📖

---

Please note that the character you are describing is fictional, and I'm just playing with the idea of a character named "Alex." Feel free to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also known as the City of Love.
Paris, officially known as the "City of Love," is the capital city of France and serves as the administrative and political center of the nation. The city has a rich history and is home to many iconic landmarks such as the Eiff

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

occupation

]

!

 I

'm

 a

 [

特长

/

爱好

]

 who

 has

 always

 been

 passionate

 about

 [

h

obby

].

 I

'm

 [

age

]

 years

 old

,

 and

 I

've

 always

 been

 [

character

istic

/

character

istic

 

1

]

 to

 [

character

istic

 

2

].

 I

 enjoy

 [

activity

]

 and

 [

idea

]

 and

 I

 always

 strive

 to

 [

goal

].

 I

've

 learned

 a

 lot

 [

number

 of

 skills

]

 over

 the

 years

 and

 I

'm

 always

 [

level

 of

 dedication

].

 I

'm

 [

extra

ordinary

 quality

]

 and

 I

'm

 always

 [

positive

 attitude

].

 I

'm

 a

 [

ability

/

condition

]

 that

 makes

 me

 a

 [

interest

/

interest



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 most

 populous

 city

 in

 France

.

 It

 is

 also

 one

 of

 the

 largest

 and

 most

 influential

 cities

 in

 Europe

 and

 is

 known

 for

 its

 cultural

,

 artistic

,

 and

 historical

 attractions

.

 The

 city

 is

 also

 renowned

 for

 its

 music

,

 cuisine

,

 and

 fashion

.

 Paris

 is

 located

 in

 the

 center

 of

 France

,

 on

 the

 Se

ine

 River

,

 and

 is

 home

 to

 the

 E

iff

el

 Tower

.

 Its

 history

 dates

 back

 to

 the

 

6

th

 century

,

 and

 it

 is

 considered

 the

 cr

adle

 of

 French

 culture

 and

 literature

.

 It

 is

 also

 a

 global

 center

 for

 education

,

 with

 many

 prestigious

 universities

 and

 institutions

 of

 higher

 learning

 located

 in

 Paris

.

 Paris

 is

 also

 known

 for

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 vast

 array

 of

 technological

 advancements

 that

 will

 greatly

 influence

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 one

 another

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 accuracy

 and

 precision

:

 As

 AI

 systems

 become

 more

 advanced

,

 they

 are

 likely

 to

 become

 even

 more

 accurate

 and

 precise

 in

 their

 decision

-making

 and

 problem

-solving

.

 This

 will

 make

 AI

 systems

 even

 more

 useful

 and

 useful

 to

 a

 wider

 range

 of

 people

.



2

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

 are

 likely

 to

 become

 increasingly

 common

 in

 the

 future

,

 thanks

 to

 advancements

 in

 AI

 technology

.

 These

 vehicles

 will

 be

 able

 to

 navigate

 the

 roads

 themselves

,

 making

 them

 safer

 and

 more

 efficient

.



3

.




In [6]:
llm.shutdown()