# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Elena and I am 18 years old. I work as a software developer. I can be found in the following places: Denver, Colorado and Dallas, Texas. In Denver, I often travel by car. I am traveling between Denver and Dallas by car. At Dallas, I have my "office" where I work. 

Now, I'm curious to know how to get there. 
To travel between Denver and Dallas by car, I should use a straight line path or should I use a zig-zag route? I am not familiar with the zig-zag route. 

Also, what's the name of the zig-z
Prompt: The president of the United States is
Generated text:  30 years older than the president of Brazil. The president of Brazil is older than the president of France by 3 years. If the president of France is currently 30 years old, how old is the president of Brazil?

To determine the age of the president of Brazil, we need to follow the information provided step by step.

1. We know that the president of France is currently 30 years old.
2. The pre

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm passionate about [Your Passion], and I'm always looking for ways to [Your Goal]. I'm excited to meet you and learn more about you. [Your Name] [Your Job Title] [Company Name] [Company Address] [City, State, ZIP Code] [Phone Number] [Email Address] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Instagram Profile] [GitHub Profile] [LinkedIn Profile] [Twitter Profile

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Academy of Sciences. Paris is a bustling city with a rich history and culture, and is a popular tourist destination. It is the largest city in France and the second-largest city in the European Union. The city is known for its fashion industry, art, and cuisine, and is a major center for science and research. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also home to many international organizations and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with other technologies: AI is already being integrated into a wide range of other technologies, such as smart homes, self-driving cars, and virtual assistants. As these technologies continue to evolve, we can expect to see even more integration of AI into our daily lives.

2. Enhanced capabilities: AI is likely to become even more powerful and capable in the future. This could include improvements in natural language processing, image recognition, and autonomous decision-making.

3. Greater emphasis on ethical considerations: As AI becomes more integrated into our daily lives, there will be a



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a software developer with over 10 years of experience in the industry. I have a keen eye for detail and a passion for creating innovative solutions that solve complex problems. I am always looking for new and exciting challenges to tackle, and I am always ready to learn from my colleagues and mentors. In my spare time, I enjoy exploring new parts of the world and trying out new foods. I am a friendly and outgoing person, and I thrive on collaborating with others to achieve our shared goals. Thank you for taking the time to meet me! [Name] [Company Name] Software Developer [Name] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

The statement can be expanded to include additional information about Paris's architecture, culture, or any other aspects of the city, if desired. Example: 



### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Job

 Title

]

 who

 started

 my

 career

 in

 [

Industry

]

 and

 have

 been

 in

 [

Position

]

 for

 [

Number

 of

 Years

].

 I

'm

 [

Age

]

 and

 I

 currently

 live

 in

 [

City

],

 with

 a

 [

Primary

 Interest

/

 passion

]

 of

 [

Primary

 Interest

].

 I

'm

 always

 [

Comments

 on

 Personality

 or

 Personality

 Traits

]

 and

 I

 enjoy

 [

Positive

 Thing

 to

 Do

].

 I

'm

 excited

 to

 meet

 you

 and

 learn

 more

 about

 [

Your

 Field

 of

 Expert

ise

].

 Let

 me

 know

 how

 I

 can

 assist

 you

 in

 [

How

 You

 Can

 Help

].

 Thank

 you

!

 [

Name

]

 

🌟





Please

 note

 that

 you

 are

 free

 to

 use



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historical

 city

 with

 a

 rich

 cultural

 heritage

,

 known

 for

 its

 magnificent

 architecture

,

 iconic

 landmarks

,

 and

 a

 lively

 atmosphere

.

 Paris

 is

 the

 capital

 of

 France

 and

 is

 located

 in

 the

 south

 of

 the

 country

 on

 the

 Î

le

 de

 France

.

 The

 city

 is

 also

 home

 to

 important

 cultural

 institutions

 such

 as

 the

 Lou

vre

 Museum

 and

 the

 Palace

 of

 Vers

ailles

.

 It

 is

 also

 famous

 for

 its

 annual

 E

iff

el

 Tower

 celebration

.

 The

 capital

 of

 France

 is

 a

 bustling

 met

ropolis

 with

 a

 diverse

 population

 and

 a

 vibrant

 cultural

 scene

.

 Its

 architecture

,

 cuisine

,

 and

 festivals

 all

 contribute

 to

 its

 reputation

 as

 a

 leading

 global

 city

.

 Paris

 is

 often

 referred

 to

 as

 "



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 combination

 of

 rapid

 progress

,

 innovation

,

 and

 convergence

 of

 different

 fields

.

 Here

 are

 some

 potential

 trends

 that

 could

 be

 expected

 in

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 ethics

 and

 responsibility

:

 As

 AI

 systems

 become

 more

 complex

 and

 autonomous

,

 there

 will

 be

 increasing

 scrutiny

 of

 their

 impact

 on

 society

 and

 the

 economy

.

 Governments

 and

 organizations

 will

 likely

 need

 to

 develop

 frameworks

 for

 responsible

 AI

 development

,

 ensuring

 that

 AI

 is

 used

 eth

ically

 and

 responsibly

.



2

.

 Greater

 use

 of

 AI

 in

 healthcare

:

 AI

-powered

 healthcare

 systems

 are

 already

 being

 developed

 to

 assist

 doctors

 and

 nurses

 in

 diagn

osing

 and

 treating

 diseases

.

 As

 AI

 becomes

 more

 advanced

,




In [6]:
llm.shutdown()