# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.75it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tanya and I love cats, but I love more than just that. I love to work with animals, play with them, and travel with them. I have a passion for zoos, animal shelters, and helping people make sense of the world around them. I hope to live in a city that is full of people who are interested in animals and have the time to take care of them. I am a teacher at a high school, and I love being able to work with young people. I believe that animals are part of our human experience, and I hope to be able to help people find ways to connect with them and respect
Prompt: The president of the United States is
Generated text:  200 years old. His daughter is 1/2 the age of the president. His son is twice the age of the daughter. How old is the president's son?

To determine the age of the president's son, we start by defining the ages of the president, daughter, and son based on the information given.

1. The president's age is given as 200 years.
2. The da

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] [Ability] who has always been [Positive Traits] and [Negative Traits]. I'm [Your Name] and I'm here to [Your Goal or Purpose]. I'm excited to meet you and learn more about you. [Your Name] [Your Goal or Purpose] [Your Name] [Your Goal or Purpose] [Your Name] [Your Goal or Purpose] [Your Name] [Your Goal or Purpose] [Your Name] [Your Goal or Purpose] [Your Name] [Your Goal or

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the European Union and the world’s 10th largest city by population. It is located on the Seine River and is home to many of the world’s most famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its rich history, including the French Revolution and the French Revolution Square. The city is a hub of culture, art, and cuisine, and is a popular tourist destination. It is home to many of the world’s most famous museums, including the Louvre and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be a need to address privacy and security concerns. This could lead to more robust privacy protections and enhanced security measures to protect against AI-based threats.

3. Greater automation and efficiency: AI is likely to become more integrated with human intelligence, leading to greater automation



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am an AI language model. I am programmed to provide information and assistance to users like you, and I am constantly learning and improving. I am here to help you with any questions you may have and to provide you with reliable and accurate information. So, if you have any questions or need help, please don't hesitate to reach out to me. I am here to assist you. [Name]. How can I assist you today? [Name]. I'm [Name], an AI language model. I'm programmed to provide information and assistance to users like you. I'm here to help you with any questions you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Please answer the following question about the statement:
Is Paris a commune? No. 

Answer this question by providing the correct response.
No. Paris is a city, not a commune. 

I apologize

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 [

Your

 Age

]

 years

 old

.

 I

'm

 a

 [

Your

 Field

 of

 Expert

ise

]

 expert

,

 and

 I

 specialize

 in

 [

Your

 Specialty

].

 I

 enjoy

 [

Your

 Passion

],

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 [

Your

 Skill

].

 How

 can

 I

 help

 you

 today

?

 

🌟

✨

✨

✨





This

 is

 a

 neutral

 introduction

 that

 sets

 the

 tone

 for

 the

 character

's

 role

 and

 personality

.

 It

 doesn

't

 assume

 anything

 about

 the

 character

's

 skills

 or

 experience

.

 It

's

 simply

 the

 start

 of

 an

 engaging

 conversation

.

 



However

,

 you

 could

 also

 consider

 an

 introduction

 that

 focuses

 on

 the

 character

's

 specific

 skill

 or

 expertise

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 home

 to

 many

 important

 historical

 and

 cultural

 institutions

,

 including

 the

 Lou

vre

 Museum

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Centre

 Pom

pid

ou

.

 It

 is

 also

 home

 to

 many

 notable

 artists

 and

 intellectuals

,

 including

 Michel

angelo

,

 Leonardo

 da

 Vinci

,

 and

 Pablo

 Picasso

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 and

 a

 vibrant

 cultural

 scene

.

 It

 is

 the

 capital

 city

 of

 France

 and

 the

 largest

 city

 in

 the

 European

 Union

.

 According

 to

 the

 

2

0

1

9

 census

,

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

,

 with

 potential

 trends

 and

 developments

 shaping

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 Aut

onomy

 and

 Self

-

Driving

 Cars

:

 With

 advances

 in

 AI

,

 we

 may

 see

 more

 autonomous

 vehicles

 on

 the

 road

,

 and

 self

-driving

 cars

 becoming

 more

 common

.

 This

 could

 lead

 to

 a

 shift

 in

 the

 transportation

 industry

,

 as

 people

 move

 away

 from

 traditional

 driving

.



2

.

 Improved

 Medical

 Imaging

 and

 Diagnostic

 Tools

:

 AI

 is

 already

 being

 used

 to

 enhance

 medical

 imaging

,

 leading

 to

 more

 accurate

 and

 faster

 diagnoses

.

 As

 AI

 improves

,

 it

 could

 lead

 to

 even

 more

 precise

 and

 accurate

 diagnostic

 tools

.






In [6]:
llm.shutdown()