# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.13it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I am a student at Temple University, where I have been studying Mathematics for the past four years. At Temple University, I have made several friends and have been involved in a variety of extracurricular activities, such as Math Club and the Science Olympiad, among others. My main goal is to continue pursuing my studies and achieve academic success in the field of Mathematics. How can I improve my grades in Mathematics? There are several strategies that you can use to improve your grades in Mathematics. Here are a few ideas:

1. Set clear goals: Define your goals and create a study plan. Setting clear goals will help you
Prompt: The president of the United States is
Generated text:  a political office. The current president is Donald Trump. The current term of office of this office is 4 years. The next presidential election is coming soon. What will be the president of the United States in 2 years time? 
The current president is Dona

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the country and the most populous city in Europe. It is located on the Seine River and is the seat of the French government and the country's cultural, political, and economic center. Paris is known for its rich history, beautiful architecture, and vibrant culture, and is a major tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a popular destination for tourists and locals alike, and is considered one of the most beautiful cities in the world. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the development of the technology in the coming years. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be a need for more ethical considerations to be taken into account. This could lead to the development of new ethical frameworks and standards for



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert fictional name], and I'm a [insert fictional profession or academic degree] with a background in [insert relevant experience or background]. I enjoy [insert personal interests or hobbies], and I'm always looking for new opportunities to learn and grow. I'm also someone who is always willing to lend a helping hand, whether it's helping someone in need or just trying to make the world a better place. So, if you're looking for someone who's always ready to help and make a difference, I'm your guy! 🌟

Please let me know if you'd like me to add more information or if you have any

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "The City of Light". It is located in the north of France and is the seat of the French government, and is known for its cultural and artistic attractions. The c

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

Type

 of

 Character

]

 who

 has

 been

 [

How

 Many

 Years

 of

 Experience

 in

 Your

 Field

 of

 Work

].

 My

 expertise

 lies

 in

 [

describe

 your

 area

 of

 expertise

]

 and

 I

 have

 [

describe

 your

 most

 notable

 achievement

 or

 accomplishment

].

 I

 am

 a

 professional

 who

 is

 always

 [

describe

 your

 role

,

 attitude

,

 and

 approach

]

 to

 success

.

 I

 enjoy

 [

describe

 your

 hobbies

,

 interests

,

 or

 other

 passions

].

 I

 am

 [

describe

 your

 current

 profession

 or

 role

],

 and

 I

 look

 forward

 to

 [

describe

 your

 future

 goals

 or

 aspirations

].

 Thank

 you

 for

 asking

!

 Let

 me

 know

 if

 you

 would

 like

 me

 to

 expand

 on

 anything

 in

 my

 introduction

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

la

 Grande

 et

 la

 M

arse

ill

aise."



that

 Paris

 is

 a

 city

 with

 a

 rich

 history

,

 many

 famous

 landmarks

,

 and

 a

 vibrant

 culture

.

 It

 is

 also

 known

 for

 its

 popular

 restaurants

,

 museums

,

 and

 fashion

 scene

.

 The

 city

 is

 a

 cultural

 hub

 and

 has

 played

 an

 important

 role

 in

 French

 history

 and

 culture

.



That

 Paris

 is

 a

 city

 with

 a

 rich

 history

,

 many

 famous

 landmarks

,

 and

 a

 vibrant

 culture

.

 It

 is

 also

 known

 for

 its

 popular

 restaurants

,

 museums

,

 and

 fashion

 scene

.

 The

 city

 is

 a

 cultural

 hub

 and

 has

 played

 an

 important

 role

 in

 French

 history

 and

 culture

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 very

 promising

,

 and

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 direction

 of

 AI

 development

.

 Some

 of

 the

 possible

 future

 trends

 in

 AI

 include

:



1

.

 More

 intelligent

 machines

:

 With

 the

 advancement

 of

 AI

,

 it

 is

 likely

 that

 we

 will

 see

 more

 intelligent

 machines

 that

 are

 more

 capable

 of

 performing

 tasks

 that

 require

 human

-like

 cognitive

 abilities

,

 such

 as

 learning

,

 decision

-making

,

 and

 problem

-solving

.



2

.

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

,

 reduce

 medical

 errors

,

 and

 increase

 efficiency

.

 Future

 trends

 in

 AI

 in

 healthcare

 could

 include

 developing

 AI

-powered

 diagnostic

 tools

 and

 therapies

 that

 can

 help

 doctors

 make

 more

 accurate

 diagnoses

,

 identify

 diseases




In [6]:
llm.shutdown()