# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.29it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ryan from Ohio, I live in the University of Toledo campus. I am an academic from the University of Toledo who is pursuing a degree in Environmental Engineering. I have been working in the field of civil engineering for the past 5 years. I have been interested in the environment since I was young and have always been passionate about environmental conservation. As a student, I have always been passionate about the environment and the people who live in it.
I am now planning to start my own business in environmental engineering, and I am currently writing a business plan to do so. The business plan will cover the goals, challenges, and strategies for the
Prompt: The president of the United States is
Generated text:  a very important person. In fact, he is the boss of the whole country. People usually like to choose him as the president of the United States. 

If you go to the United States, you would usually find many big and beautiful buildings

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience. I'm a [type of character] who is [character's personality traits or background]. I'm always [positive or negative] about [what I do or say]. I'm [positive or negative] about [what I do or say]. I'm [positive or negative] about [what I do or say]. I'm [positive or negative] about [what I do or say]. I'm [positive or negative] about [what I do or say]. I'm [positive or negative] about [what I do or say]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral. Paris is a major cultural and economic center, with a rich history dating back to the Roman Empire and a modern city that has undergone significant development over the centuries. It is a popular tourist destination and a major center for business, politics, and entertainment. The city is also known for its cuisine, with dishes like croissants, escargot, and foie gras being popular. Paris is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and context-aware AI systems that can better understand and respond to human needs.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as better decision-making in various industries.

3. Increased use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I am a [Occupation] who has always been passionate about [Objective]. I'm [Age] years old, and I'm always looking for [What is the character's dream] and [What is the character's biggest challenge]. I thrive on [What is the character's hobby or interest]. I believe in [Why do you believe in [Objective]]? I am [Age], and I love [What is the character's favorite food or drink]. I am [Age], and I enjoy [What is the character's favorite hobby or activity]. I am [Age] years old, and I have

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light and the Louvre Museum. The city has a population of around 2.3 million people. It is located in the northern part of France and is the largest city in Europe by area. Paris is known for its beautiful architecture, museums, and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

insert

 occupation

]

 who

 has

 always

 been

 passionate

 about

 [

insert

 something

 that

 you

 do

 or

 are

 interested

 in

,

 such

 as

 music

,

 photography

,

 or

 writing

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 adventures

 and

 have

 been

 living

 a

 laid

-back

 life

 for

 many

 years

.

 I

'm

 a

 [

insert

 your

 profession

 or

 field

 of

 interest

],

 and

 I

'm

 always

 trying

 to

 stay

 up

-to

-date

 with

 the

 latest

 trends

 and

 technologies

.

 I

 enjoy

 exploring

 new

 places

 and

 trying

 new

 things

,

 and

 I

'm

 always

 seeking

 out

 new

 challenges

 and

 experiences

.

 Thank

 you

 for

 having

 me

!

 

🌍

🌟





I

'm

 a

 writer

 and

 traveler

 with

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 in

 the

 Prov

ence

-Al

pes

-C

ôte

 d

’

Az

ur

 region

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 Paris

 is

 a

 cultural

 and

 political

 center

,

 home

 to

 many

 prestigious

 institutions

,

 including

 the

 French

 National

 Library

,

 the

 Metropolitan

 Museum

 of

 Art

,

 and

 the

 Mus

ée

 de

 l

'

Or

anger

ie

.

 With

 a

 population

 of

 over

 

2

.

 

8

 million

 people

,

 Paris

 is

 one

 of

 the

 most

 populous

 cities

 in

 the

 world

 and

 is

 considered

 one

 of

 the

 world

’s

 greatest

 cities

.

 While

 the

 French

 capital

 is

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 and

 there

 are

 many

 potential

 trends

 that

 we

 can

 expect

 to

 see

 in

 the

 years

 to

 come

.

 Some

 of

 the

 most

 likely

 trends

 include

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 data

 is

 collected

 and

 analyzed

,

 there

 is

 a

 growing

 demand

 for

 ethical

 considerations

.

 This

 includes

 issues

 like

 bias

,

 transparency

,

 and

 accountability

 in

 AI

 decision

-making

.



2

.

 Greater

 integration

 of

 AI

 into

 healthcare

:

 AI

 can

 help

 doctors

 and

 researchers

 make

 more

 accurate

 diagnoses

 and

 treatment

 plans

,

 leading

 to

 better

 patient

 outcomes

.

 In

 the

 future

,

 we

 may

 see

 more

 integration

 of

 AI

 into

 healthcare

,

 with

 AI

 algorithms

 used

 to

 analyze

 patient

 data

 and

 provide

 personalized

 treatment

 plans

.



3

.




In [6]:
llm.shutdown()