# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yuchen, a young programmer majoring in computer science, and I am currently working on this problem: Given a single string S, return the minimum number of substrings that need to be removed so that all characters in S are also present in the string S. 

For instance, if the input is "bcbabc", the output should be 2, because we need to remove "bca" and "bc" to make all characters in the string "bcbabc" appear in it.

Input:
- The first line contains an integer T denoting the number of test cases.
- Each of the next T lines contains
Prompt: The president of the United States is
Generated text:  a member of the executive branch of the United States government. As such, the president has the power to veto bills. The president can veto a bill passed by the United States Congress and it cannot be overridden. However, the president can never be convicted of a criminal offense, and as a result, the president cannot be tried for that offense. The presi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter, a historic neighborhood. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, fashion, and art scene. It is a major economic and financial center, and is home to many of France's major cities and regions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant culture. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the development of the technology in the coming years. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management and fraud detection. As AI technology



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I'm a [insert profession, such as "writer," "artist," "teacher," "researcher," etc.]. I have an average to good typing speed and I enjoy writing in my spare time. I love creating stories and making people laugh. What kind of person are you?

Please feel free to customize the character's appearance, personality, and any other details you like to include in your introduction. I'm really looking forward to reading about you! 😊✨

Your self-introduction is short, neutral, and speaks to your interests. It's appropriate for a fictional character, but it can be

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

How did the French Revolution begin? The French Revolution began on July 14, 1789, when the king and queen were forced to abdicate and face the threat of a French coup d'état.

What was

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Age

]

 year

 old

 person

.

 I

 have

 always

 been

 an

 active

 and

 healthy

 person

 with

 a

 passion

 for

 learning

 and

 understanding

 the

 world

 around

 me

.

 I

 enjoy

 pursuing

 my

 own

 interests

 and

 often

 spend

 my

 days

 exploring

 different

 topics

 and

 events

 that

 fasc

inate

 me

.

 Whether

 it

's

 attending

 conferences

,

 reading

 books

,

 or

 simply

 taking

 a

 walk

 in

 the

 park

,

 I

 find

 that

 my

 activities

 make

 me

 happy

 and

 fulfilled

.

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 knowledge

,

 and

 I

 believe

 that

 I

 have

 a

 lot

 to

 offer

 anyone

 who

 is

 willing

 to

 learn

 from

 me

.

 What

's

 your

 name

,

 and

 what

 do

 you

 do

 for



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

  



Options

 are

:


[

+]

 negative




[

+]

 positive




[

+]

 negative





[

+]

 positive




[

+]

 positive




[

+]

 positive





Paris

 is

 a

 major

 French

 city

.

 It

 has

 been

 the

 capital

 of

 France

 for

 over

 a

 thousand

 years

.

 It

 has

 a

 rich

 history

 and

 is

 known

 for

 its

 beautiful

 architecture

,

 art

,

 and

 food

.

 Paris

 is

 a

 bustling

 city

 with

 many

 historic

 and

 modern

 landmarks

.

 It

 has

 a

 diverse

 population

 and

 is

 a

 major

 tourist

 destination

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

 Dame

 Cathedral

,

 and

 many

 other

 attractions

.

 Paris

ians

 love

 to

 travel

 and

 enjoy

 the

 city

's

 cultural

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 several

 key

 trends

 that

 are

 shaping

 its

 direction

.

 Here

 are

 some

 of

 the

 most

 notable

 trends

:



1

.

 Adv

ancements

 in

 machine

 learning

:

 Machine

 learning

 is

 one

 of

 the

 most

 significant

 areas

 of

 AI

 development

,

 and

 it

 is

 rapidly

 advancing

 at

 a

 pace

 that

 is

 unprecedented

.

 With

 advances

 in

 deep

 learning

,

 neural

 networks

,

 and

 other

 advanced

 algorithms

,

 AI

 will

 become

 increasingly

 adept

 at

 solving

 complex

 problems

.



2

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 there

 is

 a

 growing

 emphasis

 on

 how

 it

 will

 be

 used

 and

 how

 it

 will

 be

 regulated

.

 Increased

 focus

 on

 ethical

 considerations

 is

 likely

 to

 continue

 as

 AI




In [6]:
llm.shutdown()