# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-29 21:09:06] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-29 21:09:06] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-29 21:09:06] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-29 21:09:06] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-10-29 21:09:16] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-29 21:09:16] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-29 21:09:16] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-29 21:09:16] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:08,  2.11it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:08,  2.11it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:08,  2.11it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.36it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.36it/s]

Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.36it/s] 

Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.40it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.40it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.40it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:02,  6.45it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:01<00:02,  6.45it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:01<00:02,  6.45it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.77it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.77it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.77it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.77it/s]

Capturing batches (bs=40 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  9.83it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  9.83it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:01<00:00,  9.83it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:01<00:00,  9.83it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.02it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.02it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.02it/s] 

Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.02it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 15.01it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 15.01it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 15.01it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00,  9.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Betsy, and I am the founder of the Independent Herbal Spa. I specialize in providing gentle, soothing, and healing treatments with natural remedies and herbs, which are safe to use for both men and women of all ages.
My treatment plans are personalized for each client, ensuring that they receive the best care possible. I believe in the power of nature and the healing potential of herbs and botanicals to restore balance and vitality.
Since I started my practice, I have seen many people come to me feeling stressed, anxious, or just plain tired. Many have mentioned how they felt better after their treatments and felt rejuvenated. I understand that each
Prompt: The president of the United States is
Generated text:  in Washington, D. C. when it rains. The president says, "The rain will stop soon." While he is speaking, a lightning bolt strikes the roof of the U. S. Capitol and the rain stops. When President Eisenhower says, "The rain will stop soon

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and experiences. Let's chat! [Name] [Job Title] [Company Name] [Company Address] [City, State, Zip Code] [Phone Number] [Email Address] [LinkedIn Profile] [Twitter Profile] [Facebook Profile] [Website URL] [Personal Website URL] [LinkedIn Profile URL] [Twitter Profile URL] [Facebook Profile URL] [Personal Website URL] [LinkedIn Profile URL] [Twitter Profile URL] [Facebook Profile URL] [Personal Website URL] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is a bustling metropolis with a diverse population and a rich cultural heritage. Paris is also a major financial center and a major tourist destination, attracting millions of visitors each year. The city is home to many world-renowned museums, art galleries, and landmarks, including the Louvre and the Notre-Dame Cathedral. Paris is a city of contrasts, with its modern architecture and high-tech industries blending seamlessly with its historic charm. The city is also known for its food scene, with many famous restaurants and cafes serving up delicious

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, with automation becoming more prevalent in tasks that are repetitive or can be done more efficiently by machines. This could lead to the development of new job roles, such as AI specialists and data scientists.

2. Enhanced human-AI collaboration: AI is likely to become more integrated with human AI, allowing for more efficient and effective collaboration between humans and machines. This could lead to the development of new AI systems that can work alongside humans in complex tasks.

3. AI ethics and privacy concerns: As AI becomes more prevalent, there



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [fill in the blank with a profession] with [fill in the blank with a nationality]. I graduated from [University name] in [year]. I am currently working as a [fill in the blank with a profession] and [fill in the blank with a nationality]. I love [fill in the blank with a hobby or passion]. I am passionate about [fill in the blank with a hobby or passion]. I am [fill in the blank with a personality trait]. I believe in [fill in the blank with a value or belief]. I am [fill in the blank with a profession]. What is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to see continued advancements in machine learning and deep learning, with more sophisticated algorithms that can learn f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

 Smith

 and

 I

 am

 a

 freelance

 software

 developer

 with

 over

 five

 years

 of

 experience

 in

 web

 development

,

 UX

/UI

 design

,

 and

 agile

 project

 management

.

 I

 am

 passionate

 about

 helping

 people

 and

 creating

 amazing

 websites

 and

 apps

.

 I

 love

 connecting

 with

 people

 and

 leveraging

 my

 knowledge

 to

 solve

 complex

 problems

.

 I

 am

 always

 looking

 for

 new

 challenges

 and

 learning

 new

 technologies

 to

 stay

 on

 top

 of

 the

 latest

 trends

 in

 the

 industry

.

 I

 believe

 that

 my

 skills

 and

 knowledge

 can

 make

 a

 big

 difference

 in

 the

 world

 and

 I

 am

 excited

 to

 bring

 my

 expertise

 to

 any

 project

 that

 comes

 my

 way

.

 Thank

 you

!

 Let

 me

 know

 if

 you

 have

 any

 other

 questions

 or

 if

 you

'd

 like

 me

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 one

 of

 the

 world

's

 most

 populous

 cities

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 unique

 architecture

,

 and

 vibrant

 cultural

 scene

.

 Paris

 is

 home

 to

 many

 of

 the

 world

's

 most

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 a

 hub

 for

 education

,

 entertainment

,

 and

 business

,

 and

 is

 a

 major

 transportation

 hub

 for

 Europe

.

 Paris

 is

 a

 globally

 renowned

 city

 that

 has

 been

 a

 cultural

 and

 political

 center

 for

 centuries

,

 and

 its

 impact

 on

 the

 world

 continues

 to

 this

 day

.

 Paris

's

 diverse

 population

 and

 rich

 history

 make

 it

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 rapidly

 evolving

,

 with

 many

 potential

 areas

 of

 innovation

 and

 development

.

 Some

 possible

 trends

 include

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 more

 companies

 and

 governments

 become

 aware

 of

 the

 potential

 risks

 and

 biases

 in

 AI

 systems

,

 there

 is

 a

 growing

 emphasis

 on

 creating

 systems

 that

 are

 fair

,

 transparent

,

 and

 equitable

.



2

.

 Greater

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 to

 improve

 diagnosis

,

 treatment

,

 and

 patient

 care

,

 but

 as

 the

 technology

 advances

,

 it

 is

 likely

 that

 it

 will

 be

 used

 more

 extensively

 in

 the

 healthcare

 industry

.



3

.

 Greater

 use

 of

 AI

 in

 transportation

:

 As

 autonomous

 vehicles

 become

 more

 common

,

 there

 will

 be

 a

 greater

 focus




In [6]:
llm.shutdown()