# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Faisal and I'm a non-binary individual who identifies as a young person. I like to write, and I'm passionate about digital storytelling. My work has been featured in books, magazines, and online communities. I also have a passion for sharing my story through social media, with a strong focus on helping others find a sense of belonging through technology. What's your experience with digital storytelling, and what's your approach to engaging with others and sharing your story online? Additionally, I'm interested in exploring the possibilities of a career in the tech industry, particularly in the areas of AI and cybersecurity. What advice would you give to someone considering
Prompt: The president of the United States is
Generated text:  trying to persuade the governors of the other 13 states to put a hole in the embankment of the Mississippi River. 

Does it follow that did the president of the United States ask governors of the other 13 states 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and art galleries. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest and most prestigious in the world. Additionally, Paris is home to the Louvre Museum, one of the world's most famous art museums, and the Notre-Dame Cathedral, a UNESCO World Heritage site. The city is also known for its cuisine, with its famous croissants, boudin, and other

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will lead to increased efficiency, cost savings, and job displacement, but it will also create new opportunities for AI-driven innovation.

2. Enhanced privacy and security: As AI becomes more integrated into our daily lives, there will be a growing need for robust privacy and security measures to protect sensitive data. This will require advancements in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a/an [Age] year old aspiring [career] professional. I enjoy [interest] in [occupation] and hope to [future goal]. I believe [my own value proposition]. I also have a passion for [personal interest], which has inspired me to pursue my career and personal development goals. I'm always looking to learn and grow, and I'm excited to share my journey with anyone who wants to know more. Thank you for taking the time to meet me! 🌟✨

**Note**: I have crafted the introduction to sound neutral, engaging, and enthusiastic, using appropriate language and phrases

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Task: To create a French translation of the above factual statement using only the words "Paris" and "French", without altering the meaning or structure of the sentence. The translation shoul

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

'm

 [

Age

],

 and

 I

'm

 an

 [

occupation

].

 I

'm

 a

 [

field

 of

 study

 or

 hobby

]

 who

 is

 passionate

 about

 [

specific

 hobby

 or

 interest

].

 I

'm

 always

 looking

 for

 new

 adventures

 and

 learning

 new

 things

,

 and

 I

'm

 always

 eager

 to

 share

 my

 knowledge

 with

 others

.

 Whether

 it

's

 on

 the

 phone

,

 in

 person

,

 or

 online

,

 I

'm

 always

 ready

 to

 help

 and

 be

 there

 for

 you

.

 I

'm

 a

 [

character

 trait

 or

 quality

]

 person

 who

 is

 always

 learning

 and

 growing

.

 I

 believe

 that

 everyone

 has

 the

 potential

 to

 achieve

 great

 things

,

 and

 I

'm

 here

 to

 help

 you

 on

 your

 journey

 towards

 achieving

 that

 potential

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 "

City

 of

 Light

".

 



R

ationale

:

 I

 must

 select

 one

 of

 the

 options

 provided

 for

 a

 complete

 statement

.

 In

 this

 case

,

 I

 have

 chosen

 to

 select

 "

Paris

,

 also

 known

 as

 the

 '

City

 of

 Light

'

".

 This

 statement

 accurately

 represents

 the

 factual

 information

 provided

 in

 the

 instruction

.

 The

 other

 options

 given

 are

 not

 directly

 related

 to

 the

 capital

 city

 of

 France

,

 making

 "

Paris

,

 also

 known

 as

 the

 '

City

 of

 Light

'"

 the

 most

 suitable

 answer

 to

 complete

 the

 given

 statement

.

 



Example

 statement

:

 "

Paris

,

 also

 known

 as

 the

 '

City

 of

 Light

',

 is

 a

 vibrant

 and

 historic

 city

 in

 central

 France

,

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 an

 exciting

 and

 rapidly

 evolving

 field

,

 with

 several

 potential

 trends

 that

 could

 shape

 the

 technology

 we

 use

 and

 interact

 with

 in

 the

 future

.



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 diagnosis

 and

 treatment

,

 predict

 patient

 outcomes

,

 and

 identify

 new

 disease

 areas

.

 As

 AI

 becomes

 more

 advanced

,

 we

 may

 see

 even

 more

 sophisticated

 applications

 in

 healthcare

,

 such

 as

 developing

 personalized

 medicine

,

 disease

 prevention

,

 and

 disease

 surveillance

.



2

.

 Improved

 availability

 of

 AI

-powered

 tools

 and

 services

:

 As

 the

 cost

 of

 AI

 software

 and

 hardware

 continues

 to

 decrease

,

 we

 may

 see

 more

 widespread

 adoption

 of

 AI

-powered

 tools

 and

 services

,

 such

 as

 virtual

 assistants

,

 chat




In [6]:
llm.shutdown()