# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.81it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sara, I'm 17 years old, and I'm in high school now. I have some problems with my English. In the past, I had to learn many words and sentence structures. However, now, I have trouble understanding what is being said and I find it hard to make connections between the different parts of the sentence. So, could you provide some advice on improving my English? I have some experience with this, and I can ask questions to my teacher or tutor if needed. Thank you!
Sure, I'd be happy to help you improve your English! Here are some tips that may help:

1. Read and
Prompt: The president of the United States is
Generated text:  a powerful and influential person who is responsible for the general welfare of the country. While the president has a lot of power, there are also some important duties and responsibilities that the president of the United States must fulfill. Some of the duties and responsibilities that the president must fulfill include providi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and [job title]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or simply "Paris". It is the largest city in Europe and the third-largest city in the world, with a population of over 2. 5 million people. Paris is known for its rich history, art, and culture, as well as its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also home to many famous museums, including the Musée d'Orsay and the Musée Rodin. Paris is a popular tourist destination and a major economic and cultural center in Europe. It is also the seat

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased automation and robotics: AI is already being used in manufacturing, healthcare, and transportation, and we can expect to see even more automation and robotics in the future. This will lead to increased efficiency, reduced costs, and improved quality of life.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and interpret human language, leading to more sophisticated chatbots, virtual assistants, and other AI-powered tools.

3. Improved decision-making: AI will become more capable of making more accurate and nuanced decisions, leading to better outcomes for businesses and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I am a [Last Name] [First Name]. I have always been fascinated by the world of video games and technology, and I have spent years learning about the latest advancements and keeping up with the latest trends. My dream job is to create my own game, and I am currently developing a new title called "The Code Breaker," which is a puzzle game with a deep character development system and a mysterious plot. I am excited to start my new career and make a difference in the gaming industry. 
I love being outdoors and exploring new places, and I enjoy spending time with my family and friends. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "The City of Light" and "The City of Love". It is a historic center with a rich history and a modern cityscape that is home to many iconic landm

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 come

 from

 [

Name

].

 I

 have

 always

 been

 passionate

 about

 [

Name

],

 and

 my

 life

 has

 been

 filled

 with

 everything

 from

 helping

 people

 to

 making

 friends

.

 I

 love

 to

 travel

 and

 experience

 new

 things

,

 and

 I

 enjoy

 learning

 new

 things

.

 I

'm

 always

 eager

 to

 expand

 my

 hor

izons

 and

 try

 new

 things

,

 and

 I

 believe

 that

 my

 energy

 and

 enthusiasm

 are

 contagious

.

 How

 can

 I

 help

 you

 today

?

 Do

 you

 have

 a

 question

 or

 would

 you

 like

 to

 chat

 about

 something

 specific

?

 Can

 you

 tell

 me

 about

 yourself

?

 I

 look

 forward

 to

 meeting

 you

!

 

📈

✨

✨

✨





This

 short

 self

-int

roduction

 is

 neutral

 and

 doesn

't



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 rich

 history

 and

 cultural

 heritage

.

 It

 is

 the

 largest

 city

 in

 France

 and

 home

 to

 numerous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Additionally

,

 Paris

 is

 known

 for

 its

 cuisine

,

 particularly

 its

 famous

 Paris

ian

 fries

,

 as

 well

 as

 its

 fashion

 scene

.

 Despite

 being

 a

 global

 met

ropolis

,

 Paris

 has

 maintained

 its

 cultural

 identity

 and

 continues

 to

 attract

 tourists

 from

 around

 the

 world

.

 The

 city

 is

 home

 to

 many

 museums

,

 art

 galleries

,

 and

 cultural

 institutions

,

 offering

 visitors

 a

 unique

 experience

 of

 French

 culture

 and

 history

.

 Paris

 also

 has

 a

 thriving

 arts

 scene

,

 including

 the

 Op

éra

 Garn

ier



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 complex

,

 with

 a

 wide

 range

 of

 emerging

 trends

 that

 will

 shape

 the

 landscape

 of

 the

 technology

 field

.

 Some

 of

 the

 key

 trends

 to

 watch

 include

:



1

.

 Deep

 learning

 and

 machine

 learning

:

 Deep

 learning

 will

 continue

 to

 advance

,

 with

 more

 powerful

 algorithms

 and

 networks

 emerging

 that

 can

 recognize

 patterns

 and

 make

 decisions

 at

 a

 deeper

 level

.

 This

 will

 have

 a

 significant

 impact

 on

 areas

 such

 as

 healthcare

,

 finance

,

 and

 autonomous

 vehicles

.



2

.

 Natural

 language

 processing

:

 As

 AI

 becomes

 more

 capable

 of

 understanding

 human

 language

,

 we

 will

 see

 increased

 use

 of

 natural

 language

 processing

 in

 areas

 such

 as

 speech

 recognition

,

 translation

,

 and

 chat

bots

.



3

.

 Autonomous

 vehicles

:

 Autonomous




In [6]:
llm.shutdown()