# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.50it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tessa and I live in the United States. I am 20 years old and have been living in the United States for 5 years. I am currently studying at the university of southern california. I plan to become a teacher. I'm very passionate about teaching and have always been interested in learning about different cultures and learning how to teach. I would like to help and support others who are studying and teaching in the United States and other countries.
I want to teach English to students from all over the world. I am eager to learn more about the language and culture of different countries and cultures and how to teach effectively. I would
Prompt: The president of the United States is
Generated text:  a political office with no term limit. During his time as the president, Clinton spoke with President Bush and other prominent figures to try to win the nomination. With Clinton in a weakened condition at the end of his presidency, President Bush picked 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm passionate about [What I Love About My Profession], and I'm always looking for ways to [What I Want to Improve]. I'm a [What I Do Best], and I'm always ready to learn and grow. I'm excited to meet you and learn more about you. What's your name? What's your occupation? What's your skill? What's your passion? What's your goal? What's your best skill

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city of light and art. It is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also famous for its cuisine, fashion, and music, making it a cultural and economic hub of the country. The city is home to many world-renowned museums, art galleries, and theaters, and is a popular tourist destination. Paris is a vibrant and dynamic city that continues to evolve and change over time. It is a city of contrasts, with its rich history and modernity intertwined. The city is a symbol of France and a major hub of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more efficient and effective AI systems that can make better decisions and solve complex problems.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be an increased need for privacy and security measures to protect the data and information that is generated and processed by AI systems.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I’m a friendly, outgoing person who enjoys taking care of my own business and helping people. My name is [Name], and I’m a [job title] with over [number] years of experience in the [industry], and I’m here to help you achieve your goals. Let’s get started! 🌟

This short self-introduction is neutral, as it doesn’t promote any particular subject or profession. It’s meant to be informative and engaging, akin to a casual conversation.

How can we refine our self-introduction to ensure it’s more appropriate for our professional context? What advice would you give to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the historical center and largest city in the country. It is known as the City of Love and its architecture, museums, and cultural institutions are a testament to France's rich cultural her

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

]

 and

 I

'm

 a

 

2

5

-year

-old

 software

 engineer

.

 I

'm

 passionate

 about

 technology

 and

 have

 been

 working

 in

 the

 field

 for

 several

 years

.

 I

'm

 not

 afraid

 to

 try

 new

 things

 and

 challenge

 myself

,

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 skills

.

 I

 enjoy

 solving

 problems

 and

 helping

 others

,

 and

 I

 believe

 in

 using

 technology

 to

 create

 positive

 change

.

 In

 my

 spare

 time

,

 I

 enjoy

 reading

 and

 watching

 movies

,

 and

 I

 also

 love

 spending

 time

 with

 my

 loved

 ones

.

 I

'm

 a

 bit

 of

 a

 go

-go

-go

-go

-go

-go

-go

,

 but

 I

 hope everyone

 finds me

 interesting.

 [

insert

 character

's

 name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 third

-largest

 city

 in

 the

 world

 by

 population

,

 and

 is

 also

 the

 largest

 city

 in

 metropolitan

 France

.

 Its

 cuisine

,

 fashion

,

 and

 architecture

 are

 all

 renowned

 worldwide

.

 France

's

 capital

 is

 known

 for

 its

 vibrant

 culture

,

 rich

 history

,

 and

 stunning

 city

scape

.

 It

 is

 often

 referred

 to

 as

 the

 "

Paris

 of

 the

 East

."

 Paris

 is

 also

 home

 to

 several

 world

-ren

owned

 museums

 and

 art

 galleries

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 cultural

 scene

,

 and

 is

 considered

 one

 of

 the

 most

 desirable

 cities

 in

 the

 world

.

 It

 is

 also

 home

 to

 the

 E

iff

el



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 a

 combination

 of

 innovation

 and

 adaptation

.

 Here

 are

 some

 possible

 trends

 that

 are

 currently

 being

 explored

 and

 are

 expected

 to

 evolve

:



1

.

 Increased

 integration

 with

 human

 intelligence

:

 AI

 systems

 will

 become

 more

 integrated

 with

 human

 intelligence

,

 enabling

 them

 to

 perform

 tasks

 that

 are

 difficult

 or

 impossible

 for

 humans

 to

 do

.

 For

 example

,

 AI

 systems

 could

 learn

 to

 generate

 human

-like

 language

,

 write

 poetry

,

 or

 even

 create

 art

.



2

.

 Development

 of

 more

 sophisticated

 natural

 language

 processing

:

 As

 AI

 becomes

 more

 integrated

 with

 humans

,

 its

 ability

 to

 understand

 and

 generate

 human

-like

 language

 will

 become

 increasingly

 sophisticated

.

 This

 will

 enable

 AI

 systems

 to

 communicate

 more

 effectively

 with

 humans

,

 understand

 emotions

,

 and




In [6]:
llm.shutdown()