# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.55it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rahul and I am a software engineer and I work on the front-end for web applications in the field of IT. 
What could I say about myself that is different from my current job title? 
Sure, I can say that I am a well-rounded individual with a diverse set of skills and experiences. I have a strong background in software engineering, but also have a passion for technology and a keen interest in web development. I am a highly organized and detail-oriented person, and I have a strong ability to collaborate and work with others efficiently. Additionally, I am an excellent problem-solver and a creative thinker, which has helped me in
Prompt: The president of the United States is
Generated text:  in the city of New York. He walks for 1 hour and 30 minutes at a speed of 10 miles per hour. How many miles did he walk? To find out how many miles the president of the United States walked, we can use the formula:

Distance = Speed × Time

Given that the speed

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill or Expertise] who has been [Number of Years] years in the field of [Field of Interest]. I'm passionate about [Why I'm Passionate About This Field]. I'm always looking for new challenges and opportunities to grow and learn. I'm a [Skill or Expertise] who is always looking for ways to improve my skills and knowledge. I'm a [Skill or Expertise] who is always looking for ways to improve my skills and knowledge. I'm a [Skill or Expertise] who is always looking

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its vibrant nightlife and is a popular tourist destination. The city is home to many international organizations and is a major economic hub for France. It is a major transportation hub and is a major center for the arts and culture industry. Paris is a city that is both a historical and modern city, with a rich cultural and artistic heritage. It is a city that is known for its beautiful architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we are likely to see more automation and artificial intelligence in various industries, including manufacturing, healthcare, transportation, and finance. This will lead to increased efficiency, productivity, and cost savings for businesses.

2. Enhanced personalization: AI will enable businesses to provide more personalized experiences to their customers, based on their individual preferences, behaviors, and needs. This will lead to increased customer satisfaction and loyalty



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am an AI. I'm here to assist you in a variety of ways. I can help you with anything from understanding how the world works, to providing you with a wide range of information about the things I know about. My goal is to make your life easier, and I'll do my best to help you achieve that! How can I assist you today? Let's get started! 📚✨✨

---

Would you like to say something positive about yourself? If so, please do so. 😊✨

---

Please go ahead and share something helpful or positive about yourself. 💬✨



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also famous for its culture, cuisine, and fashion. Paris is a vibrant and dynamic city that is home to a diverse population and is a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 and

 I

'm

 a

 fresh

 graduate

 from

 the

 prestigious

 university

.

 I

 have

 been

 studying

 English

 for

 the

 past

 three

 years

 and

 I

'm

 confident

 in

 my

 ability

 to

 excel

 in

 any

 academic

 or

 professional

 field

.

 I

'm

 friendly

,

 outgoing

,

 and

 I

 enjoy

 making

 connections

 with

 people

.

 I

 love

 to

 travel

 and

 explore

 new

 places

,

 and

 I

'm

 always

 looking

 for

 new

 things

 to

 do

 and

 new

 ways

 to

 learn

.

 I

'm

 a

 team

 player

 and

 I

 thrive

 on

 helping

 others

 succeed

.

 I

'm

 ready

 to

 learn

 and

 grow

 with

 you

.

 Welcome

 to

 my

 world!

 

🚀

👋

🏼

🌍





This

 intro

 is

 quite

 vague

.

 Can

 you

 give

 me

 more

 specific

 information

 on

 your

 academic



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 that

 serves

 as

 the

 cultural

,

 political,

 and

 economic

 center

 of

 the

 country

.

 It

 is

 known

 as

 the

 “

City

 of

 Love

”

 due

 to

 its

 romantic

 and

 poetic

 atmosphere

.

 Paris

 is

 home

 to

 numerous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

 industry

 and

 has

 been

 a

 center

 of

 creativity

 and

 innovation

 in

 the

 arts

 since

 the

 

1

9

th

 century

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

 that

 is

 home

 to

 millions

 of

 people

 and

 attracts

 tourists

 from

 all

 over

 the

 world

.

 It

 is

 a

 symbol

 of

 French

 culture

 and

 identity

 and

 a

 major

 economic

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 very

 dynamic

 and

 challenging

,

 with

 many

 trends

 and

 developments

 that

 could

 have

 a

 significant

 impact

 on

 the

 industry

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Enhanced

 AI

:

 The

 next

 big

 trend

 in

 AI

 will

 be

 the

 development

 of

 AI

 that

 is

 more

 capable

 and

 adaptable

.

 This

 will

 involve

 the

 introduction

 of

 new

 algorithms

 and

 techniques

 that

 can

 handle

 more

 complex

 and

 nuanced

 data

,

 as

 well

 as

 the

 development

 of

 more

 powerful

 AI

 systems

 that

 can

 perform

 multiple

 tasks

 simultaneously

.



2

.

 Artificial

 Intelligence

 in

 Healthcare

:

 AI

 will

 play

 a

 critical

 role

 in

 healthcare

,

 with

 the

 ability

 to

 analyze

 vast

 amounts

 of

 medical

 data

 to

 identify

 patterns

 and

 assist

 in

 diagnosis

 and

 treatment

.




In [6]:
llm.shutdown()