# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine

import sglang as sgl
from sglang.utils import print_highlight
import asyncio

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

INFO 11-11 23:15:22 weight_utils.py:243] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.25it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.14it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing synchronous streaming generation ===")

for prompt in prompts:
    print_highlight(f"\nPrompt: {prompt}")
    print("Generated text: ", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)
    print()

Generated text: 

 Alex

 and

 I

'm

 a

 new

 student

 at

 your

 school

.

 I

'm

 excited

 to

 meet

 you

 all

 and

 I

'm

 looking

 forward

 to

 learning

 and

 making

 friends

.

 I

'm

 in

 Mrs

.

 Johnson

's

 class

 and

 I

 think

 we

're

 going

 to

 have

 a

 great

 year

 together

.


How

 are

 you

 all

 doing

 today

?

 Is

 it

 a

 nice

 day

 outside

?

 I

've

 been

 looking

 forward

 to

 this

 day

 for

 a

 long

 time

 and

 I

'm

 glad

 to

 finally

 be

 here

.

 I

'm

 a

 bit

 nervous

,

 but

 I

'm

 sure

 I

'll

 get

 used

 to

 everything

 soon

.


Do

 any

 of

 you

 have

 any

 fun

 plans

 for

 the

 weekend

?

 I

 was

 thinking

 of

 going

 to

 the

 park

 with

 my

 family

 on

 Saturday




Generated text: 

 in

 the

 midst

 of

 a

 crisis

 that

 has

 left

 many

 without

 access

 to

 basic

 necessities

 like

 food

 and

 water

.

 Climate

 change

 is

 one

 of

 the

 main

 causes

 of

 this

 crisis

,

 with

 extreme

 weather

 events

 like

 flooding

 and

 drought

s

 affecting

 the

 country

's

 infrastructure

.


In

 Paris

,

 the

 situation

 is

 particularly

 dire

,

 with

 many

 residents

 struggling

 to

 access

 clean

 water

 and

 food

 due

 to

 the

 city

's

 aging

 infrastructure

 and

 the

 impacts

 of

 climate

 change

.

 The

 city

's

 mayor

,

 Anne

 H

idal

go

,

 has

 called

 for

 international

 aid

 to

 help

 address

 the

 crisis

.


The

 crisis

 in

 Paris

 is

 a

 symptom

 of

 a

 larger

 problem

 that

 affects

 many

 cities

 around

 the

 world

.

 Climate

 change

 is

 causing

 more

 frequent

 and

 severe




Generated text: 

 in

 the

 cloud

.

 This

 is

 the

 conclusion

 that

 many

 experts

 in

 the

 field

 are

 drawing

 after

 years

 of

 research

 and

 experimentation

.


Cloud

-based

 AI

 is

 the

 future

 of

 AI

 for

 several

 reasons

.

 Firstly

,

 cloud

 computing

 offers

 scalability

 and

 flexibility

.

 AI

 models

 can

 be

 easily

 trained

,

 updated

 and

 scaled

 up

 or

 down

 as

 needed

 without

 the

 need

 for

 significant

 infrastructure

 investments

.

 This

 makes

 cloud

-based

 AI

 an

 attractive

 option

 for

 businesses

 that

 need

 to

 quickly

 adapt

 to

 changing

 market

 conditions

.


Second

ly

,

 cloud

-based

 AI

 can

 be

 accessed

 by

 anyone

 with

 an

 internet

 connection

,

 making

 it

 a

 global

 technology

.

 This

 means

 that

 businesses

 and

 individuals

 around

 the

 world

 can

 access

 advanced

 AI

 capabilities

 without

 the

 need

 for

 significant




### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print_highlight(f"\nPrompt: {prompt}")
        print_highlight(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}

print_highlight("\n=== Testing asynchronous streaming generation ===")


async def main():
    for prompt in prompts:
        print_highlight(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        generator = await llm.async_generate(prompt, sampling_params, stream=True)
        async for chunk in generator:
            print(chunk["text"], end="", flush=True)
        print()


asyncio.run(main())

Generated text: 

 David

 S

.

 and

 I

 am

 a

 writer

,

 a

 speaker

,

 a

 pod

caster

,

 and

 a

 coach

 who

 specializes

 in

 helping

 others

 reach

 their

 full

 potential

 and

 live

 their

 best

 lives

.

 I

 am

 passionate

 about

 helping

 people

 achieve

 their

 goals

,

 overcome

 obstacles

,

 and

 unlock

 their

 true

 potential

.


I

 am

 a

 two

-time

 Amazon

 Best

 Selling

 Author

,

 a

 sought

-after

 speaker

,

 and

 a

 popular

 podcast

 host

.

 My

 work

 has

 been

 featured

 in

 various

 media

 outlets

,

 including

 Forbes

,

 Entrepreneur

 Magazine

,

 and

 Business

 Insider

.


I

 am

 a

 certified

 coach

 with

 over

 

20

 years

 of

 experience

 helping

 individuals

 and

 organizations

 achieve

 their

 goals

.

 My

 coaching

 style

 is

 warm

,

 empath

etic

,

 and

 results

-driven

.

 I




Generated text: 

 a

 city

 like

 no

 other

,

 full

 of

 charm

 and

 romance

.

 It

 is

 a

 place

 of

 grand

 bou

lev

ards

,

 beautiful

 gardens

,

 and

 world

-ren

owned

 museums

 and

 landmarks

.

 There

 is

 so

 much

 to

 see

 and

 do

 in

 this

 great

 city

,

 that

 it

's

 impossible

 to

 fit

 it

 all

 into

 one

 visit

.

 But

 here

 are

 a

 few

 of

 the

 top

 attractions

 that

 you

 won

't

 want

 to

 miss

:


This

 iconic

 landmark

 is

 one

 of

 the

 most

 recognizable

 buildings

 in

 the

 world

 and

 a

 must

-

visit

 for

 any

 first

-time

 visitor

 to

 Paris

.

 The

 E

iff

el

 Tower

 was

 built

 in

 the

 late

 

19

th

 century

 for

 the

 World

's

 Fair

 and

 was

 intended

 to

 be

 a

 temporary

 structure




Generated text: 

 bright

,

 but

 it

's

 not

 without

 challenges

 and

 controversies

.

 One

 of

 the

 most

 pressing

 concerns

 is

 the

 issue

 of

 bias

 in

 AI

 systems

,

 which

 can

 perpet

uate

 and

 even

 amplify

 existing

 social

 inequalities

.

 To

 address

 this

,

 researchers

 are

 exploring

 new

 approaches

 to

 AI

 development

 that

 prioritize

 fairness

,

 transparency

,

 and

 accountability

.


At

 the

 forefront

 of

 this

 movement

 is

 the

 concept

 of

 explain

able

 AI

 (

X

AI

).

 X

AI

 aims

 to

 provide

 clear

 and

 understandable

 explanations

 for

 AI

 decision

-making

 processes

,

 allowing

 users

 to

 understand

 the

 reasoning

 behind

 AI

-driven

 outcomes

.

 This

 not

 only

 enhances

 accountability

 but

 also

 enables

 AI

 systems

 to

 be

 more

 transparent

 and

 less

 susceptible

 to

 bias

.


Another

 critical

 area

 of

 focus

 is

 fairness




In [6]:
llm.shutdown()