# Fast Serving LLM for multiple users using SGLang

Model like LLama 3.1 and gemma-2 has made it possible to deploy your own LLM on your own machine at a very low cost without relying on a SaaS service. However, putting an LLM into production to serve a large number of users simultaneously is not straightforward. To achieve this, various fast serving frameworks implement best practices, allowing us to do so with great simplicity.

Some examples of these features are:

- Dynamic Batching: Efficiently group incoming requests into batches to maximize GPU utilization. This involves strategies to handle variable sequence lengths and minimize padding overhead.
- Request Queuing and Prioritization: Implement a robust queuing system to manage incoming requests, potentially with prioritization mechanisms for different use cases or users.
- Concurrency and Parallelism: Leverage multi-threading and asynchronous operations to handle multiple requests concurrently, maximizing hardware utilization.
- Optimized Kernel Execution
- Efficient Memory Allocation and Deallocation: Minimize memory fragmentation and overhead associated with memory management.
- Model Parameter Sharding: Distribute model parameters across multiple GPUs or devices to handle large models that exceed the memory capacity of a single device.
- Memory-Efficient Attention Mechanisms: Implement optimized attention mechanisms that reduce memory consumption, especially for long sequences.
- Optimized KV Caching
and so on.

There are several excellent alternatives such as TGI (from Huggingface), the popular vLLM, TensorRT-LLM, and the latest NVIDIA NIM.

Today, I will show how to deploy an LLM to serve 64 users or more simultaneously (potentially in parallel) while maintaining high performance using SGLang.

## Hardware Requirements

To make it work, however, you will need a GPU with Compute Capabilities >= 8.9 (the number of features actually implemented). Only the latest GPUs on the market fall into this category, such as the NVIDIA H100, NVIDIA L4, NVIDIA L40, RTX 6000 and GTX 4090.

## Installing

In [1]:
!pip install --upgrade pip
!pip install "sglang[all]"
!pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ # Install FlashInfer CUDA kernels

Looking in indexes: https://flashinfer.ai/whl/cu121/torch2.3/


## Run the server 

Run the following `code` in your shell. 

### Replace [HF_TOKEN] with a personal HF token with reading permissions

This step will require some time to download the docker image and the model from Hugginface

<code> docker run --gpus all \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "[HF_TOKEN]" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --host 0.0.0.0 --port 30000 --mem-fraction-static 0.7
</code>

You should now see in the console something like this:


![Tux, the Linux mascot](images/server_fired.PNG)

If you see this the server is up and running.

# Test the LLM server

In [2]:
import openai
import time
import threading
import concurrent.futures

In [7]:
class LLMClient():
    def __init__(self):
        self.client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
    def run(self,messages,temperature=0.0,max_tokens=1000):
        # Chat completion
        response = self.client.chat.completions.create(
            model="default",
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens)
        return response

In [9]:
llm = LLMClient()
messages = [
                {"role": "system", "content": "You are a helpful AI assistant"},
                {"role": "user", "content": "What is the city in Italy where you can eat the best pizza?"},
            ]
start = time.time()
r = llm.run(messages=messages,temperature=0.8,max_tokens=200)
end = time.time()

print(f"Response: {r.choices[0].message.content}")
print(f"Completion Tokens: {r.usage.completion_tokens}")
print(f"Tokens/Second: {r.usage.completion_tokens/(end-start)}")

Response: That's a tough question! Italy is famous for its delicious pizza, and opinions on the best city for pizza can vary depending on personal taste. However, I can give you some popular options:

1. **Naples (Napoli)**: Known as the birthplace of pizza, Naples is a must-visit for any pizza lover. The Neapolitan-style pizza, made with fresh ingredients and cooked in a wood-fired oven, is a UNESCO-recognized culinary tradition. Try a classic Margherita pizza at a local pizzeria like Pizzeria Di Matteo or Pizzeria Brandi.
2. **Rome**: Rome is another city in Italy famous for its pizza. You'll find many authentic Neapolitan-style pizzerias, like Pizzeria La Montecarlo or Pizzeria Roscioli, serving up delicious pies with fresh ingredients.
3. **Milan**: Milan has a unique pizza style, often influenced by the region's rich culinary history
Completion Tokens: 200
Tokens/Second: 25.85678610374305


For a single request we are getting 25 tokens/seconds. It is not the best but remember that the model is running on a single NVIDIA L4 that is not the best GPU available for FP8 operations.

Now let's see how it handles 64 potential users calling the API

To simulate a real scenario, we must assume that each user will send a different request from all the others. This is important because sending the same request repeatedly would lead to unrealistic use of the KV cache, resulting in extremely high response times. To do this, we generate N questions that N potential users might ask and evaluate the performance in terms of tokens per second.

The steps implemented are as follows:

- Generation of N questions in parallel
- Sending the N questions to the API, simulating a separation interval of 0.07 seconds between each (sending them all together would actually make the management of dynamic batching simpler for the framework)







In [59]:
def generate_question(llm_client):
    messages = [
                {"role": "system", "content": "You are a helpful AI assistant that generate random questions"},
                {"role": "user", "content": "Generate a random question without anything else. I need for a synthetic data. Mix it a lot using 'how', 'why', 'what','which' and so on"},
            ]
    questions = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=N) as executor:
        # Submit the function call to the executor N times
        futures = [executor.submit(llm_client.run,messages,temperature=1.0) for _ in range(N)]
        # Collect and print the results as they complete
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            response = future.result()
            questions.append({"response":response.choices[0].message.content,"completion_tokens":response.usage.completion_tokens})
        return questions

def parallel_invoke(llm_client,questions):
    messages = []
    for q in questions:
        messages.append([
                    {"role": "system", "content": "You are a helpful AI assistant that answer questions"},
                    {"role": "user", "content": q['response']}])
    answers = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=N) as executor:
        # Submit the function call to the executor N times
        futures = [executor.submit(llm_client.run,messages[i],temperature=1.0,max_tokens=200) for i in range(N)]
        # Collect and print the results as they complete
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            response = future.result()
            answers.append({"response":response.choices[0].message.content,"completion_tokens":response.usage.completion_tokens})
        return answers


def call_llm(index,llm,messages,max_tokens):
    start = time.time()
    response = llm.run(messages=messages,temperature=0.0,max_tokens=max_tokens)
    end = time.time()
    responses[index] = {"response":response, "exec_time":end-start}
    

def batch_requests(llm_client,messages,batch_size=1,interval_time=0.07,max_tokens=100):
    timers = []
    throughputs = []
    #total_tokens = 0
    for i in range(batch_size):
        t=threading.Timer(interval_time*i,call_llm, args=(i,llm_client,messages[i],max_tokens))
        timers.append(t)
        t.start()
    for t in timers:
        t.join()
    for r in responses:
        total_tokens = r['response'].usage.total_tokens
        #print(r['response'].choices[0].message.content)
        timed = r['exec_time']
        throughputs.append(total_tokens/timed)
        print(f"Tokens/Second: {total_tokens/timed}")
    return throughputs
    

### Let's generate the N questions

In [60]:
N = 64

llm = LLMClient()
start = time.time()
questions = generate_question(llm_client=llm)
end = time.time()
total_tokens= 0
for q in questions:
    total_tokens += q['completion_tokens']
print(f"Total tokens/second: {((total_tokens)/(end-start))}")

Total tokens/second: 539.0938228646794


Let's try a parallel invokation (every request spawn in the same time)

In [61]:
start = time.time()
response = parallel_invoke(llm,questions)
end = time.time()
total_tokens = sum([r['completion_tokens'] for r in response])
print(f"Average Tokens/second: {total_tokens/(end-start)}")

Average Tokens/second: 1101.2147646320898


In parallel we have served 64 parallel request in less then 11 seconds, for a total of 1100 tokens/seconds handled. 

Not bad at all. Let's see every single user what performance are experiencing

### Now let's send each question with an interval of 0.07 seconds from each other

In [67]:
N = 64 # Batch Size
responses = [None] * N

batch_messages = []
for q in questions:
    batch_messages.append([
                {"role": "system", "content": "You are a helpful AI assistant"},
                {"role": "user", "content": q['response']}])
throughputs = batch_requests(llm,batch_messages,batch_size=N,interval_time=0.07,max_tokens=200)

Tokens/Second: 22.481655557363872
Tokens/Second: 20.86233051398501
Tokens/Second: 21.993210432250052
Tokens/Second: 21.862647076802933
Tokens/Second: 22.473124052874006
Tokens/Second: 22.525694930439993
Tokens/Second: 22.76676529303312
Tokens/Second: 21.967466851255836
Tokens/Second: 22.201399894963224
Tokens/Second: 22.34818522383896
Tokens/Second: 22.680139911751773
Tokens/Second: 22.91284934919259
Tokens/Second: 22.787741950817207
Tokens/Second: 22.297225837366152
Tokens/Second: 22.42831067558483
Tokens/Second: 22.11863232244019
Tokens/Second: 22.90617508997586
Tokens/Second: 23.13758767361235
Tokens/Second: 23.206375149178957
Tokens/Second: 22.70835953395026
Tokens/Second: 23.41402763822864
Tokens/Second: 22.582414886834012
Tokens/Second: 22.718171640603472
Tokens/Second: 22.774653329202827
Tokens/Second: 22.546546754490745
Tokens/Second: 23.16676206291592
Tokens/Second: 23.310038544695598
Tokens/Second: 23.645004872443455
Tokens/Second: 22.93377239063281
Tokens/Second: 23.16824986

#### We are serving 64 users in parallel ensuring a decent throughput to all the users (21 to 29 tokens/second on a single NVIDIA L4)
#### Note that the model is not quanitized but it is the FP8 version and SGLang can handle AWQ/FP8/GPTQ/Marlin format, so this numbers can still be improved. 

# Let's test with 128 users

In [68]:
N = 128 # Batch Size
responses = [None] * N

llm = LLMClient()
start = time.time()
questions = generate_question(llm_client=llm)
end = time.time()

print(f"{N} parallel questions generation time: {end-start}")

batch_messages = []
for q in questions:
    batch_messages.append([
                {"role": "system", "content": "You are a helpful AI assistant"},
                {"role": "user", "content": q['response']}])
throughputs = batch_requests(llm,batch_messages,batch_size=N,interval_time=0.07,max_tokens=200)

128 parallel questions generation time: 7.0390636920928955
Tokens/Second: 17.286992033894414
Tokens/Second: 16.592788713455715
Tokens/Second: 16.60591790331349
Tokens/Second: 17.191043674119104
Tokens/Second: 16.98837166994836
Tokens/Second: 16.93018850114843
Tokens/Second: 17.546547677695138
Tokens/Second: 16.83857269735047
Tokens/Second: 16.92173122069925
Tokens/Second: 17.148522844716943
Tokens/Second: 17.089902459644357
Tokens/Second: 17.24558245768498
Tokens/Second: 17.334764222814787
Tokens/Second: 17.63456014162903
Tokens/Second: 16.84011196233023
Tokens/Second: 17.125831074694982
Tokens/Second: 17.208415311559225
Tokens/Second: 17.29401385276943
Tokens/Second: 17.58980823644283
Tokens/Second: 17.181779590582444
Tokens/Second: 17.763938673266328
Tokens/Second: 17.63849967158452
Tokens/Second: 16.877400247871364
Tokens/Second: 17.230581852474334
Tokens/Second: 27.032927940112195
Tokens/Second: 17.12178138317449
Tokens/Second: 17.274694892507316
Tokens/Second: 17.29010981763018
To

In [69]:
sum([t for t in throughputs])/len(throughputs)

18.79558754291914

### 128 requests starting with an interval of 0.07 second from one to another served in 20 seconds with an average of 18.5 tokens/seconds.

# Conclusion

SGLang is a fantastic serving framework to garantee a good throughput to all your users in application with multiple users. 
The framework handle very well a LLama-3.1-8B in FP8 on a single NVIDIA L4 serving 64 parallel users. Increase the number of users will start degrade the performance but will keep the throughput valid for every user. 

If the users based grow a lot, you will need to add other GPUs or multiple nodes.