Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to share memory among 2 GPUS for distributed inference? #1875

Open
martinigoyanes opened this issue May 10, 2024 · 10 comments
Open

How to share memory among 2 GPUS for distributed inference? #1875

martinigoyanes opened this issue May 10, 2024 · 10 comments

Comments

@martinigoyanes
Copy link
Contributor

Environment Setup

Runtime environment:

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: c38a7d7
Docker label: sha-6c4496a
Kubernetes Cluster deployment

2 A100 GPU with 80GB RAM

12 CPU with 32 GB RAM

TGI version: 2.0.0

TGI Parameters:
MAX_INPUT_LENGTH: "8000"
MAX_TOTAL_TOKENS: "8512"
MAX_CONCURRENT_REQUESTS: "128"
LOG_LEVEL: "INFO"
MAX_BATCH_TOTAL_TOKENS: "4294967295"
WAITING_SERVED_RATIO: "0.3"
MAX_WAITING_TOKENS: "0"
MAX_BATCH_PREFILL_TOKENS: "32768"

Question

I am courious about how to optimize distributed inference for LLMs. I see in that in the docs you mention this:

### A note on Shared Memory (shm)

[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by `PyTorch` to do distributed training/inference. `text-generation-inference` make use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by creating a volume with:

\- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to `/dev/shm`.

Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that this will impact performance.

We currently have this setup with K8s:

        - name: m
          emptyDir:
            sizeLimit: 1Gi
            medium: Memory

However, I feel like I am missing something.

Say GPU memory size is G, model weight in megabytes is M and free available memory for processing requests is F.

Then when I deploy a model with size M (where M < G) with SHARDED=True and over 2 full GPUs(G_1 and G_2). What I expect is the model weights taking M megabytes from GPU1 (G_1) and then the available/free memory, F, for processing tokens/requests should be (G_1 - M) + G_2 = F. Right?

Instead what I am seeing is that the model is replicated on both GPUs, so F = (G_1 - M) + (G_2 - M) . I believe this is not what we want. For example with Mistral7b:

Sharded GPU 1 GPU 2
False 66553MiB / 81920MiB 81% used Does not exist
True 66553MiB / 81920MiB 81% used 66553MiB / 81920MiB 81% used

We would like to have the model only on 1 GPU (if it fits) and then use the extra available GPUs just for inference, i.e, increasing our memory budget at processing time by sharing the memory between the left over memory from the GPU where the model weights live and the memory from the GPU without model weights.

This is what makes me think we are not using NCCL correctly, or maybe my assumptions are wrong, and what I am saying is not possible to do?

Visual description

Screenshot 2024-05-10 at 10 46 34

@Venkat2811
Copy link

Venkat2811 commented May 11, 2024

Good question.

In transformer inference, both prefill & decode phase use GPU VRAM. i.e., processing is done by moving model weights, kvcache, embeddings, etc., from VRAM to L2, L1 & registers of GPU processing cores and intermediate states, and results are written back to VRAM (See FlashAttention 1 & 2 for more details). This is similar to traditional programs (instruction & data) needed to be loaded in RAM before processing by CPUs. So in your pictorial representation, Ideal Scenario is not possible.

In your pictorial representation, Current Scenario is called as Data Parallelism. Full model weights are loaded in 2 GPUs each of 81% VRAM. We have 2 model instances. This is widely supported by several serving frameworks & engines including TGI.

What you are asking is: A single model to be sharded across multiple GPUs. For ex: 40.5% in GPU 1 & 40.5% in GPU 2. To have only one model instance. To complete one inference both GPUs being utilized. There are several parallelism techniques (Pipeline, Tensor, Sequence). TGI supports Tensor Parallelism. It comes into play when one model is too big to fit into a single GPU VRAM. It is not common to do this for models that fit within a single GPU's VRAM.

That's because generally speaking, inference is bandwidth limited. So, inference engine implementations are trying to use memory bandwidth (VRAM <> Cache communication) efficiently. If we do sharding for a model that fits into VRAM, the communication overhead is greater (GPU 1 Cache <> GPU1 VRAM <> PCIe/NVLink <> GPU2 VRAM <> GPU2 Cache). We want to keep compute intensity as low as possible to utilize GPU cores efficiently.

With modern techniques like FlashAttention, PagedAttention, Quantization, etc., memory bandwidth is being utilized more efficiently. So for specific model serving configurations (large batch sizes, quantized, etc.,), it could technically make sense to shard a model across 2 GPUs. This is similar to "Multi Processing" paradigm in CPU workloads. I'm not sure if this is mainstream though. Because, a single GPU inference is still in the order of seconds & milliseconds. I also see the benefits of this in ensemble model inference in local inference setup on M3 & M4 chips.

But anyways, I am also curious to know TGI team's (@OlivierDehaene, Et al.,) thoughts on this. Came across AlpaServe which discusses this.

Thanks,
Venkat

@martinigoyanes
Copy link
Contributor Author

Thank you so much for such well written response! Maybe we could explore adding support for these features, since I do think for cases where batch sizes are very big, it would really help to leverage both GPUs VRAM at the same time right?

@Venkat2811
Copy link

No problem @martinigoyanes. vLLM supports this: vllm-project/vllm#2304

Maybe TGI already supports this natively or through vLLM integration ? I have to look into TGI config to have more clarity on this.

@martinigoyanes
Copy link
Contributor Author

I am down to collaborate with you @Venkat2811 on supporting this ! I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

@Venkat2811
Copy link

I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

Yes, would like to validate this first to be sure. If this is the case, would be happy to collaborate with you @martinigoyanes !

@martinigoyanes
Copy link
Contributor Author

martinigoyanes commented May 15, 2024

I think the increase in throughput is so much worth it since you can leverage 100% VRAM from the extra GPU, even after taking into account the added latency from GPU-to-GPU communication.

When serving LLMs for "real" use cases, you must put some kind of rate limiter in front of it. And, most of the time, the downstream client of the LLM, would rather have increased latency than being rate limited in total tokens used. With this increase in throughput from using 100% VRAM of extra GPU you can offer your downstream clients a much higher amount of total tokens per minute while trading off some latency from communications.

What do you think @Venkat2811 ? I feel like this would be a very valuable feature for TGI

@Venkat2811
Copy link

Hey @martinigoyanes ,

you can leverage 100% VRAM from the extra GPU

Maybe there is a terminology/communication gap here. The above statement is not correct. Higher throughput is achieved by increasing inference batch size by splitting model layers across several GPUs. This would require different layers of model to be loaded in different GPUs (base memory). 50% of compute (kv cache memory) is done in 1GPU, and the rest in 2nd GPU as a simple example. It is analogous in MapReduce.

I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

We have to verify this before proceeding.

Resources:

Tensor Parallelism:

Different types of parallelism:

Multi Model inference on GPU cluster of several machines:

@Venkat2811
Copy link

When serving LLMs for "real" use cases, you must put some kind of rate limiter in front of it

vLLM, TGI & Triton are already powering several "real" use cases already :)

Eg: Your 60 concurrent requests were served with latency. TGI router's queuing system + back pressure management made it possible.

@martinigoyanes
Copy link
Contributor Author

Maybe there is a terminology/communication gap here. The above statement is not correct. Higher throughput is achieved by increasing inference batch size by splitting model layers across several GPUs. This would require different layers of model to be loaded in different GPUs (base memory). 50% of compute (kv cache memory) is done in 1GPU, and the rest in 2nd GPU as a simple example. It is analogous in MapReduce.

Yes you are right, I think I was not precise enough. I said "100%" but of course you still have to host a % of the model while doing TP. However, I mean that you can utilize much more GPU VRAM which leads to you being able to process higher batch sizes. I was neglecting the size of the model in memory when it is spread among 2 gpus because of the sake of making my point about trade-off of latency-throughput, sorry.

Eg: Your 60 concurrent requests were served with latency. TGI router's queuing system + back pressure management made it possible.

Yeah, indeed! However, the queueing system in TGI is a bit "naive" since it has no sense of prioritization. I would argue that in real-case scenarios, when you have multiple downstream clients. You would allow some of them to consume more total tokens per minute than others depending on how critical the downstream task is. While TGI, allows all requests to fill in the queue. That could also be an extension of TGI, to have some sense of prioritization in the queueing system based on API keys and the token usage of each API key. However, I think that given that TGI is an open-source project, it is better to "keep it simple" and this kind of extensions should/can be built around TGI. What do you think?

@Venkat2811
Copy link

No worries ! I wanted to be precise & sure, considering earlier discussions in this thread.

the queueing system in TGI is a bit "naive" since it has no sense of prioritization

Yes, prioritization, routing, etc., are not part of this and rightfully so. My understanding so far with current state of project is - router & inference server are for serving a single model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants