fix OOM in BlockManager #973

zhyncs · 2024-01-17T04:56:37Z

Motivation

fix OOM in BlockManager

Modification

replace total * ratio with free * ratio in GetBlockCount, exclude the weights memory usage

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

zhyncs · 2024-01-17T04:57:13Z

Hi @lzhangzz @lvhan028 May you help review this pr

lzhangzz · 2024-01-17T10:44:26Z

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

lvhan028 · 2024-01-17T10:56:56Z

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

zhyncs · 2024-01-18T08:21:54Z

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

Hi @lzhangzz Usually when using a single machine with multiple GPUs for tensor parallelism, hardware specifications are the same, and the weights are also equally divided. May we assume that in most scenarios the free memory of multiple ranks are the same?

zhyncs · 2024-01-18T08:24:32Z

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).
Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

Hi @lvhan028 I agree that the usage documentation should also be updated. May this fix be considered to be merged first?

lvhan028 · 2024-01-19T03:12:34Z

Hi, @zhyncs
I still have some concerns.
In the tensor parallelism case, if some specified GPUs are occupied by other programs or zombie processes, free memory will differ among the GPUs. It means the number of k/v cache blocks in each GPU is different, which will bring catastrophe during inference.

replace `free * ratio` with `(total * ratio - (total -free))`

zhyncs · 2024-01-19T04:00:16Z

Hi @lzhangzz @lvhan028 In the latest commit, I have updated the free * ratio with total * ratio - (total - free).

I agreed with @lvhan028's case.

And if we want to completely solve the OOM issue and potential inconsistencies with multi-GPU's k/v cache blocks during TP, we may execute a forward pass with dummy inputs to profile the memory usage of the model like vLLM's profile_run.

lvhan028 · 2024-01-22T03:29:31Z

Hi, @zhyncs
total * ratio - (total - free) might be negative.
I prefer to compute the k/v cache memory size on python side, not C++/cuda side.
@lzhangzz any comments?

zhyncs · 2024-01-22T04:42:55Z

Hi @lvhan028
I agree that total * ratio - (total - free) might be negative. And we may add assertion for this case. The program should report the error to the user and gracefully exit, prompting the user to provide a correct value for the configuration. Guiding the user through the correct configuration will help prevent any further issues or misunderstandings.
Anyway, the current total * ratio is problematic.
Controlling these parameters and checksums on the Python side is ok.

zhyncs · 2024-01-22T08:04:27Z

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).
Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

Hi @lvhan028 @lzhangzz

If it is more difficult for us to decide whether to go with this temporary fix or a more sophisticated one, such as profile_run similar to vLLM, then we can now go ahead and update the docs to temporarily bypass the OOM issue by configuring cache_max_entry_count when the user encounters it, even though there is a certain mind-burden associated with this configuration.

TurbomindEngineConfig cache_max_entry_count
  ->LlamaTritonModel EngineParams cache_max_block_count
    ->LlamaV2 EngineParams cache_max_block_count
      ->LlamaBatch EngineParams cache_max_block_count
        ->SequenceManager block_count
          ->BlockManager block_count
            ->GetBlockCount ratio

lvhan028 · 2024-01-22T08:56:04Z

src/turbomind/models/llama/BlockManager.cc

@@ -86,7 +86,9 @@ size_t BlockManager::GetBlockCount(size_t block_size, double ratio)
    size_t free{};
    size_t total{};
    check_cuda_error(cudaMemGetInfo(&free, &total));
-    return static_cast<size_t>(total * ratio) / block_size;
+    FT_CHECK_WITH_INFO(total * ratio - (total - free) > 0,


I agree. But should it be total * ratio <= free?

There's no need to have this restriction, it doesn't matter.
e.g. A100 80G, Llama 2 13b Chat, ratio 0.9
total * ratio =80*0.9=72, free =80-26=54, total * ratio - (total - free) =46 (weights: 26g, kv cache: 46g)
In this case, total * ratio is not less than free but it's ok.
We may take total * ratio - (total - free) as a whole.

Currently, the ratio definition is that the size of ratio*total memory is allocated for the k/v cache. So, when OOM happens, we suggest decreasing the ratio.
However, the equation total * ratio - (total - free) = free - (1-ratio)*total indicates an increasing ratio when memory is run out.
That's why I suggest FT_CHECK_WITH_INFO(total * ratio < free, ...)

Maybe we can consider revising the definition of ratio. Using the ratio*total-sized memory for k/v cache can impose a significant mental burden on users. For example, when running Llama 2 7b chat fp16 on an A30 24g device, the weights occupy 14g, and the default ratio parameter is set to 0.5. If not configured, it could result in OOM. Even when configured manually, the user would need to subtract the weights from the total memory, make estimations, and perform calculations to determine an appropriate ratio. It's not user-friendly.
The ratio can convey the following meanings, eliminating the need for users to consider model weights.

weights: total - free k/v cache: total * ratio - (total - free) other: (1 - ratio) * total

Maybe the ratio definition could be the size of ratio*total - MODEL_WEIGHTS memory is allocated for the k/v cache.

lzhangzz · 2024-01-24T07:20:14Z

With kv = total * ratio - (total - free), the semantic of ratio becomes ratio = kv% + used%, which is the ratio of kv cache plus the ratio of weight.

However, it's inconsistent with the option name cache_max_entry_count, which suggesting we are configuring the amount kv-cache blocks.

And it does not solve the possible inconsistency in TP mode. Anything depends on free suffers the same problem.

I suggest we still use kv = free * ratio as it's consistent with the current option.

To deal with the inconsistency, we may

Find the mininal free of all ranks, which requires additional communication and synchronization.
Do nothing for now and flag a warning that the user should check for consistency.

zhyncs · 2024-01-24T07:31:47Z

With kv = total * ratio - (total - free), the semantic of ratio becomes ratio = kv% + used%, which is the ratio of kv cache plus the ratio of weight.

However, it's inconsistent with the option name cache_max_entry_count, which suggesting we are configuring the amount kv-cache blocks.

And it does not solve the possible inconsistency in TP mode. Anything depends on free suffers the same problem.

I suggest we still use kv = free * ratio as it's consistent with the current option.

To deal with the inconsistency, we may

Find the mininal free of all ranks, which requires additional communication and synchronization.

Do nothing for now and flag a warning that the user should check for consistency.

@lzhangzz Ok I agree. The first commit is just free * ratio. In order to use cache_max_entry_count as before, I will revert to this and add warning for the TP consistency. And I think this check is suitable before initializing AsyncEngine. @AllentDan Do you have any suggestion?

zhyncs · 2024-01-24T08:50:27Z

from py3nvml.py3nvml import (
    nvmlInit,
    nvmlDeviceGetCount,
    nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetMemoryInfo,
    nvmlShutdown,
)
import os


def get_device_ids():
    devices = os.getenv('CUDA_VISIBLE_DEVICES', '')
    return list(map(int, devices.split(','))) if devices else []


def compare_individual_gpu_memory(device_ids):
    try:
        nvmlInit()
        device_count = nvmlDeviceGetCount()
        total_mem = []
        free_mem = []

        for i in range(device_count):
            if device_ids and i not in device_ids:
                continue
            handle = nvmlDeviceGetHandleByIndex(i)
            mem_info = nvmlDeviceGetMemoryInfo(handle)
            total_mem.append(mem_info.total)
            free_mem.append(mem_info.free)

        all_total_equal = total_mem.count(total_mem[0]) == len(total_mem)
        all_free_equal = free_mem.count(free_mem[0]) == len(free_mem)

        print(f"Total memory equal across GPUs: {all_total_equal}")
        print(f"Free memory equal across GPUs: {all_free_equal}")

        nvmlShutdown()

    except Exception as e:
        print(f"An exception occurred: {e}")


device_ids = get_device_ids()
compare_individual_gpu_memory(device_ids)

It works when TP > 1. Some code snippets like this @AllentDan

AllentDan · 2024-01-24T09:38:58Z

It would be better that the check and logging a suggestion ratio could be done in Turbomind. Not only AsyncEngine used Turbomind.

zhyncs · 2024-01-24T09:41:53Z

It would be better that the check and logging a suggestion ratio could be done in Turbomind. Not only AsyncEngine used Turbomind.

ok

AllentDan · 2024-01-24T10:04:45Z

As for python package to get the memory information, I would suggest using torch.cuda.mem_get_info. This function requires no extra third party package.

zhyncs · 2024-01-25T05:53:21Z

Hi @lzhangzz @AllentDan May you help review the latest commit? Thanks.

zhyncs · 2024-01-26T05:50:06Z

Hi @lvhan028 May you help merge this pr? Thanks.

lzhangzz · 2024-01-26T06:26:35Z

fix lint issues please

zhyncs · 2024-01-26T06:27:23Z

fix lint issues please

ok

lvhan028 · 2024-01-26T06:53:07Z

@zhyncs please resolve linting error

pip install pre-commit
cd lmdeploy
pre-commit install .
pre-commit run --all-files

zhyncs · 2024-01-26T07:14:06Z

@zhyncs please resolve linting error

pip install pre-commit
cd lmdeploy
pre-commit install .
pre-commit run --all-files

[username@hostname lmdeploy]# pre-commit run --all-files
flake8...................................................................Passed
isort....................................................................Passed
yapf.....................................................................Passed
trim trailing whitespace.................................................Passed
check yaml...............................................................Passed
fix end of files.........................................................Passed
fix requirements.txt.....................................................Passed
fix double quoted strings................................................Passed
check for merge conflicts................................................Passed
fix python encoding pragma...............................................Passed
mixed line ending........................................................Passed
mdformat.................................................................Passed
codespell................................................................Passed
docformatter.............................................................Passed
check copyright..........................................................Passed

done

lzhangzz · 2024-01-26T07:31:05Z

hold on, just found some problems in _compare_individual_gpu_memory

lmdeploy/turbomind/turbomind.py

zhyncs · 2024-01-26T08:29:07Z

Hi @lzhangzz May you help review the latest commit? Thanks.

fix OOM in BlockManager

0ea44d4

lvhan028 requested a review from lzhangzz January 17, 2024 05:09

fix OOM in BlockManager

421103b

replace `free * ratio` with `(total * ratio - (total -free))`

miaoerduo mentioned this pull request Jan 19, 2024

[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

Closed

2 tasks

add check

fc1dfc4

lvhan028 reviewed Jan 22, 2024

View reviewed changes

Merge branch 'InternLM:main' into patch-1

ec7c0f7

update BlockManager and add memory equality check

b307011

Merge branch 'InternLM:main' into patch-1

f2654ff

lzhangzz previously approved these changes Jan 26, 2024

View reviewed changes

lvhan028 approved these changes Jan 26, 2024

View reviewed changes

fix lint

3c1adb5

lzhangzz reviewed Jan 26, 2024

View reviewed changes

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved

lzhangzz self-requested a review January 26, 2024 07:42

fix comment

3e3328b

lzhangzz approved these changes Jan 26, 2024

View reviewed changes

lzhangzz merged commit 0f324a4 into InternLM:main Jan 26, 2024
8 checks passed

zhyncs deleted the patch-1 branch January 26, 2024 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix OOM in BlockManager #973

fix OOM in BlockManager #973

zhyncs commented Jan 17, 2024 •

edited

zhyncs commented Jan 17, 2024

lzhangzz commented Jan 17, 2024

lvhan028 commented Jan 17, 2024

zhyncs commented Jan 18, 2024

zhyncs commented Jan 18, 2024

lvhan028 commented Jan 19, 2024

zhyncs commented Jan 19, 2024

lvhan028 commented Jan 22, 2024

zhyncs commented Jan 22, 2024

zhyncs commented Jan 22, 2024

lvhan028 Jan 22, 2024

zhyncs Jan 22, 2024

lvhan028 Jan 23, 2024

zhyncs Jan 23, 2024

zhyncs Jan 23, 2024

lzhangzz commented Jan 24, 2024

zhyncs commented Jan 24, 2024

zhyncs commented Jan 24, 2024

AllentDan commented Jan 24, 2024

zhyncs commented Jan 24, 2024

AllentDan commented Jan 24, 2024

zhyncs commented Jan 25, 2024

zhyncs commented Jan 26, 2024

lzhangzz commented Jan 26, 2024

zhyncs commented Jan 26, 2024

lvhan028 commented Jan 26, 2024

zhyncs commented Jan 26, 2024

lzhangzz commented Jan 26, 2024

zhyncs commented Jan 26, 2024

fix OOM in BlockManager #973

fix OOM in BlockManager #973

Conversation

zhyncs commented Jan 17, 2024 • edited

Motivation

Modification

Checklist

zhyncs commented Jan 17, 2024

lzhangzz commented Jan 17, 2024

lvhan028 commented Jan 17, 2024

zhyncs commented Jan 18, 2024

zhyncs commented Jan 18, 2024

lvhan028 commented Jan 19, 2024

zhyncs commented Jan 19, 2024

lvhan028 commented Jan 22, 2024

zhyncs commented Jan 22, 2024

zhyncs commented Jan 22, 2024

lvhan028 Jan 22, 2024

Choose a reason for hiding this comment

zhyncs Jan 22, 2024

Choose a reason for hiding this comment

lvhan028 Jan 23, 2024

Choose a reason for hiding this comment

zhyncs Jan 23, 2024

Choose a reason for hiding this comment

zhyncs Jan 23, 2024

Choose a reason for hiding this comment

lzhangzz commented Jan 24, 2024

zhyncs commented Jan 24, 2024

zhyncs commented Jan 24, 2024

AllentDan commented Jan 24, 2024

zhyncs commented Jan 24, 2024

AllentDan commented Jan 24, 2024

zhyncs commented Jan 25, 2024

zhyncs commented Jan 26, 2024

lzhangzz commented Jan 26, 2024

zhyncs commented Jan 26, 2024

lvhan028 commented Jan 26, 2024

zhyncs commented Jan 26, 2024

lzhangzz commented Jan 26, 2024

zhyncs commented Jan 26, 2024

zhyncs commented Jan 17, 2024 •

edited