Fix all devices occupation when applying tp to torch engine by updating device map #1172

grimoire · 2024-02-21T06:23:43Z

Update device map when TP.

If host memory is enough, The model would be loaded on the host and distributed to devices.
If these is no enough host memory, a device map would be created to load the weight on available devices and distribute them.

lvhan028 · 2024-02-27T04:52:33Z

What's the risk of using auto device map?

grimoire · 2024-02-27T05:16:54Z

All available devices would be used, and it is hard to clear all caches in unexpected devices.
nearby weights would be dispatched to the same device. For example, if we have [w0, w1, w2, w3], auto would tend to put [w0, w1] in device0 and [w2, w3] in device1. Which might leads to OOM when distributing weights.

lvhan028 · 2024-02-27T08:54:11Z

lmdeploy/pytorch/engine/model_agent.py

+    for name, _ in model.named_parameters():
+        device_map[name] = device_id
+        device_id = (device_id + 1) % world_size
+    for name, _ in model.named_buffers():


Will this method make some GPUs heavily loaded and others are not?

In some special case, yes.
For example, if tp==2 and weight size is [large, small, large, small, ...], device0 would load all large weights and device1 would load all small weights. But I think this is almost not possible for LLM.

pppppM · 2024-02-27T10:24:14Z

lmdeploy/pytorch/engine/model_agent.py

        model.eval()
        model.config.use_cache = True

        if rank == 0:
            with LoadNoInit():
-                device_map = 'auto'
+                device_map = __get_device_map(model, device_map)


Assuming there are two gpus and two linears, is the device_map on rank0 {'linear0':0, 'linear1':1}?

If the GPU memory of linear0 + linear1 is greater than the available GPU memory of rank0, but linear0 and linear1 are individually less than the available GPU memory of rank0 and rank1 respectively, will device_map be switched to the CPU?

Assuming there are two gpus and two linears, is the device_map on rank0 {'linear0':0, 'linear1':1}?

Yes.

will device_map be switched to the CPU?

t0:

rank0: linear0

rank1: linear1

t1

rank0: linear0/2

rank1: linear0/2 + linear1

t2

rank0: linear0/2 + linear1/2

rank1: linear0/2 + linear1/2

Max memory usage would be linear0/2 + linear1 on rank1

No CPU fallback.

update device map

cdf348c

grimoire added Bug:P1 Bug:P2 and removed Bug:P1 labels Feb 21, 2024

lvhan028 force-pushed the main branch from 9cbf9a7 to 24beeb6 Compare February 21, 2024 09:10

lvhan028 reviewed Feb 27, 2024

View reviewed changes

lvhan028 requested a review from pppppM February 27, 2024 08:55

pppppM reviewed Feb 27, 2024

View reviewed changes

pppppM approved these changes Feb 27, 2024

View reviewed changes

lvhan028 requested a review from zhulinJulia24 February 27, 2024 12:35

lvhan028 approved these changes Feb 28, 2024

View reviewed changes

lvhan028 merged commit a5ff047 into InternLM:main Feb 28, 2024
4 checks passed

lvhan028 changed the title ~~Update torch TP device map.~~ Fix all devices occupation when applying tp to torch engine by updating device map Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix all devices occupation when applying tp to torch engine by updating device map #1172

Fix all devices occupation when applying tp to torch engine by updating device map #1172

grimoire commented Feb 21, 2024

lvhan028 commented Feb 27, 2024

grimoire commented Feb 27, 2024

lvhan028 Feb 27, 2024

grimoire Feb 27, 2024

pppppM Feb 27, 2024

grimoire Feb 27, 2024

Fix all devices occupation when applying tp to torch engine by updating device map #1172

Fix all devices occupation when applying tp to torch engine by updating device map #1172

Conversation

grimoire commented Feb 21, 2024

lvhan028 commented Feb 27, 2024

grimoire commented Feb 27, 2024

lvhan028 Feb 27, 2024

Choose a reason for hiding this comment

grimoire Feb 27, 2024

Choose a reason for hiding this comment

pppppM Feb 27, 2024

Choose a reason for hiding this comment

grimoire Feb 27, 2024

Choose a reason for hiding this comment