Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all devices occupation when applying tp to torch engine by updating device map #1172

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

grimoire
Copy link
Collaborator

Update device map when TP.

  • If host memory is enough, The model would be loaded on the host and distributed to devices.
  • If these is no enough host memory, a device map would be created to load the weight on available devices and distribute them.

@lvhan028
Copy link
Collaborator

What's the risk of using auto device map?

@grimoire
Copy link
Collaborator Author

  1. All available devices would be used, and it is hard to clear all caches in unexpected devices.
  2. nearby weights would be dispatched to the same device. For example, if we have [w0, w1, w2, w3], auto would tend to put [w0, w1] in device0 and [w2, w3] in device1. Which might leads to OOM when distributing weights.

for name, _ in model.named_parameters():
device_map[name] = device_id
device_id = (device_id + 1) % world_size
for name, _ in model.named_buffers():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this method make some GPUs heavily loaded and others are not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some special case, yes.
For example, if tp==2 and weight size is [large, small, large, small, ...], device0 would load all large weights and device1 would load all small weights. But I think this is almost not possible for LLM.

@lvhan028 lvhan028 requested a review from pppppM February 27, 2024 08:55
model.eval()
model.config.use_cache = True

if rank == 0:
with LoadNoInit():
device_map = 'auto'
device_map = __get_device_map(model, device_map)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there are two gpus and two linears, is the device_map on rank0 {'linear0':0, 'linear1':1}?

If the GPU memory of linear0 + linear1 is greater than the available GPU memory of rank0, but linear0 and linear1 are individually less than the available GPU memory of rank0 and rank1 respectively, will device_map be switched to the CPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there are two gpus and two linears, is the device_map on rank0 {'linear0':0, 'linear1':1}?

Yes.

will device_map be switched to the CPU?

t0:

  • rank0: linear0
  • rank1: linear1

t1

  • rank0: linear0/2
  • rank1: linear0/2 + linear1

t2

  • rank0: linear0/2 + linear1/2
  • rank1: linear0/2 + linear1/2

Max memory usage would be linear0/2 + linear1 on rank1

No CPU fallback.

@lvhan028 lvhan028 merged commit a5ff047 into InternLM:main Feb 28, 2024
4 checks passed
@lvhan028 lvhan028 changed the title Update torch TP device map. Fix all devices occupation when applying tp to torch engine by updating device map Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants