-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix all devices occupation when applying tp to torch engine by updating device map #1172
Conversation
What's the risk of using |
|
for name, _ in model.named_parameters(): | ||
device_map[name] = device_id | ||
device_id = (device_id + 1) % world_size | ||
for name, _ in model.named_buffers(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this method make some GPUs heavily loaded and others are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some special case, yes.
For example, if tp==2
and weight size is [large, small, large, small, ...]
, device0 would load all large weights and device1 would load all small weights. But I think this is almost not possible for LLM.
model.eval() | ||
model.config.use_cache = True | ||
|
||
if rank == 0: | ||
with LoadNoInit(): | ||
device_map = 'auto' | ||
device_map = __get_device_map(model, device_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there are two gpus and two linears, is the device_map
on rank0 {'linear0':0, 'linear1':1}
?
If the GPU memory of linear0
+ linear1
is greater than the available GPU memory of rank0, but linear0
and linear1
are individually less than the available GPU memory of rank0 and rank1 respectively, will device_map
be switched to the CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there are two gpus and two linears, is the device_map on rank0 {'linear0':0, 'linear1':1}?
Yes.
will device_map be switched to the CPU?
t0:
- rank0:
linear0
- rank1:
linear1
t1
- rank0:
linear0/2
- rank1:
linear0/2
+linear1
t2
- rank0:
linear0/2
+linear1/2
- rank1:
linear0/2
+linear1/2
Max memory usage would be linear0/2
+ linear1
on rank1
No CPU fallback.
Update device map when TP.