Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code-Clean and refactor #306

Merged
merged 1,045 commits into from
Apr 7, 2022
Merged

Code-Clean and refactor #306

merged 1,045 commits into from
Apr 7, 2022

Conversation

zhuzilin
Copy link
Collaborator

@zhuzilin zhuzilin commented Mar 9, 2022

relates to #304

cc @reyoung @JiayiFeng

As we no longer convert tensors to other dtypes, some
parameter transfered from the deepspeed context shall
be removed.
Remove the HOLD_AFTRER_FWD and HOLD_AFTER_BWD.
For now, the 3 states stands for:

- COMPUTE: the chunk is used for compute, cannot move;
- HOLD: the chunk can be moved or released;
- RELEASED: the chunk is released (payload is None).

Also, merge the release_dist and release, as we shall only
mark a chunk to HOLD in release, and consider how to move
it in the eviction policy.
For now, client.prepare_device is the only
api to make room for new chunks.
This commit also remove the redundant code after
this change.
This commit should be able to run 12B model on
the 1xV100 240GB memory machine on wechat cluster.
However, all the unloaded chunks are instantly move to
CPU, which makes it really slow.

Also, compare to the original design, the current activation
checkpoint is in float32, this makes the maximum model size or
batch size smaller. In my test, the maximum model could still
reach 12B, but the batch size is reduced from 8 to 4.
This commit adds a gradient chunk to chunk_list, so that
we could offload the gradient by chunks. This should help better
utilize the bandwidth between CPU and GPU.
This commit keeps all gradient chunks to CPU (pin_memory)
and will copy the calculated grad to them.

After this change, the time for GPT3_8B model on 1xV100 240GB CPU
memory is around 28s.
In GPT3_8B model, if we use fp32 param (and autocast),
the chunk moving time will be 14s out of a total 27s,
however, using fp16 param will reduce the moving time
to 6s. The performance loss is too much to neglected and
we have to give up using autocast and turn to apex O2.
This commit adds the apex loss scaler back and compare
PatrickStart with Apex O2 stage.

Also, we updated the license to the shorter version.
After this commit, the speed of GPT3_8B is 17s/iter.
This commit also adjust some code format and rename the
test folder to `tests` as most of them are not unitests.
@reyoung reyoung self-requested a review March 10, 2022 06:27
@reyoung reyoung changed the title [WIP] Refactor Code-Clean and refactor Apr 7, 2022
@reyoung reyoung merged commit 9a5fe2c into Tencent:orphan Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants