-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code-Clean and refactor #306
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fix the copy bug
Write Read Buffer Ues Cache Memory for ADAM
…tar into jiaruifang/memcacheadam
This reverts commit 9641791.
Mem cache to avoid too much allocation and free.
As we no longer convert tensors to other dtypes, some parameter transfered from the deepspeed context shall be removed.
Remove the HOLD_AFTRER_FWD and HOLD_AFTER_BWD. For now, the 3 states stands for: - COMPUTE: the chunk is used for compute, cannot move; - HOLD: the chunk can be moved or released; - RELEASED: the chunk is released (payload is None). Also, merge the release_dist and release, as we shall only mark a chunk to HOLD in release, and consider how to move it in the eviction policy.
For now, client.prepare_device is the only api to make room for new chunks. This commit also remove the redundant code after this change.
This commit should be able to run 12B model on the 1xV100 240GB memory machine on wechat cluster. However, all the unloaded chunks are instantly move to CPU, which makes it really slow. Also, compare to the original design, the current activation checkpoint is in float32, this makes the maximum model size or batch size smaller. In my test, the maximum model could still reach 12B, but the batch size is reduced from 8 to 4.
This commit adds a gradient chunk to chunk_list, so that we could offload the gradient by chunks. This should help better utilize the bandwidth between CPU and GPU.
This commit keeps all gradient chunks to CPU (pin_memory) and will copy the calculated grad to them. After this change, the time for GPT3_8B model on 1xV100 240GB CPU memory is around 28s.
In GPT3_8B model, if we use fp32 param (and autocast), the chunk moving time will be 14s out of a total 27s, however, using fp16 param will reduce the moving time to 6s. The performance loss is too much to neglected and we have to give up using autocast and turn to apex O2.
This commit adds the apex loss scaler back and compare PatrickStart with Apex O2 stage. Also, we updated the license to the shorter version.
After this commit, the speed of GPT3_8B is 17s/iter.
This commit also adjust some code format and rename the test folder to `tests` as most of them are not unitests.
Tiny polish AsyncMemTracer
Tiny Polish Shell and gitignore
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
relates to #304
cc @reyoung @JiayiFeng