Code-Clean and refactor #306

zhuzilin · 2022-03-09T10:03:53Z

relates to #304

cc @reyoung @JiayiFeng

fix the copy bug

…ang/memcache

Develop

Write Read Buffer Ues Cache Memory for ADAM

…tar into jiaruifang/memcacheadam

This reverts commit 9641791.

Mem cache to avoid too much allocation and free.

As we no longer convert tensors to other dtypes, some parameter transfered from the deepspeed context shall be removed.

Remove the HOLD_AFTRER_FWD and HOLD_AFTER_BWD. For now, the 3 states stands for: - COMPUTE: the chunk is used for compute, cannot move; - HOLD: the chunk can be moved or released; - RELEASED: the chunk is released (payload is None). Also, merge the release_dist and release, as we shall only mark a chunk to HOLD in release, and consider how to move it in the eviction policy.

For now, client.prepare_device is the only api to make room for new chunks. This commit also remove the redundant code after this change.

This commit should be able to run 12B model on the 1xV100 240GB memory machine on wechat cluster. However, all the unloaded chunks are instantly move to CPU, which makes it really slow. Also, compare to the original design, the current activation checkpoint is in float32, this makes the maximum model size or batch size smaller. In my test, the maximum model could still reach 12B, but the batch size is reduced from 8 to 4.

This commit adds a gradient chunk to chunk_list, so that we could offload the gradient by chunks. This should help better utilize the bandwidth between CPU and GPU.

This commit keeps all gradient chunks to CPU (pin_memory) and will copy the calculated grad to them. After this change, the time for GPT3_8B model on 1xV100 240GB CPU memory is around 28s.

In GPT3_8B model, if we use fp32 param (and autocast), the chunk moving time will be 14s out of a total 27s, however, using fp16 param will reduce the moving time to 6s. The performance loss is too much to neglected and we have to give up using autocast and turn to apex O2.

This commit adds the apex loss scaler back and compare PatrickStart with Apex O2 stage. Also, we updated the license to the shorter version.

After this commit, the speed of GPT3_8B is 17s/iter.

This commit also adjust some code format and rename the test folder to `tests` as most of them are not unitests.

Tiny polish AsyncMemTracer

Tiny Polish Shell and gitignore

feifeibear and others added 30 commits November 24, 2021 13:54

fix the copy bug

541d1ce

fix a bug

b18b354

polish code.

806c4fe

fix a bug

c059c66

fix a bug

6d9b5d4

fix a bug in stream copy cause nan

decc331

Merge pull request Tencent#243 from feifeibear/jiaruifang/debug_copy

32dd569

fix the copy bug

upgrade 0.4.2 to 0.4.3

1a2a492

hotfix a bug caused by copy_ instream

2fb2cdf

Merge branch 'develop' of github.com:Tencent/PatrickStar into jiaruif…

1294826

…ang/memcache

Merge pull request Tencent#244 from Tencent/develop

97202fe

Develop

polish readme

ec129a4

Use memory cache for adam read and write buff

de402fa

fix a bug

f1a54c4

pass mem cache unitest

0be52a3

pass all unitests

3b2638d

Merge pull request Tencent#245 from Tencent/jiaruifang/memcacheadam

289db15

Write Read Buffer Ues Cache Memory for ADAM

polish code according to comments

a0dedbb

polish code

bb97c10

fix performance issues.

b86bdfa

try to get back lost performance

9641791

fix a critical bug

fbf58d6

Merge branch 'jiaruifang/memcacheadam' of github.com:Tencent/PatrickS…

108fd16

…tar into jiaruifang/memcacheadam

Revert "try to get back lost performance"

7b5dbbe

This reverts commit 9641791.

polish code.

46cb376

fix a bug

2b1b3f9

Merge pull request Tencent#241 from Tencent/jiaruifang/memcache

72daf8d

Mem cache to avoid too much allocation and free.

finish the init version.

4d74429

polish code.

1d8d758

fix a bug

7db37d4

zhuzilin added 21 commits March 1, 2022 19:31

remove chunk buffer

d867be5

pure gpu version calibrated

a21a9af

make all param chunks fp32

a43d115

use torch autocast for amp

570056e

remove chunk tensor index

9bd003d

remove redundant code

87efd72

Clean proprocess context

729136f

As we no longer convert tensors to other dtypes, some parameter transfered from the deepspeed context shall be removed.

remove redundant code

7e42732

Unify the memory management api in client

2f26843

For now, client.prepare_device is the only api to make room for new chunks. This commit also remove the redundant code after this change.

remove redundant code

b57f737

Manage gradient with chunk for better performance

c24b843

This commit adds a gradient chunk to chunk_list, so that we could offload the gradient by chunks. This should help better utilize the bandwidth between CPU and GPU.

Keep gradient chunks on CPU

7b26226

This commit keeps all gradient chunks to CPU (pin_memory) and will copy the calculated grad to them. After this change, the time for GPT3_8B model on 1xV100 240GB CPU memory is around 28s.

Add loss scaler back

3a0b5cf

This commit adds the apex loss scaler back and compare PatrickStart with Apex O2 stage. Also, we updated the license to the shorter version.

Some minor adjustment

e0a8de4

After this commit, the speed of GPT3_8B is 17s/iter.

Remove TrainingStage and out-of-date tests

cfc961b

This commit also adjust some code format and rename the test folder to `tests` as most of them are not unitests.

Fix tests

1decf02

Fix bugs on multi-gpu

bf78447

Merge branch 'refactor' into orphan_refactor

9431f60

reyoung self-requested a review March 10, 2022 06:27

Joseph Yu and others added 6 commits March 10, 2022 07:23

Tiny polish AsyncMemTracer

d595c7a

Merge pull request #1 from reyoung/feature/tiny_polish_async_mem_tracer

9a6c8b4

Tiny polish AsyncMemTracer

Tiny polish shell demo

471ff2d

Remove unused init

de8ad03

Tiny polish shell

5cf0372

Merge pull request #4 from reyoung/feature/tiny_polish_shell

97a88d2

Tiny Polish Shell and gitignore

reyoung changed the title ~~[WIP] Refactor~~ Code-Clean and refactor Apr 7, 2022

reyoung merged commit 9a5fe2c into Tencent:orphan Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-Clean and refactor #306

Code-Clean and refactor #306

zhuzilin commented Mar 9, 2022

Code-Clean and refactor #306

Code-Clean and refactor #306

Conversation

zhuzilin commented Mar 9, 2022