Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint m2: pserver checkpoint about lookup table #11410

Closed
seiriosPlus opened this issue Jun 12, 2018 · 9 comments
Closed

checkpoint m2: pserver checkpoint about lookup table #11410

seiriosPlus opened this issue Jun 12, 2018 · 9 comments
Assignees

Comments

@seiriosPlus
Copy link
Collaborator

Checkpoint M2: Save lookup table on PServer.

@seiriosPlus seiriosPlus self-assigned this Jun 12, 2018
@seiriosPlus
Copy link
Collaborator Author

#Checkpoint_M2_Plan

Checkpoint 目录结构说明

checkpoint_dir
├── checkpoint_0
│ ├── pserver_0
│ │ └── lookup_table
│ │ └── var2.w_2
│ ├── pserver_1
│ │ └── lookup_table
│ │ └── var.w_1
│ └── trainer_0
│ ├── model
│ │ └── var.w_1
│ ├── epoch_id
│ └── step_id
└── checkpoint_1

M1 目标 支持Trainer端Save, 支持Trainer/Pserver端Load

##M2 目标 支持Lookup table 在PServer端的Save/Load

Design Doc

SAVE 阶段

  1. 在PServer待运行的ProgramDesc中加入一个save checkpoint相关的block
  2. Block中的OP是两个:一: checkpoint_prepare_op 用来生成相关的目录结构, 二是save_op, 将input中传入的var进行save。 目前save_op不支持SelectedRows类型, 需要优化相应的save方法
  3. 在Listen_and_serve_op 加入一个attr(name: checkpoint_block),在Python端做transpile的时候,将ID更新给Listen_and_serve_op
  4. 在Protobuf中加入一个Service(name: save_checkpint)
  5. 触发checkpoint save机制: Trainer 0 通过RPC广播发送checkpoint 消息给全部的PServer,listen_and_serve_op收到后会执行相应的block

LOAD阶段

  1. 目前load_checkpoint中目前的LOAD方法不支持 SelectedRows 的变量,需要修改load_op加入对 SelectedRows 的支持
  2. startup_progrom的load_checkpoint 的 inputs中需要加入 Lookup table的TableName
  3. 目前的load机制是针对指定的Varname从文件中进行load,针对Lookup table会指定TableName,遍历__model__文件夹进行全部load

@panyx0718
Copy link
Contributor

When trainer0 sends the message for all pservers to save parameters, the other trainers are still training. We need to make sure it's ok for other trainers to wait (in sync mode) for the save, or for different pservers to save different versions of parameters (in async mode).

@panyx0718
Copy link
Contributor

For the directory layout:
Does it need to be separated by "pserver ids"?
I assume you want to support some kind of elastic in the future? So the pserver cluster might be a bit dynamic.

If we only keep only persistent variables in the folder. During dynamic scaling or recovering, we can re-arrange the parameter assignments

@seiriosPlus
Copy link
Collaborator Author

@panyx0718
in sync mode, when trainer0 start to save checkpoint, other trainers also can not do something to update parameters on pserver.
in async mode, there have not control clearly.

@seiriosPlus
Copy link
Collaborator Author

For the directory layout:
Does it need to be separated by "pserver ids"?
I assume you want to support some kind of elastic in the future? So the pserver cluster might be a bit dynamic.

If we only keep only persistent variables in the folder. During dynamic scaling or recovering, we can re-arrange the parameter assignments

It is an awesome suggestion, I will discuss how to design it with other partners.

@seiriosPlus
Copy link
Collaborator Author

Update Checkpoint directory structure:

checkpoint_dir
├── checkpoint_0
│   ├── __lockup_table__
│   │   ├── table.pserver_1
│   │   └── table.pserver_2
│   ├── __model__
│   │   └── var.w_1
│   └── trainer_0
│       ├── epoch_id
│       └── step_id
└── checkpoint_1

@typhoonzero
Copy link
Contributor

@panyx0718 @seiriosPlus I think it's hard to make pserver distributed table "elastic": when the table is large, rehashing or re-distribute all keys will take too much time since the checkpoint is saved as files on some distribute filesystem, we can not randomly access the keys in those files, reordering may spend even more time than training the daily incremental data.

For a large distributed table, it's sort of a "best practice" to not rehashing them but only recover them when needed.

@jacquesqiao
Copy link
Member

I agree, and if need to do rehashing, we can do it using an independent tool offline before recover, it will make the design much simpler. Most of the time, for a very large scale sparse training task, the parameter server number will be fixed.

@panyx0718
Copy link
Contributor

Sounds good. I thought you guys want to do elastic in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants