checkpoint m2: pserver checkpoint about lookup table #11410

seiriosPlus · 2018-06-12T13:40:31Z

Checkpoint M2: Save lookup table on PServer.

seiriosPlus · 2018-06-13T02:54:14Z

#Checkpoint_M2_Plan

Checkpoint 目录结构说明

checkpoint_dir
├── checkpoint_0
│ ├── pserver_0
│ │ └── lookup_table
│ │ └── var2.w_2
│ ├── pserver_1
│ │ └── lookup_table
│ │ └── var.w_1
│ └── trainer_0
│ ├── model
│ │ └── var.w_1
│ ├── epoch_id
│ └── step_id
└── checkpoint_1

M1 目标支持Trainer端Save，支持Trainer/Pserver端Load

##M2 目标支持Lookup table 在PServer端的Save/Load

Design Doc

SAVE 阶段

在PServer待运行的ProgramDesc中加入一个save checkpoint相关的block
Block中的OP是两个：一： checkpoint_prepare_op 用来生成相关的目录结构，二是save_op, 将input中传入的var进行save。 目前save_op不支持SelectedRows类型，需要优化相应的save方法
在Listen_and_serve_op 加入一个attr(name: checkpoint_block)，在Python端做transpile的时候，将ID更新给Listen_and_serve_op
在Protobuf中加入一个Service(name: save_checkpint)
触发checkpoint save机制： Trainer 0 通过RPC广播发送checkpoint 消息给全部的PServer，listen_and_serve_op收到后会执行相应的block

LOAD阶段

目前load_checkpoint中目前的LOAD方法不支持 SelectedRows 的变量，需要修改load_op加入对 SelectedRows 的支持
startup_progrom的load_checkpoint 的 inputs中需要加入 Lookup table的TableName
目前的load机制是针对指定的Varname从文件中进行load，针对Lookup table会指定TableName，遍历__model__文件夹进行全部load

panyx0718 · 2018-06-13T07:09:28Z

When trainer0 sends the message for all pservers to save parameters, the other trainers are still training. We need to make sure it's ok for other trainers to wait (in sync mode) for the save, or for different pservers to save different versions of parameters (in async mode).

panyx0718 · 2018-06-13T07:15:10Z

For the directory layout:
Does it need to be separated by "pserver ids"?
I assume you want to support some kind of elastic in the future? So the pserver cluster might be a bit dynamic.

If we only keep only persistent variables in the folder. During dynamic scaling or recovering, we can re-arrange the parameter assignments

seiriosPlus · 2018-06-13T09:29:56Z

@panyx0718
in sync mode, when trainer0 start to save checkpoint, other trainers also can not do something to update parameters on pserver.
in async mode, there have not control clearly.

seiriosPlus · 2018-06-13T09:34:57Z

For the directory layout:
Does it need to be separated by "pserver ids"?
I assume you want to support some kind of elastic in the future? So the pserver cluster might be a bit dynamic.

If we only keep only persistent variables in the folder. During dynamic scaling or recovering, we can re-arrange the parameter assignments

It is an awesome suggestion, I will discuss how to design it with other partners.

seiriosPlus · 2018-06-13T09:54:32Z

Update Checkpoint directory structure:

checkpoint_dir
├── checkpoint_0
│   ├── __lockup_table__
│   │   ├── table.pserver_1
│   │   └── table.pserver_2
│   ├── __model__
│   │   └── var.w_1
│   └── trainer_0
│       ├── epoch_id
│       └── step_id
└── checkpoint_1

typhoonzero · 2018-06-14T01:44:23Z

@panyx0718 @seiriosPlus I think it's hard to make pserver distributed table "elastic": when the table is large, rehashing or re-distribute all keys will take too much time since the checkpoint is saved as files on some distribute filesystem, we can not randomly access the keys in those files, reordering may spend even more time than training the daily incremental data.

For a large distributed table, it's sort of a "best practice" to not rehashing them but only recover them when needed.

jacquesqiao · 2018-06-14T03:33:39Z

I agree, and if need to do rehashing, we can do it using an independent tool offline before recover, it will make the design much simpler. Most of the time, for a very large scale sparse training task, the parameter server number will be fixed.

panyx0718 · 2018-06-14T12:10:08Z

Sounds good. I thought you guys want to do elastic in the future.

seiriosPlus self-assigned this Jun 12, 2018

seiriosPlus assigned panyx0718, jacquesqiao and typhoonzero Jun 13, 2018

seiriosPlus mentioned this issue Jun 14, 2018

Checkpoint M2: lookup table checkpoint #11490

Merged

seiriosPlus closed this as completed Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint m2: pserver checkpoint about lookup table #11410

checkpoint m2: pserver checkpoint about lookup table #11410

seiriosPlus commented Jun 12, 2018

seiriosPlus commented Jun 13, 2018

panyx0718 commented Jun 13, 2018

panyx0718 commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

typhoonzero commented Jun 14, 2018

jacquesqiao commented Jun 14, 2018

panyx0718 commented Jun 14, 2018

checkpoint m2: pserver checkpoint about lookup table #11410

checkpoint m2: pserver checkpoint about lookup table #11410

Comments

seiriosPlus commented Jun 12, 2018

seiriosPlus commented Jun 13, 2018

Checkpoint 目录结构说明

M1 目标 支持Trainer端Save， 支持Trainer/Pserver端Load

Design Doc

SAVE 阶段

LOAD阶段

panyx0718 commented Jun 13, 2018

panyx0718 commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

seiriosPlus commented Jun 13, 2018

typhoonzero commented Jun 14, 2018

jacquesqiao commented Jun 14, 2018

panyx0718 commented Jun 14, 2018

M1 目标支持Trainer端Save，支持Trainer/Pserver端Load