Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter server design doc #1880

Closed
helinwang opened this issue Apr 25, 2017 · 4 comments
Closed

Parameter server design doc #1880

helinwang opened this issue Apr 25, 2017 · 4 comments
Assignees

Comments

@helinwang
Copy link
Contributor

helinwang commented Apr 25, 2017

无论是否重写parameter server,我们都需要一个清晰的parameter server接口:

  • 如果不重写,可以保证修改现有代码不会让可维护性降得更低。以后要重写只用保持接口不变,更换实现即可。
  • 如果重写,也需要一个清晰的接口,保证代码质量。

目前parameter server只需要支持:异步SGD,不带动量的优化算法(传统SGD),dense更新。
不需要支持:同步SGD,各种带动量的优化算法,sparse更新。
接口需要涵盖目前不需要支持的功能,不需要实现不支持的功能。

据我理解,design doc需要以下几块接口的定义:

  • 运行parameter server的命令行参数是什么:
    比如: --port 8000 --save-period 60

  • RPC Server接口伪代码:
    我想象的,举个例子:

    int update(string method, list of dense gradients);
    int download(pointer to list of dense gradients);
    int updateSparse(string method, xxx);
    int downloadSparse(xxx);
    int saveModel(string path);
  • RPC client API,python and C/C++
    出于性能考虑update和download貌似只能是C/C++ API (不然数据类型需要经过python中转)。

    • C/C++ API(C或者C++都行,只要表达了意思就好)。如果要用golang重写,实现的时候需要C API:
      我想象的,举个例子:
    int update(string method, list of dense gradients);
    int download(pointer to list of dense gradients);
    int updateSparse(string method, xxx);
    int downloadSparse(xxx);
    int saveModel(string path);
    int wait(int t);
    
    // e.g.,
    // update(...);
    // int errorCode = wait(download(...));
    • python API(是否只需要保存模型的API?)
  • 把parameter server 主干部分(参数存储,更新部分,除去RPC以及使用etcd做service announcement的部分)当作一个库,库的API。
    定义库API可以把RPC代码与主干部分清晰分开。以后保持接口,更换实现会很方便。另外有可能用golang写主程序,编译时候链接这个库。

@gongweibao
Copy link
Contributor

gongweibao commented Apr 25, 2017

补充一下:现在的parameter server的rpc机制有些问题

  • socket是block模式的,没有timeout,也没有错误以后失败重试、关闭连接重试机制,也就是传送数据必须要成功。这在节点比较多、网络压力大的情况下容易出问题。
  • 一个trainer一个线程,一旦trainer个数多,parameterserver线程调度的压力会很大

@helinwang
Copy link
Contributor Author

helinwang commented Apr 25, 2017

go rpc用的on-wire格式gob很有意思:

From integers we can build all the other types: bytes, strings, arrays, slices, maps, even floats. Floating-point values are represented by their IEEE 754 floating-point bit pattern, stored as an integer, which works fine as long as you know their type, which we always do. By the way, that integer is sent in byte-reversed order because common values of floating-point numbers, such as small integers, have a lot of zeros at the low end that we can avoid transmitting.

有压缩传输尺寸的可能性。
https://blog.golang.org/gobs-of-data

@dzhwinter
Copy link
Contributor

dzhwinter commented May 2, 2017

@helinwang
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants