Parameter server is a distributed machine learning framework. It scales machine learning applications to industry-level problems, namely 10 to 100 billions of samples and features, 100 to 1000 machines. It is a joint project by CMU SML-Lab, Baidu IDL, and Google.
- Efficient communication. All communications are asynchronous. It is optimized for machine learning tasks to reduce network traffic and overhead.
- Flexible consistency models. The system provides flexible consistency models to allow the algorithm designer to balance algorithmic convergence rate and system efficiency, where the best trade-off depends on data, algorithm, and hardware.
- Elastic Scalability. New nodes can be added without restarting the running framework.
- Fault Tolerance and Durability. Recovery from and repair of non-catastraphic machine failures within several seconds, without interrupting computation.
- Ease of Use. The globally shared parameters are represented as (potentially sparse) vectors and matrices to facilitate development of machine learning applications. The linear algebra data types come with high-performance multi-threaded linear algebra libraries.
Requirement:
- compiler: gcc >= 4.7 (tested on 4.7.x, 4.8.x) or clang >= 3.4 (tested on 3.4). You can change the first two lines of src/Makefile to switch between gcc and clang.
- system: Should work on both Linux and Mac OS (tested on Ubuntu 12.10, 13.10, Max OS X 10.8 10.9)
- depended libraries: The system depends on zeromq, gflags, glogs, gtest, protobuf, zlib, snappy, eigen3 and optional mpi. We provide install.sh to build them from sources automatically.
Build parameter server:
cd src && make -j8
The -j
option specifies how many threads are used to
build the projects, you may change it to a more proper value.
The system can read both raw binary data and protobuf data. It also supports several text formats. In a text format, each example is presented as a line of plain text.
The format of one instance:
label;group_id feature[:value] feature[:value] ...;groud_id ...;...;
- label: +1/-1 for binary classification, 0,1,2,... for multiclass classification, a float value for regression. And certainly it can be empty.
- group_id: an integer identity of a feature group, each instance should contains at least one feature group.
- feature: for sparse data, it is an 64-bit integer presenting the feature id, while for dense data, it is a float feature value
- weight: only valid for non-binary sparse data, it is a float feature value.
The libsvm format uses the sparse format. It presents each instance as:
label feature_id:value feature_id:value ...
Check examples on RCV1.
One way to start the system is using mpirun
. To install mpi
on Ubuntu, run sudo apt-get install mpich
, or sudo port install mpich
for Mac OS. The system can be
also started via ssh
and resource manager such as yarn
without mpi (In progress).
mpirun -np 5 ./ps_mpi -interface lo0 -num_workers 2 -num_servers 2 -app ../config/block_prox_grad.config
The augments:
- -np: the number of processes created by
mpirun
. It should >= num_workers + num_servers + 1 (the scheduler). Use-hostfile hosts
to specify the machines. - -interface: the network interface, run
ifconfig
to find the available network interfaces - -num_workers: the number of worker nodes. Each one will get a part of training data
- -num_servers: the number of server nodes. Each one will get a part of the model.
- -app: the application configuration
Use ./ps_mpi --help
to see more arguments.
To run sparse logistic regression on RCV1 from scratch:
mkdir your_working_dir
cd your_working_dir
git clone git@github.com:mli/parameter_server.git .
git clone git@github.com:mli/parameter_server_third_party.git third_party
git clone git@github.com:mli/parameter_server_data.git data
third_party/install.sh
cd src && make -j8
mpirun -np 5 ./ps_mpi -interface lo0 -num_workers 2 -num_servers 2 -app ../config/block_prox_grad.config