It is framework, which you can easily use for Map/Reduce model. As C++ is compiled language, so it is not easy to spread code among multiple computers. It is done by the use of shared library and polymorphism, so user need just to implement map, reduce functions and serialization of used types (all primitive types are available). Moreover, on each computer data is processed concurrently.
For using this utility you need to have next third-pary libraries
- Libssh
- Boost (System, Filesystem, Program options)
In ubuntu you can easily intall them using next command:
sudo apt install libssh-dev libboost-all-dev
Also, Dockerfile is provided with preinstalled libraries. You can build it with command:
docker build . -t milishchuk/mapreduce
mkdir build
cd build
cmake ..
make
make install
This repo contains example, which you can use for better understanding. We can divide implementing by next phases:
If you need specific type for your data, you should implement KeyValueType
interface with void parse(const std::string&)
and std::string to_string() const
methods. Or you can use implemented primitives(char, int, double, long). Implement
custom KeyValueTypeFactory
for this type.
Implement Map
and Reduce
classes inherited by map_base
and reduce_base
interfaces.
Implement function std::shared_ptr<job_config> get_config()
, which will return config with map
,reduce
functions, and
Factories for key_in
, key_out
, value_in
, value_out
, value_res
. For better understanding next diagram:
map groupby reduce key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out ======> key_out, [value_out] =====> key_out, value_res key_in, value_in ==> key_out, value_out key_in, value_in ==> key_out, value_out
Nextly, you need to make shared library with std::shared_ptr<job_config> get_config()
function
You can use blocking run_task_blocking
function or non-blocking run_task
, which will return std::future
. All data
to map nodes are read from files on corresponding computers.
For quick enviroment setup you can use run_example.sh script, which launchs 4 map nodes, reduce node and master node to which result will be returned.
Roman Milishchuk @Midren