Distributed Map/Reduce framework

It is framework, which you can easily use for Map/Reduce model. As C++ is compiled language, so it is not easy to spread code among multiple computers. It is done by the use of shared library and polymorphism, so user need just to implement map, reduce functions and serialization of used types (all primitive types are available). Moreover, on each computer data is processed concurrently.

Dependencies

For using this utility you need to have next third-pary libraries

Libssh
Boost (System, Filesystem, Program options)

In ubuntu you can easily intall them using next command:

sudo apt install libssh-dev libboost-all-dev

Also, Dockerfile is provided with preinstalled libraries. You can build it with command:

docker build . -t milishchuk/mapreduce

Installing

mkdir build
cd build
cmake ..
make
make install

Usage

This repo contains example, which you can use for better understanding. We can divide implementing by next phases:

Custom KeyValueType

If you need specific type for your data, you should implement KeyValueType interface with void parse(const std::string&) and std::string to_string() const methods. Or you can use implemented primitives(char, int, double, long). Implement custom KeyValueTypeFactory for this type.

Map/Reduce

Implement Map and Reduce classes inherited by map_base and reduce_base interfaces.

JobConfig

Implement function std::shared_ptr<job_config> get_config(), which will return config with map,reduce functions, and Factories for key_in, key_out, value_in, value_out, value_res. For better understanding next diagram:

                 map                     groupby                      reduce
key_in, value_in ==> key_out, value_out  
key_in, value_in ==> key_out, value_out  
key_in, value_in ==> key_out, value_out  ======> key_out, [value_out] =====> key_out, value_res
key_in, value_in ==> key_out, value_out 
key_in, value_in ==> key_out, value_out

Shared library

Nextly, you need to make shared library with std::shared_ptr<job_config> get_config() function

Run

You can use blocking run_task_blocking function or non-blocking run_task, which will return std::future. All data to map nodes are read from files on corresponding computers.

Preview

For quick enviroment setup you can use run_example.sh script, which launchs 4 map nodes, reduce node and master node to which result will be returned.

Contributors

Roman Milishchuk @Midren

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
configurator		configurator
example		example
map_node		map_node
reduce_node		reduce_node
ssh		ssh
types		types
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
PVS-Studio.cmake		PVS-Studio.cmake
README.md		README.md
json_server.h		json_server.h
map_reduce.cpp		map_reduce.cpp
map_reduce.h		map_reduce.h
run_example.sh		run_example.sh
util.cpp		util.cpp
util.h		util.h

License

Midren/distributed_map_reduce

Folders and files

Latest commit

History

Repository files navigation

Distributed Map/Reduce framework

Table of contents

Dependencies

Installing

Usage

Custom KeyValueType

Map/Reduce

JobConfig

Shared library

Run

Preview

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages