Experimenting with parallel file processing in Erlang
Erlang C Python
Switch branches/tags
Nothing to show
Pull request Compare This branch is 1 commit ahead of rodaebel:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
Makefile
README.rst
creader.erl
mkdata.py
parallel.erl
reader.c
serial.erl
serial.py
serialc.erl

README.rst

Parallel File Processing With Erlang

This tiny project is about processing line-oriented records where order does not matter. It tries to provide a sufficient solution written in Erlang for distributing workload across multiple CPU cores [1].

Copyright and License

Copyright (c) 2011, Tobias Rodaebel

This software is released under the Apache License, Version 2.0. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Building and Running

For compiling the Erlang programs just enter:

$ make

You can now run them as follows:

$ erl -run serial start PATH

Benchmarking

The project includes a very simple Python script mkdata.py to generate test data. In order to generate 5*10^6 lines (~1.1 GB) of test data enter the following command:

$ python mkdata.py 5000000 > test.txt

And these are the results (in seconds) of running our programs on different hardware with the same test data. For the first series the disk cache was flushed before each run by rebooting the machine.

Machine Erlang R14B03 serial.erl Erlang R14B03 parallel.erl pypy 1.5 serial.py
MBP 11.342 7.821 6.254
MBP (cached) 10.684 1.124 3.127
  • MBP = MacBook Pro 2.3 GHz Intel Core i7 / SSD

Conclusion

As of this writing, Erlang R1403 seems to be relatively inefficient when doing normal file I/O. Buffering and parallel data processing helps to gain slightly better results, though. But for anyone who wants to dive deeper into this matter, I recommend Jay Nelson's talk "Process-Striped Buffering with gen_stream" [2] he gave at the Erlang Factory 2011.

Footnotes

[1]See Time Bray's Wide Finder Project.
[2]Jay Nelson on Process-Striped Buffering with gen_stream.