Every repository with this icon (
Every repository with this icon (
| Description: | a Map/Reduce framework for distributed computing edit |
-
Hi, I have configure a cluster with two computers, xiliu-fedora (master) and xiliu-public (worker). Both are running Fedora 11, python 2.6.3. However, I did alway get following timeout errors, and on the worker node, erl is crash, and throws erl_crash.dump
2009/10/08 19:20:36master
map:0 added to waitlist
2009/10/08 19:20:36xiliu-public
WARN: [map:0] Node failure: "Couldn't connect to xiliu-public (timeout). Node blacklisted temporarily."
2009/10/08 19:20:04master
map:0 assigned to xiliu-public
2009/10/08 19:20:04master
map:0 added to waitlist
2009/10/08 19:20:04master
Map phase
2009/10/08 19:20:04master
Starting job
2009/10/08 19:20:04master
New job!
[xiliu@xiliu-public ~]$ tail -f erl_crash.dump timeout
infinity
fun
'' '$end_of_table' 'nonode@nohost' '_' true
false
=end ^CComments
-
0 comments Created 2 months ago by tuulosResults of external sort depends on the current localebugxExternal sort uses the LC* environment variables to determine sort order. To guarantee consistent results, LC_ALL=C should be set for sort.
Comments
-
Unable to run disco master and slave on same machine in 0.2.3
3 comments Created 3 months ago by cpenningtonWhile experimenting with 0.2.3, I found I couldn't run both
disco master startand
disco worker startat the same time. In particular, when starting the worker, I got a message to the effect of "lighttpd is already running with worker config". However, the only lighttpd process running was that with the master config file.
I also noticed that the MASTER_DISCO_PORT config parameter has been removed in 0.2.3, which I suspect is related. Does the master automatically start a worker as well? Or is there some other way of running both a master and slave on the same node?
Comments
Yes, in 0.2.3 the master is automatically capable of being a worker (thus getting rid of DISCO_MASTER_PORT), if it is listed in your configuration. You should only need to start the master.
cpennington
Fri Sep 11 11:40:13 -0700 2009
| link
Ah, that's good to know. I think I didn't catch the implication that that was how it worked in the Setting Up Disco docs, because I was expecting it to work like it did in 0.2.2
-
If you use the sample count_words.py code in the documentation and you wrap the new job call in a loop (sleeping 30 seconds after completion), then the 2nd call will always fail (at least on osx) with a socket.error: (54, 'Connection reset by peer')
If you don't use the loop and call the script manually repeatedly, no such error occurs. Seems to me that disco is not closing connections properly after finishing a job.
Comments
-
2 comments Created 3 months ago by adriaantdisco using foo instead of foo.local on macosxbugxOn macosx, disco-worker gets a master_url in sys.argv[4] that on macosx does not use localhost if you're using a single host but rather the name part of 'foo.local' (ie 'foo'), the alternative host naming scheme. Since this doesn't resolve disco fails with "WARN [map:0] Failed to get http://foo:7000/disco/master/_disco_4441/2c/wordcount@1251260554//params: (2, 'Temporary failure in name resolution')"
To fix I hack node/disco-worker with:
master_url = master_url.replace('foo', 'localhost')
Comments
-
0 comments Created 3 months ago by tuulosnumber of waiting tasks is negative in the web ui when nr_maps < len(input)bugxwhen nr_maps < len(input), the web ui shows a negative number of waiters in the job info box
Comments
-
The current scheduler is suboptimal when the cluster is fully loaded. Whenever a task finishes on any node, the next task is scheduled on its place, regardless where its input data is located.
A node-specific task queue would solve this problem by pre-sorting tasks to nodes according to locations of input files.
Comments
-
1 comment Created 4 months ago by tuulosExternal map doesn't seem to work with base64 encoded databugx -
1 comment Created 4 months ago by tuulosdisco_worker.erl doesn't rate limit erroneous outputbugx -
0 comments Created 4 months ago by tuulosJob fails if a node node dies when the job is runningbugxIf a node dies (e.g. "ERROR: Worker crashed in map:51 @ disco-21: noconnection") when the job is running, the job isn't rescheduled properly to another node.
Comments
-
0 comments Created 4 months ago by tuulosRate limit is not applied for all messagesbugxarchwild reported on IRC that a Java process can cause Erlang to consume 100% of CPU, if a misbehaving external process produces masses of output to stderr.
This should be easy to fix by applying rate limit to all types of messages, not just .
Comments
-
1 comment Created 4 months ago by tuulosKill job doesn't always kill external processesbugxarchwild reported on the IRC channel that Java processes lauched via the external interface sometimes don't die when the job is killed. They don't seem to die with pkill either.
Test Java with external interface.
Comments
This bug was probably related to the bug #21 (disco_worker didn't rate limit erroneous messages on stdout). The disco_worker gen_server was overwhelmed by messages from stdout and the kill message never got to the process.
Solved by applying rate imit to stdout (commit db76c80fc).
-
0 comments Created 4 months ago by tuulosAutomatic distribution of simple modulesfeaturexDisco can copy required files and libraries to nodes when using the external interface. It should be possible to use this feature also with native Python code so that simple modules etc. could packaged in a job request and distributed automatically to nodes.
Comments
-
0 comments Created 4 months ago by tuulosAutomatic inference of required_modulesfeaturexInvestigate if dis.dis() disassembler could be used to infer dependant modules automatically for the m/r functions. See
http://groups.google.com/group/disco-dev/browse_thread/thread/7e63e6da57bcd05b?hl=en
Comments
-
0 comments Created 5 months ago by jsurrattmake-lighttpd-proxyconf.py parses commentsbugxWe have commented out lines of host names and IPs in our /etc/hosts that we'd rather not remove at this time. However, make-lighttpd-proxyconf.py still parses those old and invalid lines. Modifying the script to the code below fixes the problem.
Thanks!
#!/usr/bin/python import os, re port = os.environ["DISCO_PORT"] print "proxy.server = (" r = re.compile("^(\d+\.\d+\.\d+\.\d+)\s+(.*)", re.MULTILINE) for x in re.finditer(r, file("/etc/hosts").read()): ip, host = x.groups() print '"/disco/node/%s/" => (("host" => "%s", "port" => %s)),' %\ (host, ip, port) print ")"Comments
-
1 comment Created 6 months ago by tuuloserror with deb-packages: /etc/disco/disco.conf not foundbugxError while trying out the experimental deb-packages on a fresh Deb 5 VM. There is no /
etc/disco/disco.conf found anywhere. Seehttp://groups.google.com/group/disco-dev/browse_thread/thread/1ee969fb18aa61d3?hl=en
Comments
-
1 comment Created 6 months ago by tuulosJob fails with "truncated input" if a remote input is processed for too long timebugxLighttpd has the default setting
server.max-write-idle = 360
which closes the connection after 360 seconds of client inactivity. The default value is too low for many reduce tasks that keep many connections open simultaneously.
Comments
-
22:54 < neal> hi
22:55 < neal> when i try to run homedisco i get the following error
22:55 < neal> Traceback (most recent call last):
22:55 < neal> File "wordcount.py", line 1, in
22:55 < neal> from homedisco import HomeDisco
22:55 < neal> File "/home/neal/school/seng474/frequent_itemsets/cfim/homedisco.py", line 4, in
<module>22:55 < neal> from disconode import disco_worker
22:55 < neal> File "/usr/local/lib/python2.6/dist-packages/disconode/disco_worker.py", line 504, in
<module>22:55 < neal> init()
22:55 < neal> File "/usr/local/lib/python2.6/dist-packages/disconode/disco_worker.py", line 64, in
init22:55 < neal> OOB_URL = ("http://%s/disco/ctrl/oob_get?" % this_master())<br/> 22:55 < neal> File "/usr/local/lib/python2.6/dist-packages/disconode/disco_worker.py", line 46, in
this_master22:55 < neal_> return sys.argv[4].split("/")[2]
Comments
-
1 comment Created 7 months ago by tuulosbugxToo many partitions lead to "IOError: Too many arguments"holdx'nr_reduces = 5000' fails due to the map task trying to merge 5000 files together. This may also happen with many input files, in which case nr_reduces = nr_maps / 2.
Comments
-
3 comments Created 7 months ago by tuulosbugx#!/usr/bin/env python instead of #!/usr/bin/pythonholdxA few scripts use #!/usr/bin/python (i.e. disco-worker) instead of #!/usr/bin/env python. When the scripts are executed directly, non-system python installations cannot be used.
Comments
Does /usr/bin/env work on OS X? If someone can approve that it works on macs, I can make the change.
-
balls.png is missing from the git repository master/www/ directory
Comments
-
2 comments Created 7 months ago by tuulosbugxSeparate webpages for status and configurationholdxInstead of having status & configuration on the same webpage, it would be useful to separate them into 2 separate pages. The configuration can then be password protected and accessible only by the cluster administrator.
Comments
-
0 comments Created 7 months ago by tuulosHomedisco "can't access local input file" errorsbugxIn the git source, the homedisco example (at the bottom of util/homedisco.py) fails to run with the following error:
sqs2 ~/src/disco: python util/homedisco.py
[09/01/02 00:47:19 none ()] Received a new map job! [09/01/02 00:47:19 none ()] Done: 3 entries mapped in total [09/01/02 00:47:19 none ()] 0 chunk://localhost/homedisco@1230878839/map-chunk-0 [09/01/02 00:47:19 none ()] Received a new reduce job! **[09/01/02 00:47:19 none ()] Starting reduce connect_input(fname=chunkfile://data/homedisco@1230878839/map-chunk-0)
Traceback (most recent call last):
File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 39, in open_localf = file(fname)IOError: [Errno 2] No such file or directory: 'data/homedisco@1230878839/map-chunk-0'
None
**[09/01/02 00:47:19 none (chunkfile://data/homedisco@1230878839/map-chunk-0)] Can't access a local input file: chunkfile://data/homedisco@1230878839/map-chunk-0 Traceback (most recent call last):
File "util/homedisco.py", line 78, inreduce = fun_reduce)File "util/homedisco.py", line 44, in new_job
disco_worker.op_reduce(req)File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 430, in op_reduce
fun_reduce(red_in.iter(), red_out, red_params)File "util/homedisco.py", line 60, in fun_reduce
for k, v in iter:File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 285, in multi_file_iterator
sze, fd = connect_input(fname)File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 131, in connect_input
return open_local(input, local_file, is_chunk)File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 50, in open_local
% input, input)File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 39, in open_local
f = file(fname)IOError: [Errno 2] No such file or directory: 'data/homedisco@1230878839/map-chunk-0'
It appears that the open_local path is incorrectly determining the filename from the chunkfile:// URI it is given. It does not prepend the value of the DISCO_ROOT environment variable as it should.
Result_iterator also tries to load the result from a relative path when it should be applying DISCO_ROOT to the beginning. It fails with this error if only the open_local issue is fixed:
[09/01/02 00:55:54 none ()] Received a new map job! [09/01/02 00:55:54 none ()] Done: 3 entries mapped in total [09/01/02 00:55:54 none ()] 0 chunk://localhost/homedisco@1230879354/map-chunk-0 [09/01/02 00:55:54 none ()] Received a new reduce job! [09/01/02 00:55:54 none ()] Starting reduce connect_input(fname=chunkfile://data/homedisco@1230879354/map-chunk-0)
[09/01/02 00:55:54 none ()] Reduce done: 3 entries reduced in total [09/01/02 00:55:54 none ()] Reduce done [09/01/02 00:55:54 none ()] 0 disco://localhost/homedisco@1230879354/reduce-disco-0 ['file://data/homedisco@1230879354/reduce-disco-0'] Traceback (most recent call last):
File "util/homedisco.py", line 80, infor k, v in result_iterator(res):File "build/bdist.macosx-10.5-i386/egg/disco/core.py", line 261, in result_iterator IOError: [Errno 2] No such file or directory: 'data/homedisco@1230879354/reduce-disco-0'
After applying this patch, the correct output is returned:
sqs2 ~/src/disco: python util/homedisco.py
[09/01/02 00:57:57 none ()] Received a new map job! [09/01/02 00:57:57 none ()] Done: 3 entries mapped in total [09/01/02 00:57:57 none ()] 0 chunk://localhost/homedisco@1230879477/map-chunk-0 [09/01/02 00:57:57 none ()] Received a new reduce job! [09/01/02 00:57:57 none ()] Starting reduce [09/01/02 00:57:57 none ()] Reduce done: 3 entries reduced in total [09/01/02 00:57:57 none ()] Reduce done [09/01/02 00:57:57 none ()] 0 disco://localhost/homedisco@1230879477/reduce-disco-0 KEY red:dog VALUE dog
KEY red:cat VALUE cat
KEY red:possum VALUE possumThe patch also fixes the problem for a custom HomeDisco job I wrote, but there's no test suite for me to determine whether it is correct in all cases. Specifically, it does not appear to introduces issues when running remote jobs (i.e., not through HomeDisco), but I can't guarantee anything. Also, there may be a better way of doing this. (I saw that the LOCAL_PATH env var exists, but it already has "/data" at the end, and the filenames we are appending to $DISCO_ROOT have "/data" at the beginning, so using LOCAL_PATH would result in an incorrect "/data/data".)
Comments
-
Filter doesn’t work correctly on the job status page.
Comments
-
0 comments Created 7 months ago by tuulosTracebacks are formatted incorrectly on the status pagebugxIn some cases tracebacks from failed tasks are formatted incorrectly on the status page.
Comments
-
One of my nodes did not have /etc/environment pointing to the bin directory where the "erl" command was installed. When that node was configured to accept jobs on the master, the entire job would die.
It be much better if the master would just redirect to another node.
Comments












solved