Every repository with this icon (
Every repository with this icon (
| Description: | a Map/Reduce framework for distributed computing edit |
-
0 comments Created 2 months ago by cpenningtonDisco 0.2.3 doesn't work with Amazon EC2 instructionsbugxDisco 0.2.3 depends on erlang-base 13b, which Debian Lenny doesn't supply. The image "ami-e69d798f" is a lenny install, so it doesn't have the required packages.
Furthermore, setup-instances.py overrides /etc/apt/sources.list to only use discoproject and lenny repos, so it reinforces the same issue.
Comments
-
0 comments Created 2 months ago by adriaantTroubleshooting section not updated for 0.2.3bugxhttp://discoproject.org/doc/start/troubleshoot.html#troubleshooting contains suggestions that don't work for 0.2.3. For example,
erl -boot master/discocannot be done. I'm trying to figure out why I cannot start the master on a Linux boxComments
-
0 comments Created about 1 month ago by tuulosFirewall issues should be mentioned in documentationbugxComments
-
0 comments Created 2 months ago by cpenningtonDisco 0.2.3 master can't connect to remote Amazon EC2 nodesbugxRunning Disco 0.2.3 source distribution on ami-2946a740 (Debian Squeeze AMI from http://alestic.com/), with one master node and one remote slave, I was unable to run a job successfully. I went through all of the steps on the Troubleshooting Disco page, and found that the slave:start test failed with an {error, timeout} result. I tried a number of different flavors of node names (ec2 internal name, short names defined in /etc/hosts), and met with the same results. I was successfully able to run slave:start on the slave node to start a slave locally, but the same command from the master failed.
Comments
-
0 comments Created about 1 month ago by tuulosNode config table doesn't work correctly with multiple linesbugxAdding new lines to the config table seems to drop old lines sometimes. Correspondingly, adding new nodes to the cluster on the fly doesn't always produce expected results.
Comments
-
0 comments Created about 1 month ago by tuulosWorker behaves badly if it runs out of file descriptorsbugxYou can easily make the worker run out of file descriptors e.g. by setting nr_reduces = 1000. This may result e.g. to spurious DNS failures, since no new sockets can be opened.
Disco should limit the number of fds used in a worker process, at least to make the error messages more descriptive.
Comments
-
0 comments Created about 1 month ago by tuulosMaster should monitor memory consumption of jobsbugxIf a job leaks memory / consumes vast amounts of memory otherwise, it can take nodes down easily. Master should monitor per-task/job/node memory consumption and kill rogue processes.
Comments
-
0 comments Created about 1 month ago by tuulosJob with thousands of reduces fails due to lack of file descriptorsbugxRaising the number of reduces to thousands is problematic because it requires a huge number of open file descriptors per process.
Comments
-
0 comments Created about 1 month ago by jflatowjobs should disappear from list immediately after deletedbugxComments
-
It should be possible to combine various low-level readers, such as gzip decoding, with higher level custom readers. This requires that you can specify a list (stack) of readers / writers instead of just one. See
http://groups.google.com/group/disco-dev/browse_thread/thread/7e63e6da57bcd05b?hl=en
Comments
-
Merge Eran Sandler's patches to add support for Gzipped inputs. See
http://groups.google.com/group/disco-dev/browse_thread/thread/7e63e6da57bcd05b?hl=en
Comments
-
0 comments Created 5 months ago by tuulosReduce doesn't fail nicely if HTTP server downbugxWhen running tests map tasks execute ok even if lighttpd is down on a node. However, reduce fails since it can't access the map results. At least with comm_httplib this results to a cryptic error message - at least it could say nicely what's wrong.
Comments
-
0 comments Created 3 months ago by tuulosJob fails due to a temporary glusterfs errorbugxJob fails fatally due to a glusterfs error (timeout), even though the error should be recoverable.
Gluster logs (client):
[2009-08-12 15:32:09] E [client-protocol.c:437:client_ping_timer_expired] dx26-vol1: Server 172.16.1.27:9900 has not responded in the last 60 seconds, disconnecting.
Server:
[2009-08-12 18:44:41] E [server-protocol.c:3903:server_readv] vol1: invalid argument: state->fd [2009-08-12 18:44:41] W [server-protocol.c:4210:server_fstat] server: fd - 1: unresolved fd [2009-08-12 18:44:41] E [server-protocol.c:5763:server_finodelk] server: fd - 1: unresolved fdDisco:
Worker failed. Last words:
Traceback (most recent call last):
File "/usr/bin/disco-worker", line 69, inmethod(m)File "/var/lib/python-support/python2.5/disconode/disco_worker.py", line 392, in op_map
run_map(job_input[0], partitions, map_params)File "/var/lib/python-support/python2.5/disconode/disco_worker.py", line 327, in run_map
for entry in reader:File "/home/tuulos/src/disco/pydisco/disco/func.py", line 60, in netstr_reader File "/home/tuulos/src/disco/pydisco/disco/func.py", line 40, in read_netstr IOError: [Errno 107] Transport endpoint is not connected
close failed: [Errno 107] Transport endpoint is not connected
Comments
-
0 comments Created 2 months ago by adriaantbugxStopping master should stop epmd tooholdx/usr/local/lib/erlang/erts-5.7.2/bin/epmd -daemon This one caused me a headache as I wasn't able to restart the master. The epmd process was still running and prevented master from starting up.
Comments
-
1 comment Created 3 months ago by tuulosbugxFailure in "Moving results to resultfs" is a fatal errorholdxIf Glusterfs fails on a single node, it is detected only when results are being moved to resultfs, which leads to a fatal error. However, the error should be recoverable by executing the task on another node.
Comments
-
0 comments Created 3 months ago by tuulos0.3xNew parameter to limit the maximum number of cores used by a jobfeaturex -
0 comments Created about 1 month ago by tuulos0.3xProvide Mochiweb as a built-in default alternative to LighttpdfeaturexIntegrate MochiWeb as a default web server to Disco, to make setting up the system easier. Lighttpd should be still provided as a high-performace option (Lighty could be supervised by Disco).
Comments
-
0 comments Created about 1 month ago by tuulosSleep before re-trying a failed taskbugxIf a task fails due to e.g. a temporarily unavailable input, it should wait for some time before re-trying. Otherwise it can easily reach the maximum number of failed tasks in a few seconds.
Comments
-
0 comments Created about 1 month ago by tuulos0.3xJob records should be persistent in Disco masterfeaturexMaster should store job records persistently, so master could be restarted without losing any records.
Comments
-
The bin/disco script could include a troubleshoot mode which executes steps at
http://discoproject.org/doc/start/troubleshoot.html
automatically and shows suggestions how to fix detected issues.
Comments
-
disco-worker catches exceptions with the following code:
try: run(method, mode, part, m) except comm.CommException, x: util.data_err("HTTP error: %s" % x) except IOError, x: util.data_err("IO error: %s" % x)The calls to util.data_err fail because it requires 2 parameters.
Comments
-
0 comments Created 26 days ago by tuulosDisco master can become overloaded if hundreds of jobs are running concurrentlybugxIf hundreds (>400) of jobs are running (and finishing) concurrently, gen_server:call(disco_server) calls can start to timeout.
Comments











