Potential feature request (via cloud-haskell) #3

hyperthunk · 2017-02-10T09:06:04Z

[Imported from JIRA. Reported by davidsd @davidsd) as CH-2 on 2012-08-17 18:10:24]
Hello,

I'm primarily interested in using Cloud Haskell for running computations on clusters that use job schedulers like Platform LSF and Oracle Grid Engine. The typical pattern is that I submit an array of N jobs to the scheduler, and the scheduler decides which machines to run them on, and at what time. As a user, some key features of this setup are:

Different processes in the job array are started at different times. This is typically because there are other users on the cluster, and the scheduler uses priorities and queues to determine what should be run when.

The number of processes running at any given time is almost always less than N. The simplest example of this is if I schedule 1000 jobs on a cluster with 100 machines, then obviously some of the jobs must be run in sequence. A more common example for me is that the cluster is busy, and my jobs get interleaved with jobs from other users, reducing the effective number of available machines.

I have no control over which machines processes are started on. There's also no way to know which machine a process will be started on before the process actually starts.

Individual processes may be killed or suspended at any time. This is most common when they happen to be running on a machine for which another user has higher priority (enough to kick me off).

I'm wondering what would be involved in writing a Cloud Haskell backend for this type of environment. I have written some ad-hoc programs to deal with this sort of thing before. An example situation is that I have a function f which is very expensive to compute, and I would like to farm different calls to this function out to different machines. The model I've used is:

A single master which decides for which x's to compute f(x).
A bunch of workers, each equipped to compute f(x).
A process determines it's own master/slave status based on its index in the job array (an environment variable). Job index 1 is the master, the rest are slaves.
When a slave starts up, it uses the job scheduler to find the IP address of the master, and sends a "ready" message to the master.
The master keeps a queue of available slaves which is updated whenever a "ready" message arrives, or whenever the result of a computation arrives.
The master also keeps a list of running slaves and what computations they're performing.
If a slave dies, it's discarded from the list of running slaves, and its computation is sent to the next available slave.
This is all to cope with the (somewhat frustrating) fact that the "cloud" is dynamic, and many of its properties are only known at runtime. The number of available slaves can grow or shrink during the course of the computation. From what I've read, it looks like Cloud Haskell prefers to assume that the size and topology of the cloud is static. Is this necessary? Any recommendations on writing a backend for the environment above?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential feature request (via cloud-haskell) #3

Potential feature request (via cloud-haskell) #3

hyperthunk commented Feb 10, 2017

Potential feature request (via cloud-haskell) #3

Potential feature request (via cloud-haskell) #3

Comments

hyperthunk commented Feb 10, 2017