<?xml version="1.0" encoding="UTF-8"?>
<commit>
  <added type="array">
    <added>
      <filename>doc/py/core.rst</filename>
    </added>
    <added>
      <filename>doc/py/func.rst</filename>
    </added>
    <added>
      <filename>doc/py/util.rst</filename>
    </added>
  </added>
  <modified type="array">
    <modified>
      <diff>@@ -102,7 +102,7 @@ exactly *num-pairs* consequent key-value pairs as defined above.
 
 Inputs for the external map are read using the provided *map_reader*. The
 map reader may produce each input entry as a single string (like the
-default :func:`disco.map_line_reader` does) that is used as the value
+default :func:`disco.func.map_line_reader` does) that is used as the value
 in a key-value pair where the key is an empty string. Alternatively,
 the reader may return a pair of strings as a tuple, in which case both
 the key and the value are specified.
@@ -149,7 +149,7 @@ Parameters
 ''''''''''
 
 Any parameters for the external program must be specified in the
-*ext_params* parameter for :func:`disco.job`. If *ext_params* is specified
+*ext_params* parameter for :func:`disco.core.Job`. If *ext_params* is specified
 as a string, Disco will provide it as is for the external program in the
 standard input, before any key-value pairs. It is on the responsibility
 of the external program to read all bytes that belong to the parameter set
@@ -157,7 +157,7 @@ before starting to receive key-value pairs.
 
 As a special case, the standard C interface for Disco, as specified
 below, accepts a dictionary of string-string pairs as *ext_params*. The
-dictionary is then encoded by :func:`disco.job` using the *netstring*
+dictionary is then encoded by :func:`disco.core.Job` using the *netstring*
 module. The *netstring* format is extremely simple, consisting of consequent
 key-value pairs. An example how to parse parameters in this case can be
 found in the :cfunc:`read_parameters` function in *ext/disco.c*.
@@ -182,7 +182,7 @@ included in its supporting files. It must not write to any files on its
 host, to ensure integrity of the runtime environment.
 
 An external map or reduce task is specified by giving a dictionary, instead of a
-function, as the *fun_map* or *reduce* parameter in :func:`disco.job`. The
+function, as the *fun_map* or *reduce* parameter in :func:`disco.core.Job`. The
 dictionary contains at least a single key-value pair where key is the string
 *&quot;op&quot;* and the value the actual executable code. Here's an example::
 
@@ -196,9 +196,9 @@ file names (not paths) of the supporting files, such as *&quot;config.txt&quot;*
 above. The corresponding values must contain the contents of the
 supporting files as strings.
 
-A convenience function :func:`disco.external` is provided for constructing the
+A convenience function :func:`disco.util.external` is provided for constructing the
 dictionary that specifies an external task. Here's the same example as above but
-using :func:`disco.external`::
+using :func:`disco.util.external`::
 
         disco.job(&quot;disco://localhost:5000&quot;,
                   [&quot;disco://localhost/myjob/file1&quot;],</diff>
      <filename>doc/external.rst</filename>
    </modified>
    <modified>
      <diff>@@ -5,14 +5,14 @@ Glossary
 .. glossary::
 
    disco master
-        Master process that takes care of receiving Disco jobs, scheduling them 
-        and distributing tasks to the cluster. There may be many Disco masters
-        running in parallel, as long as they manage separate sets of resources
-        (CPUs).
+        Master process that takes care of receiving Disco jobs,
+        scheduling them and distributing tasks to the cluster. There
+        may be many Disco masters running in parallel, as long as they
+        manage separate sets of resources (CPUs).
    job
-        A sequence of the map :term:`task` and the reduce
-        :term:`task`. Made with a single call to the :func:`disco.job`
-        function.
+        A sequence of the map :term:`task` and the
+        reduce :term:`task`. Started by calling the
+        :meth:`disco.core.Disco.new_job` method.
 
    job functions
         Job functions are the functions that the user can specify in</diff>
      <filename>doc/glossary.rst</filename>
    </modified>
    <modified>
      <diff>@@ -36,59 +36,4 @@ raised if it is likely that the occurred error is temporary.
 Typically this function is used by map readers to signal a temporary failure
 in accessing an input file.
 
-.. function:: parse_dir(dir_url)
-
-Parses a directory URL, such as ``dir://nx02/test_simple@12243344`` to
-a list of direct URLs. In contrast to other functions in this module,
-this function is not used by the job functions, but might be useful for
-other programs that need to parse results returned by :func:`disco.job`,
-for instance.
-
-.. function:: netstr_reader(fd, size, fname)
-
-A map reader for Disco's internal key-value format. This reader can be
-used to read results produced by map and reduce functions. An alias for
-:func:`disco.chain_reader`.
-
-.. function:: re_reader(regexp, fd, size, fname[, output_tail])
-
-A map reader that uses an arbitrary regular expression to parse the input
-stream. The desired regular expression is specified in *regexp*. The reader
-works as follows:
-
- 1. X bytes is read from *fd* and appended to an internal buffer *buf*.
- 2. ``m = regexp.match(buf)`` is executed. 
- 3. If *buf* produces a match, ``m.groups()`` is yielded, which contains an
-    input entry for the map function. Step 2. is executed for the remaining
-    part of *buf*. If no match is made, go to step 1. 
- 4. If *fd* is exhausted before *size* bytes have been read, a data error is
-    raised, unless *size* is not specified.
- 5. When *fd* is exhausted but *buf* contains unmatched bytes, two modes are
-    available: If *output_tail = True*, the remaining *buf* is yielded as is.
-    Otherwise, which is the default case, a message is sent that warns about
-    trailing bytes and the remaining *buf* is discarded.
-
-Note that :func:`disco_worker.re_reader` fails if the input streams contains
-unmatched bytes between matched entries. Make sure that your *regexp* is
-constructed so that it covers all the bytes in the input stream.
-
-:func:`disco_worker.re_reader` provides an easy way to construct parsers for
-textual input streams. For instance, the following reader produces full HTML 
-documents as input entries::
-
-        def html_reader(fd, size, fname):
-                for x in re_reader(&quot;&lt;HTML&gt;(.*?)&lt;/HTML&gt;&quot;, fd, size, fname):
-                        yield x[0]
-
-
-The default :func:`disco.map_line_reader` is defined as follows::
-
-        def map_line_reader(fd, sze, fname):
-                for x in re_reader(&quot;(.*?)\n&quot;, fd, sze, fname, output_tail = True):
-                        yield x[0]
-
-Note that since *output_tail = True* in :func:`disco.map_line_reader`, an input
-file that lacks the final newline character is silently accepted.
-
-
 </diff>
      <filename>doc/py/disco_worker.rst</filename>
    </modified>
    <modified>
      <diff>@@ -14,12 +14,13 @@ unnecessarily long when the code is run through the cluster. Also, when
 an existing job appears to be slow or faulty, one could benefit from a
 good profiler or debugger.
 
-:mod:`homedisco` makes development of Disco jobs as easy as ordinary
-Python programs. It creates a job request using :func:`disco.job`
-similarly to a normal Disco request, but instead of sending it to the
-master, it instantiates a local :mod:`disco_worker` and passes the request
-to it. This allows local execution of exactly the same map and reduce
-tasks as you would run in the distributed environment.
+:mod:`homedisco` makes development of Disco jobs as easy
+as ordinary Python programs. It creates a job request using
+:meth:`disco.core.Disco.new_job` similarly to a normal Disco request,
+but instead of sending it to the master, it instantiates a local
+:mod:`disco_worker` and passes the request to it. This allows local
+execution of exactly the same map and reduce tasks as you would run in
+the distributed environment.
 
 As a result, you can treat your job functions as a normal Python
 program and use standard Python debuggers and profilers to analyze the
@@ -43,7 +44,6 @@ can test your map and reduce functions independently from each other
 and focus on edit-run-debug cycle with one task without running the
 other. 
 
-
 :mod:`homedisco` tasks may read any inputs, remote or local, as any
 other Disco job. However, results from a task are always written to a
 new directory that is automatically created under the directory where
@@ -86,26 +86,28 @@ file either locally or from an external source, as any Disco job.
 We need two separate :class:`homedisco.HomeDisco` environments: One for
 running the map task, *map_hd*, and one for the reduce, *reduce_hd*. Using
 these environments, we can call :meth:`homedisco.HomeDisco.job` that
-works exactly like :func:`disco.job`. Outputs of the map task are given
-as inputs to the reduce task. In the end, we print out the results using
-:func:`disco.result_iterator`.
+works exactly like :meth:`disco.core.Disco.new_job`. Outputs of the map
+task are given as inputs to the reduce task. In the end, we print out
+the results using :func:`disco.core.result_iterator`.
 
 Since :meth:`homedisco.HomeDisco.job` runs only single instance of
 the given task, the map task accepts only one input, in contrast to
-:func:`disco.job` that can take several. Similarly, if you have several
-partitions (i.e. *nr_reduces* is larger than one), only one of them
-will be processed by the reduce task, as specified by the *partition*
-parameter in :class:`homedisco.HomeDisco`. However, the reduce task may take
-several inputs in which case only data belonging to the specified partition
-will be used from the files, as long as they are saved in the ``chunk://``
-format --- usually Disco handles this issue correctly by itself.
-
-Note that the format of result files that are produced by the map task
-depends whether the map is used alone or whether it is followed by reduce. Thus
-if you want to read outputs of the map task with :func:`disco.result_iterator`,
-you must not specify *reduce* in :meth:`homedisco.HomeDisco.job`. However, if
-your map task is followed by reduce, as in the above example, you should specify
-the parameter *reduce* as usual.
+:meth:`disco.core.Disco.new_job` that can take several. Similarly,
+if you have several partitions (i.e. *nr_reduces* is larger than one),
+only one of them will be processed by the reduce task, as specified by
+the *partition* parameter in :class:`homedisco.HomeDisco`. However, the
+reduce task may take several inputs in which case only data belonging to
+the specified partition will be used from the files, as long as they are
+saved in the ``chunk://`` format --- usually Disco handles this issue
+correctly by itself.
+
+Note that the format of result files that are produced by the map
+task depends whether the map is used alone or whether it is followed
+by reduce. Thus if you want to read outputs of the map task with
+:func:`disco.core.result_iterator`, you must not specify *reduce* in
+:meth:`homedisco.HomeDisco.job`. However, if your map task is followed
+by reduce, as in the above example, you should specify the parameter
+*reduce* as usual.
 
 Module contents
 ---------------</diff>
      <filename>doc/py/homedisco.rst</filename>
    </modified>
    <modified>
      <diff>@@ -2,12 +2,30 @@
 Disco API Reference
 ===================
 
+Disco client
+------------
+
+.. toctree::
+   :maxdepth: 1
+   
+   core
+   func
+   util
+
+Runtime environment for Disco jobs
+----------------------------------
+
 .. toctree::
    :maxdepth: 1
 
-   disco
-   discoapi
    disco_worker
+
+Utilities
+---------
+
+.. toctree::
+   :maxdepth: 1
+
    homedisco
 
 </diff>
      <filename>doc/py/index.rst</filename>
    </modified>
    <modified>
      <diff>@@ -4,12 +4,23 @@
 Tutorial
 ========
 
-This tutorial shows how to create and run a Disco job that counts words in a
-large text file. To start with, you need nothing but a single large text file.
-Let's call the file ``bigfile.txt``.
+This tutorial shows how to create and run a Disco job that counts
+words in a large text file. To start with, you need nothing but a
+single large text file.  Let's call the file ``bigfile.txt``. If
+you don't happen to have a suitable file at hand, you can
+download one from `http://discoproject.org/bigfile.txt
+&lt;http://discoproject.org/bigfile.txt&gt;`_.
 
-Prepare input data
-------------------
+Some steps are executed slightly differently on a local cluster (or on
+a single machine) and Amazon EC2. In these cases, you can find separate
+instructions for the two environments. Follow the one that applies to
+your case. Note that if you run the steps on the master node of your
+EC2 cluster, in contrast to a remote machine that communicates with the
+master node, you can follow the instructions for local clusters.
+
+
+1. Prepare input data
+---------------------
 
 Disco can distribute computation only if data is distributed as well. Thus
 our first step is to split ``bigfile.txt`` into small chunks. There is a
@@ -17,8 +28,8 @@ standard Unix command, ``split``, that can split a file into many pieces,
 which is exactly what we want. We need also a directory where the chunks
 are stored.  Let's call it ``bigtxt``::
 
-        % mkdir bigtxt
-        % split -l 100000 bigfile.txt bigtxt/bigtxt-
+        mkdir bigtxt
+        split -l 100000 bigfile.txt bigtxt/bigtxt-
 
 After running these lines, the directory ``bigtxt`` contains many files, named
 like ``bigtxt-aa``, ``bigtxt-ab`` etc. which each contain 100,000 lines (except
@@ -29,35 +40,67 @@ size smaller. The more chunks you have, the more processes you can run in
 parallel. However, since launching a new process is not free, you shouldn't make
 the chunks too small.
 
-Distribute chunks to the cluster
---------------------------------
+2. Distribute chunks to the cluster
+-----------------------------------
 
 In theory you could use the chunks in the ``bigtxt`` directory
 directly, but in practice it is a good idea to distribute the IO load
 to many separate servers.  Disco provides a utility script called
 ``distrfiles.py`` that distributes files from a directory to the cluster.
 
-Run it as follows::
+The script requires that the environment variable ``DISCO_ROOT``, which
+is the home directory of Disco, is specified. It is usually defined in
+``/etc/disco/disco.conf``, so you can export it to your shell by saying::
+
+        source /etc/disco/disco.conf
+
+By default, ``DISCO_ROOT=/srv/disco/``. The script will copy files to the
+``$DISCO_ROOT/data/bigtxt`` directory.
+
+Local cluster
+'''''''''''''
 
-        % python disco/util/distrfiles.py bigtxt /etc/nodes &gt; bigtxt.chunks
+Run the script as follows::
+
+        python disco/util/distrfiles.py bigtxt /etc/nodes &gt; bigtxt.chunks
 
 Here ``bigtxt`` refers to the directory that contains the files and
-``/etc/nodes`` is a file that lists available nodes in the cluster. The
-script copies files to the nodes randomly, and outputs location of
-each file to the standard output, which we capture here to the file
-``bigtxt.chunks``. Take a look at that file to get an idea how inputs
-are specified for Disco.
+``/etc/nodes`` is a file that lists available nodes in the cluster, one
+hostname per line. The script copies files to the nodes randomly, and
+outputs location of each file to the standard output, which we capture
+here to the file ``bigtxt.chunks``. Take a look at that file to get an
+idea how inputs are specified for Disco.
 
 If you want to repeat this command multiple times, e.g. for a new set of
 chunks, you need to run::
 
-        % REMOVE_FIRST=1 python disco/util/distrfiles.py bigtxt /etc/nodes &gt; bigtxt.chunks
+        REMOVE_FIRST=1 python disco/util/distrfiles.py bigtxt /etc/nodes &gt; bigtxt.chunks
 
 which first removes the target directory on each node before copying
 the chunks. 
 
-Write job functions
--------------------
+Amazon EC2
+''''''''''
+
+With Amazon EC2, we need to specify the ssh-key to the script, using the
+``SSH_KEY`` environment variable, so it can copy files to the nodes. Since
+the EC2's ssh-key is specific to the root user, we also need to set the
+user to root with the ``SSH_USER`` variable. The ``DISCO_ROOT`` variable should
+be set to its default value, ``/srv/disco``.
+
+Run the script as follows::
+        
+        DISCO_ROOT=/srv/disco SSH_KEY=your-key-file SSH_USER=root python disco/util/distrfiles.py bigtxt ec2-nodes &gt; bigtxt.chunk
+
+Here ``your-key-file`` should be the same keypair file that was
+used in :ref:`ec2setup`. The node list ``ec2-nodes`` is produced by
+``setup-instances.py`` script. Similarly to the local clusters, you can
+use the ``REMOVE_FIRST`` flag if you run the script many times with the
+same dataset.
+
+
+3. Write job functions
+----------------------
 
 Next we need to write map and reduce functions to count the words in
 the chunks.
@@ -70,9 +113,9 @@ Start your favorite text editor and open a file called, say,
 
 Quite compact, eh? The map function always takes two parameters, here they
 are called *e* and *params*. The first parameter contains an input entry,
-which is by default a line of input. An input entry can be anything,
-as you can define a custom function that extracts them from an input
-stream --- see the parameter *map_reader* in :func:`disco.job` for more
+which is by default a line of input. An input entry can be anything, as
+you can define a custom function that extracts them from an input stream
+--- see the parameter *map_reader* in :func:`disco.core.Job` for more
 information. The second parameter, *params*, can be any object that you
 specify, in case that you need some additional input for your functions.
 
@@ -100,7 +143,7 @@ the map function, which belong to this reduce instance or partition.
 
 In this case, different words are randomly assigned to different reduce
 instances. Again, this is something that can be changed --- see the
-parameter *partition* in :func:`disco.job` for more information. However,
+parameter *partition* in :func:`disco.core.Job` for more information. However,
 as long as all occurrences of the same word go to the same reduce,
 we can be sure that the final counts are correct.
 
@@ -114,72 +157,110 @@ The third parameter *params* contains the same additional input as in
 the map function.
 
 That's it. Now we have written map and reduce functions for counting
-words in parallel!
+words in parallel.
 
-Run job
--------
+4. Run the job
+--------------
 
-Now the only thing missing is a command for running the job. In Disco,
-a single function call is needed to start a new job, which is aptly
-called :func:`disco.job`. There's a large number of parameters that you can
-use to tune your job but luckily only the first three of them are required.
+Now the only thing missing is a command for running the job. First,
+we establish a connection to the Disco master by instantiating a
+:class:`disco.core.Disco` object. After that, we can start the job by
+calling :meth:`disco.core.Disco.new_job`. There's a large number of
+parameters that you can use to specify your job but only three of them
+are required for a simple job like ours.
 
 In addition to starting the job, we want to print out the results as well.
-A function called :func:`disco.result_iterator` takes a list of addresses to
-the result files, that is returned by the :func:`disco.job` call, and iterates
-through all key-value pairs in the results.
+First, however, we have to wait until the job has finished. This is done with
+the :meth:`disco.core.Disco.wait` call, which returns results of the job once
+has it has finished. For convenience, the :meth:`disco.core.Disco.wait` method, 
+as well as other methods related to a job, can be called through the
+:class:`disco.core.Job` object that is returned by :meth:`disco.core.Disco.new_job`.
+
+A function called :func:`disco.core.result_iterator` takes
+a list of addresses to the result files, that is returned by
+:meth:`disco.core.Disco.wait`, and iterates through all key-value pairs
+in the results.
 
 The following lines run the job and print out the results. Write them to the end
 of your file::
 
-        import disco, sys
-        results = disco.job(sys.argv[1], &quot;disco_tut&quot;, sys.argv[2:], fun_map, reduce = fun_reduce)
-        for word, total in disco.result_iterator(results):
+        import sys
+        from disco.core import Disco, result_iterator
+        
+        results = Disco(sys.argv[1]).new_job(
+                name = &quot;disco_tut&quot;,
+                input = sys.argv[2:],
+                map = fun_map,
+                reduce = fun_reduce).wait()
+        
+        for word, total in result_iterator(results):
                 print word, total
 
-Here we read the address of the Disco master and the input files from the
-command line. The map function is given as the third parameter, *fun_map*, and
-the reduce function as the keyword parameter *reduce = fun_reduce* for
-:func:`disco.job`.
+Here we read the address of the Disco master and the input files from
+the command line. Note how the map and reduce functions are provided to
+:meth:`disco.core.Disco.new_job` simply as normal keywords arguments *map*
+and *reduce*.
+
+Now comes the moment of truth. 
+
+Local cluster
+'''''''''''''
+        
+Run the script as follows::
+
+        python count_words.py disco://localhost `cat bigtxt.chunks` &gt; bigtxt.results
 
-Now comes the moment of truth. Run the script as follows::
+If you run the Disco master in a non-standard port, replace
+``disco://localhost`` with the correct address to the
+master. Alternatively, you can specify the ``DISCO_MASTER_PORT``
+environment variable, which specifies the port to the master.
 
-        % export PYTHONPATH=disco/py
-        % python count_words.py disco://localhost:5000 `cat bigtxt.chunks` &gt; bigtxt.results
+Amazon EC2
+''''''''''
+
+In contrast to a local cluster, :func:`disco.core.result_iterator`
+can't fetch the results directly from the EC2 nodes. Due to this reason, we must
+use the master node as a proxy. 
+
+Run the scripts as follows::
+        
+        DISCO_PROXY=disco://localhost python count_words.py disco://localhost `cat bigtxt.chunks` &gt; bigtxt.results
+
+Here we assume that there's a SSH tunnel from your local machine to the
+EC2 master, as started automatically by the ``setup-instances.py`` script.
+
+----
 
 If everything goes well, the script pauses for some time while the
 job executes. The inputs are read from the file ``bigtxt.chunks``
 which was created earlier. Finally the outputs are written to
 ``bigtxt.results``.  While the job is running, you can point your web
-browser at ``http://localhost:5000`` which lets you follow the progress
-of your job in real-time.
+browser at ``http://localhost:8989`` (or some other port where you run the
+Disco master) which lets you follow the progress of your job in real-time.
 
-Note that in your case the Disco master, specified here by
-``disco://localhost:5000``, might be running on a different address. If you
-can't find Disco at ``http://localhost:5000`` in your browser, consult
-your nearest sysadmin for the correct settings.
 
-Conclusion
+What next?
 ----------
 
 As you saw, creating a new Disco job is pretty straightforward. Next you could
 write functions for a bit more complex job, which could, for instance, count
 only words that are provided as a parameter to the map function.
 
-It is highly recommended that you take a look in :mod:`homedisco`. It
-is a simple replacement for :func:`disco.job` that lets you to debug,
+It is highly recommended that you take a look in :mod:`homedisco`. It is
+a simple replacement for :func:`disco.core.Job` that lets you to debug,
 profile and test your Disco functions on your local machine, instead of
 running them in the cluster. It is an invaluable tool when developing
 new programs for Disco.
 
 You can also experiment with providing custom partitioning and reader
 functions. They are written in the same way as map and reduce functions.
-Just see some examples in the :mod:`disco` module. After that, you could
-try to chain many map/reduce jobs together, so that outputs of the previous
-job are used as the inputs for the next one --- in that case you need
-to use :func:`disco.chain_reader`.
+Just see some examples in the :mod:`disco.func` module. After that,
+you could try to chain many map/reduce jobs together, so that outputs
+of the previous job are used as the inputs for the next one --- in that
+case you need to use :func:`disco.func.chain_reader`.
 
 The best way to learn is to pick a problem or algorithm that you know
 well, and implement it with Disco. After all, Disco was designed to
 be as simple as possible so you can concentrate on your own problems,
 not on the framework.
+</diff>
      <filename>doc/start/tutorial.rst</filename>
    </modified>
  </modified>
  <removed type="array">
    <removed>
      <filename>doc/py/disco.rst</filename>
    </removed>
    <removed>
      <filename>doc/py/discoapi.rst</filename>
    </removed>
  </removed>
  <parents type="array">
    <parent>
      <id>fbd564da800431fbb5221f45a6359f14badaf534</id>
    </parent>
  </parents>
  <author>
    <name>Ville Tuulos</name>
    <email>tuulos@nxfront.nokiapaloalto.com</email>
  </author>
  <url>http://github.com/tuulos/disco/commit/6c38816feeb39f229bdf14110cdc1366936a13d3</url>
  <id>6c38816feeb39f229bdf14110cdc1366936a13d3</id>
  <committed-date>2008-09-01T21:26:21-07:00</committed-date>
  <authored-date>2008-09-01T21:26:21-07:00</authored-date>
  <message>Updated documentation for the 0.1 release</message>
  <tree>bea822bff7409fbe72d19d4491d1df671cda1ba4</tree>
  <committer>
    <name>Ville Tuulos</name>
    <email>tuulos@nxfront.nokiapaloalto.com</email>
  </committer>
</commit>
