Tweak comments and docs (#4)

* Some notes on wording * Reword documentations
JuliaFolds · Feb 4, 2021 · e4ca321 · e4ca321
1 parent f8cd6cb
commit e4ca321
Show file tree

Hide file tree

Showing 5 changed files with 71 additions and 52 deletions.
diff --git a/src/docs/DepthFirstEx.md b/src/docs/DepthFirstEx.md
@@ -15,18 +15,12 @@ julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, DepthFirstEx())
 
 # Extended help
 
-`DepthFirstEx` schedules chunks of size roughly equal to `basesize` in
-the ordered that each chunk appears in the input collection. The basecase
+`DepthFirstEx` schedules chunks of size roughly equal to `basesize` in the
+order that each chunk appears in the input collection. However, the base case
 computation does not wait for all the tasks to be scheduled. This approach
 performs better than a more naive approach where the all tasks are scheduled
-before the reduction starts.  `DepthFirstEx` is useful for reductions that can
-terminate early (e.g., `findfirst`, `@floop` with `break`).
-
-## Keyword Arguments
-- `basesize`: The size of base case.
-- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
-  `true`/`:ivdep`, the inner-most loop of each base case is annotated
-  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
+at once before the reduction starts. `DepthFirstEx` is useful for reductions
+that can terminate early (e.g., `findfirst`, `@floop` with `break`).
 
 ## More examples
 
@@ -41,3 +35,9 @@ julia> @floop DepthFirstEx() for x in 1:1000_000
        acc
 4642844
 ```
+
+## Keyword Arguments
+- `basesize`: The size of base case.
+- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
+  `true`/`:ivdep`, the inner-most loop of each base case is annotated
+  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
diff --git a/src/docs/NondeterministicEx.md b/src/docs/NondeterministicEx.md
@@ -1,7 +1,7 @@
     NondeterministicEx(; [simd,] [basesize,] [ntasks,])
 
-Batched parallel reduction for non-parallelizable collections (e.g.,
-`Channel`, `Iterators.Stateful`).
+Pipelined batched reduction for parallelizing computations with
+non-parallelizable collections (e.g., `Channel`, `Iterators.Stateful`).
 
 This is a simple wrapper of
 [`NondeterministicThreading`](https://juliafolds.github.io/Transducers.jl/dev/reference/manual/#Transducers.NondeterministicThreading)
@@ -20,6 +20,8 @@ julia> Folds.sum(partially_parallelizable(Iterators.Stateful(1:100)), Nondetermi
 234462500
 ```
 
+# Extended help
+
 In the above example, we can run `gcd(y, 42)` (mapping), `for y in 1:10000x`
 (flattening), and `+` for `sum` (reduction) in parallel even though the
 iteration of `Iterators.Stateful(1:100)` is not parallelizable. Note that, as
@@ -42,3 +44,22 @@ julia> @floop NondeterministicEx() for x in Iterators.Stateful(1:100)
        acc
 234462500
 ```
+
+## Notes
+
+"Nondeterministic" in the name indicates that the result of the reduction is
+not deterministic (i.e., schedule-dependent) _if_ the reducing function is
+only approximately associative (e.g., `+` on floats). For computations (e.g.,
+`Folds.collect`) that uses strictly associative operations (e.g., "`vcat`"),
+the result does not depend on the particular scheduling decision of Julia
+runtime. To be more specific, this executor uses the scheduling that does not
+produce deterministic divide-and-conquer "task" graph. Instead, the shape of
+the graph is determined by the particular timing of each computation at
+run-time.
+
+## Keyword Arguments
+- `basesize`: The size of base case.
+- `ntasks`: The number of tasks used by this executor.
+- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
+  `true`/`:ivdep`, the inner-most loop of each base case is annotated
+  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
diff --git a/src/docs/TaskPoolEx.md b/src/docs/TaskPoolEx.md
@@ -36,14 +36,6 @@ assigning a task to a dedicated thread is stolen from ThreadPools.jl.
     argument so that your library function can be used with any executors
     including `TaskPoolEx`.
 
-## Keyword Arguments
-- `background = false`: Do not run tasks on `threadid() == 1`.
-- `ntasks`: The number of tasks to be used.
-- `basesize`: The size of base case.
-- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
-  `true`/`:ivdep`, the inner-most loop of each base case is annotated
-  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
-
 ## More examples
 
 ```julia
@@ -57,3 +49,12 @@ julia> @floop TaskPoolEx() for x in 1:1000_000
        acc
 4642844
 ```
+
+## Keyword Arguments
+- `background = false`: If `background == true`, do not run tasks on
+  `threadid() == 1`.
+- `ntasks`: The number of tasks to be used.
+- `basesize`: The size of base case.
+- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
+  `true`/`:ivdep`, the inner-most loop of each base case is annotated
+  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
diff --git a/src/docs/WorkStealingEx.md b/src/docs/WorkStealingEx.md
@@ -1,7 +1,6 @@
     WorkStealingEx(; [simd,] [basesize])
 
-Work-stealing scheduling for parallel (but not concurrent) execution. Useful
-for load-balancing.
+Work-stealing scheduling for parallel execution. Useful for load-balancing.
 
 # Examples
 
@@ -16,27 +15,17 @@ julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, WorkStealingEx())
 # Extended help
 
 `WorkStealingEx` implements [work stealing
-scheduler](https://en.wikipedia.org/wiki/Work_stealing) for Transducers.jl
-and other JuliaFolds/*.jl packages. Worker tasks are pooled (for each
-executor) so that the number of Julia `Task`s used for a reduction can be
-much smaller than `input_length ÷ basesize`. This has a positive impact for
-reduction that requires load-balancing since this does not incur the overhead
-of spawning tasks. However, as the worker tasks are occupied by a base case
-until the base case is fully reduced, the user functions (reducing functions
-and transducers) cannot use concurrency primitives such as channels and
-semaphores to communicate _within them_. See below for discussion on usable
-concurrency patterns.
-
-**NOTE:** `WorkStealingEx` is more experimental than the default multi-thread
-executor `ThreadedEx`. Importantly, `WorkStealingEx` still does not perform
-well than `ThreadedEx` for parallel computation that does not require
-load-balancing.
-
-## Keyword Arguments
-- `basesize`: The size of base case.
-- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
-  `true`/`:ivdep`, the inner-most loop of each base case is annotated
-  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
+scheduler](https://en.wikipedia.org/wiki/Work_stealing) (in particular,
+[continuation
+stealing](https://en.wikipedia.org/wiki/Work_stealing#Child_stealing_vs._continuation_stealing))
+for Transducers.jl and other JuliaFolds/*.jl packages. Worker tasks are
+cached and re-used so that the number of Julia `Task`s used for a reduction
+can be much smaller than `input_length ÷ basesize`. This has a positive
+impact on computations that require load-balancing since this does not incur
+the overhead of spawning tasks.
+
+**NOTE:** `WorkStealingEx` is more complex and experimental than the default
+multi-thread executor `ThreadedEx`.
 
 ## More examples
 
@@ -52,12 +41,8 @@ julia> @floop WorkStealingEx() for x in 1:1000_000
 4642844
 ```
 
-## Possible concurrency primitive usages
-
-* Each channel is used solely for consuming or producing
-  items (not both):
-    * User functions that only consumes items from channels that are produced by
-      `Task`s outside the reduction.
-    * User functions that only produces items to channels that have enough buffer
-      size or are consumed by `Task`s outside the reduction.
-* Locks that are acquired and released within an iteration.
+## Keyword Arguments
+- `basesize`: The size of base case.
+- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them.  If
+  `true`/`:ivdep`, the inner-most loop of each base case is annotated
+  by `@simd`/`@simd ivdep`.  Use a plain loop if `false` (default).
diff --git a/src/trampoline.jl b/src/trampoline.jl
@@ -1,3 +1,15 @@
+# This is not the usual style of trampoline (thunk-returning functions) and
+# maybe there's a better name (pop-call-append?). We do not transform the
+# recursion fully to a loop for work-stealing scheduler. Until the first
+# base case (initial decent), the recursion is used as usual (so still consumes
+# O(log2 n) stack space). The required amount of stack space is comparable to
+# other divide-and-conquer implementations. The idea/assumption is that it
+# _might_ be a good idea to keep the recursion so that the compiler can infer
+# types in many cases (TODO: check this). Once the call stack is constructed by
+# the recursion, the continuations in the chain (cactus stack) are evaluated in
+# a loop ("trampoline"). That said, it'd be interesting to see if the standard
+# trampoline has some performance/implementation advantages.
+
 and_finally(@nospecialize f) = listof(Function, f)
 
 before(@nospecialize(f), chain::List{Function}) = Cons{Function}(f, chain)