Skip to content

Commit

Permalink
Tweak comments and docs (#4)
Browse files Browse the repository at this point in the history
* Some notes on wording

* Reword documentations
  • Loading branch information
tkf committed Feb 4, 2021
1 parent f8cd6cb commit e4ca321
Show file tree
Hide file tree
Showing 5 changed files with 71 additions and 52 deletions.
20 changes: 10 additions & 10 deletions src/docs/DepthFirstEx.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,12 @@ julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, DepthFirstEx())

# Extended help

`DepthFirstEx` schedules chunks of size roughly equal to `basesize` in
the ordered that each chunk appears in the input collection. The basecase
`DepthFirstEx` schedules chunks of size roughly equal to `basesize` in the
order that each chunk appears in the input collection. However, the base case
computation does not wait for all the tasks to be scheduled. This approach
performs better than a more naive approach where the all tasks are scheduled
before the reduction starts. `DepthFirstEx` is useful for reductions that can
terminate early (e.g., `findfirst`, `@floop` with `break`).

## Keyword Arguments
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
at once before the reduction starts. `DepthFirstEx` is useful for reductions
that can terminate early (e.g., `findfirst`, `@floop` with `break`).

## More examples

Expand All @@ -41,3 +35,9 @@ julia> @floop DepthFirstEx() for x in 1:1000_000
acc
4642844
```

## Keyword Arguments
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
25 changes: 23 additions & 2 deletions src/docs/NondeterministicEx.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
NondeterministicEx(; [simd,] [basesize,] [ntasks,])

Batched parallel reduction for non-parallelizable collections (e.g.,
`Channel`, `Iterators.Stateful`).
Pipelined batched reduction for parallelizing computations with
non-parallelizable collections (e.g., `Channel`, `Iterators.Stateful`).

This is a simple wrapper of
[`NondeterministicThreading`](https://juliafolds.github.io/Transducers.jl/dev/reference/manual/#Transducers.NondeterministicThreading)
Expand All @@ -20,6 +20,8 @@ julia> Folds.sum(partially_parallelizable(Iterators.Stateful(1:100)), Nondetermi
234462500
```

# Extended help

In the above example, we can run `gcd(y, 42)` (mapping), `for y in 1:10000x`
(flattening), and `+` for `sum` (reduction) in parallel even though the
iteration of `Iterators.Stateful(1:100)` is not parallelizable. Note that, as
Expand All @@ -42,3 +44,22 @@ julia> @floop NondeterministicEx() for x in Iterators.Stateful(1:100)
acc
234462500
```

## Notes

"Nondeterministic" in the name indicates that the result of the reduction is
not deterministic (i.e., schedule-dependent) _if_ the reducing function is
only approximately associative (e.g., `+` on floats). For computations (e.g.,
`Folds.collect`) that uses strictly associative operations (e.g., "`vcat`"),
the result does not depend on the particular scheduling decision of Julia
runtime. To be more specific, this executor uses the scheduling that does not
produce deterministic divide-and-conquer "task" graph. Instead, the shape of
the graph is determined by the particular timing of each computation at
run-time.

## Keyword Arguments
- `basesize`: The size of base case.
- `ntasks`: The number of tasks used by this executor.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
17 changes: 9 additions & 8 deletions src/docs/TaskPoolEx.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,6 @@ assigning a task to a dedicated thread is stolen from ThreadPools.jl.
argument so that your library function can be used with any executors
including `TaskPoolEx`.

## Keyword Arguments
- `background = false`: Do not run tasks on `threadid() == 1`.
- `ntasks`: The number of tasks to be used.
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).

## More examples

```julia
Expand All @@ -57,3 +49,12 @@ julia> @floop TaskPoolEx() for x in 1:1000_000
acc
4642844
```

## Keyword Arguments
- `background = false`: If `background == true`, do not run tasks on
`threadid() == 1`.
- `ntasks`: The number of tasks to be used.
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
49 changes: 17 additions & 32 deletions src/docs/WorkStealingEx.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
WorkStealingEx(; [simd,] [basesize])

Work-stealing scheduling for parallel (but not concurrent) execution. Useful
for load-balancing.
Work-stealing scheduling for parallel execution. Useful for load-balancing.

# Examples

Expand All @@ -16,27 +15,17 @@ julia> Folds.sum(i -> gcd(i, 42), 1:1000_000, WorkStealingEx())
# Extended help

`WorkStealingEx` implements [work stealing
scheduler](https://en.wikipedia.org/wiki/Work_stealing) for Transducers.jl
and other JuliaFolds/*.jl packages. Worker tasks are pooled (for each
executor) so that the number of Julia `Task`s used for a reduction can be
much smaller than `input_length ÷ basesize`. This has a positive impact for
reduction that requires load-balancing since this does not incur the overhead
of spawning tasks. However, as the worker tasks are occupied by a base case
until the base case is fully reduced, the user functions (reducing functions
and transducers) cannot use concurrency primitives such as channels and
semaphores to communicate _within them_. See below for discussion on usable
concurrency patterns.

**NOTE:** `WorkStealingEx` is more experimental than the default multi-thread
executor `ThreadedEx`. Importantly, `WorkStealingEx` still does not perform
well than `ThreadedEx` for parallel computation that does not require
load-balancing.

## Keyword Arguments
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
scheduler](https://en.wikipedia.org/wiki/Work_stealing) (in particular,
[continuation
stealing](https://en.wikipedia.org/wiki/Work_stealing#Child_stealing_vs._continuation_stealing))
for Transducers.jl and other JuliaFolds/*.jl packages. Worker tasks are
cached and re-used so that the number of Julia `Task`s used for a reduction
can be much smaller than `input_length ÷ basesize`. This has a positive
impact on computations that require load-balancing since this does not incur
the overhead of spawning tasks.

**NOTE:** `WorkStealingEx` is more complex and experimental than the default
multi-thread executor `ThreadedEx`.

## More examples

Expand All @@ -52,12 +41,8 @@ julia> @floop WorkStealingEx() for x in 1:1000_000
4642844
```

## Possible concurrency primitive usages

* Each channel is used solely for consuming or producing
items (not both):
* User functions that only consumes items from channels that are produced by
`Task`s outside the reduction.
* User functions that only produces items to channels that have enough buffer
size or are consumed by `Task`s outside the reduction.
* Locks that are acquired and released within an iteration.
## Keyword Arguments
- `basesize`: The size of base case.
- `simd`: `false`, `true`, `:ivdep`, or `Val` of one of them. If
`true`/`:ivdep`, the inner-most loop of each base case is annotated
by `@simd`/`@simd ivdep`. Use a plain loop if `false` (default).
12 changes: 12 additions & 0 deletions src/trampoline.jl
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
# This is not the usual style of trampoline (thunk-returning functions) and
# maybe there's a better name (pop-call-append?). We do not transform the
# recursion fully to a loop for work-stealing scheduler. Until the first
# base case (initial decent), the recursion is used as usual (so still consumes
# O(log2 n) stack space). The required amount of stack space is comparable to
# other divide-and-conquer implementations. The idea/assumption is that it
# _might_ be a good idea to keep the recursion so that the compiler can infer
# types in many cases (TODO: check this). Once the call stack is constructed by
# the recursion, the continuations in the chain (cactus stack) are evaluated in
# a loop ("trampoline"). That said, it'd be interesting to see if the standard
# trampoline has some performance/implementation advantages.

and_finally(@nospecialize f) = listof(Function, f)

before(@nospecialize(f), chain::List{Function}) = Cons{Function}(f, chain)
Expand Down

0 comments on commit e4ca321

Please sign in to comment.