From b0be7d8720077db98e282dfb040892d1c3b95758 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Fri, 22 Jan 2021 12:01:15 +0100 Subject: [PATCH 01/11] a readme --- README.md | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) diff --git a/README.md b/README.md index db5caae..679552c 100644 --- a/README.md +++ b/README.md @@ -7,3 +7,107 @@ This was originally developed for the separated-out lightweight distributed-processing framework that can be used with GigaSOM. +## Why and how? + +This provides a very simple, imperative and straightforward way to move your +data around a cluster of Julia processes created by the `Distributed` package, +and run computation on the distributed data pieces. The main aim of the package +is to avoid anything complicated, the first version used in GigaSOM had just +under 500 lines of relatively straightforward code with comments. + +The package provides a few very basic primitives that lightly wrap the +`Distributed` package functions `remotecall` and `fetch`. The most basic one is +`save_at`, which takes a worker ID, variable name and variable content, and +saves the content to the variable on the selected worker. `get_from` works the +same way, but takes the data back from the worker. + +You can thus send some random array to a few distributed workers: + +```julia +julia> using Distributed, DiDa + +julia> addprocs(2) +2-element Array{Int64,1}: + 2 + 3 + +julia> save_at(2, :x, randn(10,10)) +Future(2, 1, 4, nothing) +``` + +The `Future` returned from `save_at` is the normal Julia future from +`Distributed`, you can even `fetch` it to wait until the operation is really +done on the other side. Fetching the data is done the same way: + +```julia +julia> get_from(2,:x) +Future(2, 1, 15, nothing) + +julia> get_val_from(2,:x) # auto-fetch()ing variant +10×10 Array{Float64,2}: + -0.850788 0.946637 1.78006 … + -0.49596 0.497829 -2.03013 + … +``` + +All commands support full quoting, which allows you to easily distinguish +between code parts that are executed locally and remotely: + +```julia +julia> save_at(3, :x, randn(1000,1000)) # generates a matrix locally and sends it to the remote worker + +julia> save_at(3, :x, :(randn(1000,1000))) # generates a matrix right on the remote worker and saves it there + +julia> get_val_from(3, :x) # retrieves the generated matrix and fetches it +… + +julia> get_val_from(3, :(randn(1000,1000))) # generates the matrix on the worker and fetches the data +… +``` + +Notably, this is different from the approach taken by `DistributedArrays` and +similar packages -- all data manipulation is explicit, and any data type is +supported as long as it can be moved among workers by the `Distributed` +package. This helps with various highly non-array-ish data, such as large text +corpora and graphs. + +There are various goodies for easy work with matrix-style data, namely +scattering, gathering and running distributed algorithms: + +```julia +julia> x = randn(1000,3) +1000×3 Array{Float64,2}: + -0.992481 0.551064 1.67424 + -0.751304 -0.845055 0.105311 + -0.712687 0.165619 -0.469055 + ⋮ + +julia> dataset = scatter_array(:myDataset, x, workers()) # sends slices of the array to workers +Dinfo(:myDataset, [2, 3]) # a helper for holding the variable name and the used workers together + +julia> get_val_from(3, :(size(myDataset))) +(500, 3) # there's really only half of the data + +julia> dmapreduce(dataset, sum, +) # MapReduce-style sum of all data +-51.64369103751014 + +julia> dstat(dataset, [1,2,3]) # get means and sdevs in individual columns +([-0.030724038974465212, 0.007300925745200863, -0.028220577808245786], + [0.9917470012495775, 0.9975120525455358, 1.000243845434252]) + +julia> dmedian(dataset, [1,2,3]) # distributed iterative median in columns +3-element Array{Float64,1}: + 0.004742259615849834 + 0.039043266340824986 + -0.05367799062404967 + +julia> dtransform(dataset, x -> 2 .^ x) # exponentiate all data (medians should now be around 1) +Dinfo(:myDataset, [2, 3]) + +julia> gather_array(dataset) # download the data from workers to a sing +1000×3 Array{Float64,2}: + 0.502613 1.46517 3.1915 + 0.594066 0.55669 1.07573 + 0.610183 1.12165 0.722438 + ⋮ +``` From f858ae2b90cfc64bf5c775ccc7ce8a122b281546 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Fri, 22 Jan 2021 14:02:07 +0100 Subject: [PATCH 02/11] start the docs --- README.md | 18 ++++- docs/Project.toml | 3 + docs/make.jl | 22 ++++++ docs/src/functions.md | 29 ++++++++ docs/src/index.md | 29 ++++++++ docs/src/tutorial.md | 160 ++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 258 insertions(+), 3 deletions(-) create mode 100644 docs/Project.toml create mode 100644 docs/make.jl create mode 100644 docs/src/functions.md create mode 100644 docs/src/index.md create mode 100644 docs/src/tutorial.md diff --git a/README.md b/README.md index 679552c..a944f11 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,24 @@ # DiDa.jl -Simple Distributed Data manipulation and processing routines for Julia. +Simple distributed data manipulation and processing routines for Julia. This was originally developed for [GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains the separated-out lightweight distributed-processing framework that can be used with GigaSOM. -## Why and how? +## Why? This provides a very simple, imperative and straightforward way to move your data around a cluster of Julia processes created by the `Distributed` package, and run computation on the distributed data pieces. The main aim of the package -is to avoid anything complicated, the first version used in GigaSOM had just +is to avoid anything complicated-- the first version used in GigaSOM had just under 500 lines of relatively straightforward code with comments. +Most importantly, distributed processing should be simple and accessible. + +## Brief how-to + The package provides a few very basic primitives that lightly wrap the `Distributed` package functions `remotecall` and `fetch`. The most basic one is `save_at`, which takes a worker ID, variable name and variable content, and @@ -31,6 +35,8 @@ julia> addprocs(2) 2 3 +julia> @everywhere using DiDa + julia> save_at(2, :x, randn(10,10)) Future(2, 1, 4, nothing) ``` @@ -111,3 +117,9 @@ julia> gather_array(dataset) # download the data from workers to a sing 0.610183 1.12165 0.722438 ⋮ ``` + +## What does the name `DiDa` mean? + +**Di**stributed **Da**ta. + +There is no consensus on how to pronounce the shortcut. diff --git a/docs/Project.toml b/docs/Project.toml new file mode 100644 index 0000000..2742e84 --- /dev/null +++ b/docs/Project.toml @@ -0,0 +1,3 @@ +[deps] +Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" +DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8" diff --git a/docs/make.jl b/docs/make.jl new file mode 100644 index 0000000..2cc314f --- /dev/null +++ b/docs/make.jl @@ -0,0 +1,22 @@ +using Documenter, DiDa + +makedocs(modules = [DiDa], + clean = false, + format = Documenter.HTML(prettyurls = !("local" in ARGS)), + sitename = "DiDa.jl", + authors = "The developers of DiDa.jl", + linkcheck = !("skiplinks" in ARGS), + pages = [ + "Home" => "index.md", + "Tutorial" => "tutorial.md", + "Functions" => "functions.md", + ], +) + +deploydocs( + repo = "github.com/LCSB-BioCore/DiDa.jl.git", + target = "build", + branch = "gh-pages", + devbranch = "master", + versions = "stable" => "v^", +) diff --git a/docs/src/functions.md b/docs/src/functions.md new file mode 100644 index 0000000..f5cb708 --- /dev/null +++ b/docs/src/functions.md @@ -0,0 +1,29 @@ +# Functions + +## Data structures + +```@autodocs +Modules = [DiDa] +Pages = ["structs.jl"] +``` + +## Base functions + +```@autodocs +Modules = [DiDa] +Pages = ["base.jl"] +``` + +## Higher-level array operations + +```@autodocs +Modules = [DiDa] +Pages = ["tools.jl"] +``` + +## Input/Output + +```@autodocs +Modules = [DiDa] +Pages = ["io.jl"] +``` diff --git a/docs/src/index.md b/docs/src/index.md new file mode 100644 index 0000000..2247cf9 --- /dev/null +++ b/docs/src/index.md @@ -0,0 +1,29 @@ +# DiDa.jl - simple work with distributed data + +This packages provides a relatively simple Distributed Data manipulation and +processing routines for Julia. + +The design of the package and data manipulation approach is deliberately +"imperative" and "hands-on", to allow as much user influence on the actual way +the data are moved and stored in the cluster as possible. It uses the +`Distributed` package and its infrastructure of workers, and provides a few +very basic primitives that lightly wrap the `Distributed` package functions +`remotecall` and `fetch`. + +There are also various extra functions to easily run distributed data +transformations, MapReduce-style algorithms, store and load the data on worker +local storage (e.g. to prevent memory exhaustion) and others. + +To start quickly, you can read the tutorial: + +```@contents +Pages=["tutorial.md"] +``` + +### Functions + +A full reference to all functions is given here: + +```@contents +Pages = ["functions.md"] +``` diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md new file mode 100644 index 0000000..ca2e135 --- /dev/null +++ b/docs/src/tutorial.md @@ -0,0 +1,160 @@ + +# DiDa tutorial + +The primary purpose of this tutorial is to get a basic grasp of the main `DiDa` functions and methodology. + +For starting up, let's create a few distributed workers and import the package: + +```julia +julia> using Distributed, DiDa + +julia> addprocs(3) +2-element Array{Int64,1}: + 2 + 3 + 4 + +julia> @everywhere using DiDa +``` + +## Moving the data around + +In `DiDa`, the storage of distributed data is done in the "native" Julia way -- +the data is stored in normal named variables. Each node holds its own dataset +in a custom set of variables, and these are completely independent of each +other. + +There are two basic data-moving functions: + +- [`save_at`](@ref), which evaluates a given object on the remote worker, and + stores it in a variable. `save_at(3, :x, 123)` is roughly the same as if you + connected to the Julia session on the worker `3` and typed `x = 123`. +- [`get_from`](@ref), which evaluates a given object on the remote worker and + returns a `Future` that holds the evaluated result. To get the value of `x` + from worker `3`, you call `fetch(get_from(3, :x))`. + +Additionally, there is [`get_val_from`](@ref), which calls the `fetch` right +away. + +```julia +julia> save_at(3,:x,123) +Future(3, 1, 11, nothing) + +julia> get_val_from(3, :x) +123 + +julia> get_val_from(4, :x) +ERROR: On worker 4: +UndefValError: x not defined +… +``` + +We use quoting to precisely distinguish what code is evaluated on the "leader" +worker that you use, and what code is evaluated on the workers. Basically, +everything quoted is going to get to the workers without any evaluation; +everything other is evaluated on the main node. + +For example, this picks up variable `x` from the node, which is named as `y` on +the main node: +```julia +julia> y=:x +:x + +julia> get_val_from(3, y) +123 +``` + +The quoting system can be easily exploited to tell the system that some +operations (e.g., heavy computations) are going to be executed on the remotes. + +For example, this code generates a huge random matrix locally and sends it to +the worker, which may not be desired (the data transfer takes precious time): + +```julia +julia> save_at(2, :mtx, randn(1000, 1000)) +``` + +On the remote, this may have been executed as something like `mtx = [0.384478, 0.763806, -0.885208, …]` . + +If you quote the parameter, it is not going to be evaluated on the main worker, but rather goes unevaluated and "packed" as an expression to the remote, which unpacks and evaluates it for itself. The data transfer is thus minimized: + +```julia +julia> save_at(2, :mtx, :(randn(1000,1000))) +``` + +On the remote, this is executed properly as `mtx = randn(1000, 1000)`. This is useful if handling large data -- you can easily load giant datasets to the workers without the risk that all data are loaded at your computer, likely causing an out-of-memory trouble. + +The same applies for receiving the data -- you can let some of the workers compute a very hard function and download it as follows: + +```julia +julia> get_val_from(2, :( computeAnswerToMeaningOfLife() )) +42 +``` + +If the expression in the previous case was not quoted, it would actually lead to the main worker computing the answer, sending it to worker `2`, and receiving back, which is likely not what we wanted. + +### Parallelization and synchronization + +Operations executed by `save_at` and `get_from` are asynchronous by default, which is good and bad, depending on the purpose. For example, results of hard-to-compute functions may not yet be saved at the time you need them. Let's demonstrate that on a simulated-hard function: + +```julia +julia> save_at(2, :delayed, :(begin sleep(30); 42; end)) +Future(2, 1, 18, nothing) + +julia> get_val_from(2, :delayed) +ERROR: On worker 2: +UndefVarError: delayed not defined + +# …wait 30 seconds… + +julia> get_val_from(2, :delayed) +42 +``` + +The simplest way to prevent such data races is to `fetch` the future returned +from `save_at`, which correctly waits until the result is available. + +The synchronization is not performed by default because the non-syncronized behavior allows to very easily implement parallelism -- you can start multiple computations at once, and then wait for all of them to complete. + +For example, this code distributes the random data and synchronizes correctly, but is basically serial: +```julia +julia> @time for i in workers() + fetch(save_at(i, :x, :(randn(10000,10000)))) + end + 1.073267 seconds (346 allocations: 12.391 KiB) +``` + +By spawning the operations first and waiting for all of them later, you can make the code parallel, and usually a few times faster (depending on the number of workers): + +```julia +julia> @time fetch.([save_at(i, :x, :(randn(10000,10000))) for i in workers()]) + 0.403235 seconds (44.50 k allocations: 2.277 MiB) +3-element Array{Nothing,1}: +nothing +… +``` + +The same is applicable for retrieving the sub-results parallely. This example demonstrates that multiple workers do some work (in this case, wait actively) at the same time: + +```julia +julia> @time fetch.([get_from(i, :(begin sleep(1); myid(); end)) + for i in workers()]) + 1.027651 seconds (42.26 k allocations: 2.160 MiB) +3-element Array{Int64,1}: + 2 + 3 + 4 +``` + +Notably, you can even send individual `Future`s to other workers, allowing the +workers to synchronize and transfer the data among each other. This is +beneficial for implementing advanced parallel algoritms. + +### `Dinfo` handles + +## Transformations and reductions + +## Persisting the data + +## Miscellaneous functions + From cc51135d0662c88eca1a094a69ce50f596ee13d0 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Fri, 22 Jan 2021 17:13:59 +0100 Subject: [PATCH 03/11] more docs --- docs/src/tutorial.md | 115 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 104 insertions(+), 11 deletions(-) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index ca2e135..04be29e 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -1,7 +1,8 @@ # DiDa tutorial -The primary purpose of this tutorial is to get a basic grasp of the main `DiDa` functions and methodology. +The primary purpose of this tutorial is to get a basic grasp of the main `DiDa` +functions and methodology. For starting up, let's create a few distributed workers and import the package: @@ -74,28 +75,49 @@ the worker, which may not be desired (the data transfer takes precious time): julia> save_at(2, :mtx, randn(1000, 1000)) ``` -On the remote, this may have been executed as something like `mtx = [0.384478, 0.763806, -0.885208, …]` . +On the remote, this may have been executed as something like +`mtx = [0.384478, 0.763806, -0.885208, …]` . -If you quote the parameter, it is not going to be evaluated on the main worker, but rather goes unevaluated and "packed" as an expression to the remote, which unpacks and evaluates it for itself. The data transfer is thus minimized: +If you quote the parameter, it is not going to be evaluated on the main worker, +but rather goes unevaluated and "packed" as an expression to the remote, which +unpacks and evaluates it for itself. The data transfer is thus minimized: ```julia julia> save_at(2, :mtx, :(randn(1000,1000))) ``` -On the remote, this is executed properly as `mtx = randn(1000, 1000)`. This is useful if handling large data -- you can easily load giant datasets to the workers without the risk that all data are loaded at your computer, likely causing an out-of-memory trouble. +On the remote, this is executed properly as `mtx = randn(1000, 1000)`. This is +useful if handling large data -- you can easily load giant datasets to the +workers without the risk that all data are loaded at your computer, likely +causing an out-of-memory trouble. -The same applies for receiving the data -- you can let some of the workers compute a very hard function and download it as follows: +The same applies for receiving the data -- you can let some of the workers +compute a very hard function and download it as follows: ```julia julia> get_val_from(2, :( computeAnswerToMeaningOfLife() )) 42 ``` -If the expression in the previous case was not quoted, it would actually lead to the main worker computing the answer, sending it to worker `2`, and receiving back, which is likely not what we wanted. +If the expression in the previous case was not quoted, it would actually lead +to the main worker computing the answer, sending it to worker `2`, and +receiving back, which is likely not what we wanted. + +Finally, it is very easy to work with multiple variables saved at a single +worker -- you just reference them in the expression: +```julia +julia> save_at(2,:x,123) +julia> save_at(2,:y,321) +julia> get_val_from(2, :(2*x+y)) +567 +``` ### Parallelization and synchronization -Operations executed by `save_at` and `get_from` are asynchronous by default, which is good and bad, depending on the purpose. For example, results of hard-to-compute functions may not yet be saved at the time you need them. Let's demonstrate that on a simulated-hard function: +Operations executed by `save_at` and `get_from` are asynchronous by default, +which is good and bad, depending on the purpose. For example, results of +hard-to-compute functions may not yet be saved at the time you need them. Let's +demonstrate that on a simulated-hard function: ```julia julia> save_at(2, :delayed, :(begin sleep(30); 42; end)) @@ -114,9 +136,12 @@ julia> get_val_from(2, :delayed) The simplest way to prevent such data races is to `fetch` the future returned from `save_at`, which correctly waits until the result is available. -The synchronization is not performed by default because the non-syncronized behavior allows to very easily implement parallelism -- you can start multiple computations at once, and then wait for all of them to complete. +The synchronization is not performed by default because the non-syncronized +behavior allows to very easily implement parallelism -- you can start multiple +computations at once, and then wait for all of them to complete. -For example, this code distributes the random data and synchronizes correctly, but is basically serial: +For example, this code distributes the random data and synchronizes correctly, +but is basically serial: ```julia julia> @time for i in workers() fetch(save_at(i, :x, :(randn(10000,10000)))) @@ -124,7 +149,9 @@ julia> @time for i in workers() 1.073267 seconds (346 allocations: 12.391 KiB) ``` -By spawning the operations first and waiting for all of them later, you can make the code parallel, and usually a few times faster (depending on the number of workers): +By spawning the operations first and waiting for all of them later, you can +make the code parallel, and usually a few times faster (depending on the number +of workers): ```julia julia> @time fetch.([save_at(i, :x, :(randn(10000,10000))) for i in workers()]) @@ -134,7 +161,9 @@ nothing … ``` -The same is applicable for retrieving the sub-results parallely. This example demonstrates that multiple workers do some work (in this case, wait actively) at the same time: +The same is applicable for retrieving the sub-results parallely. This example +demonstrates that multiple workers do some work (in this case, wait actively) +at the same time: ```julia julia> @time fetch.([get_from(i, :(begin sleep(1); myid(); end)) @@ -152,6 +181,70 @@ beneficial for implementing advanced parallel algoritms. ### `Dinfo` handles +Remembering the remote variable names and worker numbers is extremely +impractical, especially if you manage multiple variables on various subsets of +all available workers at once. `DiDa` defines a small [`Dinfo`](@ref) data structure +that manages exactly that information for you. Many other functions are able to +work with `Dinfo` transparently, instead of the "raw" identifiers and worker +lists. + +For example, you can use [`scatter_array`](@ref) to automatically separate the +array-like dataset to roughly-same pieces scattered accross multiple workers, +and obtain the `Dinfo` object: +```julia +julia> dataset = scatter_array(:myData, randn(1000,3), workers()) +Dinfo(:myData, [2, 3, 4]) +``` + +You can check the size of the resulting slices on each worker (note the +`$(...)` syntax for un-quoting, i.e., inserting evaluated data into quoted +expressions): +```julia +julia> fetch.([get_from(w, :(size($(dataset.val)))) for w in dataset.workers]) +3-element Array{Tuple{Int64,Int64},1}: + (333, 3) + (333, 3) + (334, 3) +``` + +The `Dinfo` object is used e.g. by the statistical functions, such as +[`dstat`](@ref) (see below for more examples). `dstat` just computes means and +standard deviations in selected columns of the data: + +```julia +julia> dstat(dataset, [1,2]) +([-0.029108965193981328, 0.019687519297162222], # means + [0.9923669075507301, 0.9768313338000191]) # sdevs +``` + +There are three functions for basic dataset management using the `Dinfo`: + +- [`dcopy`](@ref) for duplicating the data objects on all related workers +- [`unscatter`](@ref) for removing the data from workers (and freeing the + memory) +- [`gather_array`](@ref) for collecting the array pieces from individual + workers and pasting them together (an opposite of `scatter_array`. + +Continuing the previous example, we can copy the data, remove the originals, +and gather the copies: + +```julia +julia> dataset2 = dcopy(dataset, :backup) +Dinfo(:backup, [2, 3, 4]) + +julia> unscatter(dataset) + +julia> get_val_from(2, :myData) + # nothing + +julia> gather_array(dataset2) +1000×3 Array{Float64,2}: + 0.241102 0.62638 0.759203 + 0.981085 -1.01467 -0.495331 + -0.439979 -0.884943 -1.62218 + ⋮ +``` + ## Transformations and reductions ## Persisting the data From 99d061059f5e08985c246e0148722eb497f98372 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Fri, 22 Jan 2021 21:35:29 +0100 Subject: [PATCH 04/11] finish the docs tutorial --- docs/src/tutorial.md | 123 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index 04be29e..bef6d9b 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -247,7 +247,130 @@ julia> gather_array(dataset2) ## Transformations and reductions +There are several simplified functions to run parallel transformations of the +data and reductions: + +- [`dtransform`](@ref) that processes all parts of dataset using a given + function and stores the result +- [`dexec`](@ref) that is similar to `dtransform` but used to execute a + function that works with the data using "side-effects" for increased + efficiency, such as in case of the small array-modifying operations. +- [`dmapreduce`](@ref) that applies (maps) a function to all data parts and + reduces (or folds) the results to one using another function +- [`dmap`](@ref) that executes a function over the distributed data parts, + distributing a vector of values as parameters for that function. + +For example, `dtransform` can be used to exponentiate the whole dataset: +```julia +julia> dataset = scatter_array(:myData, randn(1000,3), workers()) +julia> get_val_from(dataset.workers[1], :(myData[1,:])) +3-element Array{Float64,1}: + -1.0788051465727018 + -0.29710863020942757 + -2.5613834309426546 + +julia> dtransform(dataset, x -> 2 .^ x) +Dinfo(:myData, [2, 3, 4]) + +julia> get_val_from(dataset.workers[1], :(myData[1,:])) +3-element Array{Float64,1}: + 0.4734207525033287 + 0.8138819001228813 + 0.16941300928000705 +``` + +You may have noticed that `dtransform` returns a new `Dinfo` object. That is +safe to ignore, but you can use `dtransform` to put the result into another +distributed variable with the extra argument, in which case the returned +`Dinfo` wraps this new distributed variable. That is very useful for easily +creating new datasets with the same command on all workers. In the following +example, also note that the function does not need to be quoted. + +```julia +julia> anotherDataset = dtransform((), _ ->randn(100), workers(), :newData) +Dinfo(:newData, [2, 3, 4]) +``` + +The `dexec` function is handy if your transformation does not modify the whole +array, but leaves most of it untouched and rewriting it would be a waste of +resources. This multiplies the 5-th element of each distributed array by 42: + +```julia +julia> dexec(anotherDataset, arr -> arr[5] *= 42) +julia> gather_array(anotherDataset)[1:6] +6-element Array{Float64,1}: + 0.8270400003123709 + -0.10688512653581493 + -1.0462015551052068 + -1.2891453384843214 + 16.429315504503112 + 0.13958421716454797 + ⋮ +``` + +MapReduce is a handy primitive that is suitable for operations that can +compress the dataset slices into relatively small values that can be combined +efficiently. For example, this computes the sum of squares of the whole array: + +```julia +julia> dmapreduce(anotherDataset, x -> sum(x.^2), +) +8633.94032741762 +``` + +Finally, `dmap` is useful for passing each worker a specific value from the +given vector, which may be useful in cases when each worker is supposed to do +something different with the data, such as submitting them to a different +interface or saving them as a different file. The results are returned as a +vector. This example is rather simplistic: + +```julia +julia> dmap(Vector(1:length(workers())), + val -> "Worker number $(val) has ID $(myid())", + workers()) +3-element Array{String,1}: + "Worker number 1 has ID 2" + "Worker number 2 has ID 3" + "Worker number 3 has ID 4" +``` + ## Persisting the data +There some support for storing the loaded dataset on each worker's local +storage. This is quite beneficial for storing sub-results and various artifacts +for future use without wasting memory. + +The available functions are: +- [`dstore`](@ref), which saves the dataset to a disk, such as in +```julia +julia> dstore(anotherDataset) +``` + ...which, in this case, creates files `newData-1.slice` to `newData-3.slice` + that contain the respective parts of the dataset. The name can be modified + using the `files` parameter. +- [`dload`](@ref), which loads the data back, using the same arguments as `dstore` +- [`dunlink`](@ref), which removes the corresponding files. + +One possible use-case for this is a relatively easy way of data exchange +between nodes in a HPC environment, where the disk storage is usually a very +fast "scratch space" that is shared among all participants of a computation. + ## Miscellaneous functions +There are many extra functions that work on matrix data, as common in many +science areas (especially in flow cytometry data processing, where DiDa +originated): + +- [`dselect`](@ref) reduces a matrix to several selected columns +- [`dapply_cols`](@ref) transforms selected columns with a function +- [`dapply_rows`](@ref) does the same with rows +- [`dstat`](@ref) quickly computes mean and standard deviation in selected + columns (as shown above) +- [`dstat_buckets`](@ref) does the same for multiple data "groups" present in + the same matrix; the data groups are specified by an integer vector (this is + great e.g. for computing per-cluster statistics, given the integer vector + assigns data entries to clusters) +- [`dcount`](@ref) counts ocurrences of items in an integer vector, similar to e.g. R function `tabulate` +- [`dcount_buckets`](@ref) does the same per groups +- [`dscale`](@ref) scales the selected columns to mean 0 and standard deviation 1 +- [`dmedian`](@ref) computes a median in columns of the dataset (That is done by an approximative algorithm that works in time `O(n*iters)`, thus works even for really large datasets. Precision increases by roughly 1 bit per iteration, the default is 20 iterations.) +- [`dmedian_buckets`](@ref) computes the medians using the above method for multiple data groups From 323e4c1f4017c70b61fc00a57311b07ff4b13442 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Fri, 22 Jan 2021 22:07:40 +0100 Subject: [PATCH 05/11] line breaks --- docs/src/tutorial.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index bef6d9b..9b0fd38 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -369,8 +369,14 @@ originated): the same matrix; the data groups are specified by an integer vector (this is great e.g. for computing per-cluster statistics, given the integer vector assigns data entries to clusters) -- [`dcount`](@ref) counts ocurrences of items in an integer vector, similar to e.g. R function `tabulate` +- [`dcount`](@ref) counts ocurrences of items in an integer vector, similar to + e.g. R function `tabulate` - [`dcount_buckets`](@ref) does the same per groups -- [`dscale`](@ref) scales the selected columns to mean 0 and standard deviation 1 -- [`dmedian`](@ref) computes a median in columns of the dataset (That is done by an approximative algorithm that works in time `O(n*iters)`, thus works even for really large datasets. Precision increases by roughly 1 bit per iteration, the default is 20 iterations.) -- [`dmedian_buckets`](@ref) computes the medians using the above method for multiple data groups +- [`dscale`](@ref) scales the selected columns to mean 0 and standard deviation + 1 +- [`dmedian`](@ref) computes a median in columns of the dataset (That is done + by an approximative algorithm that works in time `O(n*iters)`, thus works + even for really large datasets. Precision increases by roughly 1 bit per + iteration, the default is 20 iterations.) +- [`dmedian_buckets`](@ref) computes the medians using the above method for + multiple data groups From a83b455b2df5f63ff8d37a19b430466cd94467a8 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Sat, 23 Jan 2021 11:08:44 +0100 Subject: [PATCH 06/11] proofread the documentation, clear the english --- docs/src/tutorial.md | 242 ++++++++++++++++++++++++------------------- 1 file changed, 134 insertions(+), 108 deletions(-) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index 9b0fd38..39527f7 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -21,21 +21,23 @@ julia> @everywhere using DiDa ## Moving the data around In `DiDa`, the storage of distributed data is done in the "native" Julia way -- -the data is stored in normal named variables. Each node holds its own dataset -in a custom set of variables, and these are completely independent of each -other. +the data is stored in normal named variables. Each node holds its own data in +an arbitrary set of variables as "plain data"; content of these variables is +completely independent among nodes. There are two basic data-moving functions: -- [`save_at`](@ref), which evaluates a given object on the remote worker, and - stores it in a variable. `save_at(3, :x, 123)` is roughly the same as if you - connected to the Julia session on the worker `3` and typed `x = 123`. +- [`save_at`](@ref), which evaluates a given expression on the remote worker, + and stores it in a variable. In particular, `save_at(3, :x, 123)` is roughly + the same as if you would manually connect to the Julia session on the worker + `3` and type `x = 123`. - [`get_from`](@ref), which evaluates a given object on the remote worker and returns a `Future` that holds the evaluated result. To get the value of `x` - from worker `3`, you call `fetch(get_from(3, :x))`. + from worker `3`, you may call `fetch(get_from(3, :x))` to fetch the + "contents" of that future. (Additionally, there is [`get_val_from`](@ref), + which calls the `fetch` for you.) -Additionally, there is [`get_val_from`](@ref), which calls the `fetch` right -away. +The use of these functions is quite straightforward: ```julia julia> save_at(3,:x,123) @@ -50,13 +52,14 @@ UndefValError: x not defined … ``` -We use quoting to precisely distinguish what code is evaluated on the "leader" -worker that you use, and what code is evaluated on the workers. Basically, -everything quoted is going to get to the workers without any evaluation; -everything other is evaluated on the main node. +`DiDa` uses *quoting* to allow you to precisely specify the parts of the code +that should be evaluated on the "main" Julia process (the one you interact +with), and the code that shold be evaluated on the remote workers. Basically, +all quoted code is going to get to the workers without any evaluation; all +other code is evaluated on the main node. -For example, this picks up variable `x` from the node, which is named as `y` on -the main node: +For example, this picks up the contents of variable `x` from the remote worker, +despite that actual symbol is named as `y` in the main process: ```julia julia> y=:x :x @@ -65,33 +68,36 @@ julia> get_val_from(3, y) 123 ``` -The quoting system can be easily exploited to tell the system that some -operations (e.g., heavy computations) are going to be executed on the remotes. +This system is used to easily specify that some particular operations (e.g., +heavy computations) are going to be executed on the remotes. -For example, this code generates a huge random matrix locally and sends it to -the worker, which may not be desired (the data transfer takes precious time): +To illustrate the difference between quoted and non-quoted code, the following +code generates a huge random matrix locally and sends it to the worker, which +may not be desired (the data transfer takes a lot of precious time): ```julia julia> save_at(2, :mtx, randn(1000, 1000)) ``` -On the remote, this may have been executed as something like +On the remote worker `2`, this will be executed as something like `mtx = [0.384478, 0.763806, -0.885208, …]` . If you quote the parameter, it is not going to be evaluated on the main worker, but rather goes unevaluated and "packed" as an expression to the remote, which -unpacks and evaluates it for itself. The data transfer is thus minimized: +unpacks and evaluates it by itself: ```julia julia> save_at(2, :mtx, :(randn(1000,1000))) ``` -On the remote, this is executed properly as `mtx = randn(1000, 1000)`. This is -useful if handling large data -- you can easily load giant datasets to the -workers without the risk that all data are loaded at your computer, likely -causing an out-of-memory trouble. +The data transfer is minimized to a few-byte expression `randn(1000,1000)`. On +the remote, this is executed properly as `mtx = randn(1000, 1000)`. -The same applies for receiving the data -- you can let some of the workers +This is useful for handling large data -- you can easily load giant datasets to +the workers without hauling all the data through your computer; very likely +also decreasing the risk of out-of-memory problems. + +The same principle applies for receiving the data -- you can let some of the workers compute a very hard function and download it as follows: ```julia @@ -99,11 +105,11 @@ julia> get_val_from(2, :( computeAnswerToMeaningOfLife() )) 42 ``` -If the expression in the previous case was not quoted, it would actually lead -to the main worker computing the answer, sending it to worker `2`, and -receiving back, which is likely not what we wanted. +If the expression in the previous case was not quoted, it would actually cause +the main worker to compute the answer, send it to worker `2`, and receive it +back unchanged, which is likely not what we wanted. -Finally, it is very easy to work with multiple variables saved at a single +Finally, this way it is very easy to work with multiple variables saved at a single worker -- you just reference them in the expression: ```julia julia> save_at(2,:x,123) @@ -114,16 +120,17 @@ julia> get_val_from(2, :(2*x+y)) ### Parallelization and synchronization -Operations executed by `save_at` and `get_from` are asynchronous by default, -which is good and bad, depending on the purpose. For example, results of -hard-to-compute functions may not yet be saved at the time you need them. Let's -demonstrate that on a simulated-hard function: +Operations executed by `save_at` and `get_from` are *asynchronous* by default, +which may be both good and bad, depending on the situation. For example, when +using `save_at`, the results of hard-to-compute functions may not yet be saved +at the time you need them. Let's demonstrate that on a "simulated" slow +function: ```julia julia> save_at(2, :delayed, :(begin sleep(30); 42; end)) Future(2, 1, 18, nothing) -julia> get_val_from(2, :delayed) +julia> get_val_from(2, :delayed) # the computation is not finished yet, thus the variable is not assigned ERROR: On worker 2: UndefVarError: delayed not defined @@ -134,14 +141,17 @@ julia> get_val_from(2, :delayed) ``` The simplest way to prevent such data races is to `fetch` the future returned -from `save_at`, which correctly waits until the result is available. +from `save_at`, which correctly waits until the result is properly available on +the target worker. -The synchronization is not performed by default because the non-syncronized -behavior allows to very easily implement parallelism -- you can start multiple -computations at once, and then wait for all of them to complete. +This *synchronization is not performed by default*, because the non-syncronized +behavior allows you to very easily implement parallelism. In particular, you +may start multiple asynchronous computations at once, and then wait for all of +them to complete to make sure all results are available. Because the operations +run asynchronously, they are processed concurrently, thus faster. -For example, this code distributes the random data and synchronizes correctly, -but is basically serial: +To illustrate the difference, the following code distributes some random data +and then synchronizes correctly, but is essentially serial: ```julia julia> @time for i in workers() fetch(save_at(i, :x, :(randn(10000,10000)))) @@ -161,9 +171,8 @@ nothing … ``` -The same is applicable for retrieving the sub-results parallely. This example -demonstrates that multiple workers do some work (in this case, wait actively) -at the same time: +The same is applicable for retrieving the sub-results parallelly. This example +demonstrates that multiple workers can do some work at the same time: ```julia julia> @time fetch.([get_from(i, :(begin sleep(1); myid(); end)) @@ -181,12 +190,12 @@ beneficial for implementing advanced parallel algoritms. ### `Dinfo` handles -Remembering the remote variable names and worker numbers is extremely -impractical, especially if you manage multiple variables on various subsets of -all available workers at once. `DiDa` defines a small [`Dinfo`](@ref) data structure -that manages exactly that information for you. Many other functions are able to -work with `Dinfo` transparently, instead of the "raw" identifiers and worker -lists. +Remembering and managing the remote variable names and worker numbers is +extremely impractical, especially if you need to maintain multiple variables on +various subsets of all available workers at once. `DiDa` defines a small +[`Dinfo`](@ref) data structure that keeps that information for you. Many other +functions are able to work with `Dinfo` transparently, instead of the "raw" +symbols and worker lists. For example, you can use [`scatter_array`](@ref) to automatically separate the array-like dataset to roughly-same pieces scattered accross multiple workers, @@ -196,9 +205,11 @@ julia> dataset = scatter_array(:myData, randn(1000,3), workers()) Dinfo(:myData, [2, 3, 4]) ``` -You can check the size of the resulting slices on each worker (note the -`$(...)` syntax for un-quoting, i.e., inserting evaluated data into quoted -expressions): +`Dinfo` contains the necessary information about the "contents" of the +distributed dataset: The name of variable used to save it on workers, and IDs +of individual workers. The storage of the variables is otherwise same as with +the basic data-moving function -- you can e.g. manually check the size of the +resulting slices on each worker using `get_from`: ```julia julia> fetch.([get_from(w, :(size($(dataset.val)))) for w in dataset.workers]) 3-element Array{Tuple{Int64,Int64},1}: @@ -206,6 +217,8 @@ julia> fetch.([get_from(w, :(size($(dataset.val)))) for w in dataset.workers]) (333, 3) (334, 3) ``` +(Note the `$(...)` syntax for *un-quoting*, i.e., inserting evaluated data into +quoted expressions.) The `Dinfo` object is used e.g. by the statistical functions, such as [`dstat`](@ref) (see below for more examples). `dstat` just computes means and @@ -217,13 +230,14 @@ julia> dstat(dataset, [1,2]) [0.9923669075507301, 0.9768313338000191]) # sdevs ``` -There are three functions for basic dataset management using the `Dinfo`: +There are three functions for straightforward data management using the +`Dinfo`: - [`dcopy`](@ref) for duplicating the data objects on all related workers - [`unscatter`](@ref) for removing the data from workers (and freeing the memory) - [`gather_array`](@ref) for collecting the array pieces from individual - workers and pasting them together (an opposite of `scatter_array`. + workers and pasting them together (an opposite of `scatter_array`) Continuing the previous example, we can copy the data, remove the originals, and gather the copies: @@ -247,18 +261,19 @@ julia> gather_array(dataset2) ## Transformations and reductions -There are several simplified functions to run parallel transformations of the -data and reductions: +There are several simplified functions to run parallel computation on the +distributed data: -- [`dtransform`](@ref) that processes all parts of dataset using a given +- [`dtransform`](@ref) processes all worker's parts of the data using a given function and stores the result -- [`dexec`](@ref) that is similar to `dtransform` but used to execute a - function that works with the data using "side-effects" for increased - efficiency, such as in case of the small array-modifying operations. -- [`dmapreduce`](@ref) that applies (maps) a function to all data parts and - reduces (or folds) the results to one using another function -- [`dmap`](@ref) that executes a function over the distributed data parts, - distributing a vector of values as parameters for that function. +- [`dmapreduce`](@ref) applies ("maps") a function to all data parts and + reduces ("folds") the intermediate results to a single result using another + function +- [`dexec`](@ref) is similar to `dtransform`, but expects a function that + modifies the data in-place (using "side-effects"), for increased efficiency + in cases such as very small array modifications +- [`dmap`](@ref) executes a function over the workers, also distributing a + vector of values as parameters for that function. For example, `dtransform` can be used to exponentiate the whole dataset: ```julia @@ -279,21 +294,24 @@ julia> get_val_from(dataset.workers[1], :(myData[1,:])) 0.16941300928000705 ``` -You may have noticed that `dtransform` returns a new `Dinfo` object. That is -safe to ignore, but you can use `dtransform` to put the result into another -distributed variable with the extra argument, in which case the returned -`Dinfo` wraps this new distributed variable. That is very useful for easily -creating new datasets with the same command on all workers. In the following -example, also note that the function does not need to be quoted. +You may have noticed that `dtransform` returns a new `Dinfo` object, which we +safely discard in this case. You can use `dtransform` to save the result into +another distributed variable (by supplying the new name in an extra argument), +in which case the returned `Dinfo` wraps this new distributed variable. That is +useful for easily generating new datasets on all workers, as in the following +example: ```julia julia> anotherDataset = dtransform((), _ ->randn(100), workers(), :newData) Dinfo(:newData, [2, 3, 4]) ``` +(Note that the function body does not need to be quoted.) + The `dexec` function is handy if your transformation does not modify the whole array, but leaves most of it untouched and rewriting it would be a waste of -resources. This multiplies the 5-th element of each distributed array by 42: +resources. This example multiplies the 5th element of each distributed array +part by 42: ```julia julia> dexec(anotherDataset, arr -> arr[5] *= 42) @@ -309,19 +327,20 @@ julia> gather_array(anotherDataset)[1:6] ``` MapReduce is a handy primitive that is suitable for operations that can -compress the dataset slices into relatively small values that can be combined -efficiently. For example, this computes the sum of squares of the whole array: +"compress" the dataset slices into relatively small pieces of data, which can +be combined efficiently. For example, this computes the sum of squares of the +whole array: ```julia julia> dmapreduce(anotherDataset, x -> sum(x.^2), +) 8633.94032741762 ``` -Finally, `dmap` is useful for passing each worker a specific value from the -given vector, which may be useful in cases when each worker is supposed to do -something different with the data, such as submitting them to a different -interface or saving them as a different file. The results are returned as a -vector. This example is rather simplistic: +Finally, `dmap` passes each worker a specific value from a given vector, which +may be useful in cases when each worker is supposed to do something slightly +different with the data (for example, submit them to a different interface or +save them as a different file). The results are returned as a vector. This +example is rather simplistic: ```julia julia> dmap(Vector(1:length(workers())), @@ -335,48 +354,55 @@ julia> dmap(Vector(1:length(workers())), ## Persisting the data -There some support for storing the loaded dataset on each worker's local -storage. This is quite beneficial for storing sub-results and various artifacts -for future use without wasting memory. +`DiDa` provides support for storing the loaded dataset in each worker's local +storage. This is quite beneficial for saving sub-results and various artifacts +of the computation process for later use, without unnecessarily wasting +main memory. -The available functions are: -- [`dstore`](@ref), which saves the dataset to a disk, such as in +The available functions are as follows: +- [`dstore`](@ref) saves the dataset to a disk, such as in ```julia julia> dstore(anotherDataset) ``` ...which, in this case, creates files `newData-1.slice` to `newData-3.slice` - that contain the respective parts of the dataset. The name can be modified - using the `files` parameter. -- [`dload`](@ref), which loads the data back, using the same arguments as `dstore` -- [`dunlink`](@ref), which removes the corresponding files. + that contain the respective parts of the dataset. The precise naming scheme + can be specified using the `files` parameter. +- [`dload`](@ref) loads the data back into memory +- [`dunlink`](@ref) removes the corresponding files from the storage -One possible use-case for this is a relatively easy way of data exchange -between nodes in a HPC environment, where the disk storage is usually a very -fast "scratch space" that is shared among all participants of a computation. +Apart from saving the data for later use, this provides a relatively easy way +of exchanging data between nodes in a HPC environment. There, the disk storage +is usually a very fast "scratch space" that is shared among all participants of +a computation, and can be used to "broadcast" or "shuffle" the data without any +significant overhead. ## Miscellaneous functions -There are many extra functions that work on matrix data, as common in many -science areas (especially in flow cytometry data processing, where DiDa -originated): +For convenience, `DiDa` also contains simple implementations of various common +utility operations for processing matrix data. These originated in +flow-cytometry use-cases (which is what `DiDa` was originally built for), but +are applicable in many other areas of data analysis: -- [`dselect`](@ref) reduces a matrix to several selected columns -- [`dapply_cols`](@ref) transforms selected columns with a function +- [`dselect`](@ref) reduces a matrix to several selected columns (in a + relatively usual scenario where the rows of the matrix are "events" and + columns represent "features", `dselect` discards the unwanted features) +- [`dapply_cols`](@ref) transforms selected columns with a given function - [`dapply_rows`](@ref) does the same with rows -- [`dstat`](@ref) quickly computes mean and standard deviation in selected - columns (as shown above) +- [`dstat`](@ref) quickly computes the mean and standard deviation in selected + columns (as demonstrated above) - [`dstat_buckets`](@ref) does the same for multiple data "groups" present in - the same matrix; the data groups are specified by an integer vector (this is - great e.g. for computing per-cluster statistics, given the integer vector - assigns data entries to clusters) -- [`dcount`](@ref) counts ocurrences of items in an integer vector, similar to - e.g. R function `tabulate` + the same matrix, the data groups are specified by a distributed integer + vector (This is useful e.g. for computing per-cluster statistics, in which + case the integer vector should assign individual data entries to clusters.) +- [`dcount`](@ref) counts the numbers of ocurrences of items in an integer + vector, similar to e.g. R function `tabulate` - [`dcount_buckets`](@ref) does the same per groups - [`dscale`](@ref) scales the selected columns to mean 0 and standard deviation 1 -- [`dmedian`](@ref) computes a median in columns of the dataset (That is done - by an approximative algorithm that works in time `O(n*iters)`, thus works - even for really large datasets. Precision increases by roughly 1 bit per - iteration, the default is 20 iterations.) -- [`dmedian_buckets`](@ref) computes the medians using the above method for +- [`dmedian`](@ref) computes a median of the selected columns of the dataset + (The computation is done using an approximative iterative algorithm in time + `O(n*iters)`, which scales even to really large datasets. The precision of + the result increases by roughly 1 bit per iteration, the default is 20 + iterations.) +- [`dmedian_buckets`](@ref) uses the above method to compute the medians for multiple data groups From 11185c23cc47cbc7daba41a28d5aed936844cb84 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Sat, 23 Jan 2021 11:11:05 +0100 Subject: [PATCH 07/11] spellcheck --- docs/src/tutorial.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index 39527f7..5764e54 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -54,7 +54,7 @@ UndefValError: x not defined `DiDa` uses *quoting* to allow you to precisely specify the parts of the code that should be evaluated on the "main" Julia process (the one you interact -with), and the code that shold be evaluated on the remote workers. Basically, +with), and the code that should be evaluated on the remote workers. Basically, all quoted code is going to get to the workers without any evaluation; all other code is evaluated on the main node. @@ -144,7 +144,7 @@ The simplest way to prevent such data races is to `fetch` the future returned from `save_at`, which correctly waits until the result is properly available on the target worker. -This *synchronization is not performed by default*, because the non-syncronized +This *synchronization is not performed by default*, because the non-synchronized behavior allows you to very easily implement parallelism. In particular, you may start multiple asynchronous computations at once, and then wait for all of them to complete to make sure all results are available. Because the operations @@ -171,7 +171,7 @@ nothing … ``` -The same is applicable for retrieving the sub-results parallelly. This example +The same is applicable for retrieving the sub-results in parallel. This example demonstrates that multiple workers can do some work at the same time: ```julia @@ -186,7 +186,7 @@ julia> @time fetch.([get_from(i, :(begin sleep(1); myid(); end)) Notably, you can even send individual `Future`s to other workers, allowing the workers to synchronize and transfer the data among each other. This is -beneficial for implementing advanced parallel algoritms. +beneficial for implementing advanced parallel algorithms. ### `Dinfo` handles @@ -198,7 +198,7 @@ functions are able to work with `Dinfo` transparently, instead of the "raw" symbols and worker lists. For example, you can use [`scatter_array`](@ref) to automatically separate the -array-like dataset to roughly-same pieces scattered accross multiple workers, +array-like dataset to roughly-same pieces scattered across multiple workers, and obtain the `Dinfo` object: ```julia julia> dataset = scatter_array(:myData, randn(1000,3), workers()) @@ -394,13 +394,13 @@ are applicable in many other areas of data analysis: the same matrix, the data groups are specified by a distributed integer vector (This is useful e.g. for computing per-cluster statistics, in which case the integer vector should assign individual data entries to clusters.) -- [`dcount`](@ref) counts the numbers of ocurrences of items in an integer +- [`dcount`](@ref) counts the numbers of occurrences of items in an integer vector, similar to e.g. R function `tabulate` - [`dcount_buckets`](@ref) does the same per groups - [`dscale`](@ref) scales the selected columns to mean 0 and standard deviation 1 - [`dmedian`](@ref) computes a median of the selected columns of the dataset - (The computation is done using an approximative iterative algorithm in time + (The computation is done using an approximate iterative algorithm in time `O(n*iters)`, which scales even to really large datasets. The precision of the result increases by roughly 1 bit per iteration, the default is 20 iterations.) From b19f7e8651a8ff13483cae78ab44d9995a32a4a1 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Sat, 23 Jan 2021 11:14:54 +0100 Subject: [PATCH 08/11] reformat the dstore docs --- docs/src/tutorial.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md index 5764e54..4009942 100644 --- a/docs/src/tutorial.md +++ b/docs/src/tutorial.md @@ -361,17 +361,17 @@ main memory. The available functions are as follows: - [`dstore`](@ref) saves the dataset to a disk, such as in -```julia -julia> dstore(anotherDataset) -``` - ...which, in this case, creates files `newData-1.slice` to `newData-3.slice` - that contain the respective parts of the dataset. The precise naming scheme - can be specified using the `files` parameter. -- [`dload`](@ref) loads the data back into memory + `dstore(anotherDataset)`, which, in this case, creates files + `newData-1.slice` to `newData-3.slice` that contain the respective parts of + the dataset. The precise naming scheme can be specified using the `files` + parameter. +- [`dload`](@ref) loads the data back into memory (again using a `Dinfo` + parameter with dataset description to get the dataset name and the list of + relevant workers) - [`dunlink`](@ref) removes the corresponding files from the storage Apart from saving the data for later use, this provides a relatively easy way -of exchanging data between nodes in a HPC environment. There, the disk storage +to exchange the data among nodes in a HPC environment. There, the disk storage is usually a very fast "scratch space" that is shared among all participants of a computation, and can be used to "broadcast" or "shuffle" the data without any significant overhead. From 8b34e7c5e4ec01c67e0bf5220ce7af2f61e6a9e0 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Sat, 23 Jan 2021 11:36:28 +0100 Subject: [PATCH 09/11] logo --- docs/src/assets/logo.svg | 126 +++++++++++++++++++++++++++++++++++++++ docs/src/index.md | 3 +- 2 files changed, 128 insertions(+), 1 deletion(-) create mode 100644 docs/src/assets/logo.svg diff --git a/docs/src/assets/logo.svg b/docs/src/assets/logo.svg new file mode 100644 index 0000000..9b05cb9 --- /dev/null +++ b/docs/src/assets/logo.svg @@ -0,0 +1,126 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/src/index.md b/docs/src/index.md index 2247cf9..3431183 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,4 +1,5 @@ -# DiDa.jl - simple work with distributed data + +# DiDa.jl — simple work with distributed data This packages provides a relatively simple Distributed Data manipulation and processing routines for Julia. From 6f19f53fd995e460ea82d852ebfea816b1c980f2 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Mon, 25 Jan 2021 10:17:39 +0100 Subject: [PATCH 10/11] update README --- README.md | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index a944f11..d40711d 100644 --- a/README.md +++ b/README.md @@ -3,19 +3,29 @@ Simple distributed data manipulation and processing routines for Julia. This was originally developed for -[GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains -the separated-out lightweight distributed-processing framework that can be used -with GigaSOM. +[`GigaSOM.jl`](https://github.com/LCSB-BioCore/GigaSOM.jl); DiDa.jl package +contains the separated-out lightweight distributed-processing framework that +was used in `GigaSOM.jl`. ## Why? -This provides a very simple, imperative and straightforward way to move your -data around a cluster of Julia processes created by the `Distributed` package, +DiDa.jl provides a very simple, imperative and straightforward way to move your +data around a cluster of Julia processes created by the +[`Distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/) package, and run computation on the distributed data pieces. The main aim of the package -is to avoid anything complicated-- the first version used in GigaSOM had just -under 500 lines of relatively straightforward code with comments. - -Most importantly, distributed processing should be simple and accessible. +is to avoid anything complicated-- the first version used in +[GigaSOM](https://github.com/LCSB-BioCore/GigaSOM.jl) had just under 500 lines +of relatively straightforward code (including the doc-comments). + +Compared to plain `Distributed` API, you get more straightforward data +manipulation primitives, some extra control over the precise place where code +is executed, and a few high-level functions. These include a distributed +version of `mapreduce`, simpler work-alike of the +[DistributedArrays](https://github.com/JuliaParallel/DistributedArrays.jl) +functionality, and easy-to-use distributed dataset saving and loading. + +Most importantly, the main motivation behind the package is that the +distributed processing should be simple and accessible. ## Brief how-to From 0c3671d3bc126f6dfd24c94e6314fcea61043f15 Mon Sep 17 00:00:00 2001 From: Mirek Kratochvil Date: Mon, 25 Jan 2021 10:25:46 +0100 Subject: [PATCH 11/11] autobuild docs with actions --- .github/workflows/docs.yml | 26 ++++++++++++++++++++++++++ docs/make.jl | 2 +- 2 files changed, 27 insertions(+), 1 deletion(-) create mode 100644 .github/workflows/docs.yml diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..5c48c91 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,26 @@ +# ref: https://juliadocs.github.io/Documenter.jl/stable/man/hosting/#GitHub-Actions-1 +name: Documentation + +on: + push: + branches: + - develop + tags: '*' + pull_request: + release: + types: [published, created] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - uses: julia-actions/setup-julia@latest + with: + version: 1.5 + - name: Install dependencies + run: julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()' + - name: Build and deploy + env: + DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key + run: julia --project=docs/ docs/make.jl diff --git a/docs/make.jl b/docs/make.jl index 2cc314f..4b2d7bf 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -17,6 +17,6 @@ deploydocs( repo = "github.com/LCSB-BioCore/DiDa.jl.git", target = "build", branch = "gh-pages", - devbranch = "master", + devbranch = "develop", versions = "stable" => "v^", )