LCSB-BioCore · laurentheirendt · Jan 25, 2021 · Jan 22, 2021 · Jan 22, 2021 · Jan 22, 2021
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,26 @@
+# ref: https://juliadocs.github.io/Documenter.jl/stable/man/hosting/#GitHub-Actions-1
+name: Documentation
+
+on:
+  push:
+    branches:
+      - develop
+    tags: '*'
+  pull_request:
+  release:
+    types: [published, created]
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: julia-actions/setup-julia@latest
+        with:
+          version: 1.5
+      - name: Install dependencies
+        run: julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
+      - name: Build and deploy
+        env:
+          DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
+        run: julia --project=docs/ docs/make.jl
diff --git a/README.md b/README.md
@@ -1,9 +1,135 @@
 # DiDa.jl
 
-Simple Distributed Data manipulation and processing routines for Julia.
+Simple distributed data manipulation and processing routines for Julia.
 
 This was originally developed for
-[GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains
-the separated-out lightweight distributed-processing framework that can be used
-with GigaSOM.
+[`GigaSOM.jl`](https://github.com/LCSB-BioCore/GigaSOM.jl); DiDa.jl package
+contains the separated-out lightweight distributed-processing framework that
+was used in `GigaSOM.jl`.
 
+## Why?
+
+DiDa.jl provides a very simple, imperative and straightforward way to move your
+data around a cluster of Julia processes created by the
+[`Distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/) package,
+and run computation on the distributed data pieces. The main aim of the package
+is to avoid anything complicated-- the first version used in
+[GigaSOM](https://github.com/LCSB-BioCore/GigaSOM.jl) had just under 500 lines
+of relatively straightforward code (including the doc-comments).
+
+Compared to plain `Distributed` API, you get more straightforward data
+manipulation primitives, some extra control over the precise place where code
+is executed, and a few high-level functions. These include a distributed
+version of `mapreduce`, simpler work-alike of the
+[DistributedArrays](https://github.com/JuliaParallel/DistributedArrays.jl)
+functionality, and easy-to-use distributed dataset saving and loading.
+
+Most importantly, the main motivation behind the package is that the
+distributed processing should be simple and accessible.
+
+## Brief how-to
+
+The package provides a few very basic primitives that lightly wrap the
+`Distributed` package functions `remotecall` and `fetch`. The most basic one is
+`save_at`, which takes a worker ID, variable name and variable content, and
+saves the content to the variable on the selected worker. `get_from` works the
+same way, but takes the data back from the worker.
+
+You can thus send some random array to a few distributed workers:
+
+```julia
+julia> using Distributed, DiDa
+
+julia> addprocs(2)
+2-element Array{Int64,1}:
+ 2
+ 3
+
+julia> @everywhere using DiDa
+
+julia> save_at(2, :x, randn(10,10))
+Future(2, 1, 4, nothing)
+```
+
+The `Future` returned from `save_at` is the normal Julia future from
+`Distributed`, you can even `fetch` it to wait until the operation is really
+done on the other side. Fetching the data is done the same way:
+
+```julia
+julia> get_from(2,:x)
+Future(2, 1, 15, nothing)
+
+julia> get_val_from(2,:x) # auto-fetch()ing variant
+10×10 Array{Float64,2}:
+ -0.850788    0.946637     1.78006   … 
+ -0.49596     0.497829    -2.03013
+   …
+```
+
+All commands support full quoting, which allows you to easily distinguish
+between code parts that are executed locally and remotely:
+
+```julia
+julia> save_at(3, :x, randn(1000,1000))     # generates a matrix locally and sends it to the remote worker
+
+julia> save_at(3, :x, :(randn(1000,1000)))  # generates a matrix right on the remote worker and saves it there
+
+julia> get_val_from(3, :x)                  # retrieves the generated matrix and fetches it
+…
+
+julia> get_val_from(3, :(randn(1000,1000))) # generates the matrix on the worker and fetches the data
+…
+```
+
+Notably, this is different from the approach taken by `DistributedArrays` and
+similar packages -- all data manipulation is explicit, and any data type is
+supported as long as it can be moved among workers by the `Distributed`
+package. This helps with various highly non-array-ish data, such as large text
+corpora and graphs.
+
+There are various goodies for easy work with matrix-style data, namely
+scattering, gathering and running distributed algorithms:
+
+```julia
+julia> x = randn(1000,3)
+1000×3 Array{Float64,2}:
+ -0.992481   0.551064     1.67424
+ -0.751304  -0.845055     0.105311
+ -0.712687   0.165619    -0.469055
+  ⋮
+
+julia> dataset = scatter_array(:myDataset, x, workers())  # sends slices of the array to workers
+Dinfo(:myDataset, [2, 3])   # a helper for holding the variable name and the used workers together
+
+julia> get_val_from(3, :(size(myDataset)))
+(500, 3)    # there's really only half of the data
+
+julia> dmapreduce(dataset, sum, +) # MapReduce-style sum of all data
+-51.64369103751014
+
+julia> dstat(dataset, [1,2,3]) # get means and sdevs in individual columns
+([-0.030724038974465212, 0.007300925745200863, -0.028220577808245786],
+ [0.9917470012495775, 0.9975120525455358, 1.000243845434252])
+
+julia> dmedian(dataset, [1,2,3]) # distributed iterative median in columns
+3-element Array{Float64,1}:
+  0.004742259615849834
+  0.039043266340824986
+ -0.05367799062404967
+
+julia> dtransform(dataset, x -> 2 .^ x) # exponentiate all data (medians should now be around 1)
+Dinfo(:myDataset, [2, 3])
+
+julia> gather_array(dataset) # download the data from workers to a sing
+1000×3 Array{Float64,2}:
+ 0.502613  1.46517   3.1915
+ 0.594066  0.55669   1.07573
+ 0.610183  1.12165   0.722438
+  ⋮
+```
+
+## What does the name `DiDa` mean?
+
+**Di**stributed **Da**ta.
+
+There is no consensus on how to pronounce the shortcut.
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -0,0 +1,3 @@
+[deps]
+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"
diff --git a/docs/make.jl b/docs/make.jl
@@ -0,0 +1,22 @@
+using Documenter, DiDa
+
+makedocs(modules = [DiDa],
+    clean = false,
+    format = Documenter.HTML(prettyurls = !("local" in ARGS)),
+    sitename = "DiDa.jl",
+    authors = "The developers of DiDa.jl",
+    linkcheck = !("skiplinks" in ARGS),
+    pages = [
+        "Home" => "index.md",
+        "Tutorial" => "tutorial.md",
+        "Functions" => "functions.md",
+    ],
+)
+
+deploydocs(
+    repo = "github.com/LCSB-BioCore/DiDa.jl.git",
+    target = "build",
+    branch = "gh-pages",
+    devbranch = "develop",
+    versions = "stable" => "v^",
+)
diff --git a/docs/src/assets/logo.svg b/docs/src/assets/logo.svg
diff --git a/docs/src/functions.md b/docs/src/functions.md
@@ -0,0 +1,29 @@
+# Functions
+
+## Data structures
+
+```@autodocs
+Modules = [DiDa]
+Pages = ["structs.jl"]
+```
+
+## Base functions
+
+```@autodocs
+Modules = [DiDa]
+Pages = ["base.jl"]
+```
+
+## Higher-level array operations
+
+```@autodocs
+Modules = [DiDa]
+Pages = ["tools.jl"]
+```
+
+## Input/Output
+
+```@autodocs
+Modules = [DiDa]
+Pages = ["io.jl"]
+```
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -0,0 +1,30 @@
+
+# DiDa.jl — simple work with distributed data
+
+This packages provides a relatively simple Distributed Data manipulation and
+processing routines for Julia.
+
+The design of the package and data manipulation approach is deliberately
+"imperative" and "hands-on", to allow as much user influence on the actual way
+the data are moved and stored in the cluster as possible. It uses the
+`Distributed` package and its infrastructure of workers, and provides a few
+very basic primitives that lightly wrap the `Distributed` package functions
+`remotecall` and `fetch`.
+
+There are also various extra functions to easily run distributed data
+transformations, MapReduce-style algorithms, store and load the data on worker
+local storage (e.g. to prevent memory exhaustion) and others.
+
+To start quickly, you can read the tutorial:
+
+```@contents
+Pages=["tutorial.md"]
+```
+
+### Functions
+
+A full reference to all functions is given here:
+
+```@contents
+Pages = ["functions.md"]
+```