Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# ref: https://juliadocs.github.io/Documenter.jl/stable/man/hosting/#GitHub-Actions-1
name: Documentation

on:
push:
branches:
- develop
tags: '*'
pull_request:
release:
types: [published, created]

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@latest
with:
version: 1.5
- name: Install dependencies
run: julia --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Build and deploy
env:
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
run: julia --project=docs/ docs/make.jl
134 changes: 130 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,135 @@
# DiDa.jl

Simple Distributed Data manipulation and processing routines for Julia.
Simple distributed data manipulation and processing routines for Julia.

This was originally developed for
[GigaSOM.jl](https://github.com/LCSB-BioCore/GigaSOM.jl), this package contains
the separated-out lightweight distributed-processing framework that can be used
with GigaSOM.
[`GigaSOM.jl`](https://github.com/LCSB-BioCore/GigaSOM.jl); DiDa.jl package
contains the separated-out lightweight distributed-processing framework that
was used in `GigaSOM.jl`.

## Why?

DiDa.jl provides a very simple, imperative and straightforward way to move your
data around a cluster of Julia processes created by the
[`Distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/) package,
and run computation on the distributed data pieces. The main aim of the package
is to avoid anything complicated-- the first version used in
[GigaSOM](https://github.com/LCSB-BioCore/GigaSOM.jl) had just under 500 lines
of relatively straightforward code (including the doc-comments).

Compared to plain `Distributed` API, you get more straightforward data
manipulation primitives, some extra control over the precise place where code
is executed, and a few high-level functions. These include a distributed
version of `mapreduce`, simpler work-alike of the
[DistributedArrays](https://github.com/JuliaParallel/DistributedArrays.jl)
functionality, and easy-to-use distributed dataset saving and loading.

Most importantly, the main motivation behind the package is that the
distributed processing should be simple and accessible.

## Brief how-to

The package provides a few very basic primitives that lightly wrap the
`Distributed` package functions `remotecall` and `fetch`. The most basic one is
`save_at`, which takes a worker ID, variable name and variable content, and
saves the content to the variable on the selected worker. `get_from` works the
same way, but takes the data back from the worker.

You can thus send some random array to a few distributed workers:

```julia
julia> using Distributed, DiDa

julia> addprocs(2)
2-element Array{Int64,1}:
2
3

julia> @everywhere using DiDa

julia> save_at(2, :x, randn(10,10))
Future(2, 1, 4, nothing)
```

The `Future` returned from `save_at` is the normal Julia future from
`Distributed`, you can even `fetch` it to wait until the operation is really
done on the other side. Fetching the data is done the same way:

```julia
julia> get_from(2,:x)
Future(2, 1, 15, nothing)

julia> get_val_from(2,:x) # auto-fetch()ing variant
10×10 Array{Float64,2}:
-0.850788 0.946637 1.78006 …
-0.49596 0.497829 -2.03013
```

All commands support full quoting, which allows you to easily distinguish
between code parts that are executed locally and remotely:

```julia
julia> save_at(3, :x, randn(1000,1000)) # generates a matrix locally and sends it to the remote worker

julia> save_at(3, :x, :(randn(1000,1000))) # generates a matrix right on the remote worker and saves it there

julia> get_val_from(3, :x) # retrieves the generated matrix and fetches it

julia> get_val_from(3, :(randn(1000,1000))) # generates the matrix on the worker and fetches the data
```

Notably, this is different from the approach taken by `DistributedArrays` and
similar packages -- all data manipulation is explicit, and any data type is
supported as long as it can be moved among workers by the `Distributed`
package. This helps with various highly non-array-ish data, such as large text
corpora and graphs.

There are various goodies for easy work with matrix-style data, namely
scattering, gathering and running distributed algorithms:

```julia
julia> x = randn(1000,3)
1000×3 Array{Float64,2}:
-0.992481 0.551064 1.67424
-0.751304 -0.845055 0.105311
-0.712687 0.165619 -0.469055

julia> dataset = scatter_array(:myDataset, x, workers()) # sends slices of the array to workers
Dinfo(:myDataset, [2, 3]) # a helper for holding the variable name and the used workers together

julia> get_val_from(3, :(size(myDataset)))
(500, 3) # there's really only half of the data

julia> dmapreduce(dataset, sum, +) # MapReduce-style sum of all data
-51.64369103751014

julia> dstat(dataset, [1,2,3]) # get means and sdevs in individual columns
([-0.030724038974465212, 0.007300925745200863, -0.028220577808245786],
[0.9917470012495775, 0.9975120525455358, 1.000243845434252])

julia> dmedian(dataset, [1,2,3]) # distributed iterative median in columns
3-element Array{Float64,1}:
0.004742259615849834
0.039043266340824986
-0.05367799062404967

julia> dtransform(dataset, x -> 2 .^ x) # exponentiate all data (medians should now be around 1)
Dinfo(:myDataset, [2, 3])

julia> gather_array(dataset) # download the data from workers to a sing
1000×3 Array{Float64,2}:
0.502613 1.46517 3.1915
0.594066 0.55669 1.07573
0.610183 1.12165 0.722438
```

## What does the name `DiDa` mean?

**Di**stributed **Da**ta.

There is no consensus on how to pronounce the shortcut.
Comment thread
exaexa marked this conversation as resolved.
3 changes: 3 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DocumenterTools = "35a29f4d-8980-5a13-9543-d66fff28ecb8"
22 changes: 22 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
using Documenter, DiDa

makedocs(modules = [DiDa],
clean = false,
format = Documenter.HTML(prettyurls = !("local" in ARGS)),
sitename = "DiDa.jl",
authors = "The developers of DiDa.jl",
linkcheck = !("skiplinks" in ARGS),
pages = [
"Home" => "index.md",
"Tutorial" => "tutorial.md",
"Functions" => "functions.md",
],
)

deploydocs(
repo = "github.com/LCSB-BioCore/DiDa.jl.git",
target = "build",
branch = "gh-pages",
devbranch = "develop",
versions = "stable" => "v^",
)
126 changes: 126 additions & 0 deletions docs/src/assets/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions docs/src/functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Functions

## Data structures

```@autodocs
Modules = [DiDa]
Pages = ["structs.jl"]
```

## Base functions

```@autodocs
Modules = [DiDa]
Pages = ["base.jl"]
```

## Higher-level array operations

```@autodocs
Modules = [DiDa]
Pages = ["tools.jl"]
```

## Input/Output

```@autodocs
Modules = [DiDa]
Pages = ["io.jl"]
```
30 changes: 30 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

# DiDa.jl — simple work with distributed data

This packages provides a relatively simple Distributed Data manipulation and
processing routines for Julia.

The design of the package and data manipulation approach is deliberately
"imperative" and "hands-on", to allow as much user influence on the actual way
the data are moved and stored in the cluster as possible. It uses the
`Distributed` package and its infrastructure of workers, and provides a few
very basic primitives that lightly wrap the `Distributed` package functions
`remotecall` and `fetch`.

There are also various extra functions to easily run distributed data
transformations, MapReduce-style algorithms, store and load the data on worker
local storage (e.g. to prevent memory exhaustion) and others.

To start quickly, you can read the tutorial:

```@contents
Pages=["tutorial.md"]
```

### Functions

A full reference to all functions is given here:

```@contents
Pages = ["functions.md"]
```
Loading