Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable batch execution of workflows #1929

Open
rokroskar opened this issue Feb 4, 2021 · 6 comments
Open

enable batch execution of workflows #1929

rokroskar opened this issue Feb 4, 2021 · 6 comments
Labels

Comments

@rokroskar
Copy link
Member

rokroskar commented Feb 4, 2021

One of the goals of renku workflows is to allow a user to develop a robust, working pipeline by iterating quickly on a dataset locally (laptop, interactive session) and then send that workflow to a more capable resource to run on a bigger dataset or parameters requiring extra compute power.

A significant goal of the workflow KG representation was to allow for serialization of workflow information into other formats. At the moment only the common workflow language (CWL) is supported, but the same methods used to create CWL files can be extended to other workflow languages. One limitation of CWL is that there doesn't seem to be good support for running these workflows either on kubernetes or on HPC systems.

The goal of this epic is to serve as a roadmap for the implementation of a) the supporting devops infrastructure and b) the required code changes for a simple PoC of batch/remote workflow execution.

General use-case

A simple use-case might look something like this:

# develop the workflow
renku run <step1> <dataset>
renku run <step2> <dataset>

# update the dataset
renku dataset update --all

# run update remotely to avoid expensive local calculation
renku update --remote=cloud --all

# or use rerun to specify different parameters
renku rerun --edit-inputs --remote=cloud

The last two steps are identical to what the user can do now, except that they would run in the kubernetes cluster. The steps should be sent to the workflow engine as a DAG expressed in whatever workflow language is needed by the backend. Some steps might run in parallel. Once all the steps have completed, they should push the changes back to the repo, just like a user would do if running those commands locally.

An analogous flow might be envisioned from the web UI, where the page showing the overview of the project's assets might inform the user that some workflow outputs are out of date and give the option to update them automatically.

Issues to consider

There are several issues to consider (in no particular order of importance):

  • serialization to a different workflow format/language/syntax
  • minimize data i/o - ideally, the data required for the calculations would be pulled once and shared between the steps - on kubernetes this poses a potential difficulty for parallel steps because of issues around multi-attach volumes
  • ability to run steps in parallel and combine automatically before a dependent step
  • UX of configuring remote resources and setting defaults - for example, a default could be to run on the same cluster as the renkulab instance, but a user may choose to specify a custom resource using a standard interface (e.g. an HPC cluster via ssh and Slurm or LFS)
  • providing some feedback about the status of the remote execution, especially access to error logs
  • the remote workflow should run in a docker container that includes all of the software dependencies specified in the project - we should consider making "batch" docker images that don't include the entire jupyter stack to minimize the time it takes to launch containers

Building a PoC

The result of this epic should be a general architecture for running remote workflows and a PoC that implements the architecture for some subset of the above functionality using a specific workflow backend. One obvious choice for kubernetes is Argo workflows. Other potential options:

@rokroskar rokroskar added the Epic label Feb 4, 2021
@Panaetius
Copy link
Member

renku update --remote=cloud --all would this terminate immediately or once the whole workflow is finished? I can see good reasons for either approach (Also important in regards to providing some feedback about the status of the remote execution).

I'd like for the solution to this to use pluggy (even if our implementations are just in renku/core/) to make it easily extenable. For the config, I think it can go into renku.ini in a section specific to each backend. We should also support multiple backends of the same type (similar to kubecontexts), as I'm sure there are users with access to multiple k8s clusters, for instance. The plugins should also provide a wizard to setup the cluster in a separate command (like renku workflow setup argo) that guides the user through configuring the cluster.

Nextflow could also be a contended, it ticks many of the same boxes as argo and snakemake. Nextflow along with toil both seem popular in the bioinformatics space (maybe ask @ksanao ).

I'm wondering if update&rerun are the right semantics for this. They definitely support the use-case, but I don't know if it's exactly what a user wants. I'm thinking of a user workflow where you have a small dummy dataset for developing and a big real dataset for running in the cluster. I don't think you develop on the dummy dataset, then once everything works, you run it on the big dataset and then are happy with the results. Rather, you'd probably go back and forth between the small and big dataset over time, adding things as needed, extending the analysis etc. renku update is a bad fit for this, as you'd have to swap out the datasets all the time for things to work. renku rerun would work with --edit-inputs but we don't have --edit-outputs yet and once you rerun on the big dataset, a subsequent renku update would now point to the big dataset, which might not be what a user wants. I feel like we need a separate command, whose semantics are basically "Execute a new workflow with these inputs and these outputs, but reuse the workflow template from this other execution". Though I have to think about what such a command would look like and how to exactly represent that in our metadata. This could also be a good candidate for the "Execute this workflow from some other repository in my repo with my data" use case.

@rokroskar
Copy link
Member Author

Thanks @Panaetius - you're absolutely right, I forgot to add nextflow to the list - will do so now.

Regarding the command semantics - yes you're right, I was being a bit too myopic. We definitely need to support a different kind of command here - workflow execute?

renku run ... <output>
renku workflow create --name <workflow> <output>
renku workflow execute --remote=cloud <workflow> <parameter-list?>

Here we could allow for seamlessly using workflow templates from other projects (or even other instances), e.g.

renku workflow execute --remote=cloud http://renkulab.io/workflows/<id>

Running it without a parameter list would prompt you for whatever inputs need to be specified. This starts to bleed a bit into SwissDataScienceCenter/renku-python#1553 and probably other open issues.

re: async or sync: Ideally it would be possible to do this async with the special case where you want to wait for completion. Since it's to be used from the UI async needs to be supported but maybe starting with sync mode would be sufficient for the PoC.

@m-alisafaee
Copy link
Contributor

Thanks @rokroskar and @Panaetius! This looks pretty good!

@Panaetius
Copy link
Member

@ksanao
Copy link

ksanao commented Apr 30, 2021

RenkuLab use case for workflow execution on HPC: iSEE Dashboard Data

Context
This repository contains 3 workflows to fetch single cell omics data from the external sources:

  1. PBMC data from 10X Genomics,
  2. dataset of 379 mouse brain cells from Tasic et al. 2016 (ReprocessedAllenData)
  3. single cell tumor data GSE62944 from ExperimentHub

Each workflow consists of a single task executing Rscript with Rmd file input (folder processing_scripts) that produces a processed data file in rds format and optionally a configuration file in R format. A dataset is then created with the outputs of each workflow. The commands for these steps are listed in create_datasets_code.sh

Problem
While the first 2 workflows run smoothly and produce the desired output, the 3rd workflow fetching bigger single cell tumor data GSE62944 from ExperimentHub runs out of resources available on RenkuLab instance. Typical single cell omics are of such or bigger size (including benchmarking datasets planned for OMNIBENCHMARK system). There is a need to execute the workflows that fetch large omics data on more powerful HPC compute resources and bring back the resulting processed files and corresponding metadata.

Desired solution
Execute the workflow created with the command below on a remote compute resources. Expected output:

  • processed_data/sce-tcga-gse62944-isee.rds
  • processed_data/sce-tcga-gse62944-isee.h5
renku run --name "sce-tcga-gse62944-isee" --input "processing_scripts/process-tcga-gse62944.Rmd" Rscript -e "rmarkdown::render('processing_scripts/process-tcga-gse62944.Rmd')" 

@TaoSunVoyage
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants