# Nextflow in Terra
<table align="left">
<td>
<a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/nextflow/nextflow_examples.ipynb">
<img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
    View on GitHub
</a>
</td>
<td>
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://github.com/DataBiosphere/terra-axon-examples/blob/main/nextflow/nextflow_examples.ipynb">
<img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
Open in a Terra notebook instance
</a>
</td>
</table>

## Introduction

### What is Nextflow?
Nextflow is an open-source workflow orchestrator that simplifies writing and deploying data-intensive computational pipelines on any infrastructure.  

More information and additional examples can be found in the [Nextflow documentation](https://nextflow.io/docs/latest/index.html).

### Objectives

This notebook is intended to demonstrate how you can use the Nextflow engine on Terra to execute and manage workflows. By running the cells in this notebook, you will be able to:
* Configure Nextflow to run on this cloud environment
* Run Nextflow workflows on actual datasets
* Check the status of submitted jobs
* Leverage the Google Cloud Life Sciences API to execute workflow tasks in parallel
* View the logs for each step of the workflows executed
* Clean up intermediate results to reduce cloud storage use

Note that all of the commands demonstrated with the prefix `!` are shell commands, which you can also run from a command-line such as a Jupyter terminal.

### Notebook setup <a id="workspace_setup"> 
<div class="alert alert-block alert-info">
<b>Note:</b> This notebook assumes that you have run <a href="../workspace_setup.ipynb">workspace_setup.ipynb</a> in the parent directory to create necessary workspace resources. Please note that you can skip creating a BigQuery dataset as that won't be used in this notebook's examples.
</div>
     Running the `workspace_setup.ipynb` notebook will create two Cloud Storage buckets for your workspace files with workspace reference names: 

 - ws_files   
 - ws_files_autodelete_after_two_weeks      
    
The code in this notebook will write output files to the "autodelete" bucket by default. Any file in this bucket will be automatically deleted two weeks after it is written. This alleviates the need for you to remember to clean up temporary and example files manually. If you want to write outputs to a durable location, simply change the assignment of the `BUCKET_REFERENCE` variable in the cell below and re-run the notebook. 

In [None]:
# Change this to "ws_files" to use the durable workspace bucket instead of the autodelete bucket.
BUCKET_REFERENCE = "ws_files_autodelete_after_two_weeks"

!terra resource resolve --name $BUCKET_REFERENCE || echo "Be sure to run workspace_setup.ipynb before this notebook."

### Install `nf-core` tool

In order to interact with `nf-core` pipelines in this notebook,  you'll need to install the [`nf-core` companion tool](https://nf-co.re/tools), a command-line interface.

1. Run the first cell below to install the tool in your Terra cloud environment. 
2. Once installation has completed, navigate to the **Kernel** dropdown in the JupyterLab menu and select **Restart Kernel**. 
3. Once the kernel has restarted, run the second cell below to import the tool.

In [None]:
try:
    import nf_core
    print("nf-core is already installed")
except:
    print("Installing nf-core...")
    !pip install nf-core

In [None]:
try:
    import nf_core
    print("nf-core is already installed")
except:
    print("Please restart the kernel before importing...")

### Environment variables

One advantage of running your workflows in Terra is the ability to inject parameters into your commands from the workspace context. You can pass parameter values such as your Google Cloud Storage bucket's URL or your service account's email address with convenient, readable names in CLI commands, preventing you the tedious and error-prone copying and pasting of values otherwise required. (To view this workspace's environment variables and their corresponding values, you can run the command `terra app execute env | sort`.)

### Check environment variables

Upon running this notebook in a fresh cloud environment, you may discover that your environment variables are not available on your workspace context. Please run the cell below to check these variables. If the exception is raised, you should restart this cloud environment in the Terra UI.

In [None]:
from datetime import datetime
import os

def check_environment_variables():
    '''
    Check that the notebook VM's environment variables resolve to the 
    expected workspace context values.
    ***Remove once https://broadworkbench.atlassian.net/browse/PF-2215 is addressed.***
    '''
    if not (os.getenv('GOOGLE_CLOUD_PROJECT')
            and os.getenv('GOOGLE_SERVICE_ACCOUNT_EMAIL')
            and os.getenv('TERRA_USER_EMAIL')) :
        raise Exception('Restart your TVC Cloud Environment so that the TVC environment variables become available.')       
    else:
        print("Environment variables available as expected.")
check_environment_variables()

Run the cell below to capture the <a href="https://cloud.google.com/life-sciences/docs/concepts/locations">geographic location of the physical resources</a> where your cloud environment exists. You will provide this value when configuring jobs to run via the Google Life Sciences API in the exercises that follow. 

In [None]:
import subprocess

REGION=subprocess.run("curl -s 'http://metadata.google.internal/computeMetadata/v1/instance/zone' -H 'Metadata-Flavor: Google' \
  | sed -e 's#^.*/##' -e 's#-[a-z]$##'", stdout=subprocess.PIPE, shell=True)
REGION=str(REGION.stdout, 'utf-8')
print(REGION)

### Do I need to install `Nextflow` or other dependencies in my Terra workspace?
By default, Nextflow is installed in Terra workspaces, so you don't need to install anything to complete the exercises in this notebook.

## Example 1: Run a Nextflow workflow

In this exercise, you'll run a Nextflow workflow on sample human data. Running this example will typically incur less than $1 in cloud costs.

If desired, you can preview a fully-executed version of this notebook section by viewing [this notebook snapshot](https://terra-preprod-ui-terra.api.verily.com/workspaces/getting-started-with-workflows-workspace/resources/21ffa484-75b9-46ff-b70d-6ebeb2962eb2/notebook_snapshots/nextflow_examples_first_exercise_fully_executed.html). 

#### Clone Nextflow RNASeq project

<div class="alert alert-block alert-success">
<b>Note:</b> 
To access private GitHub repositories from your Terra workspace, you'll need to <a href="https://et-docs-tests.googleplex.com/docs/how_to_guides/terra_ssh_key_guide/#set-up-your-tvc-ssh-key-with-github">set up an SSH key</a> to connect your GitHub and Terra accounts. This is not necessary for the exercises in this notebook because the repositories you will need are publicly available.
*NOTE* Until <a href="https://verily.atlassian.net/browse/TERRA-358?atlOrigin=eyJpIjoiYWQ2YTNlYWQ5OWVhNDM2Zjk1OTYxMGQ4YzdkMzJjN2UiLCJwIjoiaiJ9"> the Workflows Notebooks are migrated to a public repository</a>, you actually DO need to have your Terra SSH key set up to run the exercises in this notebook.
</div>

Terra features built-in support for source control via [GitHub](https://github.com). Users who have set up the Terra SSH key can run `terra git` commands from their Terra workspaces without additional authentication.

In this exercise, you will use files from a public [GitHub repository](https://github.com/nextflow-io/rnaseq-nf.git) which contains a basic pipeline for quantification of genomic features from short read data and some data upon which to run the workflow.

Run the cell below to check whether the GitHub repo exists as a referenced resource in this workspace. **If your workspace isn't a clone of the [Getting Started with Workflows workspace](https://TODO),** the command below will add the referenced resource and clone it to your home directory. If you see an error message like:
```
fatal: destination path 'rnaseq-nf' already exists and is not an empty directory.
Git clone for https://github.com/nextflow-io/rnaseq-nf.git failed
```
that simply means your resource exists as we expect. *Do not* create another resource with a different name; just proceed to the following cell.

In [None]:
!terra resource resolve --name=rnaseq-nf || (terra resource add-ref git-repo \
    --name=rnaseq-nf \
    --description="Respository containing a Nextflow RNASeq pipeline and associated input data." \
    --repo-url='https://github.com/nextflow-io/rnaseq-nf.git')
!cd /home/jupyter/repos && terra git clone --resource=rnaseq-nf

### `Nextflow` configuration

#### Modify Nextflow configuration file

Next, run the cell below to inject the necessary parameters into the Nextflow configuration file and output its contents.  
In the `gls` entry, note the parameters set to workspace context variables: 
* `$TERRA_{BUCKET_REFERENCE}`,
* `$TERRA_{REGION}`,
* `$GOOGLE_CLOUD_PROJECT` and 
* `$GOOGLE_SERVICE_ACCOUNT_EMAIL`. 

If you've created the workspace bucket resource after you created this notebook instance (either via the [Terra workspace UI](f'https://terra-devel-ui-terra.api.verily.com/workspaces/emmarogge-workflows-ws) or Terra CLI commands, your bucket reference may not resolve correctly due to a stale workspace context cache (see [bug](https://broadworkbench.atlassian.net/browse/PF-2302)). To resolve the variable, you'll need to run `terra resource list` to refresh the cached workspace context, then re-run the cell below.

Please note that `google.region` specifies the [Google location(s)](https://cloud.google.com/life-sciences/docs/concepts/locations) where the job executions are deployed to the Cloud Life Sciences API, whereas `google.location` specifies the [Google region(s)](https://cloud.google.com/compute/docs/regions-zones/) where the computation is executed on Compute Engine VMs.

In [None]:
import re

def configure_nextflow_for_terra():
    """
    Injects workspace context variables as parameters & configures Nextflow to use 
    Google's Cloud Life Sciences API to run the RNASeq pipeline.
    Provides Google Cloud URL for transcriptome and read files.
    """
    terra_gls_config = f"""gls {{
        params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
        params.reads = 'gs://rnaseq-nf/data/ggal/gut_{{1,2}}.fq'
        params.multiqc = 'gs://rnaseq-nf/multiqc'
        process.executor = 'google-lifesciences'
        process.container = 'nextflow/rnaseq-nf:latest'
        workDir = "${{TERRA_{BUCKET_REFERENCE}}}/nf"
        google.region  = "{REGION}"
        google.project = "$GOOGLE_CLOUD_PROJECT"
        google.lifeSciences.network = 'network'
        google.lifeSciences.subnetwork = 'subnetwork'
        google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
        }}"""
    
    # Replace boilerplate with Terra-specific config for Google Lifesciences APIs.
    regex = "(?s)gls(\s\{.*?\s\})(?=\s|$)"
    config_file = open("/home/jupyter/repos/rnaseq-nf/nextflow.config", "r")
    data = config_file.read()
    config_file.close()
    result = re.sub(regex, terra_gls_config, data, 1)

    if result:
        config_file = open("/home/jupyter/repos/rnaseq-nf/nextflow.config", "w")
        config_file.write(result)
        config_file.close()

# Copy existing configuration file before we modify it.
!cp /home/jupyter/repos/rnaseq-nf/nextflow.config /home/jupyter/repos/rnaseq-nf/unmodified_nextflow.config

# Inject configuration.
configure_nextflow_for_terra()

#### Inspect your Nextflow config

Run the cell below to validate your configuration for Nextflow in this workspace.  
In particular, inspect the output to ensure the parameters highlighted (your workspace project, GCS bucket, and service account email) have been resolved to the appropriate values.  
You should see something like below, with the appropriate values substituted for the placeholders in curly braces:
<code>
Setting the gcloud project to the workspace project
Updated property [core/project].
manifest {
   description = 'Proof of concept of a RNA-seq pipeline implemented with Nextflow'
   author = 'Paolo Di Tommaso'
   nextflowVersion = '>=20.07.0'
}
</code><code>
params {
   transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
   reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
   multiqc = 'gs://rnaseq-nf/multiqc'
}
</code><code>
process {
   executor = 'google-lifesciences'
   container = 'nextflow/rnaseq-nf:latest'
}
   workDir =</code><code style="background:yellow;color:black">'{MY_BUCKET}/nf'</code>
<code>
google {
   location = 'us-central1'
   region = 'us-central1'
   project = <code style="background:yellow;color:black">'{MY_PROJECT}'</code>
<code>
    lifeSciences {
        network = 'network'
        subnetwork = 'subnetwork'
        serviceAccountEmail = <code style="background:yellow;color:black">'{MY_PET_SA}'</code>
   }
    }</code>
Restoring the original gcloud project configuration: {MY_PROJECT}
Updated property [core/project].
</code>

In [None]:
!terra nextflow -c /home/jupyter/repos/rnaseq-nf/nextflow.config config /home/jupyter/repos/rnaseq-nf/main.nf -profile gls

### Launch RNASeq workflow

Run the cell below to use the [Terra CLI](https://github.com/DataBiosphere/terra-cli) to launch a sample Nextflow workflow for [an RNA sequencing pipeline](https://github.com/nf-core/rnaseq),   
which maps a collection of read-pairs to a given reference genome and outputs the respective transcript model.  
The workflow should take about 10 minutes to complete. Once your job has completed, you should see output like:
```
Done! Open the following report in your browser --> results/multiqc_report.html

Completed at: DD-Month-YYYY HH:MM:SS
Duration    : 10m 41s
CPU hours   : 0.2
Succeeded   : 4
```

In [None]:
!terra nextflow -c /home/jupyter/repos/rnaseq-nf/nextflow.config run /home/jupyter/rnaseq-nf/main.nf -profile gls

### View pipeline runs

Run the cell below to print a history of the Nextflow pipelines you've executed. You should see something like:
```
TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                                 COMMAND      
YYYY-MM-DD 15:27:03	10m 13s   	<RUN_NAME>      	OK    	386f5387c7 	fb368e91-ceab-4f8c-b17f-a575836e6d80	nextflow -c rnaseq-nf/nextflow.config run rnaseq-nf/main.nf -profile gls   
```

In [None]:
!nextflow log

### View execution details for a specific run

To view the tasks executed by your pipeline during a specific run, replace the `RUN_NAME` parameter below with the run name corresponding to the nf_core RNASeq job from the output of the previous `nextflow log ...` command. Then run the cell below to see the tasks and their statuses.

In [None]:
!nextflow log <RUN_NAME> -f 'task_id,name,status,duration,cpus,container' | sort -n

### View workflow results

The result of a successful run of the example workflow includes an HTML report from [MultiQC](https://multiqc.info/).
To view the results, you can navigate in the JupyterLab file browser to the `multiqc_report.html` file in the `results` directory.
The first time you open the report, it may include a message about JavaScript being disabled.
To resolve this, select the "Trust HTML" button as described in [this JupyterLab issue](https://github.com/jupyterlab/jupyterlab/issues/9738).
Alternatively, you may right click and open the MultiQC report in a new browser tab, which will not require clicking a button.

### Optional: Clean up intermediate results

If you won't need to reference the intermediate results produced by each workflow step in the future and don't intend to run a pipeline again with the `--resume` flag, it's a good practice to remove the associated working directories & their contents from your workspace bucket. This reduces your storage cost and makes it easier to find results within your bucket when you do wish to inspect intermediate results from a particular run (i.e. for debugging or to analyze a particular step in the pipeline).   

Once you've performed cleanup for a specific run, the associated cached intermediate results have been deleted and you can no longer resume pipeline execution from those results. Therefore, it's important to perform a dry-run of the cleanup first to view the directories that will be removed and ensure you don't inadvertently clean up any checkpointing you wish to keep. 

Run the cell below to perform a dry-run for a pipeline run by replacing `<RUN_NAME>` with the run name of your specific pipeline execution. If you'd like to cleanup for all previous pipeline runs, replace `<RUN_NAME>` with `$(nextflow log -q)` in the cell below.

In [None]:
!nextflow clean -n <RUN_NAME>

If you're satisfied with deleting the pipeline working directories indicated above, replace `<RUN_NAME>` with the name of the desired pipeline execution in the cell below, then run the cell to permanently remove files associated with that run. 

If you'd prefer to clean up directories associated with all previous runs, you can replace `<RUN_NAME>` with `$(nextflow log -q)`.

In [None]:
!nextflow clean -f <RUN_NAME>

## Example 2: Run an `nf-core` workflow

Nextflow's `nf-core` is a curated collection of validated Nextflow pipelines, most of which can be run on almost any computing environment, including your Terra cloud environment. These pipelines are subject to transparent versioning and each version is run against automated testing prior to release. In this exercise, we will perform RNASeq on by running the nf-core/RNASeq pipeline from the [`nf_core` collection](https://nf-co.re/pipelines). The cost of running this example will typically be no more than $1.

If desired, you can preview a fully-executed version of this notebook section by viewing [this notebook snapshot](https://terra-preprod-ui-terra.api.verily.com/workspaces/getting-started-with-workflows-workspace/resources/eebc2df6-fc9d-491e-874c-b87dbd3a68e1/notebook_snapshots/nextflow_examples_second_exercise_fully_executed.html). 

### View available pipelines

Dozens of commonly used bioinformatics pipelines are available via [nf-core](https://nf-co.re/pipelines). Now that you have installed `nf-core`'s companion tooling on this instance, run the cell below to  
view a list of available Nextflow pipelines in the `nf-core` collection. 

In [None]:
!nf-core list

### Modify nf-core configuration file

Next, run the cell below to inject the necessary parameters into the Nextflow configuration file and output its contents.  
In the `gls` entry, note the parameters set to workspace context variables: 
* `$TERRA_{BUCKET_REFERENCE}`, 
* `$GOOGLE_CLOUD_PROJECT` and 
* `$GOOGLE_SERVICE_ACCOUNT_EMAIL`. 

If you've created the workspace bucket resource after you created this notebook instance (either via the [Terra workspace UI](f'https://terra-devel-ui-terra.api.verily.com/workspaces/emmarogge-workflows-ws) or Terra CLI commands, your bucket reference may not resolve correctly due to a stale workspace context cache (see [bug](https://broadworkbench.atlassian.net/browse/PF-2302)). To resolve the variable, you'll need to run `terra resource list` to refresh the cached workspace context, then re-run the cell below.


In [None]:
import re, os

def configure_nf_core_for_terra():
    """
    Injects workspace context variables for parameters & configures Nextflow for nf_core RNASeq.
    """
    if not os.path.exists('/home/jupyter/nextflow.config'):
        os.mknod('/home/jupyter/nextflow.config')
        
    terra_nf_core_config = f"""
    profiles {{
        gls {{
            // Uncomment below line for debugging
            //process.echo = true
            // Do not change
            process.executor = 'google-lifesciences'
            process.container = 'nextflow/rnaseq-nf:latest'
            process.maxRetries = 5
            process.genome = "R64-1-1"
            process.errorStrategy = {{ task.exitStatus==14 ? 'retry' : 'terminate' }}
            workDir = "${{TERRA_ws_files_autodelete_after_two_weeks}}/nf_core/work"
            google.region  = "us-central1"
            google.project = "$GOOGLE_CLOUD_PROJECT"
            google.lifeSciences.network = 'network'
            google.lifeSciences.subnetwork = 'subnetwork'
            google.lifeSciences.bootDiskSize = '50 GB'
            google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
            email_on_fail = "$OWNER_EMAIL"
            storage {{
                parallelThreadCount=1
        }}
    }}
  }}"""
    with open("/home/jupyter/nextflow.config", "w+") as config_file:
        config_file.write(terra_nf_core_config)
        print(terra_nf_core_config)
        config_file.close()

# Copy existing configuration file, if it exists, modifying it.
![ -f "/home/jupyter/nextflow.config" ] && cp /home/jupyter/nextflow.config /home/jupyter/unmodified_nextflow.config

# Inject configuration.
configure_nf_core_for_terra()

### View expanded configuration

Run the cell below to output the Nextflow configuration's fully-expanded values to the file `expanded_nf_core.config`, then print the parameters and their values (injected from the Terra workspace context and by the function above) to the console.

In [None]:
!terra nextflow -c /home/jupyter/nextflow.config config nf-core/rnaseq -profile gls > expanded_nf_core.config
!grep 'workDir' expanded_nf_core.config
!grep -A 25 'google {'  expanded_nf_core.config

### Get inputs for RNASeq

The input for RNASeq is raw FastQ sequencing data. Given sequencing data and a reference genome, the pipeline performs [a series of operations](https://nf-co.re/rnaseq#pipeline-summary), ultimately producing results including alignments, gene counts and a quality-control report. 
<a id="workspace_setup"> 
<div class="alert alert-block alert-success">
<b>Note:</b> Please run *only one* of the following two cells depending on whether:
    <ul>
    <li> This notebook exists in a workspace which you cloned from the <a href="TODO_LINK_TO_WS">Getting Started with Nextflow workspace</a></li> OR
    <li>You imported this notebook into a new or existing personal workspace which is not a clone of the <a href="TODO_LINK_TO_WS">Getting Started with Nextflow workspace</a>.</li>
</ul>

**If your workspace is a clone of the [Getting Started with Nextflow workspace](http://todo),** the referenced resource, `test-datasets`, is already present. You'll need to check out the branch containing the RNASeq test data by running the following cell.

In [None]:
!cd /home/jupyter/repos/test-datasets && git checkout rnaseq
!cd /home/jupyter/repos/test-datasets && cat samplesheet/samplesheet_minimal.csv || echo "Something's not quite right. Please ensure you've added the Git repo as referenced resource and checked out the RNASeq branch."

**If your workspace is NOT a clone of the [Getting Started with Nextflow workspace](http://todo),** you'll need to run the cell below to clone the GitHub repository containing nf-core test data into this cloud environment, then check out the branch containing RNASeq-specific test datasets and sample sheets. Run the following cell to validate that the referenced resource exists and check out the appropriate branch.

In [None]:
![[ -f repos/test-datasets/samplesheet/samplesheet_minimal.csv ]] && echo "Resource already exists" || (terra resource add-ref git-repo --name=test-datasets --repo-url=git@github.com:nf-core/test-datasets.git && cd /home/jupyter/repos && terra git clone --resource=nf-core-sample-data-repo &&  git checkout rnaseq)

### Run the RNASeq pipeline from `nf-core`


Run the cell below to launch the nf-core RNASeq pipeline with Google Cloud Life Sciences API as the executor on a dataset of reads from seven [_S. cerevisiae_](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000146045.2/) samples. Each run should take about one hour to complete. 

The nf_core RNASeq pipeline produces a large volume of informative output.
When inspecting the output, take note each time the executor you configured, the Google Life Sciences API, is invoked to execute the workflow steps that comprise RNASeq, from initial processing to alignment and creation of BAM files to creation of the MultiQC output files.

When the pipeline run has completed successfully, the last few lines of output should resemble:
```
Waiting for file transfers to complete (1 files)
-[nf-core/rnaseq] Pipeline completed successfully-
Completed at: DD-Month-YYY HH:SS:MM
Duration    : 58m 56s
CPU hours   : <HOURS>
Succeeded   : <NUBMER of TASKS>

Restoring the original gcloud project configuration: <GOOGLE-CLOUD-PROJECT-ID>
Updated property [core/project].
```

In [None]:
!terra nextflow -c /home/jupyter/nextflow.config run nf-core/rnaseq -profile gls \
-w $(terra resource resolve \
--name=ws_files_autodelete_after_two_weeks)/nf_core/wd -r 3.11.1 \
--outdir $(terra resource resolve --name=ws_files_autodelete_after_two_weeks)/nf_core/output \
--input /home/jupyter/repos/test-datasets/samplesheet/samplesheet_minimal.csv \
--genome 'R64-1-1'

### View pipeline runs

Run the cell below to print a history of the Nextflow pipelines you've executed. The latest entry should be something like:
```
TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                                 COMMAND      
YYYY-MM-DD HH:MM:SS     1h 3m 34s      <run_name> OK      6e1e448f53      8c1afd3c-d318-4d7d-bb07-139ecf711193       nextflow -c /home/jupyter/nextflow.config run nf-core/rnaseq -profile google -w 'gs://terra-vdevel-genial-olive-3455-ws-files/nf_core/wd' -r 3.10.1 --minAssignedFrags 1 --outdir 'gs://terra-vdevel-genial-olive-3455-autodelete-after-two-weeks/nf_core/output' --input /home/jupyter/terra-solutions-mc-terra-testing/1000_genomes/test-datasets/samplesheet/samplesheet_minimal.csv --genome R64-1-1 --google-debug true --save-reference true     
```

In [None]:
!nextflow log

### Optional: Resume pipeline execution from cached data

Nextflow pipelines can be run with the [`--resume` flag](https://www.nextflow.io/docs/latest/getstarted.html#modify-and-resume). This flag causes the pipeline run to rely on existing, cached intermediate results for some tasks; computation of new intermediate results is performed only where inputs or the pipeline have been modified. This feature allows greater efficiency and speed in pipeline development and debugging.

To resume pipeline execution for a specific run, provide the session ID of the desired run following the `--resume` flag in the cell below (session ID should be output by the `nextflow log` command in previous cell). If no modifications have been made to the pipeline or its configuration, resuming a successful run should take about 20 minutes. 


<div class="alert alert-block alert-success">
<b>Note:</b> 
 The output directory in the command to resume workflow execution (below) differs from the output directory in the original command to launch the workflow. This a necesssary workaround due to a <a href="https://github.com/nextflow-io/nextflow/issues/1189https://github.com/nextflow-io/nextflow/issues/1189">known Nextflow bug</a> in which an error is thrown in instances where an output file of the same name already exists in the output directory (due to the previous run executing successfully and publishing the MultiQC report to the expected output path). If the workflow run you are resuming did not succeed, you may change the output directory to the original destination in the command below.
</div>

In [None]:
!terra nextflow -c /home/jupyter/nextflow.config run nf-core/rnaseq -resume <SESSION_ID> -profile gls \
-w $(terra resource resolve --name=ws_files_autodelete_after_two_weeks)/nf_core/wd -r 3.11.1 \
--outdir $(terra resource resolve --name=ws_files_autodelete_after_two_weeks)/nf_core/output/resumed \
--input /home/jupyter/repos/test-datasets/samplesheet/samplesheet_minimal.csv \
--genome 'R64-1-1'

### View execution details for a specific run
To view the tasks executed by your pipeline during a specific run, replace the "RUN_NAME" parameter below with the run name corresponding to the nf_core RNASeq job from the output of the previous `nextflow log ...` command. Then run the cell below to see the tasks and their statuses.

In [None]:
!nextflow log <RUN_NAME> -f 'task_id,name,status,duration,cpus,memory,container' | sort -n

### View workflow results

The result of a successful run of the example workflow includes an HTML report from [MultiQC](https://multiqc.info/).
To view the results, you'll need to obtain this report from the workspace bucket in the output directory provided in a previous step. If you haven't changed the default, you can run the cell below to copy the report to the home directory of this cloud environment.

The first time you open the report, it may include a message about JavaScript being disabled.
To resolve this, select the "Trust HTML" button as described in [this JupyterLab issue](https://github.com/jupyterlab/jupyterlab/issues/9738).
Alternatively, you may right click and open the MultiQC report in a new browser tab, which will not have the same issue.

In [None]:
!terra gsutil cp $(terra resource resolve --name=ws_files_autodelete_after_two_weeks)/nf_core/output/multiqc/star_salmon/multiqc_report.html /home/jupyter/nf_core_multiqc_report.html

### Optional: Clean up intermediate results

If you won't need to reference the intermediate results produced by each workflow step in the future and don't intend to run a pipeline again with the `--resume` flag, it's a good practice to remove the associated working directories & their contents from your workspace bucket. This reduces your storage cost and makes it easier to find results within your bucket when you do wish to inspect intermediate results from a particular run (i.e. for debugging or to analyze a particular step in the pipeline).   

Once you've performed cleanup for a specific run, the associated cached intermediate results have been deleted and you can no longer resume pipeline execution from those results. Therefore, it's important to perform a dry-run of the cleanup first to view the directories that will be removed and ensure you don't inadvertently clean up any checkpointing you wish to keep. 

Run the cell below to perform a dry-run for a pipeline run by replacing "RUN_NAME" with the run name of your specific pipeline execution. If you'd like to cleanup for all previous pipeline runs, replace `<RUN_NAME>` with `$(nextflow log -q)` in the cell below.

In [None]:
!nextflow clean -n <RUN_NAME>

If you're satisfied with deleting the pipeline working directories indicated above, replace `<RUN_NAME>` with the name of the desired pipeline execution in the cell below, then run the cell to permanently remove files associated with that run. (If you'd prefer to clean up directories associated with all previous runs, you can replace `<RUN_NAME>` with `$(nextflow log -q)`.)

In [None]:
!nextflow clean -f <RUN_NAME>

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2022 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style   
license that can be found in the LICENSE file or at   
https://developers.google.com/open-source/licenses/bsd