# Cromwell in Terra

## Introduction

### What is Cromwell?
Cromwell is a Workflow Management System geared toward scientific workflows. 

More information and additional examples can be found in the [Cromwell documentation](https://cromwell.readthedocs.io/en/stable/).

### About this notebook

This notebook is intended to demonstrate how you can use the Cromwell engine on Terra to execute and manage workflows. Setup instructions are provided, along with examples of commands to submit workflows, check a job's status, list jobs, and examine output.

Note that all of the commands demonstrated here are shell commands, which you can also run from a command-line such as a Jupyter terminal.

### Notebook setup

#### Set up utility functions

In [None]:
'''
Resolves bucket URL from bucket reference in workspace.
'''
def get_bucket_url_from_reference(bucket_reference):
    BUCKET_CMD_OUTPUT = !terra resolve --name={bucket_reference}
    BUCKET = BUCKET_CMD_OUTPUT[0]
    return BUCKET

#### Workspace setup
<div class="alert alert-block alert-info">
<b>Note:</b> This notebook assumes that `workspace_setup.ipynb` and `cloud_env_setup.ipynb` in the parent directory have been run. 
</div>
    
`workspace_setup.ipynb` creates two Cloud Storage buckets for your workspace files with workspace reference names: 

 - ws_files   
 - ws_files_autodelete_after_two_weeks      
    
The code in this notebook will write output files to the "autodelete" bucket by default.  
    Any file in this bucket will be automatically deleted <b>two weeks</b> after it is written.  
    This alleviates the need for you to remember to clean up temporary and example files manually.  
    If you want to write outputs to a durable location, simply change the assignment of the `BUCKET_REFERENCE` variable in the cell below and re-run the notebook. 

In [None]:
# Change this to "ws_files" to use the durable workspace bucket instead of the autodelete bucket.
BUCKET_REFERENCE = "ws_files_autodelete_after_two_weeks"

In [None]:
MY_BUCKET = get_bucket_url_from_reference(BUCKET_REFERENCE)
print(f'Bucket ID: {MY_BUCKET}')

#### Cloud environment setup

The notebooks in this workspace create a few files on your cloud environment. For clarity and to ease cleanup after
running the tutorials, the notebooks will write, by default to a well-defined location as determined by the
`CROMWELL_EXAMPLES_DIR`. You are free to change this location to suit your own use cases.

In [None]:
import os

CROMWELL_EXAMPLES_DIR=os.path.expanduser('~/terra-tutorials/cromwell')
CROMWELL_CONF=f'{CROMWELL_EXAMPLES_DIR}/cromwell.runmode.conf'

HELLO_WORLD_INPUTS_JSON=f'{CROMWELL_EXAMPLES_DIR}/hello_world.inputs.json'
SAMPLE_INPUTS_JSON=f'{CROMWELL_EXAMPLES_DIR}/sample.inputs.json'

RUNMODE_LOG=f'{CROMWELL_EXAMPLES_DIR}/cromwell.run.log'

!mkdir -p {CROMWELL_EXAMPLES_DIR}

print(f'Tutorial files will be written locally to {CROMWELL_EXAMPLES_DIR}')
print()
print(f'Cromwell configuration file will be written to {CROMWELL_CONF}')
print(f'Cromwell hello-world input JSON file will be written to {HELLO_WORLD_INPUTS_JSON}')
print(f'Cromwell runmode log file will be written to {RUNMODE_LOG}')
print(f'Cromwell samples input JSON file will be written to {SAMPLE_INPUTS_JSON}')

## `Cromwell` configuration

### Do I need to install `Cromwell` or other dependencies in my Terra workspace?

The Cromwell (Java) JAR file is installed on Terra cloud environments by default, so you don't need to install anything to complete the exercises in this notebook.

### Modes of execution

`Cromwell` has [two execution modes](https://cromwell.readthedocs.io/en/stable/Modes): [*run mode*](#run_mode) and [*server mode*](#server_mode). We will execute the same "Hello World" WDL workflow in both run mode and server mode to observe the differences.

#### Run mode<a id="run_mode"></a>

Run mode is most useful for executing a single instance of a workflow for development, testing, and demos. A job executed in run mode launches a single workflow from the command line. That command executes synchronously; it stays running until the workflow exits. 

#### Server mode<a id="server_mode"></a>

Server mode is suitable for production use and scaling up the number of concurrent jobs. `Cromwell` in server mode exposes a REST API endpoint that accepts requests for job submission, monitoring and control.

## Run a simple `Cromwell` job

Run the cell below to output the contents of `helloWorld.wdl`, the WDL file for the first workflow we will run.

This workflow has no file input, but instead just accepts a string input parameter `name`.

In [None]:
!cat workflows/wdl/helloWorld.wdl

#### Provide inputs

WDL supports the specification of complex inputs, including (but not limited to)
`String`s, `Integer`s, `File`s, and `Array`s. These complex inputs are provided in
a JSON file, frequently refered to as the `inputs.json`.

Run the cell below to create an input file `hello_world_inputs.json` which sets
the input `name` to your Terra user email.

In [None]:
import os
import json

# Get Terra user email from environment.
MY_USER_EMAIL = os.environ['TERRA_USER_EMAIL']

# Create an input file with the 
data = {"hello_world.name" : f"{MY_USER_EMAIL}"}
with open (HELLO_WORLD_INPUTS_JSON, 'w') as json_file:
    json.dump(data, json_file, indent=2)

Let's see the contents of the `hello_world_input.json`

In [None]:
!cat {HELLO_WORLD_INPUTS_JSON}

### Run a job in run mode

The following command will run the `helloWorld.wdl` example, with the inputs file that you just created. The logging output of this command will be written to a log file.

In [None]:
%%bash -s {HELLO_WORLD_INPUTS_JSON} {RUNMODE_LOG}

HELLO_WORLD_INPUTS_JSON="$1"
RUNMODE_LOG="$2"

java -jar $CROMWELL_JAR \
  run \
  workflows/wdl/helloWorld.wdl \
  --inputs "${HELLO_WORLD_INPUTS_JSON}" \
  &> "${RUNMODE_LOG}"


#### View results

Run the command in the cell below to view the final logging statements in 
`cromwell.run.log`. The message should include something similar to:

```
a3d613 [50891a9e]: Workflow myWorkflow complete. Final Outputs:
{
  "myWorkflow.myTask.out": "hello world"
}
[2022-11-03 16:57:30,98] [info] WorkflowManagerActor: Workflow actor for 50891a9e-6493-4c38-9e5e-37ff1da3d613 completed with status 'Succeeded'. The workflow will be removed from the workflow store.
[2022-11-03 16:57:36,24] [info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
  "outputs": {
    "myWorkflow.myTask.out": "hello world"
  },
  "id": "50891a9e-6493-4c38-9e5e-37ff1da3d613"
}
```

In [None]:
!tail --lines=50 {RUNMODE_LOG}

### Run a job in server mode

To submit, monitor, and cancel workflow jobs using `Cromwell` in *server mode*, we recommend using the `cromshell` command line interface tool. For more information about `cromshell`, check out its [documentation](https://github.com/broadinstitute/cromshell/tree/cromshell_2.0).

<div class="alert alert-block alert-info">
<b>Note:</b>
You'll need to launch a Cromwell server before submitting any jobs in server mode by running the cromwell_server_management.ipynb</a> notebook in this directory.
</div>

#### Submitting jobs with Cromshell

[Cromshell](https://github.com/broadinstitute/cromshell) is a script for submitting workflows to a Cromwell server and monitoring / querying their results. Cromshell is preinstalled on Terra cloud environments.

##### Configure the Cromshell host port

Prior to use, Cromshell needs to know what host and port the Cromwell server is running on.

Run the cell below to write the Cromshell server configuration file.

In [None]:
%%bash

mkdir -p ~/.cromshell

echo 'localhost:8000' > ~/.cromshell/cromwell_server.config

##### Submit a job!

In [None]:
%%bash -s {HELLO_WORLD_INPUTS_JSON}

HELLO_WORLD_INPUTS_JSON="$1"

cromshell submit \
  workflows/wdl/helloWorld.wdl \
  "${HELLO_WORLD_INPUTS_JSON}"

#### Check workflow status

Run the cell below to check the status of the workflow. If no job ID is provided, `cromshell status` will return the status of the job most recently submitted. The workflow execution is saved to `cromwell-execution/test`, where `test` is the defined name of the workflow in the WDL file.

At first, the status of the job should be "Running". Run the cell below again after about 30 seconds; the status of the job should change to "Succeeded".

In [None]:
!cromshell status

#### View previous jobs

Optional: view a list of previous `Cromwell` jobs by running the cell below. The `-c` flag will produce outputs color-coded by status (green for success, red for failure, blue for currently running).

In [None]:
!cromshell list -c

## Set up and run a workflow to analyze a single sample

### Populate workflow configuration

Run the cell below to resolve the addresses of workspace resources and save them to Python variables for convenience.

In [None]:
CRAM_URLS = !terra resource resolve --name='cram-folder'
print(CRAM_URLS)
CRAM_RESOURCE_URL=CRAM_URLS[0]

REF_URLS = !terra resource resolve --name='ref-folder'
REF_RESOURCE_URL=REF_URLS[0]

Run the cell below to create an input file `inputs.json`.

In [None]:
import json

data = { 
    "CramToBamFlow.CramToBamTask.InputCram":f'{CRAM_RESOURCE_URL}/NA12878.cram',
        "CramToBamFlow.CramToBamTask.RefDict":f'{REF_RESOURCE_URL}/Homo_sapiens_assembly38.dict',
        "CramToBamFlow.CramToBamTask.RefFasta":f'{REF_RESOURCE_URL}/Homo_sapiens_assembly38.fasta',
        "CramToBamFlow.CramToBamTask.RefIndex":f'{REF_RESOURCE_URL}/Homo_sapiens_assembly38.fasta.fai',
        "CramToBamFlow.CramToBamTask.SampleName":"NA12878",
        "CramToBamFlow.ValidateSamFile.preemptible_tries":"3",
        "CramToBamFlow.cram_to_bam_disk_size":"200",
        "CramToBamFlow.cram_to_bam_mem_size":"15 GB",
        "CramToBamFlow.validate_sam_file_disk_size":"200",
        "CramToBamFlow.validate_sam_file_mem_size":"3500 MB"
}

for d in data:
    print(f"{d}: {data[d]}")

with open (SAMPLE_INPUTS_JSON, 'w') as json_file:
    json.dump(data, json_file)

In [None]:
!cromshell submit workflows/wdl/cramToBam.wdl {SAMPLE_INPUTS_JSON}

#### Check workflow status


Run the cell below to check your job's status.<br>Your job should progress from 'Submitted' to 'Running' in about fifteen seconds.<br>After a few minutes, your job's status progress to 'Succeeded'. 

In [None]:
!cromshell status

#### View workflow logs


Run the cell below to check your job's status. It may take around 30 seconds for your job's status to switch from 'Submitted' to 'Succeeded'. 

In [None]:
!cromshell logs

## Provenance

In [None]:
!date

In [None]:
!conda env export

In [None]:
!jupyter labextension list