# Create a Hail cluster on Verily Workbench
<table align="left">

  <td>
    <a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/dataproc/create_hail_cluster.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

# Overview

This notebook demonstrates how to create Dataproc (managed Spark) clusters, with [Hail](https://hail.is/) installed, on [Verily workbench](https://support.workbench.verily.com/docs/) **using the Workbench CLI**. (To instead create a cluster using the Workbench web UI, see [this guide](https://support.workbench.verily.com/docs/how_to_guides/dataproc/)).
It also discusses how to access and run jobs on the JupyterLab server on the cluster, and how to and access debugging consoles such as the Spark console.

This notebook is most straightforward to run from a Workbench workspace [notebook cloud environment](https://support.workbench.verily.com/docs/how_to_guides/using_cloud_environments/), which will pre-install the Workbench CLI.  You can use the default settings when you create the notebook environment. 

You can also run these commands from other environments, like your local machine, if you [install and configure](https://support.workbench.verily.com/docs/getting_started/install_and_run/) the Workbench CLI. The example assumes that you have already created a Workbench [workspace](https://support.workbench.verily.com/docs/getting_started/web_ui/#creating-a-new-workspace). 

## Objective

In this tutorial you will learn how to run [Hail](https://hail.is/) via [Dataproc](https://cloud.google.com/dataproc/docs/concepts/overview) with [autoscaling](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#what_is_autoscaling) for resource management. The steps include:

1. Do some initial setup and configuration.
1. Create a Dataproc cluster with Hail installed.
1. Access JupyterLab and the Spark console running on the cluster.
1. Submit a script to the cluster for Hail to run in batch mode.
1. (Optional) Pause or delete the cluster.

## How to run this notebook

Run this notebook cell by cell to create and use a Dataproc cluster with Hail installed.

## Costs

This notebook takes less than a minute to run, which will typically cost less than $0.01 of compute time on your cloud environment. **This estimate does not include the [cost](https://cloud.google.com/dataproc/pricing) of running the Dataproc cluster.**

# Setup and Configuration

## Set some variables

In [None]:
from datetime import datetime
import os

Obtain the user name so that it can become part of the Hail cluster name. This is useful when people collaborate in Verily Workbench workspaces and want to differentiate their clusters from each other.

In [None]:
USER = os.getenv('TERRA_USER_EMAIL')
if USER:
    USER = USER.split('@')[0].replace('.', '-')
else:
    print('TERRA_USER_EMAIL not defined; using USER')
    USER = os.getenv('USER')
print(USER)

Construct a cluster name from USER and date:

In [None]:
HAIL_CLUSTER_NAME = '-'.join(['hail', USER, datetime.now().strftime('%Y%m%d')])

print(HAIL_CLUSTER_NAME)

## Confirm your workspace config and create a GitHub repo reference

Next, run the following command to check that the CLI is set to use the Workbench *workspace* in which you would like to create the Dataproc cluster. 

If you are running this notebook _from_ a cloud environment in a Workbench workspace, you should be set.   
If you are running the notebook locally and would like to change the workspace, you can do so via `terra workspace set --id=<your-workspace-id>`.

In [None]:
!terra workspace describe

Then, create a [*referenced resource*](https://support.workbench.verily.com/docs/how_to_guides/add_repo_to_ws/) for the GitHub repo that holds the Dataproc example notebooks (the repo that this notebook comes from).
This allows the example notebooks to be automounted to the file system of the Dataproc cluster's main node, so that once the cluster is created, you can access them easily.

Note: If you are running this notebook from a Workbench workspace, this reference may already be defined. It is harmless to run this command more than once.

In [None]:
!terra resource resolve --name terra-axon-examples || terra resource add-ref git-repo --repo-url https://github.com/DataBiosphere/terra-axon-examples.git \
  --name terra-axon-examples

## Confirm your `gcloud` settings

As part of this example, you will use the [gcloud](https://cloud.google.com/sdk/docs/install) SDK to upload a cluster autoscaling policy.  **If you are running this example from a Workbench cloud environment, `gcloud` will already be installed and properly configured** to use your workspace's underlying Google Cloud project, and you don't need to take any action.

However, if you're running this example from a different environment, you may need to first run:

```sh
gcloud config set project <your-project-id>
```

You can find your workspace's project ID on the overview page for the workspace in the Workbench web UI, or from the command line via `terra workspace describe`.

Alternately, you can pass the `--project=<your-project-id>` argument to gcloud commands.

## Define an autoscaling policy

Configure Dataproc [autoscaling](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling) to automatically and dynamically scale the number of worker VMs in Dataproc clusters to meet workload demands.

People will likely have many different autoscaling policies, since some jobs will run best with different numbers of primary workers that will not be preempted.

Here, we're creating the policy via the command line, but you can also do so [via the Google Cloud Console](https://support.workbench.verily.com/docs/how_to_guides/dataproc/#creating-an-autoscaling-policy-via-the-cloud-console).

In [None]:
%%writefile two_worker_autoscaling_policy.yaml

workerConfig:
  # Best practice: keep min and max values identical for primary workers
  # https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#avoid_scaling_primary_workers
  minInstances: 2
  maxInstances: 2
secondaryWorkerConfig:
  maxInstances: 50
basicAlgorithm:
  cooldownPeriod: 4m
  yarnConfig:
    scaleUpFactor: 0.05
    scaleDownFactor: 1.0
    gracefulDecommissionTimeout: 1h

Import the autoscaling policy, if it does not yet exist:

In [None]:
!gcloud dataproc autoscaling-policies describe two_worker_autoscaling_policy --region=us-central1 || \
    gcloud dataproc autoscaling-policies import two_worker_autoscaling_policy \
        --source=two_worker_autoscaling_policy.yaml \
        --region=us-central1

# Create a Dataproc cluster with Hail installed


Now, we'll run the command to create a cluster.  For this example, we're only setting a few parameters.  The `--software-framework==HAIL` parameter specifies the Hail installation. The `--idle-delete-ttl` parameter indicates to delete the cluster after it is idle for the given period (in seconds). If this parameter is not included, the cluster will not auto-delete.

Run `terra resource create dataproc-cluster` to see more [options](https://support.workbench.verily.com/docs/commands/terra-resource-create-dataproc-cluster/).   
Many of the parameters of this command are passed through from the `gcloud dataproc` command. See the Dataproc [docs](https://cloud.google.com/dataproc/docs/) and [reference documentation](https://cloud.google.com/dataproc/docs/reference/rest/) for more detail on these parameters.


(You can also create a cluster via the Workbench web UI. To do so, see [this guide](https://support.workbench.verily.com/docs/how_to_guides/dataproc/)).


In [None]:
!terra resource create dataproc-cluster \
  --name={HAIL_CLUSTER_NAME} \
  --software-framework=HAIL \
  --num-workers 2 \
  --autoscaling-policy two_worker_autoscaling_policy --idle-delete-ttl 3600s

The cluster will take a few minutes to start up.  If you like, you can view your new cloud environment in the [Workbench web UI](https://workbench.verily.com/workspaces), under the *Resources* tab, as [described here](https://support.workbench.verily.com/docs/how_to_guides/dataproc/).

## Access the JupterLab server on the cluster as well as the debugging consoles

You can use the **JupyterLab URL** printed by the next cell to access JupyterLab running on the cluster. See also the URLs to the debuging consoles such as the Spark Console.

You can also find the link to your new cluster's JupyterLab server by visiting the **Resources** tab for your workspace in the Workbench Web UI, as [described here](https://support.workbench.verily.com/docs/how_to_guides/dataproc/).

Alternately, you can find the link directly the Cloud Console if you like:
* Go to the Cloud Console -> Dataproc -> Clusters
* Select the cluster on which you want to run the notebook
* Click on tab 'WEB INTERFACES'
* Click on 'JupyterLab'

CPU utilization, memory utilization, and other performance metrics for the cluster are available on the Cloud Console. Click on the cluster name to see the plots of these metrics.

In [None]:
!gcloud dataproc clusters describe {HAIL_CLUSTER_NAME} --region=us-central1 \
  --format="yaml(config.endpointConfig.httpPorts)"

# Use Hail on the cluster

## Use Hail interactively on the cluster's JupyterLab server

In the output of the section "Access JupyterLab and the debugging consoles" above, click the **JupyterLab** link, or use one of the other methods described above to access the cluster's JupyterLab server (e.g., visit the Workbench web UI).   
You can also find this link under the **WEB INTERFACES** tab when you click in to the details for your cluster in the [Cloud Console](https://console.cloud.google.com/dataproc).

From the JupyterLab server, open the `/home/jupyter/repos/terra-axon-examples/dataproc/annotate_significant_gwas_results_with_gnomad.ipynb` notebook. 
This notebook is available on the cluster's JupyterLab server because we added the [terra-axon-examples repo](https://github.com/DataBiosphere/terra-axon-examples.git) as a reference to the workspace before creating the cluster— the `terra resource add-ref git-repo ...` command above.

In the notebook, you may want to try setting the `INTERVALS_TO_EXAMINE` constant to `['chr1-chr22']`, to run at scale.  This should cause the cluster's *autoscaling* to kick in.

As the cluster scales up, you can monitor in the Cloud Console's Dataproc dashboard (e.g. see the YARN nodes panel):

<img src="https://github.com/DataBiosphere/terra-axon-examples/raw/main/dataproc/images/dataproc_dashboard.png" width="70%">

When the job(s) are done, you can monitor its turndown:

<img src="https://github.com/DataBiosphere/terra-axon-examples/raw/main/dataproc/images/dataproc_dashboard2.png" width="70%">


## Submit a Hail batch job

The notebook [batch_job_submit.ipynb](https://github.com/DataBiosphere/terra-axon-examples/blob/main/dataproc/batch_job_submit.ipynb) (in the same repo directory as this notebook) walks you through submission of a batch job to the cluster. 

To run that notebook, you will need to know the name of your cluster.  

The batch job notebook does not need to be run _on_ the cluster's server— you can run it from a notebook cloud environment in your workspace. 

In [None]:
# run this cell to be reminded of your cluster name
print(HAIL_CLUSTER_NAME)

Once the batch job is running (or after it has finished), you can view the [cluster dashboard](https://console.cloud.google.com/dataproc) (click in to view detail for each cluster) and the [job info](https://console.cloud.google.com/dataproc/jobs), including job logs, in the Google Cloud console. 

# Stop or delete your cluster when you are finished

## Stopping a cluster

Verily Workbench does not currently support "autopause"— so you may want to **stop** (pause) a cluster while you're not using it.  Alternately, your cluster will automatically be deleted after `idle-delete-ttl` seconds of activity.  For the cluster created above, we set this value to 3600s, or one hour.

You can stop the cluster via `gcloud` from the command line, or via the Workbench UI. See [this guide](https://support.workbench.verily.com/docs/how_to_guides/dataproc/) for details on how to do so from the web UI. 

You can run the next cell to stop the cluster from the command line.

<div class="alert alert-block alert-info">
<b>Note:</b> If autoscaling has been initiated, it may not be possible to <code>STOP</code> the cluster, only to <code>DELETE</code> it.<br/>   
    Even if your cluster has been stopped, it will still delete itself after the <code>idle-delete-ttl</code> period of inactivity.
</div>

You can also visit the [cluster dashboard](https://console.cloud.google.com/dataproc) in the Cloud Console to delete your cluster.  
* Go to the Cloud Console -> Dataproc -> Clusters
* Select the cluster on which you want to stop
* Click on 'Stop'

In [None]:
# Uncomment this command to STOP your cluster
# !gcloud dataproc clusters stop {HAIL_CLUSTER_NAME} --region=us-central1

## Deleting a cluster

To delete a cluster, use the Workbench web UI or CLI (You cannot delete Workbench clusters via the Cloud Console).  See [this guide](https://support.workbench.verily.com/docs/how_to_guides/dataproc/) for details on how to do so from the Workbench web UI.  To delete a cluster using the CLI, uncomment and run the command below.


In [None]:
# Uncomment this command to DELETE your cluster
# !terra resource delete --name {HAIL_CLUSTER_NAME}

# Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2023 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style   
license that can be found in the LICENSE file or at   
https://developers.google.com/open-source/licenses/bsd