# Batch job submission

This notebook shows how you can submit a Dataproc batch job. The job script is based on the `annotate_significant_gwas_results_with_gnomad.ipynb` notebook, converted to a Python script.

First, derive the staging bucket GCS URI from its known resource name (which is based on the workspace ID), using the `terra` CLI:

In [None]:
ws_id_list = !terra workspace describe --format=JSON | jq '.id'
WORKSPACE_ID = ws_id_list[0]
print(WORKSPACE_ID)

In [None]:
STAGING_BUCKET_CMD_OUTPUT = !terra resolve --name=dataproc-staging-{WORKSPACE_ID}
STAGING_BUCKET = STAGING_BUCKET_CMD_OUTPUT[0]
print(STAGING_BUCKET)

Convert the `annotate_significant_gwas_results_with_gnomad.ipynb` notebook to a python script:

In [None]:
!jupyter nbconvert --to script annotate_significant_gwas_results_with_gnomad.ipynb

Next, temporarily work around an issue invoking `terra` CLI commands from this script.  
Trim the part of the code from the notebook that derives the `STAGING_BUCKET`, and instead replace the `STAGING_BUCKET`'s value in the script.

In [None]:
!cp annotate_significant_gwas_results_with_gnomad.py annotate_significant_gwas_results_with_gnomad_ORIG.py

In [None]:
with open('annotate_significant_gwas_results_with_gnomad_ORIG.py', 'rt') as fin:
    with open('annotate_significant_gwas_results_with_gnomad.py', 'wt') as fout:
        excised_section = False
        for line in fin:
            if "### terra-cli begin" in line:
                excised_section = True
            elif "### terra-cli end" in line:
                excised_section = False
            elif excised_section:
                continue
            else:
              fout.write(line.replace("STAGING_BUCKET",
                                      f"'{STAGING_BUCKET}'"))

## Submit the batch job

**Edit the following cell with the ID of your Dataproc cluster**. 

Then run the cell to submit the batch job. You can monitor the running job via its output, or by visiting https://console.cloud.google.com/dataproc/jobs .

In [None]:
!gcloud dataproc jobs submit pyspark --cluster <YOUR_CLUSTER_ID> --region us-central1 \
    annotate_significant_gwas_results_with_gnomad.py

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date


Conda and pip installed packages:

In [None]:
!conda env export


JupyterLab extensions:

In [None]:
!jupyter labextension list


Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l


Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2023 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style  
license that can be found in the LICENSE file or at  
https://developers.google.com/open-source/licenses/bsd