# NVIDIA Clara Parabricks on Microsoft Azure 


This notebook presents the **sample code** of NVIDIA Clara Parabricks pipeline on Azure Machine Learning Studio and Ubuntu Virtual Machines on [Microsoft Azure](https://azure.microsoft.com/en-us/)

### What is NVIDIA Clara Parabricks Pipelines?*
"_Parabricks is a software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. A major benefit of Parabricks is that it is designed to deliver results at blazing fast speeds and low cost. Parabricks can analyze whole human genomes in about **45 minutes**, compared to about 30 hours for **30x WGS data**. The best part is the output results exactly match the commonly used software. So, it’s fairly simple to verify the accuracy of the output._"

### Why use NVIDIA Clara Parabricks Pipelines?*

"_Under the hood, it achieves this performance through tight integration with GPUs, which excel at performing data parallel computation much more effectively than traditional CPU-based solutions. Parabricks was built from the ground up by GPU computing and Deep Learning experts who wanted to develop the fastest and most efficient possible implementation of common genomics algorithms used in secondary analysis._"

*You can learn more at https://developer.nvidia.com/clara-parabricks


**Recommended Virtual Machine configurations from Microsoft Azure**

Standard_NC64AS_T4_V3 (64 Cores, 448 GB RAM, 2816 GB Disk) 
    **Processing Unit** GPU - 4 x NVIDIA Tesla T4

Standard_NC24s_v3 (24 cores, 448 GB RAM, 1344 GB disk) 
    **Processing Unit** GPU - 4 x NVIDIA Tesla V100

For more information about NC series VMs on Azure, please visit [this link](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series)

**IMPORTANT INFORMATION**

Users needs a **NVIDIA Clara Parabricks** licence OR trial licence for running the pipelines. For more information please visit https://www.nvidia.com/en-us/clara/genomics/

### Microsoft Azure Resources 

If you are new to Azure, see:
- [Microsoft Genomics](https://www.microsoft.com/en-us/genomics/)
- [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/)
- [Azure Virtual Machines](https://azure.microsoft.com/services/virtual-machines/)
- [Azure Linux Virtual Machines documentation](https://docs.microsoft.com/azure/virtual-machines/linux/)
- [Template reference](https://docs.microsoft.com/azure/templates/microsoft.compute/allversions)
- [Quickstart templates](https://azure.microsoft.com/resources/templates/?resourceType=Microsoft.Compute&pageNumber=1&sort=Popular)


#  Create and manage Microsoft Azure Machine Learning Studio

This chapter uses the cloud notebook server in your workspace for an install-free and pre-configured experience. Use your own environment if you prefer to have control over your environment, packages and dependencies.

Follow along with this video or use the detailed steps below to clone and run the tutorial from your workspace.

For further details on creation of Azure ML workspace please visit [this page.](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace)


In [27]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.microsoft.com/en-us/videoplayer/embed/RE4mTUr" frameborder="0" allowfullscreen></iframe>')


## Getting Started with NVIDIA Clara Parabricks 

In [29]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/AQltyCwPgU0?start=0" title="YouTube video player" frameborder="0" allowfullscreen></iframe>')


### Install Dependencies for NVIDIA Parabricks Test Run

In [None]:
!sudo apt install nvidia-driver-460

In [None]:
!sudo reboot

In [None]:
!curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

In [None]:
!sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

In [None]:
!curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
 ! sudo apt-key add -
!distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
!curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
 ! sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
!sudo apt-get update

In [None]:
!sudo apt-get install nvidia-container-runtime


In [None]:
!curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

In [None]:
!distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

In [None]:
!sudo apt-get update


In [None]:
!sudo apt-get install -y nvidia-docker2

In [None]:
!sudo systemctl restart docker

In [None]:
!sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

**`ATTENTION:` Please check the NVIDIA drivers' status before run your Parabricks pipelines. You should see the following output with your own GPU configuration.**

In [4]:
!sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda

[1B1f796a1e: Pulling fs layer 
[1Bea53ad12: Pulling fs layer 
[1B71e02073: Pulling fs layer 
[1B17bbf772: Pulling fs layer 
[1Bf1a6dfb3: Pulling fs layer 
[1B55b8b4b9: Pulling fs layer 
[1Bc0332b0a: Pull complete 032kB/1.032kBB[4A[2K[3A[2K[3A[2K[2A[2K[1A[2K[2A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[6A[2K[6A[2K[5A[2K[4A[2K[3A[2K[3A[2K[3A[2K[3A[2K[2A[2K[2A[2K[2A[2K[2A[2K[2A[2K[1A[2K[1A[2KDigest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
Thu Apr  8 23:33:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| B

#### Step 1: Untar the package. `USERS NEED TO DOWNLOAD this '.tar.gz' FILE FROM THEIR NVIDIA ACCOUNT`: https://developer.nvidia.com/clara-parabricks

In [None]:
!tar -xzf parabricks.tar.gz

#### PLEASE USE YOUR TERMINAL FOR RUNNING THE FOLLOWING CELL

#### Step 2 (Node Lock License): Run the installer.

In [None]:
!sudo ./parabricks/installer.py

#### Step 3: Verify your installation.

In [None]:
# This should display the parabricks version number:
! pbrun version

## Sample Run- 'fastq to bam' pipeline with Parabricks

### Prerequisites for downloading sample data: Download azcopy    

For convenience, consider adding the directory location of the AzCopy executable to your system path for ease of use. That way you can type azcopy from any directory on your system.

If you choose not to add the AzCopy directory to your path, you'll have to change directories to the location of your AzCopy executable and type azcopy or .\azcopy in Windows PowerShell command prompts.

As an owner of your Azure Storage account, you aren't automatically assigned permissions to access data. Before you can do anything meaningful with AzCopy, you need to decide how you'll provide authorization credentials to the storage service.

In [None]:
!wget https://aka.ms/downloadazcopy-v10-linux

### Download `hg38` reference genome

In [None]:
./azcopy cp "https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D" "./mnt" --include-pattern "Homo_sapiens_assembly38.*" --recursive

### Download the 'Genome In a Bottle' Datasets from Azure Genomics Data Lake
Several public genomics data has been uploaded as an Azure Genomics Data Lake [here](https://azure.microsoft.com/en-us/services/open-datasets/catalog/genomics-data-lake/). We create a blob service linked to this open datasets. You can find example of data calling procedure from Azure Genomics Data Lake for `Genome In a Bottle- GIAB` [datasets](https://www.nist.gov/programs-projects/genome-bottle) in below:

**Install Azure Blob Storage SDK**

In [None]:
!pip install azure-storage-blob==2.1.0

In [None]:
import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess

blob_service_client = BlockBlobService(account_name='datasetgiab', sas_token='sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=7qp%2BxGLGc%2BO2MIVzzDZY7GSqEwthyGnhXJ566KoH7As%3D')     
blob_service_client.get_blob_to_path('dataset/data/AshkenazimTrio/HG002_NA24385_son/Illumina_PCRfree_downsampled', 'HG002_HiSeq30x_subsampled_R1.fastq.gz', '/mnt/HG002_HiSeq30x_subsampled_R1.fastq.gz')

In [None]:
import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess

blob_service_client = BlockBlobService(account_name='datasetgiab', sas_token='sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=7qp%2BxGLGc%2BO2MIVzzDZY7GSqEwthyGnhXJ566KoH7As%3D')     
blob_service_client.get_blob_to_path('dataset/data/AshkenazimTrio/HG002_NA24385_son/Illumina_PCRfree_downsampled', 'HG002_HiSeq30x_subsampled_R2.fastq.gz', '/mnt/HG002_HiSeq30x_subsampled_R2.fastq.gz')

### `fastq to bam` pipeline submission to Parabricks client

In [None]:
!pbrun fq2bam --ref Homo_sapiens_assembly38.fasta --in-fq HG002_HiSeq30x_subsampled_R1.fastq.gz HG002_HiSeq30x_subsampled_R2.fastq.gz --out-bam HG002_HiSeq30x_subsampled.bam

### Notices

Third party software notices from [NVIDIA CLARA PARABRICKS](https://docs.nvidia.com/clara/parabricks/v3.5/text/software_notices.html)

### Support

For questions about this notebook: Please send an e-mail to genomics@microsoft.com

For other questions about NVIDIA Clara Parabricks [Developer forum of NVIDIA Clara Parabricks](https://forums.developer.nvidia.com/c/healthcare/parabricks/290)