Hail On Azure using HDInsights
- Create an Azure Managed Identity
- Create an Azure Data Lake Storage Gen2 instance
- Add the Managed Identity to be the storage instance with "Storage Blob Data Owner" role
- Create an HDInsight Cluster with the type of "Spark" with version 2.4
- Add a Script Action to the configuration tab with a link to a copy of
hail-install.sh
for both Head and Worker Nodes - Wait for the cluster to reach a "Running" status
- Open the Apache Ambari Dashboard on the HDInsight Cluster
- Go to the page for
Spark2
by clicking it on the left panel - Open the config tab and expand the
Advanced livy2-env
tab - Append the following to the bottom of the configuration content box:
export PYSPARK_PYTHON=/usr/bin/anaconda/envs/hail/bin/python3.7
export PYSPARK3_PYTHON=/usr/bin/anaconda/envs/hail/bin/python3.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/anaconda/envs/hail/bin/python3.7
- Expand the
Advanced spark2-env
tab 12: Replace the last line exportingPYSPARK_PYTHON
with the following line:
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/anaconda/envs/hail/bin/python3.7}
- Append the following to the end of the line starting with
export SPARK_DIST_CLASSPATH=...
:
/usr/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/hail-all-spark.jar:
- Expand the
Custom spark2-defaults
tab - Click
Add Property ...
at the bottom of the list - Input
spark.kryo.registrator
for thekey
property - Input
is.hail.kryo.HailKryoRegistrator
for thevalue
property - Select
Text
for the property type - Click
Add
to append to the end of the list - Do the same (steps 15 - 19) for the next 3 items:
Key | Value |
---|---|
spark.jars | /usr/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar |
spark.driver.extraClassPath | /usr/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar |
spark.yarn.dist.jars | /usr/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/backend/hail-all-spark.jar |
spark.executer.extraClassPath | ./hail-all-spark.jar |
- Save the changes and restart the affected services. These changes need a restart of Spark2 service. Ambari UI will prompt a required restart reminder, click Restart to restart all affected services.
- To add the new virtual environment to jupyter notebook, go to the Azure dashboard for your HDInsight cluster
- Open the Script Actions Tab
- Add a new custom script action only for the header nodes to set the jupyter notebook instance to use the new hail environment with a link to a copy of
set-jupyter-to-hail.sh
- Wait for the script to complete
- You have successfully completed configuration of the cluster to run hail