# Orchestrating Databricks workloads on AWS MWAA

## Introduction

Job orchestration is a fully integrated feature of Databricks. Customers can use the *Jobs API* or UI to create and manage jobs and features such as monitoring. Databricks orchestration can support both jobs with single or multiple tasks.

In this notebook we will use this powerful Job API together with AWS MWAA to monitor a DAG (Directed Acyclic Graph) with Databricks-based tasks. To do so, we will create a simple DAG that connects to a Databricks Cluster and executes a notebook. MWAA will in turn monitor this execution.

## 1. Create an API token in Databricks

Make sure you have access to a Databricks workspace and that a Databricks cluster that has been configured inside this workspace.

In your Databricks account, select your username from top right and then **User Settings**.

<p align="center">
    <img src="images/Databricks User Settings.png" width="700" height="400"/>
</p>

From the User Settings panel select Access tokens and then **Generate new token**.Here, you can add a description to the token and select its lifetime (after how many days it will expire). Once you have created a token, the following window will popup:

<p align="center">
    <img src="images/Databricks Token.png" width="700" height="200"/>
</p>

Copy the **Token ID** now and make a note of it. Once you press **Done** you won't be able to see its value again.

## 2. Create the MWAA to Databricks connection

Open the Airflow UI from the MWAA environment. Navigate to **Admin** and then select **Connections**.

<p align="center">
    <img src="images/Airflow UI Connections.png" width="700" height="250"/>
</p>

From the list of possible connection select **databricks_default** and click on **Edit record**. 
- In the **Host** column copy and paste the url of your Databricks account. You can find this by simply navigating your account and copying the URL in the top bar.
- In the **Extra** column you should add the following dictionary:

`{"token": "<token_from_previous_step>", "host": "<url_from_host_column>"}`


- In the **Connection Type** column you should select **Databricks** from the drop-down menu. If the connection type is missing, you will need to install additional packages using the *Airflow Provider Package*. Follow the **Create & upload requirements.txt file** section below.

### Create & upload requirements.txt file

To obtain the Databricks connection type we will need to install the corresponding Python dependency in our MWAA environment, by uploading a `requirements.txt` file in the MWAA-designated S3 bucket.

To test that we are creating our requirements.txt file correctly, before uploading it to our MWAA environment, we will use a command line interface (CLI) utility that replicates a MWAA environment locally. The CLI will build a Docker image locally, which will allow you to run a local Airflow environment to develop/test DAGs, custom plugins and dependencies before deploying them to the cloud.

We will use the following Github repository for that: https://github.com/aws/aws-mwaa-local-runner. 

1. First we will clone the repository:

`git clone https://github.com/aws/aws-mwaa-local-runner.git`

`cd aws-mwaa-local-runner`

2. Then we will build the Docker image using the following command (this will take a while):

`./mwaa-local-env build-image`

3. Once the image has been build, we can now run a local Airflow environment that will be a close representation of the MWAA environment:

`./mwaa-local-env start`

This will take a while to set up, but once everything is ready to go, you should see the following message in your terminal:

<p align="center">
    <img src="images/Container Started.png" width="400" height="300"/>
</p>


4. Now you're ready to open the Apache Airflow UI at http://localhost:8080/. By default the username will be set to **admin** and the password to **test**.

<p align="center">
    <img src="images/Local UI.png" width="700" height="350"/>
</p>


5. To add the desired **Python dependencies** you will need to navigate to `aws-mwaa-local-runner/requirements/`. Inside this folder is where you will create your `requirements.txt file`.

<p align="center">
    <img src="images/Requirements Folder.png" width="600" height="300"/>
</p>

Inside `requirements.txt` you should add the following line to install the package needed for the **Databricks connection type**:

`apache-airflow[databricks]`

6. To test the `requirements.txt` without running Airflow, use the following command:

`./mwaa-local-env test-requirements`

If everything ran successfully, you should see the following output:

<p align="center">
    <img src="images/Requirements Installed.png" width="400" height="300"/>
</p>

Now you are ready to upload this requirements.txt file to your MWAA environment.

### Upload the requirements.txt file to the S3 bucket

Once you have uploaded the file to your MWAA S3 bucket, the bucket should look like this:

<p align="center">
    <img src="images/S3 bucket.png" width="600" height="400"/>
</p>

### Specify the file path in Requirements file field

Navigate to the MWAA console and select your **Environment**. Once you're on the environment page select **Edit**.

Under the **DAG code in Amazon S3**, update your **Requirements file** field by selecting the path corresponding to the requirements.txt file you have just uploaded to the S3 bucket.

Now, we are ready to put everything together. Navigate back to the Airflow UI in your environment and let's go back to our Connections.

Click Edit Connection on the `databricks_default`, and now you should be able to find **Databricks** in the Connection Type column. The final connection should look like this:

<p align="center">
    <img src="images/Databricks Connection.png" width="700" height="500"/>
</p>

## 3. Create the Airflow DAG

Below you have an example DAG file that will run a Databricks Notebook on a specific schedule. Once you have created your DAG, you should upload it to the MWAA S3 bucket in the `dags` folder.

In [None]:
from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator, DatabricksRunNowOperator
from datetime import datetime, timedelta 


#Define params for Submit Run Operator
notebook_task = {
    'notebook_path': '<DATABRICKS_NOTEBOOK_PATH',
}


#Define params for Run Now Operator
notebook_params = {
    "Variable":5
}


default_args = {
    'owner': '<OWNER_NAME>',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': <NUMBER_DESIRED_RETRIES>,
    'retry_delay': timedelta(minutes=2)
}


with DAG('databricks_dag',
    # should be a datetime format
    start_date=<DESIRED_START_DATE>,
    # check out possible intervals, should be a string
    schedule_interval='<DESIRED_INTERVAL>',
    catchup=False,
    default_args=default_args
    ) as dag:


    opr_submit_run = DatabricksSubmitRunOperator(
        task_id='submit_run',
        # the connection we set-up previously
        databricks_conn_id='databricks_default',
        existing_cluster_id='<CLUSTER_ID>',
        notebook_task=notebook_task
    )
    opr_submit_run


Once uploaded to the `dags` folder, you will be able to see the new DAG in the Airflow UI on your MWAA environment, under paused DAGs. In order to manually trigger the DAG, you will first have to unpause it.

The DAG might fail if you're notebook contains the commands for mounting the S3 bucket, as this is something that should only be done once. To avoid this, either comment out those commands, or make sure to include the commands to un-mount the S3 bucket in a cell at the end of the notebook.

## Conclusion
At this point, you should have a good understanding of:
- How to create an API token in Databricks
- How to create a MWAA to Databricks Connection
- How to create a DAG that runs a Databricks Notebook