-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="51f698bd-880b-4a85-b187-9b96d8c2cf18"/>


# Lab: Orchestrating Jobs with Databricks

In this lab, you'll be configuring a multi-task job comprising of:
* A notebook that lands a new batch of data in a storage directory
* A Delta Live Table pipeline that processes this data through a series of tables
* A notebook that queries the gold table produced by this pipeline as well as various metrics output by DLT

## Learning Objectives
By the end of this lab, you should be able to:
* Schedule a notebook as a task in a Databricks Job
* Schedule a DLT pipeline as a task in a Databricks Job
* Configure linear dependencies between tasks using the Databricks Workflows UI

In [0]:
%run ../../Includes/Classroom-Setup-09.2.1L

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(7 seconds)
| completed (7 seconds total)

Creating & using the schema "munirsheikhcloudseekho_0lj9_da_dewd_jobs_lab_92"...(1 seconds)
Predefined tables in "munirsheikhcloudseekho_0lj9_da_dewd_jobs_lab_92":
| -none-

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_lab_92
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_lab_92/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02
| DA.paths.checkpoints:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_lab_92/_checkpoints
| DA.paths.stream_path:      dbfs:/mnt/dbacademy-users/mun

<i18n value="b7163714-376c-41fd-8e38-80a7247fa923"/>


## Land Initial Data
Seed the landing zone with some data before proceeding. You will re-run this command to land additional data later.

In [0]:
DA.data_factory.load()

Loading the file 01.json to the dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_lab_92/stream/01.json


<i18n value="6bc33560-37f4-4f91-910d-669a1708ba66"/>


## Create and Configure a Pipeline

The pipeline we create here is nearly identical to the one in the previous unit.

We will use it as part of a scheduled job in this lesson.

Execute the following cell to print out the values that will be used during the following configuration steps.

In [0]:
DA.print_pipeline_config()

0,1
Pipeline Name:,
Target:,
Storage Location:,
Notebook Path:,
Datasets Path:,
Source:,
Policy:,


<i18n value="c8b235db-10cf-4a56-92d9-330b80da4f0f"/>


Steps:
1. Click the **Workflows** button on the sidebar.
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Leave **Product Edition** as **Advanced**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipeline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the notebook specified above.
1. Under **Configuration**, add three configuration parameters:
   * Click **Add configuration**, set the "key" to **spark.master** and the "value" to **local[\*]**.
   * Click **Add configuration**, set the "key" to **datasets_path** and the "value" to the value provided in the cell above.
   * Click **Add configuration**, set the "key" to **source** and the "value" to the value provided in the cell above.
1. In the **Target** field, enter the database name provided in the cell above.<br/>
This should follow the pattern **`<name>_<hash>_dbacademy_dewd_jobs_lab_92`**
1. In the **Storage location** field, enter the path provided in the cell above.
1. For **Pipeline Mode**, select **Triggered**.
1. Uncheck the **Enable autoscaling** box.
1. Set the number of **`workers`** to **`0`** (zero).
1. Check the **Use Photon Acceleration** box.
1. For **Channel**, select **Current**
1. For **Policy**, select the value provided in the cell above.

Finally, click **Create**.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: we won't be executing this pipline directly as it will be executed by our job later in this lesson,<br/>
but if you want to test it real quick, you can click the **Start** button now.

In [0]:
# ANSWER
 
# This function is provided for students who do not 
# want to work through the exercise of creating the pipeline.
DA.create_pipeline()

[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
[0;32m<command-4094000743659299>[0m in [0;36m<cell line: 5>[0;34m()[0m
[1;32m      3[0m [0;31m# This function is provided for students who do not[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;31m# want to work through the exercise of creating the pipeline.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 5[0;31m [0mDA[0m[0;34m.[0m[0mcreate_pipeline[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m<command-4094000743658882>[0m in [0;36mcreate_pipeline[0;34m(self)[0m
[1;32m      7[0m     [0;31m# We need to delete the existing pipline so that we can apply updates[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8[0m     [0;31m# because some attributes are not mutable after creation.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m     [

In [0]:
DA.validate_pipeline_config()



<i18n value="f98768ac-cbcc-42a2-8c51-ffdc3778aa11"/>


## Schedule a Notebook Job

When using the Jobs UI to orchestrate a workload with multiple tasks, you'll always begin by scheduling a single task.

Before we start run the following cell to get the values used in this step.

In [0]:
DA.print_job_config()



<i18n value="fab2a427-5d5a-4a82-8947-c809d815c2a3"/>


Here, we'll start by scheduling the next notebook.

Steps:
1. Click the **Workflows** button on the sidebar
1. Select the **Jobs** tab.
1. Click the blue **Create Job** button
1. Configure the task:
    1. Enter **Batch-Job** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Batch Notebook Path** value provided in the cell above
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. Click **Create**
1. In the top-left of the screen, rename the job (not the task) from **`Batch-Job`** (the defaulted value) to the **Job Name** value provided in the cell above.
1. Click the blue **Run now** button in the top right to start the job.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all purpose cluster, you will get a warning about how this will be billed as all purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

<i18n value="1ab345ce-dff4-4a99-ad45-209793ddc581"/>


## Schedule a DLT Pipeline as a Task

In this step, we'll add a DLT pipeline to execute after the success of the task we configured at the start of this lesson.

Steps:
1. At the top left of your screen, you'll see the **Runs** tab is currently selected; click the **Tasks** tab.
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
1. Configure the task:
    1. Enter **DLT** for the task name
    1. For **Type**, select  **Delta Live Tables pipeline**
    1. For **Pipeline**, select the DLT pipeline you configured previously in this exercise<br/>
    Note: The pipeline will start with **DLT-Job-Lab-92** and will end with your email address.
    1. The **Depends on** field defaults to your previously defined task, **Batch-Job** - leave this value as-is.
    1. Click the blue **Create task** button

You should now see a screen with 2 boxes and a downward arrow between them. 

Your **`Batch-Job`** task will be at the top, leading into your **`DLT`** task.

<i18n value="dd4e16c5-1842-4642-8159-117cfc84d4b4"/>


## Schedule an Additional Notebook Task

An additional notebook has been provided which queries some of the DLT metrics and the gold table defined in the DLT pipeline. 

We'll add this as a final task in our job.

Steps:
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
Steps:
1. Configure the task:
    1. Enter **Query-Results** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Query Notebook Path** value provided in the cell above
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. The **Depends on** field defaults to your previously defined task, **DLT** - leave this value as-is.
    1. Click the blue **Create task** button
    
Click the blue **Run now** button in the top right of the screen to run this job.

From the **Runs** tab, you will be able to click on the start time for this run under the **Active runs** section and visually track task progress.

Once all your tasks have succeeded, review the contents of each task to confirm expected behavior.

In [0]:
# ANSWER

# This function is provided for students who do not 
# want to work through the exercise of creating the job.
DA.create_job()



In [0]:
DA.validate_job_config()



In [0]:
# ANSWER

# This function is provided to start the job and  
# block until it has completed, canceled or failed
DA.start_job()



-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>