-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="36722caa-e827-436b-8c45-3e85619fd2d0"/>


# Orchestrating Jobs with Databricks Workflows

New updates to the Databricks Jobs UI have added the ability to schedule multiple tasks as part of a job, allowing Databricks Jobs to fully handle orchestration for most production workloads.

Here, we'll start by reviewing the steps for scheduling a notebook task as a triggered standalone job, and then add a dependent task using a DLT pipeline. 

## Learning Objectives
By the end of this lesson, you should be able to:
* Schedule a notebook task in a Databricks Workflow Job
* Describe job scheduling options and differences between cluster types
* Review Job Runs to track progress and see results
* Schedule a DLT pipeline task in a Databricks Workflow Job
* Configure linear dependencies between tasks using the Databricks Workflows UI

In [0]:
%run ../../Includes/Classroom-Setup-09.1.1

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(8 seconds)
| completed (8 seconds total)

Creating & using the schema "munirsheikhcloudseekho_0lj9_da_dewd_jobs_demo_91"...(0 seconds)
Predefined tables in "munirsheikhcloudseekho_0lj9_da_dewd_jobs_demo_91":
| -none-

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_demo_91
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_demo_91/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02
| DA.paths.checkpoints:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/jobs_demo_91/_checkpoints
| DA.paths.stream_path:      dbfs:/mnt/dbacademy-user

<i18n value="f1dc94ee-1f34-40b1-b2ba-49de9801b0d1"/>


## Create and configure a pipeline
The pipeline we create here is nearly identical to the one in the previous unit.

We will use it as part of a scheduled job in this lesson.

Execute the following cell to print out the values that will be used during the following configuration steps.

In [0]:
DA.print_pipeline_config()

0,1
Pipeline Name:,
Target:,
Storage Location:,
Notebook Path:,
Datasets Path:,
Policy:,


<i18n value="b1f23965-ab36-40da-8907-e8f1fdc53aed"/>


## Create and configure a Pipeline

Steps:
1. Click the **Workflows** button on the sidebar.
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Leave **Product Edition** as **Advanced**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipeline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the notebook specified above.
1. Under **Configuration**, add two configuration parameters:
   * Click **Add configuration**, set the "key" to **spark.master** and the "value" to **local[\*]**.
   * Click **Add configuration**, set the "key" to **datasets_path** and the "value" to the value provided in the cell above.
1. In the **Target** field, enter the database name provided in the cell above.<br/>
This should follow the pattern **`<name>_<hash>_dbacademy_dewd_jobs_demo_91`**
1. In the **Storage location** field, enter the path provided in the cell above.
1. For **Pipeline Mode**, select **Triggered**.
1. Uncheck the **Enable autoscaling** box.
1. Set the number of **`workers`** to **`0`** (zero).
1. Check the **Use Photon Acceleration** box.
1. For **Channel**, select **Current**
1. For **Policy**, select the value provided in the cell above.

Finally, click **Create**.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: we won't be executing this pipline directly as it will be executed by our job later in this lesson,<br/>
but if you want to test it real quick, you can click the **Start** button now.

In [0]:
# ANSWER

# This function is provided for students who do not 
# want to work through the exercise of creating the pipeline.
DA.create_pipeline()

[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
[0;32m<command-4094000743659327>[0m in [0;36m<cell line: 5>[0;34m()[0m
[1;32m      3[0m [0;31m# This function is provided for students who do not[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;31m# want to work through the exercise of creating the pipeline.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 5[0;31m [0mDA[0m[0;34m.[0m[0mcreate_pipeline[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m<command-4094000743658743>[0m in [0;36mcreate_pipeline[0;34m(self)[0m
[1;32m      7[0m     [0;31m# We need to delete the existing pipline so that we can apply updates[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8[0m     [0;31m# because some attributes are not mutable after creation.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m     [

In [0]:
DA.validate_pipeline_config()



<i18n value="ed9ed553-77e7-4ff2-a9dc-12466e30c994"/>



## Schedule a Notebook Job

When using the Jobs UI to orchestrate a workload with multiple tasks, you'll always begin by scheduling a single task.

Before we start run the following cell to get the values used in this step.

In [0]:
DA.print_job_config_task_reset()



<i18n value="8c3c501e-0334-412a-91b3-bf250dfe8856"/>

Here, we'll start by scheduling the next notebook.

Steps:
1. Click the **Workflows** button on the sidebar.
1. Select the **Jobs** tab.
1. Click the **Create Job** button.
1. Configure the task:
    1. Enter **Reset** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Reset Notebook Path** value provided in the cell above
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. Click **Create**
1. In the top-left of the screen, rename the job (not the task) from **`Reset`** (the defaulted value) to the **Job Name** value provided in the cell above.
1. Click the blue **Run now** button in the top right to start the job.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all-purpose cluster, you will get a warning about how this will be billed as all-purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

In [0]:
# ANSWER

# This function is provided for students who do not 
# want to work through the exercise of creating the job.
DA.create_job_v1()



In [0]:
DA.validate_job_v1_config()



<i18n value="8ebdf7c7-4b4a-49a9-b9d4-25dff82ed169"/>


## Cron Scheduling of Databricks Jobs

Note that on the right hand side of the Jobs UI, directly under the **Job Details** section is a section labeled **Schedule**.

Click on the **Edit schedule** button to explore scheduling options.

Changing the **Schedule type** field from **Manual** to **Scheduled** will bring up a cron scheduling UI.

This UI provides extensive options for setting up chronological scheduling of your Jobs. Settings configured with the UI can also be output in cron syntax, which can be edited if custom configuration not available with the UI is needed.

At this time, we'll leave our job set to the **Manual (Paused)** scheduling type.

<i18n value="50665a01-dd6c-4767-b8ef-56ee02dbd9db"/>


## Review Run

As currently configured, our single notebook provides identical performance to the legacy Databricks Jobs UI, which only allowed a single notebook to be scheduled.

To Review the Job Run
1. Select the **Runs** tab in the top-left of the screen (you should currently be on the **Tasks** tab)
1. Find your job. If **the job is still running**, it will be under the **Active runs** section. If **the job finished running**, it will be under the **Completed runs** section
1. Open the Output details by clicking on the timestamp field under the **Start time** column
1. If **the job is still running**, you will see the active state of the notebook with a **Status** of **`Pending`** or **`Running`** in the right side panel. If **the job has completed**, you will see the full execution of the notebook with a **Status** of **`Succeeded`** or **`Failed`** in the right side panel
  
The notebook employs the magic command **`%run`** to call an additional notebook using a relative path. Note that while not covered in this course, <a href="https://docs.databricks.com/repos.html#work-with-non-notebook-files-in-a-databricks-repo" target="_blank">new functionality added to Databricks Repos allows loading Python modules using relative paths</a>.

The actual outcome of the scheduled notebook is to reset the environment for our new job and pipeline.

<i18n value="3dbff1a3-1c13-46f9-91c4-55aefb95be20"/>


## Schedule a DLT Pipeline as a Task

In this step, we'll add a DLT pipeline to execute after the success of the task we configured at the start of this lesson.

Steps:
1. At the top left of your screen, you'll see the **Runs** tab is currently selected; click the **Tasks** tab.
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
1. Configure the task:
    1. Enter **DLT** for the task name
    1. For **Type**, select  **Delta Live Tables pipeline**
    1. For **Pipeline**, select the DLT pipeline you configured previously in this exercise<br/>
    Note: The pipeline will start with **DLT-Job-Demo-91** and will end with your email address.
    1. The **Depends on** field defaults to your previously defined task, **Reset** - leave this value as-is.
    1. Click the blue **Create task** button

You should now see a screen with 2 boxes and a downward arrow between them. 

Your **`Reset`** task will be at the top, leading into your **`DLT`** task. 

This visualization represents the dependencies between these tasks.

Click **Run now** to execute your job.

**NOTE**: You may need to wait a few minutes as infrastructure for your job and pipeline is deployed.

In [0]:
# ANSWER

# This function is provided for students who do not 
# want to work through the exercise of creating the job.
DA.create_job_v2()



In [0]:
DA.validate_job_v2_config()



In [0]:
# ANSWER

# This function is provided to start the pipeline and  
# block until it has completed, canceled or failed
DA.start_job()



<i18n value="4fecba69-f1cf-4413-8bc6-7b50d32b2456"/>


## Review Multi-Task Run Results

Select the **Runs** tab again and then the most recent run under **Active runs** or **Completed runs** depending on if the job has completed or not.

The visualizations for tasks will update in real time to reflect which tasks are actively running, and will change colors if task failures occur. 

Clicking on a task box will render the scheduled notebook in the UI. 

You can think of this as just an additional layer of orchestration on top of the previous Databricks Jobs UI, if that helps; note that if you have workloads scheduling jobs with the CLI or REST API, <a href="https://docs.databricks.com/dev-tools/api/latest/jobs.html" target="_blank">the JSON structure used to configure and get results about jobs has seen similar updates to the UI</a>.

**NOTE**: At this time, DLT pipelines scheduled as tasks do not directly render results in the Runs GUI; instead, you will be directed back to the DLT Pipeline GUI for the scheduled Pipeline.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>