-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="2eb97b71-b2ab-4b68-afdc-1663ec49e9d4"/>


# Lab: Migrating SQL Notebooks to Delta Live Tables

This notebook describes the overall structure for the lab exercise, configures the environment for the lab, provides simulated data streaming, and performs cleanup once you are done. A notebook like this is not typically needed in a production pipeline scenario.

## Learning Objectives
By the end of this lab, you should be able to:
* Convert existing data pipelines to Delta Live Tables

<i18n value="782da0e9-5fc2-4deb-b7a4-939af49e38ed"/>


## Datasets Used

This demo uses simplified artificially generated medical data. The schema of our two datasets is represented below. Note that we will be manipulating these schema during various steps.

#### Recordings
The main dataset uses heart rate recordings from medical devices delivered in the JSON format. 

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

#### PII
These data will later be joined with a static table of patient information stored in an external system to identify patients by name.

| Field | Type |
| --- | --- |
| mrn | long |
| name | string |

<i18n value="b691e21b-24a5-46bc-97d8-a43e9ae6e268"/>


## Getting Started

Begin by running the following cell to configure the lab environment.

In [0]:
%run ../../Includes/Classroom-Setup-08.2.1L

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(7 seconds)
| completed (7 seconds total)

Creating & using the schema "munirsheikhcloudseekho_0lj9_da_dewd_dlt_lab_82"...(0 seconds)
Loading the file 01.json to the dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/dlt_lab_82/stream/01.json
Predefined tables in "munirsheikhcloudseekho_0lj9_da_dewd_dlt_lab_82":
| -none-

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/dlt_lab_82
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/dlt_lab_82/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02
| DA.paths.checkpoints:      dbfs:/mnt/dbacademy-users/mun

<i18n value="c68290ac-56ad-4d6e-afec-b0a61c35386f"/>


## Land Initial Data
Seed the landing zone with more data before proceeding.

You will re-run this command to land additional data later.

In [0]:
DA.data_factory.load()

Loading the file 02.json to the dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/dlt_lab_82/stream/02.json


<i18n value="7cb98302-06c2-4384-bdf7-2260cbf2662d"/>


Execute the following cell to print out values that will be used during the following configuration steps.

In [0]:
DA.print_pipeline_config()    

0,1
Pipeline Name:,
Target:,
Storage Location:,
Notebook Path:,
Source:,
Datasets Path:,
Policy:,


<i18n value="784d3bc4-5c4e-4ef8-ab56-3ebaa92238b0"/>


## Create and Configure a Pipeline

1. Click the **Workflows** button on the sidebar.
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Leave **Product Edition** as **Advanced**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipeline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the notebook specified above.
1. Under **Configuration**, add three configuration parameters:
   * Click **Add configuration**, set the "key" to **spark.master** and the "value" to **local[\*]**.
   * Click **Add configuration**, set the "key" to **datasets_path** and the "value" to the value provided in the cell above.
   * Click **Add configuration**, set the "key" to **source** and the "value" to the value provided in the cell above.
1. In the **Target** field, enter the database name provided in the cell above.<br/>
This should follow the pattern **`<name>_<hash>_dbacademy_dewd_dlt_lab_82`**
1. In the **Storage location** field, enter the path provided in the cell above.
1. For **Pipeline Mode**, select **Triggered**.
1. Uncheck the **Enable autoscaling** box.
1. Set the number of **`workers`** to **`0`** (zero).
1. Check the **Use Photon Acceleration** box.
1. For **Channel**, select **Current**
1. For **Policy**, select the value provided in the cell above.

Finally, click **Create**.

In [0]:
# ANSWER

# This function is provided for those students that do not 
# want to work through the exercise of creating the pipeline.
DA.create_pipeline()

[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
[0;32m<command-4094000743659699>[0m in [0;36m<cell line: 5>[0;34m()[0m
[1;32m      3[0m [0;31m# This function is provided for those students that do not[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;31m# want to work through the exercise of creating the pipeline.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 5[0;31m [0mDA[0m[0;34m.[0m[0mcreate_pipeline[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m<command-4094000743658906>[0m in [0;36mcreate_pipeline[0;34m(self)[0m
[1;32m      7[0m     [0;31m# We need to delete the existing pipline so that we can apply updates[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8[0m     [0;31m# because some attributes are not mutable after creation.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m

In [0]:
DA.validate_pipeline_config()



<i18n value="3340e93d-1fad-4549-bf79-ec239b1d59d4"/>


## Open and Complete DLT Pipeline Notebook

You will perform your work in the companion notebook [DE 8.2.2L - Migrating a SQL Pipeline to DLT Lab]($./DE 8.2.2L - Migrating a SQL Pipeline to DLT Lab),<br/>
which you will ultimately deploy as a pipeline.

Open the Notebook and, following the guidelines provided therein, fill in the cells where prompted to<br/>
implement a multi-hop architecture similar to the one we worked with in the previous section.

<i18n value="90a66079-16f8-4503-ab48-840cbdd07914"/>


## Run your Pipeline

Select **Development** mode, which accelerates the development lifecycle by reusing the same cluster across runs.<br/>
It will also turn off automatic retries when jobs fail.

Click **Start** to begin the first update to your table.

Delta Live Tables will automatically deploy all the necessary infrastructure and resolve the dependencies between all datasets.

**NOTE**: The first table update may take several minutes as relationships are resolved and infrastructure deploys.

In [0]:
# ANSWER

# This function is provided to start the pipeline and  
# block until it has completed, is cancelled or failed
DA.start_pipeline()



<i18n value="d1797d22-692c-43ce-b146-1e0248e65da3"/>


## Troubleshooting Code in Development Mode

Don't despair if your pipeline fails the first time. Delta Live Tables is in active development, and error messages are improving all the time.

Because relationships between tables are mapped as a DAG, error messages will often indicate that a dataset isn't found.

Let's consider our DAG below:

<img src="https://files.training.databricks.com/images/dlt-dag.png">

If the error message **`Dataset not found: 'recordings_parsed'`** is raised, there may be several culprits:
1. The logic defining **`recordings_parsed`** is invalid
1. There is an error reading from **`recordings_bronze`**
1. A typo exists in either **`recordings_parsed`** or **`recordings_bronze`**

The safest way to identify the culprit is to iteratively add table/view definitions back into your DAG starting from your initial ingestion tables. You can simply comment out later table/view definitions and uncomment these between runs.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>