-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="01f3c782-1973-4a69-812a-7f9721099941"/>



## End-to-End ETL in the Lakehouse

In this notebook, you will pull together concepts learned throughout the course to complete an example data pipeline.

The following is a non-exhaustive list of skills and tasks necessary to successfully complete this exercise:
* Using Databricks notebooks to write queries in SQL and Python
* Creating and modifying databases, tables, and views
* Using Auto Loader and Spark Structured Streaming for incremental data processing in a multi-hop architecture
* Using Delta Live Table SQL syntax
* Configuring a Delta Live Table pipeline for continuous processing
* Using Databricks Jobs to orchestrate tasks from notebooks stored in Repos
* Setting chronological scheduling for Databricks Jobs
* Defining queries in Databricks SQL
* Creating visualizations in Databricks SQL
* Defining Databricks SQL dashboards to review metrics and results

<i18n value="f9cf3bbc-aa6a-45c2-9d26-a3785e350e1f"/>


## Run Setup
Run the following cell to reset all the databases and directories associated with this lab.

In [0]:
%run ../../Includes/Classroom-Setup-12.2.1L

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(7 seconds)
| completed (7 seconds total)

Creating & using the schema "munirsheikhcloudseekho_0lj9_da_dewd_cap_12"...(1 seconds)
Predefined tables in "munirsheikhcloudseekho_0lj9_da_dewd_cap_12":
| -none-

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/cap_12
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/cap_12/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02
| DA.paths.checkpoints:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/cap_12/_checkpoints
| DA.paths.stream_path:      dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail

<i18n value="3fe92b6e-3e10-4771-8eef-8f4b060dd48f"/>


## Land Initial Data
Seed the landing zone with some data before proceeding.

In [0]:
DA.data_factory.load()

Loading the file 01.json to the dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/cap_12/stream/01.json


<i18n value="806818f8-e931-45ba-b86f-d65cdf76f215"/>


## Create and Configure a DLT Pipeline
**NOTE**: The main difference between the instructions here and in previous labs with DLT is that in this instance, we will be setting up our pipeline for **Continuous** execution in **Production** mode.

In [0]:
DA.print_pipeline_config()

0,1
Pipeline Name:,
Target:,
Storage Location:,
Notebook Path:,
Datasets Path:,
Source:,
Policy:,


<i18n value="e1663032-caa8-4b99-af1a-3ab27deaf130"/>


Steps:
1. Click the **Workflows** button on the sidebar.
1. Select the **Delta Live Tables** tab.
1. Click **Create Pipeline**.
1. Leave **Product Edition** as **Advanced**.
1. Fill in a **Pipeline Name** - because these names must be unique, we suggest using the **Pipeline Name** provided in the cell above.
1. For **Notebook Libraries**, use the navigator to locate and select the notebook specified above.
1. Under **Configuration**, add three configuration parameters:
   * Click **Add configuration**, set the "key" to **spark.master** and the "value" to **local[\*]**.
   * Click **Add configuration**, set the "key" to **datasets_path** and the "value" to the value provided in the cell above.
   * Click **Add configuration**, set the "key" to **source** and the "value" to the value provided in the cell above.
1. In the **Target** field, enter the database name provided in the cell above.<br/>
This should follow the pattern **`<name>_<hash>_dbacademy_dewd_cap_12`**
1. In the **Storage location** field, enter the path provided in the cell above.
1. For **Pipeline Mode**, select **Continuous**
1. Uncheck the **Enable autoscaling** box.
1. Set the number of **`workers`** to **`0`** (zero).
1. Check the **Use Photon Acceleration** box.
1. For **Channel**, select **Current**
1. For **Policy**, select the value provided in the cell above.
1. Click **Create**.
1. After the UI updates, change from **Development** to **Production** mode

This should begin the deployment of infrastructure.

In [0]:
# ANSWER
 
# This function is provided for students who do not 
# want to work through the exercise of creating the pipeline.
DA.create_pipeline()

[0;31m---------------------------------------------------------------------------[0m
[0;31mHTTPError[0m                                 Traceback (most recent call last)
[0;32m<command-4094000743659777>[0m in [0;36m<cell line: 5>[0;34m()[0m
[1;32m      3[0m [0;31m# This function is provided for students who do not[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;31m# want to work through the exercise of creating the pipeline.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 5[0;31m [0mDA[0m[0;34m.[0m[0mcreate_pipeline[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m<command-4094000743658799>[0m in [0;36mcreate_pipeline[0;34m(self)[0m
[1;32m      7[0m     [0;31m# We need to delete the existing pipline so that we can apply updates[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      8[0m     [0;31m# because some attributes are not mutable after creation.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m     [

In [0]:
DA.validate_pipeline_config()



<i18n value="6c8bd13c-938a-4283-b15a-bc1a598fb070"/>


## Schedule a Notebook Job

Our DLT pipeline is setup to process data as soon as it arrives. 

We'll schedule a notebook to land a new batch of data each minute so we can see this functionality in action.

Before we start run the following cell to get the values used in this step.

In [0]:
DA.print_job_config()



<i18n value="df989e07-97d4-4a34-9729-fad02399a908"/>


Steps:
1. Click the **Workflows** button on the sidebar
1. Select the **Jobs** tab.
1. Click the blue **Create Job** button
1. Configure the task:
    1. Enter **Land-Data** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Notebook Path** value provided in the cell above
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. Click **Create**
1. In the top-left of the screen rename the job (not the task) from **`Land-Data`** (the defaulted value) to the **Job Name** provided for you in the previous cell.    

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all purpose cluster, you will get a warning about how this will be billed as all purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

<i18n value="3994f3ee-e335-48c7-8770-64e1ef0dfab7"/>


## Set a Chronological Schedule for your Job
Steps:
1. Locate the **Schedule** section in the side panel on the right.
1. Click on the **Edit schedule** button to explore scheduling options.
1. Change the **Schedule type** field from **Manual (Paused)** to **Scheduled**, which will bring up a chron scheduling UI.
1. Set the schedule to update **Every 2**, **Minutes** from **0 minutes past the hour** 
1. Click **Save**

**NOTE**: If you wish, you can click **Run now** to trigger the first run, or wait until the top of the next minute to make sure your scheduling has worked successfully.

In [0]:
# ANSWER

# This function is provided for students who do not 
# want to work through the exercise of creating the job.
DA.create_job()



In [0]:
DA.validate_job_config()



In [0]:
# ANSWER

# This function is provided to start the job and  
# block until it has completed, canceled or failed
DA.start_job()



<i18n value="30df4ffa-22b9-4e2c-b8d8-54aa09a8d4ed"/>


## Register DLT Event Metrics for Querying with DBSQL

The following cell prints out SQL statements to register the DLT event logs to your target database for querying in DBSQL.

Execute the output code with the DBSQL Query Editor to register these tables and views. 

Explore each and make note of the logged event metrics.

In [0]:
DA.generate_register_dlt_event_metrics_sql()



<i18n value="e035ddc7-4af9-4e9c-81f8-530e8db7c504"/>


## Define a Query on the Gold Table

The **daily_patient_avg** table is automatically updated each time a new batch of data is processed through the DLT pipeline. Each time a query is executed against this table, DBSQL will confirm if there is a newer version and then materialize results from the newest available version.

Run the following cell to print out a query with your database name. Save this as a DBSQL query.

In [0]:
DA.generate_daily_patient_avg()



<i18n value="679db36c-b257-4248-b2fe-56b85099d0b9"/>


## Add a Line Plot Visualization

To track trends in patient averages over time, create a line plot and add it to a new dashboard.

Create a line plot with the following settings:
* **X Column**: **`date`**
* **Y Column**: **`avg_heartrate`**
* **Group By**: **`name`**

Add this visualization to a dashboard.

<i18n value="7351e179-68f8-4091-a6ee-647974f010ce"/>


## Track Data Processing Progress

The code below extracts the **`flow_name`**, **`timestamp`**, and **`num_output_rows`** from the DLT event logs.

Save this query in DBSQL, then define a bar plot visualization that shows:
* **X Column**: **`timestamp`**
* **Y Column**: **`num_output_rows`**
* **Group By**: **`flow_name`**

Add your visualization to your dashboard.

In [0]:
DA.generate_visualization_query()



<i18n value="5f94b102-d42e-40f1-8253-c14cbf86d717"/>


## Refresh your Dashboard and Track Results

The **Land-Data** notebook scheduled with Jobs above has 12 batches of data, each representing a month of recordings for our small sampling of patients. As configured per our instructions, it should take just over 20 minutes for all of these batches of data to be triggered and processed (we scheduled the Databricks Job to run every 2 minutes, and batches of data will process through our pipeline very quickly after initial ingestion).

Refresh your dashboard and review your visualizations to see how many batches of data have been processed. (If you followed the instructions as outlined here, there should be 12 distinct flow updates tracked by your DLT metrics.) If all source data has not yet been processed, you can go back to the Databricks Jobs UI and manually trigger additional batches.

<i18n value="b61bf387-2c1b-4ae6-8968-c4189beb477f"/>



With everything configured, you can now continue to the final part of your lab in the notebook [DE 12.2.4L - Final Steps]($./DE 12.2.4L - Final Steps)

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>