-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Incremental Data Ingestion with Auto Loader

Incremental ETL is important since it allows us to deal solely with new data that has been encountered since the last ingestion. Reliably processing only the new data reduces redundant processing and helps enterprises reliably scale data pipelines.

The first step for any successful data lakehouse implementation is ingesting into a Delta Lake table from cloud storage. 

Historically, ingesting files from a data lake into a database has been a complicated process.

Databricks Auto Loader provides an easy-to-use mechanism for incrementally and efficiently processing new data files as they arrive in cloud file storage. In this notebook, you'll see Auto Loader in action.

Due to the benefits and scalability that Auto Loader delivers, Databricks recommends its use as general **best practice** when ingesting data from cloud object storage.

## Learning Objectives
By the end of this lesson, you should be able to:
* Execute Auto Loader code to incrementally ingest data from cloud storage to Delta Lake
* Describe what happens when a new file arrives in a directory configured for Auto Loader
* Query a table fed by a streaming Auto Loader query

## Dataset Used
This demo uses simplified artificially generated medical data representing heart rate recordings delivered in the JSON format. 

| Field | Type |
| --- | --- |
| device_id | int |
| mrn | long |
| time | double |
| heartrate | double |

## Getting Started

Run the following cell to reset the demo and configure required variables and help functions.

In [0]:
%run ../Includes/Classroom-Setup-6.1

Python interpreter will be restarted.
Python interpreter will be restarted.



Creating the database "dbacademy_chiraggoel_kpmg_com_dewd_6_1"

Loading the file 01.json to the tracker dataset

Predefined Paths:
  DA.paths.working_dir: dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1
  DA.paths.user_db:     dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1/6_1.db
  DA.paths.checkpoints: dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1/checkpoints

Predefined tables in dbacademy_chiraggoel_kpmg_com_dewd_6_1:
  -none-

Setup completed in 4 seconds


## Using Auto Loader

In the cell below, a function is defined to demonstrate using Databricks Auto Loader with the PySpark API. This code includes both a Structured Streaming read and write.

The following notebook will provide a more robust overview of Structured Streaming. If you wish to learn more about Auto Loader options, refer to the <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank">documentation</a>.

Note that when using Auto Loader with automatic <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html" target="_blank">schema inference and evolution</a>, the 4 arguments shown here should allow ingestion of most datasets. These arguments are explained below.

| argument | what it is | how it's used |
| --- | --- | --- |
| **`data_source`** | The directory of the source data | Auto Loader will detect new files as they arrive in this location and queue them for ingestion; passed to the **`.load()`** method |
| **`source_format`** | The format of the source data |  While the format for all Auto Loader queries will be **`cloudFiles`**, the format of the source data should always be specified for the **`cloudFiles.format`** option |
| **`table_name`** | The name of the target table | Spark Structured Streaming supports writing directly to Delta Lake tables by passing a table name as a string to the **`.table()`** method. Note that you can either append to an existing table or create a new table |
| **`checkpoint_directory`** | The location for storing metadata about the stream | This argument is pass to the **`checkpointLocation`** and **`cloudFiles.schemaLocation`** options. Checkpoints keep track of streaming progress, while the schema location tracks updates to the fields in the source dataset |

**NOTE**: The code below has been streamlined to demonstrate Auto Loader functionality. We'll see in later lessons that additional transformations can be applied to source data before saving them to Delta Lake.

In [0]:
def autoload_to_table(data_source, source_format, table_name, checkpoint_directory):
    query = (spark.readStream
                  .format("cloudFiles")
                  .option("cloudFiles.format", source_format)
                  .option("cloudFiles.schemaLocation", checkpoint_directory)
                  .load(data_source)
                  .writeStream
                  .option("checkpointLocation", checkpoint_directory)
                  .option("mergeSchema", "true")
                  .table(table_name))
    return query

In the following cell, we use the previously defined function and some path variables defined in the setup script to begin an Auto Loader stream.

Here, we're reading from a source directory of JSON files.

In [0]:
query = autoload_to_table(data_source = f"{DA.paths.working_dir}/tracker",
                          source_format = "json",
                          table_name = "target_table",
                          checkpoint_directory = f"{DA.paths.checkpoints}/target_table")


[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
[0;32m<command-2841292000074669>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m query = autoload_to_table(data_source = f"{DA.paths.working_dir}/tracker",
[0m[1;32m      2[0m                           [0msource_format[0m [0;34m=[0m [0;34m"json"[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m                           [0mtable_name[0m [0;34m=[0m [0;34m"target_table"[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m                           checkpoint_directory = f"{DA.paths.checkpoints}/target_table")

[0;32m<command-2841292000074667>[0m in [0;36mautoload_to_table[0;34m(data_source, source_format, table_name, checkpoint_directory)[0m
[1;32m      1[0m [0;32mdef[0m [0mautoload_to_table[0m[0;34m([0m[0mdata_source[0m[0;34m,[0m [0msource_format[0m[0;34m,

Because Auto Loader uses Spark Structured Streaming to load data incrementally, the code above doesn't appear to finish executing.

We can think of this as a **continuously active query**. This means that as soon as new data arrives in our data source, it will be processed through our logic and loaded into our target table. We'll explore this in just a second.

## Helper Function for Streaming Lessons

Our notebook-based lessons combine streaming functions with batch and streaming queries against the results of those operations. These notebooks are for instructional purposes and intended for interactive, cell-by-cell execution. This pattern is not intended for production.

Below, we define a helper function that prevents our notebook from executing the next cell just long enough to ensure data has been written out by a given streaming query. This code should not be necessary in a production job.

In [0]:
def block_until_stream_is_ready(query, min_batches=2):
    import time
    while len(query.recentProgress) < min_batches:
        time.sleep(5) # Give it a couple of seconds

    print(f"The stream has processed {len(query.recentProgress)} batchs")

block_until_stream_is_ready(query)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-2841292000074672>[0m in [0;36m<module>[0;34m[0m
[1;32m      6[0m     [0mprint[0m[0;34m([0m[0;34mf"The stream has processed {len(query.recentProgress)} batchs"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      7[0m [0;34m[0m[0m
[0;32m----> 8[0;31m [0mblock_until_stream_is_ready[0m[0;34m([0m[0mquery[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mNameError[0m: name 'query' is not defined

## Query Target Table

Once data has been ingested to Delta Lake with Auto Loader, users can interact with it the same way they would any table.

In [0]:
%sql
SELECT * FROM target_table

Note that the **`_rescued_data`** column is added by Auto Loader automatically to capture any data that might be malformed and not fit into the table otherwise.

While Auto Loader captured the field names for our data correctly, note that it encoded all fields as **`STRING`** type. Because JSON is a text-based format, this is the safest and most permissive type, ensuring that the least amount of data is dropped or ignored at ingestion due to type mismatch.

In [0]:
%sql
DESCRIBE TABLE target_table

Use the cell below to define a temporary view that summarizes the recordings in our target table.

We'll use this view below to demonstrate how new data is automatically ingested with Auto Loader.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW device_counts AS
  SELECT device_id, count(*) total_recordings
  FROM target_table
  GROUP BY device_id;
  
SELECT * FROM device_counts

## Land New Data

As mentioned previously, Auto Loader is configured to incrementally process files from a directory in cloud object storage into a Delta Lake table.

We have configured and are currently executing a query to process JSON files from the location specified by **`source_path`** into a table named **`target_table`**. Let's review the contents of the **`source_path`** directory.

In [0]:
files = dbutils.fs.ls(f"{DA.paths.working_dir}/tracker")
display(files)

path,name,size,modificationTime
dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1/tracker/01.json,01.json,506710,1653227041000


At present, you should see a single JSON file listed in this location.

The method in the cell below was configured in our setup script to allow us to model an external system writing data to this directory. Each time you execute the cell below, a new file will land in the **`source_path`** directory.

In [0]:
DA.data_factory.load()

Loading the file 02.json to the tracker dataset


List the contents of the **`source_path`** again using the cell below. You should see an additional JSON file for each time you ran the previous cell.

In [0]:
files = dbutils.fs.ls(f"{DA.paths.working_dir}/tracker")
display(files)

path,name,size,modificationTime
dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1/tracker/01.json,01.json,506710,1653227041000
dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1/tracker/02.json,02.json,711589,1653227235000


## Tracking Ingestion Progress

Historically, many systems have been configured to either reprocess all records in a source directory to calculate current results or require data engineers to implement custom logic to identify new data that's arrived since the last time a table was updated.

With Auto Loader, your table has already been updated.

Run the query below to confirm that new data has been ingested.

In [0]:
%sql
SELECT * FROM device_counts

The Auto Loader query we configured earlier automatically detects and processes records from the source directory into the target table. There is a slight delay as records are ingested, but an Auto Loader query executing with default streaming configuration should update results in near real time.

The query below shows the table history. A new table version should be indicated for each **`STREAMING UPDATE`**. These update events coincide with new batches of data arriving at the source.

In [0]:
%sql
DESCRIBE HISTORY target_table

## Clean Up
Feel free to continue landing new data and exploring the table results with the cells above.

When you're finished, run the following cell to stop all active streams and remove created resources before continuing.

In [0]:
DA.cleanup()

Dropping the database "dbacademy_chiraggoel_kpmg_com_dewd_6_1"
Removing the working directory "dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.1"


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>