-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Auto Load Data to Multiplex Bronze

Our chief architect has decided that rather than connecting directly to Kafka, we'll land raw records as JSON files in cloud object storage and ingest them with Auto Loader. We'll build a multiplex table that ingests and stores the entire history of this incremental feed. The initial table will store data from all of our topics and have the following schema.

| Field | Type |
| --- | --- |
| key | BINARY |
| value | BINARY |
| topic | STRING |
| partition | LONG |
| offset | LONG
| timestamp | LONG |
| date | DATE |
| week_part | STRING |

<img src="https://files.training.databricks.com/images/ade/ADE_arch_bronze.png" width="60%" />

**NOTE**: Details on additional configurations for connecting to Kafka are available [here](https://docs.databricks.com/spark/latest/structured-streaming/kafka.html).


## Learning Objectives

By the end of this lesson, you should be able to:
- Describe a multiplex design
- Apply Auto Loader to incrementally process records
- Configure trigger intervals
- Use `trigger once` logic to execute triggered incremental loading of data.

The following cell declares the paths needed throughout this notebook.

In [0]:
%run ../Includes/bronze-setup

The `Paths` variable will be declared in each notebook for easy file management.

**NOTE**: All records are being stored on the DBFS root for this training example. Setting up separate databases and storage accounts for different layers of data is preferred in both development and production.

In [0]:
Paths

## Examine Source Data

Data files are being written to the path specified by the variable below.

Use the following cell to examine the schema in the source data and determine if anything needs to be changed as it's being ingested.

In [0]:
# TODO
Paths.sourceDaily

## Prepare Data to Join with Date Lookup Table
The initialization script has loaded a `date_lookup` table. This table has a number of pre-computed date values. Note that additional fields indicating holidays or financial quarters might often be added to this table for later data enrichment.

Pre-computing and storing these values is especially important based on our desire to partition our data by year and week, using the string pattern `YYYY-WW`. While Spark has both `year` and `weekofyear` functions built in, the `weekofyear` function may not provide expected behavior for dates falling in the last week of December or [first week of January](https://spark.apache.org/docs/2.3.0/api/sql/#weekofyear), as it defines week 1 as the first week with >3 days.

While this edge case is esoteric to Spark, a `date_lookup` table that will be used across the organization is important for making sure that data is consistently enriched with date-related details.

In [0]:
%sql

DESCRIBE date_lookup

The `date_lookup` table is very small (here we only include date info for 3 years); manually caching the subset of this table we'll be using before proceeding will make sure it's readily available in memory, although the Delta Cache will automatically cache and reuse this data anyway.

The current table being implemented requires that we capture the accurate `week_part` for each date.

The cell below loads and caches these two fields.

In [0]:
dateLookup = spark.table("date_lookup").select("date", "week_part")
dateLookup.cache().count()

Working with the JSON data stored in the `Paths.sourceDaily` location, transform the `timestamp` column as necessary to match to join it with the `date` column.

In [0]:
# TODO
jsonDF = spark.read.json(Paths.sourceDaily)
 
joinedDF = (jsonDF.join(F.broadcast(dateLookup),
#     <INSERT-MATCHING-CONDITION>,
    "left"))
 
display(joinedDF)

## Define Triggered Incremental Auto Loading to Multiplex Bronze Table

Below is starter code for a function to incrementally process data from the source directory to the bronze table, creating the table during the initial write.

Fill in the missing code to:
- Define the schema
- Configure Auto Loader to use the JSON format and specified schema
- Perform a broadcast join with the date_lookup table
- Partition the data by the `topic` and `week_part` fields

In [0]:
# TODO
def process_bronze():
#     schema = "<FILL-IN>"
     
    (spark.readStream
#         <FILL-IN>
        .load(Paths.sourceDaily)
#         .join(<FILL-IN>)
        .writeStream
        .option("checkpointLocation", Paths.bronzeCheckpoint)
#         .partitionBy(<FILL-IN>)
        .option("path", Paths.bronzeTable)
        .trigger(once=True)
        .table("bronze")
        .awaitTermination())

Run the cell below to process an incremental batch of data.

In [0]:
process_bronze()

Review the count of processed records.

In [0]:
%sql
SELECT COUNT(*) FROM bronze

Preview the data to ensure records are being ingested correctly.

In [0]:
%sql
SELECT * FROM bronze

The `Raw.arrival()` code below is a helper class to land new data in the source directory.

Executing the following cell should successfully process a new batch.

In [0]:
Raw.arrival()
process_bronze()

Confirm the count is now higher.

In [0]:
%sql
SELECT COUNT(*) FROM bronze

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>