-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Using Auto Loader and Structured Streaming with Spark SQL

## Learning Objectives
By the end of this lab, you should be able to:
* Ingest data using Auto Loader
* Aggregate streaming data
* Stream data to a Delta table

## Setup
Run the following script to setup necessary variables and clear out past runs of this notebook. Note that re-executing this cell will allow you to start the lab over.

In [0]:
%run ../Includes/Classroom-Setup-6.3L

Python interpreter will be restarted.
Python interpreter will be restarted.



Creating the database "dbacademy_chiraggoel_kpmg_com_dewd_6_3l"

Predefined Paths:
  DA.paths.working_dir: dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.3l
  DA.paths.user_db:     dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.3l/6_3l.db
  DA.paths.checkpoints: dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/6.3l/_checkpoints

Predefined tables in dbacademy_chiraggoel_kpmg_com_dewd_6_3l:
  -none-

Setup completed in 2 seconds


## Configure Streaming Read

This lab uses a collection of customer-related CSV data from DBFS found in */databricks-datasets/retail-org/customers/*.

Read this data using <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank">Auto Loader</a> using its schema inference (use **`customers_checkpoint_path`** to store the schema info). Create a streaming temporary view called **`customers_raw_temp`**.

In [0]:
customers_checkpoint_path = f"{DA.paths.checkpoints}/customers"

(spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", customers_checkpoint_path)
      .load("/databricks-datasets/retail-org/customers/")
      .createOrReplaceTempView("customers_raw_temp"))

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
[0;32m<command-2841292000074646>[0m in [0;36m<module>[0;34m[0m
[1;32m      1[0m [0mcustomers_checkpoint_path[0m [0;34m=[0m [0;34mf"{DA.paths.checkpoints}/customers"[0m[0;34m[0m[0;34m[0m[0m
[1;32m      2[0m [0;34m[0m[0m
[0;32m----> 3[0;31m (spark.readStream
[0m[1;32m      4[0m       [0;34m.[0m[0mformat[0m[0;34m([0m[0;34m"cloudFiles"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      5[0m       [0;34m.[0m[0moption[0m[0;34m([0m[0;34m"cloudFiles.format"[0m[0;34m,[0m [0;34m"csv"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m

[0;32m/databricks/spark/python/pyspark/sql/streaming.py[0m in [0;36mload[0;34m(self, path, format, schema, **options)[0m
[1;32m    450[0m                 raise ValueError("If the path is provided for stream, it needs to be a " +
[1;32m    451[0

In [0]:
from pyspark.sql import Row
assert Row(tableName="customers_raw_temp", isTemporary=True) in spark.sql("show tables").select("tableName", "isTemporary").collect(), "Table not present or not temporary"
assert spark.table("customers_raw_temp").dtypes ==  [('customer_id', 'string'),
 ('tax_id', 'string'),
 ('tax_code', 'string'),
 ('customer_name', 'string'),
 ('state', 'string'),
 ('city', 'string'),
 ('postcode', 'string'),
 ('street', 'string'),
 ('number', 'string'),
 ('unit', 'string'),
 ('region', 'string'),
 ('district', 'string'),
 ('lon', 'string'),
 ('lat', 'string'),
 ('ship_to_address', 'string'),
 ('valid_from', 'string'),
 ('valid_to', 'string'),
 ('units_purchased', 'string'),
 ('loyalty_segment', 'string'),
 ('_rescued_data', 'string')], "Incorrect Schema"

## Define a streaming aggregation

Using CTAS syntax, define a new streaming view called **`customer_count_by_state_temp`** that counts the number of customers per **`state`**, in a field called **`customer_count`**.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW customer_count_by_state_temp AS
SELECT
  state,
  count(customer_id) AS customer_count
  FROM customers_raw_temp
  GROUP BY
  state

In [0]:
assert Row(tableName="customer_count_by_state_temp", isTemporary=True) in spark.sql("show tables").select("tableName", "isTemporary").collect(), "Table not present or not temporary"
assert spark.table("customer_count_by_state_temp").dtypes == [('state', 'string'), ('customer_count', 'bigint')], "Incorrect Schema"

## Write aggregated data to a Delta table

Stream data from the **`customer_count_by_state_temp`** view to a Delta table called **`customer_count_by_state`**.

In [0]:
customers_count_checkpoint_path = f"{DA.paths.checkpoints}/customers_count"

query = (spark.table("customer_count_by_state_temp")
              .writeStream
              .format("delta")
              .option("checkpointLocation", customers_count_checkpoint_path)
              .outputMode("complete")
              .table("customer_count_by_state"))

In [0]:
DA.block_until_stream_is_ready(query)

In [0]:
assert Row(tableName="customer_count_by_state", isTemporary=False) in spark.sql("show tables").select("tableName", "isTemporary").collect(), "Table not present or not temporary"
assert spark.table("customer_count_by_state").dtypes == [('state', 'string'), ('customer_count', 'bigint')], "Incorrect Schema"

## Query the results

Query the **`customer_count_by_state`** table (this will not be a streaming query). Plot the results as a bar graph and also using the map plot.

In [0]:
%sql
SELECT * FROM customer_count_by_state

## Wrapping Up

Run the following cell to remove the database and all data associated with this lab.

In [0]:
DA.cleanup()

By completing this lab, you should now feel comfortable:
* Using PySpark to configure Auto Loader for incremental data ingestion
* Using Spark SQL to aggregate streaming data
* Streaming data to a Delta table

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>