d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Using Delta

Moovio, the fitness tracker company, is in the process of migrating non-Delta workloads to Delta Lake. You have access to the files that hold current data that you can experiment with while learning about how to work with Delta Lake. Creating Delta tables is as easy as issuing the command, `USING DELTA`. Get started by reading through the following cells and run the corresponding queries to create and modify Delta tables.

## Getting started
Run the cell below to set up your classroom environment.

In [0]:
%run "../Includes/Classroom-Setup"

## Create table

To start, we will create a table of the raw data we've been provided. We're using only a small sample of the available data, so this set is limited to 5 devices over the course of one month. The raw files are in the `.json` format.

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_data_2020_01;              

CREATE TABLE health_tracker_data_2020_01                        
USING json                                             
OPTIONS (
  path "dbfs:/mnt/training/healthcare/tracker/raw.json/health_tracker_data_2020_1.json",
  inferSchema "true"
  );

## Preview data

Before we do anything else, let's quickly inspect the data by viewing a sample.

In [0]:
%sql
SELECT * FROM health_tracker_data_2020_01 TABLESAMPLE (5 ROWS)

month,value
2020-01,"List(0, 101.3720506847, Deborah Powell, 1.5778368E9)"
2020-01,"List(0, 98.7361247811, Deborah Powell, 1.5778404E9)"
2020-01,"List(0, 99.9724260825, Deborah Powell, 1.577844E9)"
2020-01,"List(0, 99.8718710286, Deborah Powell, 1.5778476E9)"
2020-01,"List(0, 98.3444848503, Deborah Powell, 1.5778512E9)"


-sandbox
## Create Delta table

This example, so far, is of a Bronze level table. We can display the raw data, but it is not easily queryable. Our next step is to create a cleaned Silver table. This table may flow into several business aggregate Gold level tables later on. In this step, we'll focus on creating an easily queryable table that includes most or all of the data, with all columns accurately typed and object properties unpacked into individual columns. 


Recall that a Delta table consists of three things: 

1. The Delta files (in object storage)
1. The Delta [Transaction Log](https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html) saved with the Delta files in object storage. 
1. The Delta table registered in the [Metastore](https://docs.databricks.com/data/metastores/index.html#metastores) 

Run the cell below to create a new table using `DELTA`. This step registers your table in the metastore, converts your files to Delta and creates the transaction log, which will hold the record of every transaction that is performed on this table.

In [0]:
%sql
CREATE OR REPLACE TABLE health_tracker_silver 
USING DELTA
PARTITIONED BY (p_device_id)
LOCATION "/health_tracker/silver" AS (
SELECT
  value.name,
  value.heartrate,
  CAST(FROM_UNIXTIME(value.time) AS timestamp) AS time,
  CAST(FROM_UNIXTIME(value.time) AS DATE) AS dte,
  value.device_id p_device_id
FROM
  health_tracker_data_2020_01
)


Great! You have created your first Delta table! Run the `DESCRIBE DETAIL` command to view table details. You can see that the table format is `delta` and  it is stored in the location you specified.

In [0]:
%sql
DESCRIBE DETAIL health_tracker_silver

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,7e3918c2-46d7-441b-b7f8-a01536e2b90b,default.health_tracker_silver,,dbfs:/health_tracker/silver,2020-12-09T22:51:17.111+0000,2020-12-09T22:51:30.000+0000,List(p_device_id),5,57098,Map(),1,2


## Read in new data

Recall that we created that table with just one month of data. Now let's see how we can add new data to that table. 

Run the cell below to read in the new raw file.

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_data_2020_02;              

CREATE TABLE health_tracker_data_2020_02                        
USING json                                             
OPTIONS (
  path "dbfs:/mnt/training/healthcare/tracker/raw.json/health_tracker_data_2020_2.json",
  inferSchema "true"
  );

## Append files

We can append the next month of of records to the existing table using the `INSERT INTO` command. We will transform the new data to match the existing schema.

In [0]:
%sql
INSERT INTO
  health_tracker_silver
SELECT
  value.name,
  value.heartrate,
  CAST(FROM_UNIXTIME(value.time) AS timestamp) AS time,
  CAST(FROM_UNIXTIME(value.time) AS DATE) AS dte,
  value.device_id p_device_id
FROM
  health_tracker_data_2020_02

## Time Travel: Count records in the previous table

Let's count the records to verify that the append went as expected. First, we can write a query to show the count before we appended new records. Delta Lake can query an earlier version of a Delta table using a feature known as [time travel](https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel). 

We demonstrate querying the data as of version 0, which is the initial conversion of the table from Parquet. 

**`5 devices * 24 hours * 31 days`** **`=`** **`3720 records`**

In [0]:
%sql
SELECT COUNT(*) FROM health_tracker_silver VERSION AS OF 0

count(1)
3720


## Count records in our current table

Now, let's count the records to see if our new data was appended as expected. Note that this data is from February 2020, which had 29 days because 2020 was a leap year. We are still working with 5 devices, with heartrate readings occurring once an hour.

**`5 devices * 24 hours * 29 days`**   **`+`**   **`3720`** **`=`** **`7200 records`**

In [0]:
%sql
SELECT COUNT(*) FROM health_tracker_silver 

count(1)
7128


-sandbox
## Find missing records by device<br> 

Let's see if we can identify which device(s) are missing records. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The absence of records from the last few days of the month shows a phenomenon that may often occur in a production data pipeline: **late-arriving data**. This can create problems in some of the other data storage and management models we talked about. If our analytics runs on stale or incomplete data, we  may draw incorrect conclusions or make bad predicitions. Delta Lake allows us to process data as it arrives and is prepared to handle the occurrence of late arriving data.

In [0]:
%sql
SELECT p_device_id, COUNT(*) FROM health_tracker_silver GROUP BY p_device_id

p_device_id,count(1)
0,1440
1,1440
3,1440
2,1440
4,1368


## Plot Records
We can run a query and use visualization tools to find out more about which dates or times are missing. For this query, it may be helpful to compare two devices, even though we're showing only one is missing data. Run the cell below to query the table. Then, click on the chart icon to plot the data. 

To set up your graph: 
* Click `Plot Options`
* Drag `dte` into the Keys dialog 
* Drag `p_device_id`into the Series Groupings dialog
* Drag `heartrate` into the values dialog
* Choose `COUNT` as your Aggregation type (the dropdown in the lower left corner)
* Select "Bar Chart" as your display type.

In [0]:
%sql
SELECT * FROM health_tracker_silver WHERE p_device_id IN (3,4)

name,heartrate,time,dte,p_device_id
Minh Nguyen,54.5276763038,2020-01-01T00:00:00.000+0000,2020-01-01,3
Minh Nguyen,55.3566724529,2020-01-01T01:00:00.000+0000,2020-01-01,3
Minh Nguyen,55.1554433144,2020-01-01T02:00:00.000+0000,2020-01-01,3
Minh Nguyen,56.379849212,2020-01-01T03:00:00.000+0000,2020-01-01,3
Minh Nguyen,55.9843632946,2020-01-01T04:00:00.000+0000,2020-01-01,3
Minh Nguyen,55.1160133688,2020-01-01T05:00:00.000+0000,2020-01-01,3
Minh Nguyen,56.552175579,2020-01-01T06:00:00.000+0000,2020-01-01,3
Minh Nguyen,55.0979882698,2020-01-01T07:00:00.000+0000,2020-01-01,3
Minh Nguyen,91.8804648283,2020-01-01T08:00:00.000+0000,2020-01-01,3
Minh Nguyen,92.9697436381,2020-01-01T09:00:00.000+0000,2020-01-01,3


-sandbox

## Find Broken Readings 

It's always useful to check for errant readings. Think about this scenario. Is there any reading would seem impossible? 

Since this is heartrate date, we should expect that everyone who is using the tracker has a heartbeat. Let's check the data to see if we've got any data that might point to a faulty reading. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We  use a temporary view so that we can access this data again later.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW broken_readings
AS (
  SELECT COUNT(*) as broken_readings_count, dte 
  FROM health_tracker_silver
  WHERE heartrate < 0
  GROUP BY dte
  ORDER BY dte
)

-sandbox
## View broken readings

Run the cell and then create a a visualization that will help us get a sense of how many broken readings exist and how they are spread across the data. 

To visualize this view: 
* Run the cell
* Click the chart icon
* Choose 'dte' as the Key and `broken_readings_count` as Values. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You should notice that most days have at least one broken reading and that some days have more than one.

In [0]:
%sql
SELECT * FROM broken_readings;

broken_readings_count,dte
1,2020-01-01
3,2020-01-02
3,2020-01-05
3,2020-01-06
1,2020-01-08
1,2020-01-12
1,2020-01-15
2,2020-01-16
3,2020-01-17
4,2020-01-18


## Clean-up
Run the next cell to clean up your classroom enviroment

In [0]:
%run ../Includes/Classroom-Cleanup

Great job! You're officially working with Delta Lake! In the next reading, you'll continue your work to repair the broken data and missing values we discovered here.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>