d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab 4 - Delta Lab
## Module 8 Assignment
In this lab, you will continue your work on behalf of Moovio, the fitness tracker company. You will be working with a new set of files that you must move into a "gold-level" table. You will need to modify and repair records, create new columns, and merge late-arriving data.

In [0]:
%run ../Includes/Classroom-Setup

## Exercise 1: Create a table

**Summary:** Create a table from `json` files. 

Use this path to access the data: <br>
`"dbfs:/mnt/training/healthcare/tracker/raw.json/"`

Steps to complete: 
* Create a table named `health_tracker_data_2020`
* Use optional fields to indicate the path you're reading from and epress that the schema should be inferred.

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_data_2020;

CREATE TABLE health_tracker_data_2020
USING json
OPTIONS (
  path "dbfs:/mnt/training/healthcare/tracker/raw.json/",
  inferSchema "true");

## Exercise 2: Preview the data

**Summary:**  View a sample of the data in the table. 

Steps to complete: 
* Query the table with `SELECT *` to see all columns
* Sample 5 rows from the table

In [0]:
%sql
SELECT * FROM health_tracker_data_2020
TABLESAMPLE (5 ROWS)

month,value
2020-05,"List(0, 54.7922842229, Deborah Powell, 1.5882912E9)"
2020-05,"List(0, 56.1916535912, Deborah Powell, 1.5882948E9)"
2020-05,"List(0, 56.491746118, Deborah Powell, 1.5882984E9)"
2020-05,"List(0, 55.9563823115, Deborah Powell, 1.588302E9)"
2020-05,"List(0, 56.1483078922, Deborah Powell, 1.5883056E9)"


## Exercise 3: Count Records
**Summary:** Write a query to find the total number of records

Steps to complete: 
* Count the number of records in the table

**Answer the corresponding question in Coursera**

In [0]:
%sql
SELECT COUNT(*) FROM health_tracker_data_2020

count(1)
18168


## Exercise 4: Create a Silver Delta table
**Summary:** Create a Delta table that transforms and restructures your table

Steps to complete: 
* Drop the existing `month` column
* Isolate each property of the object in the `value` column to its own column
* Cast time as timestamp **and** as a date
* Partition by `device_id`
* Use Delta to write the table

In [0]:
%sql
CREATE OR REPLACE TABLE health_tracker_silver
USING DELTA
PARTITIONED BY (device_id)
LOCATION "dbfs:/health_tracker/silver" AS (
SELECT
  value.name,
  value.heartrate,
  CAST(FROM_UNIXTIME(value.time) AS timestamp) AS time,
  CAST(FROM_UNIXTIME(VALUE.TIME) AS date) AS dte,  
  value.device_id

FROM health_tracker_data_2020
)

## Exercise 5: Register table to the metastore
**Summary:** Register your Silver table to the Metastore
Steps to complete: 
* Be sure you can run the cell more than once without throwing an error
* Write to the location: `/health_tracker/silver`

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_silver;
CREATE TABLE health_tracker_silver
USING DELTA
LOCATION "/health_tracker/silver"

## Exercise 6: Check the number of records
**Summary:** Check to see if all devices are reporting the same number of records

Steps to complete: 
* Write a query that counts the number of records for each device
* Include your partitioned device id column and the count of those records

**Answer the corresponding question in Coursera**

In [0]:
%sql
SELECT COUNT(*), device_id 
FROM health_tracker_silver
GROUP BY device_id

count(1),device_id
3648,0
3648,1
3648,3
3648,2
3576,4


## Exercise 7: Plot records
**Summary:** Attempt to visually assess which dates may be missing records

Steps to complete: 
* Write a query that will return records from one devices that is **not** missing records as well as the device that seems to be missing records
* Plot the results to visually inspect the data
* Identify dates that are missing records

**Answer the corresponding question in Coursera**

In [0]:
%sql
SELECT * FROM health_tracker_silver 
WHERE device_id IN (1,4) 

name,heartrate,time,dte,device_id
Kristin Vasser,67.0409201203,2020-05-01T00:00:00.000+0000,2020-05-01,1
Kristin Vasser,65.7129616583,2020-05-01T01:00:00.000+0000,2020-05-01,1
Kristin Vasser,66.4715664581,2020-05-01T02:00:00.000+0000,2020-05-01,1
Kristin Vasser,66.4433111984,2020-05-01T03:00:00.000+0000,2020-05-01,1
Kristin Vasser,66.5503786953,2020-05-01T04:00:00.000+0000,2020-05-01,1
Kristin Vasser,65.8812904671,2020-05-01T05:00:00.000+0000,2020-05-01,1
Kristin Vasser,111.9467759173,2020-05-01T06:00:00.000+0000,2020-05-01,1
Kristin Vasser,110.0762725209,2020-05-01T07:00:00.000+0000,2020-05-01,1
Kristin Vasser,109.8338818032,2020-05-01T08:00:00.000+0000,2020-05-01,1
Kristin Vasser,111.1323803749,2020-05-01T09:00:00.000+0000,2020-05-01,1


## Exercise 8: Check for Broken Readings
**Summary:** Check to see if your data contains records that would indicate a device has misreported data
Steps to complete: 
* Create a view that contains all records reporting a negative heartrate
* Plot/view that data to see which days include broken readings

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW broken_readings
AS (
  SELECT COUNT(*) AS count_broken_readings, dte
  FROM health_tracker_silver
  WHERE heartrate < 0
  GROUP BY dte
  ORDER BY dte);
  
SELECT * FROM broken_readings

count_broken_readings,dte
1,2020-01-01
3,2020-01-02
3,2020-01-05
3,2020-01-06
1,2020-01-08
1,2020-01-12
1,2020-01-15
2,2020-01-16
3,2020-01-17
4,2020-01-18


## Exercise 9: Repair records
**Summary:** Create a view that contains interpolated values for broken readings

Steps to complete: 
* Create a temporary view that will hold all the records you want to update. 
* Transform the data such that all broken readings (where heartrate is reported as less than zero) are interpolated as the mean of the the data points immediately surrounding the broken reading. 
* After you write the view, count the number of records in it. 

**Answer the corresponding question in Coursera**

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW updates
AS (
  SELECT  
  name,
  (prev + next)/2 AS heartrate,
  time, 
  dte,
  device_id
  FROM (
    SELECT *,
      LAG(heartrate) OVER (PARTITION BY device_id, dte ORDER BY device_id, dte) AS prev,
      LEAD(heartrate) OVER (PARTITION BY device_id, dte ORDER BY device_id, dte) AS next
    FROM health_tracker_silver
    )
WHERE heartrate < 0
);

SELECT COUNT(*) FROM updates

count(1)
182


## Exercise 10: Read late-arriving data
**Summary:** Read in new late-arriving data

Steps to complete: 
* Create a new table that contains the late arriving data at this path: `"dbfs:/mnt/training/healthcare/tracker/raw-late.json"`
* Count the records <br/>

**Answer the corresponding question in Coursera**

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_late;
CREATE TABLE health_tracker_late
USING json
OPTIONS ( path "dbfs:/mnt/training/healthcare/tracker/raw-late.json",
inferSchema "true");

SELECT COUNT(*) FROM health_tracker_late

count(1)
72


## Exercise 11: Prepare inserts
**Summary:** Prepare your new, late-arriving data for insertion into the Silver table

Steps to complete: 
* Create a temporary view that holds the new late-arriving data
* Apply transformations to the data so that the schema matches our existing Silver table

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW inserts
AS (
  SELECT
    value.name,
    value.heartrate,
    CAST(FROM_UNIXTIME(value.time) AS timestamp) AS time,
    CAST(FROM_UNIXTIME(value.time) AS DATE) AS dte,
    value.device_id
  FROM health_tracker_late
  )

## Exercise 12: Prepare upserts
**Summary:** Prepare a view to upsert to our Silver table

Steps to complete: 
* Create a temporary view that is the `UNION` of the views that hold data you want to insert and data you want to update
* Count the records

**Answer the corresponding question in Coursera**

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW upserts
AS (
    SELECT * FROM updates
    UNION ALL
    SELECT * FROM inserts
);


SELECT COUNT(*) FROM upserts

count(1)
254


## Exercise 13: Perform upserts

**Summary:** Merge the upserts into your Silver table

Steps to complete: 
* Merge data on the time and device id columns from your Silver table and your upserts table
* Use `MATCH`conditions to decide whether to apply an update or an insert

In [0]:
%sql
MERGE INTO health_tracker_silver
USING upserts
ON health_tracker_silver.time = upserts.time 
AND health_tracker_silver.device_id = upserts.device_id
WHEN MATCHED THEN
  UPDATE SET
  health_tracker_silver.heartrate = upserts.heartrate
WHEN NOT MATCHED THEN
  INSERT (name, heartrate, time, dte, device_id)
  VALUES (name, heartrate, time, dte, device_id)

## Exercise 14: Write to gold
**Summary:** Create a Gold level table that holds aggregated data

Steps to complete: 
* Create a Gold-level Delta table
* Aggregate heartrate to display the average and standard deviation for each device. 
* Count the number of records

In [0]:
%sql
DROP TABLE IF EXISTS health_tracker_gold;

CREATE TABLE health_tracker_gold
USING DELTA
LOCATION "/health_tracker/gold"
AS SELECT
  device_id,
  AVG(heartrate) AS avgHealthrate,
  STD(heartrate) AS stdHeartrate
FROM health_tracker_silver
GROUP BY device_id;

SELECT * FROM health_tracker_gold 

device_id,avgHealthrate,stdHeartrate
1,82.52207869094194,23.375608427849777
0,85.08801733820111,27.41188410264521
3,84.19046229191886,24.61892380223707
4,84.5435245360994,25.57932106284926
2,82.7775300853638,25.54242733866919


## Cleanup
Run the following cell to clean up your workspace.

In [0]:
%sql
-- %run .Includes/Classroom-Cleanup


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>