d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab 3 - Sharing Insights
## Module 6 Assignment

In this lab, we will explore a small mock data set from a group of data centers. You'll see that is is similar to the data you have been working with, but it contains a few new columns and it is structured slightly differently to test your skills with hierarchical data manipulation. 

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this assignment you will: </br>

* Apply higher-order functions to array data
* Apply advanced aggregation and summary techniques to process data
* Present data in an interactive dashboard or static file 

As you work through the following tasks, you will be prompted to enter selected answers in Coursera. Find the quiz associated with this lab to enter your answers. 

Run the cell below to prepare this workspace for the lab.

In [0]:
%run ../Includes/Classroom-Setup

### Exercise 1: Create a table

**Summary:** Create a table. 

Use this path to access the data: `/mnt/training/iot-devices/data-centers/energy.json`

Steps to complete: 
* Write a `CREATE TABLE` statement for the data located at the endpoint listed above
* Use json as the file format

In [0]:
%sql
DROP TABLE IF EXISTS energy;
CREATE TABLE energy
USING json
OPTIONS (
PATH "/mnt/training/iot-devices/data-centers/energy.json"
)


### Exercise 2: Sample the table

**Summary:** Sample the table to get a closer look at a few rows

Steps to complete: 
* Write a query that allows you to see a few rows of the data

In [0]:
%sql
SELECT * FROM energy
TABLESAMPLE (3 ROWS) 

battery_level,co2_level,device_id,device_type,signal,temps,timestamp
"List(3, 3, 2)","List(1343, 1595, 1405)",0,sensor-istick,"List(24, 24, 25)","List(22, 23, 21, 23)",2019/08/02 15:00:00
"List(1, 1, 2)","List(1213, 1346, 1247)",1,sensor-inest,"List(22, 24, 24)","List(22, 37, 34, 39)",2019/07/01 03:00:00
"List(3, 1, 3)","List(1261, 1216, 1258)",2,sensor-ipad,"List(25, 24, 28)","List(32, 33, 40, 44)",2019/06/03 11:00:00


### Exercise 3: Create view

**Summary:** Create a temporary view that displays the timestamp column as a timestamp. 

Steps to complete: 
* Create a temporary view named `DCDevices`
* Convert the `timestamp` column to a timestamp type. Refer to the [Datetime patterns](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html#) documentation for the formatting information. 
* (Optional) Rename columns to use camelCase

In [0]:
%sql
CREATE 
OR REPLACE TEMPORARY VIEW DCDevices AS
SELECT
  battery_level AS batteryLevel,
  co2_level AS co2Level,
  device_id AS deviceId,
  device_type AS deviceType,
  signal,
  temps,
  CAST(unix_timestamp(timestamp, "yyyy/MM/dd kk:mm:ss") AS timestamp) timeStamp
FROM energy;

SELECT * FROM DCDevices LIMIT 3

batteryLevel,co2Level,deviceId,deviceType,signal,temps,timeStamp
"List(3, 3, 2)","List(1343, 1595, 1405)",0,sensor-istick,"List(24, 24, 25)","List(22, 23, 21, 23)",2019-08-02T15:00:00.000+0000
"List(1, 1, 2)","List(1213, 1346, 1247)",1,sensor-inest,"List(22, 24, 24)","List(22, 37, 34, 39)",2019-07-01T03:00:00.000+0000
"List(3, 1, 3)","List(1261, 1216, 1258)",2,sensor-ipad,"List(25, 24, 28)","List(32, 33, 40, 44)",2019-06-03T11:00:00.000+0000


Here the confirmation that we have now all needed colums written using camelCase and timeStamp column  has timestamp type

In [0]:
%sql
DESCRIBE DCDevices

col_name,data_type,comment
batteryLevel,array,
co2Level,array,
deviceId,bigint,
deviceType,string,
signal,array,
temps,array,
timeStamp,timestamp,


### Exercise 4: Flag records with defective batteries

**Summary:** When a battery is malfunctioning, it can report negative battery levels. Create a new boolean column `needService` that shows whether a device needs service.  

Steps to complete: 
* Write a query that shows which devices have malfunctioning batteries
* Include columns `batteryLevel`, `deviceId`, and `needService`
* Order the results by `deviceId`, and then `batteryLevel`
* **Answer the corresponding question in Coursera**

In [0]:
%sql
SELECT 
  batteryLevel,
  deviceId,
  EXISTS(batteryLevel, bl -> bl < 0 ) needService
FROM DCDevices
ORDER BY deviceId, batteryLevel
LIMIT 3

batteryLevel,deviceId,needService
"List(-4, -1, -1)",0,True
"List(-4, -1, -1)",0,True
"List(-4, -1, 0)",0,True


-sandbox
### Exercise 5: Display high CO<sub>2</sub> levels

**Summary:** Create a new column to display only CO<sub>2</sub> levels that exceed 1400 ppm. 

Steps to complete: 
* Include columns `deviceId`, `deviceType`, `highCO2`, `time`
* The column `highCO2` should contain an array of CO<sub>2</sub> readings over 1400
* Show only records that contain `highCO2` values
* Order by `deviceId`, and then `highCO2`

**Answer the corresponding question in Coursera**

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You may need to use a subquery to write this in a single query statement.

In [0]:
%sql
SELECT 
  deviceId, 
  deviceType,
  highCO2,
  time
FROM
  (SELECT 
    deviceId, 
    deviceType, 
    FILTER(co2Level, co2 -> co2 > 1400) highCO2, 
    timeStamp AS time
   FROM DCDevices)
WHERE SIZE(highCO2) > 0
ORDER BY deviceId, highCO2
LIMIT 10

deviceId,deviceType,highCO2,time
0,sensor-ipad,List(1401),2019-06-29T10:00:00.000+0000
0,sensor-igauge,List(1401),2019-06-21T13:00:00.000+0000
0,sensor-istick,List(1401),2019-08-16T21:00:00.000+0000
0,sensor-ipad,List(1401),2019-06-01T22:00:00.000+0000
0,sensor-istick,List(1401),2019-07-01T16:00:00.000+0000
0,sensor-ipad,List(1401),2019-08-12T12:00:00.000+0000
0,sensor-istick,List(1401),2019-08-02T20:00:00.000+0000
0,sensor-istick,"List(1401, 1443)",2019-08-15T21:00:00.000+0000
0,sensor-ipad,"List(1401, 1539, 1698)",2019-09-09T19:00:00.000+0000
0,sensor-igauge,List(1402),2019-08-08T21:00:00.000+0000


### Exercise 6: Create a partitioned table

**Summary:** Create a new table partitioned by `deviceId`

Steps to complete: 
* Include all columns
* Create the table using Parquet
* Rename the partitioned column `p_deviceId`
* Run a `SELECT *`  to view your table. 

**Answer the corresponding question in Coursera**

In [0]:
%sql
DROP TABLE IF EXISTS dc;
CREATE TABLE dc
USING PARQUET                          
PARTITIONED BY (p_deviceId) 
AS 
  SELECT 
  batteryLevel,
  co2Level,
  deviceId AS p_deviceId,
  deviceType,
  signal,
  TRANSFORM (temps, t -> CAST(t AS int)) temps,
  timeStamp
  FROM DCDevices;
  
SELECT * FROM dc
LIMIT 3

batteryLevel,co2Level,deviceType,signal,temps,timeStamp,p_deviceId
"List(1, 1, 1)","List(1446, 1421, 1420)",sensor-igauge,"List(14, 14, 14)","List(33, 26, 27, 25)",2019-07-29T18:00:00.000+0000,0
"List(6, 7, 6)","List(1173, 1202, 1402)",sensor-istick,"List(27, 26, 26)","List(28, 30, 35, 34)",2019-06-10T03:00:00.000+0000,0
"List(2, 0, 2)","List(1189, 1121, 1071)",sensor-igauge,"List(15, 14, 14)","List(20, 24, 25, 22)",2019-07-30T02:00:00.000+0000,0


In [0]:
%sql
SHOW PARTITIONS dc

partition
p_deviceId=0
p_deviceId=1
p_deviceId=10
p_deviceId=11
p_deviceId=12
p_deviceId=13
p_deviceId=14
p_deviceId=15
p_deviceId=16
p_deviceId=17


### Exercise 7: Visualize average temperatures

In [0]:
%sql
SELECT
  REDUCE(temps, 0, (c, acc) -> c + acc, acc ->(acc div size(temps))) AS avgT
FROM dc;

avgT
27
31
22
27
40
28
24
26
20
26


### Exercise 8: Create a widget

We creating a widget for deviceId

In [0]:
%sql
CREATE WIDGET DROPDOWN choseId DEFAULT  "0" CHOICES 
SELECT DISTINCT p_deviceId
FROM dc

### Exercise 9: Use the widget in a query

Let's see which device types we have for pointed device id

In [0]:
%sql
SELECT 
  p_deviceId,
  deviceType
FROM dc
  WHERE p_deviceId = getArgument("choseId")
GROUP BY p_deviceId, deviceType


p_deviceId,deviceType
5,sensor-igauge
5,sensor-istick
5,sensor-ipad
5,sensor-inest


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>