# IoT Devices

Define a Scala case class that will map to a Scala Dataset: _DeviceIoTData_

In [None]:
case class DeviceIoTData (battery_level: Long, c02_level: Long, 
    cca2: String, cca3: String, cn: String, device_id: Long, 
    device_name: String, humidity: Long, ip: String, latitude: Double,
    lcd: String, longitude: Double, scale:String, temp: Long, timestamp: Long)

Define a Scala case class that will map to a Scala Dataset: _DeviceTempByCountry_

In [None]:
case class DeviceTempByCountry(temp: Long, device_name: String, device_id: Long, cca3: String)

Read JSON files with device information

1. The DataFrameReader will return a DataFrame and convert to Dataset[DeviceIotData]
2. DS is a collection of Dataset that map to Scala case class _DeviceIotData_

In [None]:
val ds = spark.read.json("iot_devices.json").as[DeviceIoTData]

Schema maps to each field and type in the Scala case class object

In [None]:
ds.printSchema

In [None]:
ds.show(5, false)

Use Dataset API to filter temperature and humidity. Note the use of object.field syntax employed with Dataset JVM, similar to accessing JavaBean fields. This syntax is not only readable but compile type-safe too.

For example, if you compared d.temp > "30", you will get an compile error.

In [None]:
//val filterTempDS = ds.filter(d => {d.temp > 30 && d.humidity > 70})
val filterTempDS = ds.filter($"temp" > 30 && $"humidity" > 70)

In [None]:
filterTempDS.show(5, false)

Use a more complicated query with lambda functions with the original Dataset DeviceIoTData. Note the awkward column names prefix _1, _2, etc. This is Spark way of handling unknown columns names returned from a Dataset when using queries with lambda expressions. We just renamed them and cast them to our defined case class DeviceTempByCountry.

In [23]:
val dsTemp = ds
  .filter($"temp" > 25)
  .withColumnRenamed("_1", "temp")
  .withColumnRenamed("_2", "device_name")
  .withColumnRenamed("_3", "device_id")
  .withColumnRenamed("_4", "cca3").as[DeviceTempByCountry]
  .select($"temp", $"device_name", $"device_id", $"cca3")

dsTemp.show(10)

+----+--------------------+---------+----+
|temp|         device_name|device_id|cca3|
+----+--------------------+---------+----+
|  34|meter-gauge-1xbYRYcj|        1| USA|
|  28|   sensor-pad-4mzWkz|        4| USA|
|  27|sensor-pad-6al7RT...|        6| USA|
|  27|sensor-pad-8xUD6p...|        8| JPN|
|  26|sensor-pad-10Bsyw...|       10| USA|
|  31|meter-gauge-17zb8...|       17| USA|
|  31|sensor-pad-18XULN9Xv|       18| CHN|
|  29|meter-gauge-19eg1...|       19| USA|
|  30|  device-mac-21sjz5h|       21| AUT|
|  28|sensor-pad-24Pytz...|       24| CAN|
+----+--------------------+---------+----+
only showing top 10 rows



dsTemp: org.apache.spark.sql.DataFrame = [temp: bigint, device_name: string ... 2 more fields]


This query returns a Dataset[Row] since we don't have a corresponding case class to convert to, so a generic Row object is returned.

In [25]:
ds.select($"temp", $"device_name", $"device_id", $"humidity", $"cca3", $"cn").where("temp > 25").show(5, false)

+----+---------------------+---------+--------+----+-------------+
|temp|device_name          |device_id|humidity|cca3|cn           |
+----+---------------------+---------+--------+----+-------------+
|34  |meter-gauge-1xbYRYcj |1        |51      |USA |United States|
|28  |sensor-pad-4mzWkz    |4        |32      |USA |United States|
|27  |sensor-pad-6al7RTAobR|6        |51      |USA |United States|
|27  |sensor-pad-8xUD6pzsQI|8        |35      |JPN |Japan        |
|26  |sensor-pad-10BsywSYUF|10       |56      |USA |United States|
+----+---------------------+---------+--------+----+-------------+
only showing top 5 rows



Use the first() method to peek at first DeviceTempByCountry object

In [26]:
val device = dsTemp.first()

device: org.apache.spark.sql.Row = [34,meter-gauge-1xbYRYcj,1,USA]


** Q-1) How to detect failing devices with low battery below a threshold?**

Note: threshold level less than 8 are potential candidates

In [27]:
ds.select($"battery_level", $"c02_level", $"device_name").where($"battery_level" < 8).sort($"c02_level").show(5, false)

+-------------+---------+----------------------------+
|battery_level|c02_level|device_name                 |
+-------------+---------+----------------------------+
|4            |800      |meter-gauge-113569ucj1L     |
|4            |800      |device-mac-111327FkK365     |
|6            |800      |sensor-pad-118976n1jeLj     |
|7            |800      |sensor-pad-194806J7sArv     |
|6            |800      |meter-gauge-19390710h1TvoZMt|
+-------------+---------+----------------------------+
only showing top 5 rows



** Q-2) How to identify offending countries with high-levels of C02 emissions?**

Note: Any C02 levels above 1300 are potential violators of C02 emissions

Filter out c02_levels is eater than 1300, sort in descending order on C02_level. Note that this high-level domain specific language API reads like a SQL query

In [29]:
val newDS = ds
  .filter($"c02_level" > 1300)
  .groupBy($"cn")
  .avg()
  .sort($"avg(c02_level)".desc)

newDS.show(10, false)

+------------------------------+------------------+--------------+--------------+-------------+-------------+------------------+---------+------------------+
|cn                            |avg(battery_level)|avg(c02_level)|avg(device_id)|avg(humidity)|avg(latitude)|avg(longitude)    |avg(temp)|avg(timestamp)    |
+------------------------------+------------------+--------------+--------------+-------------+-------------+------------------+---------+------------------+
|Solomon Islands               |3.0               |1588.0        |187433.0      |40.0         |-9.43        |159.95            |21.0     |1.458444060894E12 |
|Federated States of Micronesia|3.0               |1573.0        |78806.0       |55.0         |6.92         |158.25            |13.0     |1.45844405755E12  |
|Rwanda                        |2.5               |1560.5        |102085.0      |44.0         |-2.0         |30.0              |21.5     |1.458444058393E12 |
|British Indian Ocean Territory|7.0               |1

newDS: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cn: string, avg(battery_level): double ... 7 more fields]


** Q-3) Can we sort and group country with average temperature, C02, and humidity?**

In [31]:
ds.filter($"temp" > 25 and $"humidity" > 75)
  .select("temp", "humidity", "cn")
  .groupBy($"cn")
  .avg()
  .sort($"avg(temp)".desc, $"avg(humidity)".desc).as("avg_humidity").show(10, false)

+----------------------+---------+-------------+
|cn                    |avg(temp)|avg(humidity)|
+----------------------+---------+-------------+
|Monaco                |34.0     |91.0         |
|Anguilla              |34.0     |83.0         |
|British Virgin Islands|34.0     |81.0         |
|Turkmenistan          |34.0     |80.0         |
|Suriname              |34.0     |79.0         |
|Gibraltar             |34.0     |78.0         |
|Liechtenstein         |34.0     |76.0         |
|Vanuatu               |33.5     |84.0         |
|Cameroon              |33.0     |91.0         |
|Fiji                  |33.0     |78.0         |
+----------------------+---------+-------------+
only showing top 10 rows



Q-4) Can we compute min, max values for temperature, C02, and humidity?

In [32]:
import org.apache.spark.sql.functions._ 

ds.select(min("temp"), max("temp"), min("humidity"), max("humidity"), min("c02_level"), max("c02_level"), min("battery_level"), max("battery_level")).show(10)

+---------+---------+-------------+-------------+--------------+--------------+------------------+------------------+
|min(temp)|max(temp)|min(humidity)|max(humidity)|min(c02_level)|max(c02_level)|min(battery_level)|max(battery_level)|
+---------+---------+-------------+-------------+--------------+--------------+------------------+------------------+
|       10|       34|           25|           99|           800|          1599|                 0|                 9|
+---------+---------+-------------+-------------+--------------+--------------+------------------+------------------+



import org.apache.spark.sql.functions._
