# Capítulo 3

Vamos a trabajar con Datasets, los cuales no solo son soportados por Java y Scala

## Creando Datasets

In [1]:
val ds = spark.read
.json("C:\\Users\\nora.hafidi\\Desktop\\Big Data\\iot_devices.json")
ds.show(2)

SyntaxError: invalid syntax (Temp/ipykernel_18108/2750111062.py, line 1)

Si queremos instanciar un objeto específico de dominio como un Dataset, podemos hacerlo definiendo una case class en Scala. 

In [4]:
case class DeviceIoTData (battery_level: Long, c02_level: Long,
cca2: String, cca3: String, cn: String, device_id: Long,
device_name: String, humidity: Long, ip: String, latitude: Double,
lcd: String, longitude: Double, scale:String, temp: Long,
timestamp: Long)

defined class DeviceIoTData


In [5]:
val ds = spark.read
.json("C:\\Users\\nora.hafidi\\Desktop\\Big Data\\iot_devices.json")
.as[DeviceIoTData] //El esquema

ds: org.apache.spark.sql.Dataset[DeviceIoTData] = [battery_level: bigint, c02_level: bigint ... 13 more fields]


In [6]:
ds.show(5, false)

+-------------+---------+----+----+-------------+---------+---------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|cn           |device_id|device_name          |humidity|ip           |latitude|lcd   |longitude|scale  |temp|timestamp    |
+-------------+---------+----+----+-------------+---------+---------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|8            |868      |US  |USA |United States|1        |meter-gauge-1xbYRYcj |51      |68.161.225.1 |38.0    |green |-97.0    |Celsius|34  |1458444054093|
|7            |1473     |NO  |NOR |Norway       |2        |sensor-pad-2n2Pea    |70      |213.161.254.1|62.47   |red   |6.15     |Celsius|11  |1458444054119|
|2            |1556     |IT  |ITA |Italy        |3        |device-mac-36TWSKiT  |44      |88.36.5.1    |42.83   |red   |12.83    |Celsius|19  |1458444054120|
|6            |1080     |US  |USA |United States|4  

In [7]:
ds.columns

res3: Array[String] = Array(battery_level, c02_level, cca2, cca3, cn, device_id, device_name, humidity, ip, latitude, lcd, longitude, scale, temp, timestamp)


## Operaciones con Datasets

### Acceder a una fila

In [2]:
import org.apache.spark.sql.Row
val row = Row(350, true, "Learning Spark 2E", null)
println(row.getInt(0))
println(row.getBoolean(1))
println(row.getString(2))

350
true
Learning Spark 2E


import org.apache.spark.sql.Row
row: org.apache.spark.sql.Row = [350,true,Learning Spark 2E,null]


### Consultas

#### 1.

In [17]:
val filterTempDS = ds
.select("battery_level", "c02_level", "cca2", "cn", "device_id")
.filter(col("temp") > 30 && col("humidity") > 70) //o where nen vez de filter
.show(5, false)

//Ejemplo del libro. En Datasets se utiliza lenguaje nativo de Java
"""val filterTempDS = ds
.filter(d => {d.temp > 30 && d.humidity > 70})
filterTempDS.show(5, false)"""

+-------------+---------+----+-------------+---------+
|battery_level|c02_level|cca2|cn           |device_id|
+-------------+---------+----+-------------+---------+
|0            |1466     |US  |United States|17       |
|9            |986      |FR  |France       |48       |
|8            |1436     |US  |United States|54       |
|4            |1090     |US  |United States|63       |
|4            |1072     |PH  |Philippines  |81       |
+-------------+---------+----+-------------+---------+
only showing top 5 rows



filterTempDS: Unit = ()
res8: String =
.filter({d => {d.temp > 30 && d.humidity > 70})
filterTempDS.show(5, false)


#### 2.

In [22]:
case class DeviceTempByCountry(temp: Long, device_name: String, device_id: Long,cca3: String)

defined class DeviceTempByCountry


In [25]:
val dsTemp = ds
 .select($"temp", $"device_name", $"device_id", $"device_id", $"cca3")
 .where("temp > 25")
 .as[DeviceTempByCountry]

dsTemp.show(10,false)

+----+----------------------+---------+---------+----+
|temp|device_name           |device_id|device_id|cca3|
+----+----------------------+---------+---------+----+
|34  |meter-gauge-1xbYRYcj  |1        |1        |USA |
|28  |sensor-pad-4mzWkz     |4        |4        |USA |
|27  |sensor-pad-6al7RTAobR |6        |6        |USA |
|27  |sensor-pad-8xUD6pzsQI |8        |8        |JPN |
|26  |sensor-pad-10BsywSYUF |10       |10       |USA |
|31  |meter-gauge-17zb8Fghhl|17       |17       |USA |
|31  |sensor-pad-18XULN9Xv  |18       |18       |CHN |
|29  |meter-gauge-19eg1BpfCO|19       |19       |USA |
|30  |device-mac-21sjz5h    |21       |21       |AUT |
|28  |sensor-pad-24PytzD00Cp|24       |24       |CAN |
+----+----------------------+---------+---------+----+
only showing top 10 rows



dsTemp: org.apache.spark.sql.Dataset[DeviceTempByCountry] = [temp: bigint, device_name: string ... 3 more fields]


In [19]:
val dsTemp2 = ds
 .filter(d => {d.temp > 25})
 .map(d => (d.temp, d.device_name, d.device_id, d.cca3))
 .toDF("temp", "device_name", "device_id", "cca3")
 .as[DeviceTempByCountry]
dsTemp2.show(5, false)


<console>: 17: error: Unable to find encoder for type (Long, String, Long, String). An implicit Encoder[(Long, String, Long, String)] is needed to store (Long, String, Long, String) instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

Semanticamente, select() es lo mismo que map()

#### Puedo ver solo la primera fila

In [None]:
val device = dsTemp.first()
println(device)

## Ejercicios base de datos IoT

1. Detect failing devices with battery levels below a threshold.
2. Identify offending countries with high levels of CO2 emissions.
3. Compute the min and max values for temperature, battery level, CO2, and humidity.
4. Sort and group by average temperature, CO2, humidity, and country.


1. Detectar dispositivos defectuosos con niveles de batería por debajo de un umbral.
2. Identificar países infractores con altos niveles de emisiones de CO2.
3. Calcule los valores mínimo y máximo de temperatura, nivel de batería, CO2 y humedad.
4. Ordene y agrupe por temperatura promedio, CO2, humedad y país.

In [35]:
ds.count

res18: Long = 198164


In [28]:
ds.columns

res17: Array[String] = Array(battery_level, c02_level, cca2, cca3, cn, device_id, device_name, humidity, ip, latitude, lcd, longitude, scale, temp, timestamp)


1. Detectar dispositivos defectuosos con niveles de batería por debajo de un umbral.

In [43]:
val defect = ds
.select("device_id", "device_name", "battery_level")
.where(col("battery_level") < 2)
.show(10)

+---------+--------------------+-------------+
|device_id|         device_name|battery_level|
+---------+--------------------+-------------+
|        8|sensor-pad-8xUD6p...|            0|
|       12|sensor-pad-12Y2kIm0o|            0|
|       14|sensor-pad-14QL93...|            1|
|       17|meter-gauge-17zb8...|            0|
|       36|sensor-pad-36VQv8...|            1|
|       44| sensor-pad-448DeWGL|            0|
|       77|meter-gauge-77IKW...|            1|
|       80|sensor-pad-80TY4d...|            0|
|       84|sensor-pad-84jla9J5O|            1|
|       85| therm-stick-85NcuaO|            1|
+---------+--------------------+-------------+
only showing top 10 rows



defect: Unit = ()


2. Identificar países infractores con altos niveles de emisiones de CO2.

In [65]:
val emi_c02_total = ds
.select("cca3", "c02_level")
.where(col("c02_level") > 1500)
.count

emi_c02_total: Long = 24614


In [1]:
val emi_c02 = ds
.select("cca3", "c02_level")
.where(col("c02_level") > 1500)
.show(10)

SyntaxError: invalid syntax (<ipython-input-1-d75ff8bbc3c6>, line 1)

In [72]:
val emi_c02_paises_count = ds
.select("cca3")
.where(col("c02_level") > 1500)
.groupBy("cca3")
.agg(count("cca3").alias("Total"))
emi_c02_paises_count.count

emi_c02_paises_count: org.apache.spark.sql.DataFrame = [cca3: string, Total: bigint]
res30: Long = 170


In [73]:
val emi_c02_paises = ds
.select("cca3")
.where(col("c02_level") > 1500)
.groupBy("cca3")
.agg(count("cca3").alias("Total"))
emi_c02_paises.show(170)

+----+-----+
|cca3|Total|
+----+-----+
| HTI|    4|
| PSE|    5|
| POL|  330|
| LVA|   43|
| BRB|    7|
| JAM|    1|
| ZMB|    1|
| BRA|  436|
| ARM|    5|
| MOZ|    3|
| JOR|    9|
| CUB|    3|
| FRA|  688|
| ABW|    2|
| BRN|    2|
| FSM|    1|
| COD|    3|
| URY|   12|
| BOL|   21|
| LBY|    1|
| ATG|    8|
| ITA|  375|
| UKR|  190|
| GHA|    4|
| CMR|    5|
| VIR|    5|
| SEN|    4|
| GTM|    3|
| IOT|    1|
| HRV|   33|
| VCT|    2|
| QAT|    8|
| BHS|    2|
| GBR|  816|
| GMB|    2|
| PRY|    4|
| ARE|   17|
| FRO|    3|
| CRI|   14|
| BMU|    9|
| NPL|    4|
| UGA|    4|
| VUT|    1|
| AZE|   11|
| AUS|  391|
| MLI|    1|
| MLT|    8|
| KNA|    4|
| MEX|  149|
| BGD|   23|
| PNG|    2|
| AFG|    4|
| DMA|    3|
| BLR|   16|
| MNG|    6|
| SVK|   41|
| HUN|  115|
| TKM|    1|
| NZL|   54|
| THA|  128|
| NOR|  195|
| IRQ|    1|
| VEN|   17|
| FIN|   80|
| BWA|    2|
| SAU|   11|
| BFA|    3|
| ALB|    2|
| TGO|    1|
| BHR|    8|
| NIC|    7|
| BIH|   15|
| KWT|   20|
| FJI|    1|

emi_c02_paises: org.apache.spark.sql.DataFrame = [cca3: string, Total: bigint]


3. Calcule los valores mínimo y máximo de temperatura, nivel de batería, CO2 y humedad.

In [83]:
val maximo_minimo = ds
.select(max("temp"), min("temp"), max("battery_level"), min("battery_level"), max("c02_level"), min("c02_level"), max("humidity"), min("humidity"))
.show

+---------+---------+------------------+------------------+--------------+--------------+-------------+-------------+
|max(temp)|min(temp)|max(battery_level)|min(battery_level)|max(c02_level)|min(c02_level)|max(humidity)|min(humidity)|
+---------+---------+------------------+------------------+--------------+--------------+-------------+-------------+
|       34|       10|                 9|                 0|          1599|           800|           99|           25|
+---------+---------+------------------+------------------+--------------+--------------+-------------+-------------+



maximo_minimo: Unit = ()


4. Ordene y agrupe por temperatura promedio, CO2, humedad y país.

In [94]:
val ord_prom = ds
.select("*")
.groupBy("cca3", "c02_level", "humidity")
.agg(avg("temp").alias("Temperatura media"))
.orderBy("c02_level", "humidity", "Temperatura media")
.show()

+----+---------+--------+-----------------+
|cca3|c02_level|humidity|Temperatura media|
+----+---------+--------+-----------------+
| MEX|      800|      25|             11.0|
| FRA|      800|      25|             22.0|
| USA|      800|      25|             26.0|
| KOR|      800|      26|             15.0|
| KOR|      800|      27|             13.0|
| USA|      800|      27|             30.0|
| CHN|      800|      28|             13.0|
| UKR|      800|      28|             14.0|
| CHN|      800|      29|             21.0|
| USA|      800|      29|             29.5|
| ESP|      800|      30|             31.0|
| KGZ|      800|      31|             11.0|
| KOR|      800|      32|             10.0|
| DEU|      800|      32|             12.0|
| USA|      800|      32|             25.0|
| CAN|      800|      32|             31.0|
| USA|      800|      33|             26.0|
| USA|      800|      34|             14.0|
| FRA|      800|      34|             17.0|
| MEX|      800|      35|       

ord_prom: Unit = ()
