# Tecnologías de Almacenamiento

## Tema 6. Apache Spark. Spark Streaming (1/2)
Este notebook incluye el código de ejemplo del manual del módulo

Usamos el contenedor jupyter/all-spark-notebook
```
docker run --name spark-stack -p 10000:8888 -p 4040:4040 jupyter/all-spark-notebook
```

Ejecutamos con el kernel de Scala: Spylon-kernel

Antes de empezar creamos un socket en el puerto 9999 de nuestro contenedor:
```
docker exec -it spark-stack nc -l 9999
```
No mates el proceso, se quedará bloqueado el shell

(acg)

### 2.1 Structured Streaming

In [1]:
import org.apache.spark.sql.Row 
import spark.implicits._

Intitializing Scala interpreter ...

Spark Web UI available at http://16a66b851a51:4041
SparkContext available as 'sc' (version = 3.2.0, master = local[*], app id = local-1636154005433)
SparkSession available as 'spark'


import org.apache.spark.sql.Row
import spark.implicits._


In [3]:
val lineas = spark.readStream
                  .format("socket")
                  .option("host", "localhost")
                  .option("port", 9999)
                  .load() 

lineas.isStreaming

lineas: org.apache.spark.sql.DataFrame = [value: string]
res1: Boolean = true


In [4]:
lineas.printSchema

root
 |-- value: string (nullable = true)



In [5]:
val palabras = lineas.as[String].flatMap(_.split(" "))

val numPalabras = palabras.groupBy("value").count()


palabras: org.apache.spark.sql.Dataset[String] = [value: string]
numPalabras: org.apache.spark.sql.DataFrame = [value: string, count: bigint]


La siguiente query se va a quedar escuchando (y refrescando resultados) durante 30 segundos (30000 msecs).

Acuerdate de ir al shell donde lanzaste el netcat (nc) y escribir varias palabras:

```
(master_big_data) acg@MSI ~ $ docker exec -it spark-stack nc -l 9999
Hola
Hola otra vez
donde esta la otra casa
ve a casa

```

In [19]:
val query = numPalabras.writeStream
                       .outputMode("update")
                       .format("console")
                       .start() 

query.awaitTermination(30000)


-------------------------------------------
Batch: 0
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
+-----+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
| Hola|    1|
+-----+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
| otra|    1|
|  vez|    1|
| Hola|    2|
+-----+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
| otra|    2|
| esta|    1|
|donde|    1|
| casa|    1|
|   la|    1|
+-----+-----+

-------------------------------------------
Batch: 4
-------------------------------------------
+-----+-----+
|value|count|
+-----+-----+
|   ve|    1|
| casa|    2|
|    a|    1|
+-----+-----+



query: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@3893d39b
res11: Boolean = false


In [20]:
query.stop()

In [21]:
query.isActive

res7: Boolean = false


### 2.2 Ventanas de tiempo y watermark

In [9]:
import java.sql.Timestamp 
import org.apache.spark.sql.Row 
import spark.implicits._

val lineas = spark.readStream
                  .format("socket")
                  .option("host", "localhost")
                  .option("port", 9999)
                  .option("includeTimestamp", true)
                  .load()


import java.sql.Timestamp
import org.apache.spark.sql.Row
import spark.implicits._
lineas: org.apache.spark.sql.DataFrame = [value: string, timestamp: timestamp]


In [10]:
val palabras = lineas.as[(String, Timestamp)].flatMap(line =>
                                                        line._1.split(" ").map(word => (word, line._2))
                                                      ).toDF("palabra", "timestamp")


palabras: org.apache.spark.sql.DataFrame = [palabra: string, timestamp: timestamp]


In [11]:
val windowedCounts = palabras.groupBy(window($"timestamp", "10 seconds", "5 seconds"), $"palabra")
                             .count()


windowedCounts: org.apache.spark.sql.DataFrame = [window: struct<start: timestamp, end: timestamp>, palabra: string ... 1 more field]


Después de ejecutar el siguiente bloque (que se quedara ejecutando), vuelve a ir a la ventana del netcat a escribir palabras para enviarlas como streams a la query:
```
(master_big_data) acg@MSI ~ $ docker exec -it spark-stack nc -l 9999
Hola
Como estas
donde estan las llaves
estan en el fondo del mar
```

In [28]:
val query = windowedCounts.writeStream
                          .outputMode("update")
                          .option("truncate", false)
                          .format("console")
                          .start()

query.awaitTermination(60000)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------+-----+
|window|palabra|count|
+------+-------+-----+
+------+-------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-------+-----+
|window                                    |palabra|count|
+------------------------------------------+-------+-----+
|{2021-11-05 23:03:30, 2021-11-05 23:03:40}|Hola   |1    |
|{2021-11-05 23:03:35, 2021-11-05 23:03:45}|Hola   |1    |
+------------------------------------------+-------+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+-------+-----+
|window                                    |palabra|count|
+------------------------------------------+-------+-----+
|{2021-11-05 23:03:35, 2021-11-05 23:03:45}|Como   |1    |
|{2021-11-05 23:03:30, 20

query: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@7ea2977f
res15: Boolean = false


#### watermarking

In [12]:
val windowedCounts = palabras.withWatermark("timestamp", "15 seconds")
                             .groupBy(window($"timestamp", "10 seconds", "5 seconds"),$"palabra")
                             .count()


windowedCounts: org.apache.spark.sql.DataFrame = [window: struct<start: timestamp, end: timestamp>, palabra: string ... 1 more field]


Recuerda de nuevo escribir palabras en el socket cuando ejecutes el siguiente bloque:
```
(master_big_data) acg@MSI ~ $ docker exec -it spark-stack nc -l 9999
aaa
bbb
ccc
bbb

```

In [14]:
val query = windowedCounts.writeStream
                          .outputMode("update")
                          .option("truncate", false)
                          .format("console")
                          .start() 

query.awaitTermination(1000)


query: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@3f639206
res4: Boolean = false


-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------+-----+
|window|palabra|count|
+------+-------+-----+
+------+-------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-------+-----+
|window                                    |palabra|count|
+------------------------------------------+-------+-----+
|{2021-11-05 23:15:10, 2021-11-05 23:15:20}|ccc    |1    |
|{2021-11-05 23:15:15, 2021-11-05 23:15:25}|aaa    |1    |
|{2021-11-05 23:15:15, 2021-11-05 23:15:25}|ccc    |1    |
|{2021-11-05 23:15:15, 2021-11-05 23:15:25}|bbb    |1    |
|{2021-11-05 23:15:10, 2021-11-05 23:15:20}|bbb    |1    |
|{2021-11-05 23:15:10, 2021-11-05 23:15:20}|aaa    |1    |
+------------------------------------------+-------+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------