# Tecnologías de Almacenamiento

## Tema 5. Apache Spark. Spark SQL

Este notebook incluye el código de ejemplo del manual del módulo

Usamos el contenedor jupyter/all-spark-notebook
```
docker run -–name spark-stack -p 10000:8888 -p 4040:4040 jupyter/all-spark-notebook
```

Ejecutamos con el kernel de Scala: Spylon-kernel

(acg)

### 4.1 RDDs

In [2]:
val cadenas = Array("master","bigdata","spark","data", "master", "uemc", "uemc","master","bigdata")
val cadenasRDD = sc.parallelize(cadenas, 3);

cadenas: Array[String] = Array(master, bigdata, spark, data, master, uemc, uemc, master, bigdata)
cadenasRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26


In [3]:
val file = sc.textFile("./car_samples.csv", 6);

file: org.apache.spark.rdd.RDD[String] = ./car_samples.csv MapPartitionsRDD[2] at textFile at <console>:24


In [4]:
val fileNotFound = sc.textFile("/home/vmuser/fileNotFound", 6);

fileNotFound: org.apache.spark.rdd.RDD[String] = /home/vmuser/fileNotFound MapPartitionsRDD[4] at textFile at <console>:24


In [5]:
fileNotFound.collect()

org.apache.hadoop.mapred.InvalidInputException:  Input path does not exist: file:/home/vmuser/fileNotFound

### 4.2 Transformaciones y acciones

In [6]:
val mayusculaRDD = cadenasRDD.map(p => p.toUpperCase()); mayusculaRDD.collect()

mayusculaRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at map at <console>:24
res1: Array[String] = Array(MASTER, BIGDATA, SPARK, DATA, MASTER, UEMC, UEMC, MASTER, BIGDATA)


In [7]:
val filtrado = mayusculaRDD.filter(p => p.contains("DATA")) 
filtrado.collect()


filtrado: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at filter at <console>:24
res2: Array[String] = Array(BIGDATA, DATA, BIGDATA)


In [8]:
val mayuscLength = cadenasRDD.flatMap(p => List(p.toUpperCase(), p.length))
mayuscLength.collect()


mayuscLength: org.apache.spark.rdd.RDD[Any] = MapPartitionsRDD[7] at flatMap at <console>:24
res3: Array[Any] = Array(MASTER, 6, BIGDATA, 7, SPARK, 5, DATA, 4, MASTER, 6, UEMC, 4, UEMC, 4, MASTER, 6, BIGDATA, 7)


In [9]:
val cadenasBoth = cadenasRDD.union(mayusculaRDD)
cadenasBoth.collect()


cadenasBoth: org.apache.spark.rdd.RDD[String] = UnionRDD[8] at union at <console>:25
res4: Array[String] = Array(master, bigdata, spark, data, master, uemc, uemc, master, bigdata, MASTER, BIGDATA, SPARK, DATA, MASTER, UEMC, UEMC, MASTER, BIGDATA)


In [11]:
val distinct = cadenasRDD.distinct(1)
distinct.collect()

distinct: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at distinct at <console>:25
res6: Array[String] = Array(spark, master, data, bigdata, uemc)


In [14]:
val pair = cadenasRDD.map(p => (p,1)) 
pair.collect()

pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:24
res7: Array[(String, Int)] = Array((master,1), (bigdata,1), (spark,1), (data,1), (master,1), (uemc,1), (uemc,1), (master,1), (bigdata,1))


In [15]:
val group = pair.groupByKey() 
group.collect()

group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[16] at groupByKey at <console>:24
res8: Array[(String, Iterable[Int])] = Array((master,CompactBuffer(1, 1, 1)), (uemc,CompactBuffer(1, 1)), (spark,CompactBuffer(1)), (data,CompactBuffer(1)), (bigdata,CompactBuffer(1, 1)))


In [16]:
val suma = pair.reduceByKey(_ + _) 
suma.collect()

suma: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at <console>:24
res9: Array[(String, Int)] = Array((master,3), (uemc,2), (spark,1), (data,1), (bigdata,2))


In [18]:
val pair = cadenasRDD.map(p => (p,1))

val sort = pair.sortByKey(false) 
sort.collect()


pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[18] at map at <console>:25
sort: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at sortByKey at <console>:27
res10: Array[(String, Int)] = Array((uemc,1), (uemc,1), (spark,1), (master,1), (master,1), (master,1), (data,1), (bigdata,1), (bigdata,1))


#### Acciones

In [28]:
val numeros = Array(7,8,3,6,2,1,1)
val numerosRDD = sc.parallelize(numeros, 3);

numeros: Array[Int] = Array(7, 8, 3, 6, 2, 1, 1)
numerosRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[25] at parallelize at <console>:26


In [29]:
numerosRDD.reduce(_+_)

res17: Int = 28


In [30]:
numerosRDD.count() + cadenasRDD.count()

res18: Long = 16


In [32]:
cadenasRDD.first()

res20: String = master


In [33]:
cadenasRDD.take(4)

res21: Array[String] = Array(master, bigdata, spark, data)


In [34]:
numerosRDD.max()

res22: Int = 8


In [36]:
val pair = cadenasRDD.map(p => (p,1)) 
pair.countByKey()

pair: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[26] at map at <console>:25
res23: scala.collection.Map[String,Long] = Map(bigdata -> 2, master -> 3, data -> 1, spark -> 1, uemc -> 2)


In [37]:
cadenasRDD.foreach(p => println("La palabra es " + p))

La palabra es data
La palabra es master
La palabra es uemc
La palabra es master
La palabra es bigdata
La palabra es uemc
La palabra es master
La palabra es bigdata
La palabra es spark


### 4.3 Datasets

Este bloque en principio conviene hacerlo en el shell de spark del contenedor
```
docker exec -it spark-stack spark-shell

```

In [38]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("miApp").master("local").getOrCreate()


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37b3c514


recuerda subir el fichero de strangersCharacters.txt al contenedor:
```
docker cp strangersCharacters.txt spark-stack:/home/jovyan/strangersCharacters.txt
```

In [41]:
!cat strangersCharacters.txt

Lobezno,125,hombre
Catwoman,32,mujer
Batman,43,hombre
ScarletWitch,28,mujer



In [42]:
val sc = spark.sparkContext

sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@c7fc64f


In [43]:
val lineas = sc.textFile("strangersCharacters.txt") 

lineas: org.apache.spark.rdd.RDD[String] = strangersCharacters.txt MapPartitionsRDD[30] at textFile at <console>:25


In [45]:
import org.apache.spark.sql.Row 

import spark.implicits._


import org.apache.spark.sql.Row
import spark.implicits._


In [46]:
import org.apache.spark.sql.Row 

import spark.implicits._

val partes = lineas.map(_.split(",")) 

partes.collect()


import org.apache.spark.sql.Row
import spark.implicits._
partes: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[31] at map at <console>:34
res25: Array[Array[String]] = Array(Array(Lobezno, 125, hombre), Array(Catwoman, 32, mujer), Array(Batman, 43, hombre), Array(ScarletWitch, 28, mujer))


El siguiente bloque en el contenedor no funciona, ejecutadlo en el shell de spark

In [49]:
//case class Personaje(nombre: String, edad: Long, sexo: String)
//val personajes = partes.map(atr => Personaje(atr(0), atr(1).trim.toInt, atr(2))).toDS()
//personajes.select($"nombre").first
//personajes.show() 
//personajes.foreach(p => println("El personaje " + p.nombre + " tiene " + p.edad + " años"))

### 4.4 Dataframes

In [50]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("miApp").master("local").getOrCreate()


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37b3c514


In [55]:
import org.apache.spark.sql.types._
val schema = new StructType()
      .add("Nombre",StringType,true)
      .add("Edad",IntegerType,true)
      .add("Genero",StringType,true)

import org.apache.spark.sql.types._
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Nombre,StringType,true), StructField(Edad,IntegerType,true), StructField(Genero,StringType,true))


In [70]:
val characters = spark.read.format("csv").option("header", "false").schema(schema).load("strangersCharacters.txt")

characters: org.apache.spark.sql.DataFrame = [Nombre: string, Edad: int ... 1 more field]


In [71]:
characters.show()

+------------+----+------+
|      Nombre|Edad|Genero|
+------------+----+------+
|     Lobezno| 125|hombre|
|    Catwoman|  32| mujer|
|      Batman|  43|hombre|
|ScarletWitch|  28| mujer|
+------------+----+------+



In [72]:
characters.columns

res38: Array[String] = Array(Nombre, Edad, Genero)


In [73]:
characters.select("nombre","edad").show()

+------------+----+
|      nombre|edad|
+------------+----+
|     Lobezno| 125|
|    Catwoman|  32|
|      Batman|  43|
|ScarletWitch|  28|
+------------+----+



In [74]:
val youngs = characters.filter($"edad" < 30) 
youngs.show()

+------------+----+------+
|      Nombre|Edad|Genero|
+------------+----+------+
|ScarletWitch|  28| mujer|
+------------+----+------+



youngs: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Nombre: string, Edad: int ... 1 more field]


In [75]:
characters.first()

res41: org.apache.spark.sql.Row = [Lobezno,125,hombre]


In [76]:
val tresPrimeros = characters.head(3) 
tresPrimeros(1)

tresPrimeros: Array[org.apache.spark.sql.Row] = Array([Lobezno,125,hombre], [Catwoman,32,mujer], [Batman,43,hombre])
res42: org.apache.spark.sql.Row = [Catwoman,32,mujer]


In [77]:
characters.count()

res43: Long = 4


In [78]:
characters.groupBy("edad").count().show()

+----+-----+
|edad|count|
+----+-----+
|  28|    1|
|  43|    1|
| 125|    1|
|  32|    1|
+----+-----+



In [79]:
characters.describe().show()

+-------+------------+----------------+------+
|summary|      Nombre|            Edad|Genero|
+-------+------------+----------------+------+
|  count|           4|               4|     4|
|   mean|        null|            57.0|  null|
| stddev|        null|45.7748111228581|  null|
|    min|      Batman|              28|hombre|
|    max|ScarletWitch|             125| mujer|
+-------+------------+----------------+------+



### 4.5 Vistas

In [80]:
characters.createOrReplaceTempView("personajes")

In [82]:
val mujeres = spark.sql("select nombre,edad from personajes where Genero = 'mujer'" )
mujeres.show()

+------------+----+
|      nombre|edad|
+------------+----+
|    Catwoman|  32|
|ScarletWitch|  28|
+------------+----+



mujeres: org.apache.spark.sql.DataFrame = [nombre: string, edad: int]


In [83]:
val spark2 = spark.newSession()

spark2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@72363a46


In [84]:
val t = spark2.sql("select * from personajes") 

org.apache.spark.sql.AnalysisException:  Table or view not found: personajes; line 1 pos 14;

In [85]:
characters.createOrReplaceGlobalTempView("personajes_global")

In [88]:
val t = spark2.sql("select nombre,edad from global_temp.personajes_global where Genero = 'mujer'")
t.show()

+------------+----+
|      nombre|edad|
+------------+----+
|    Catwoman|  32|
|ScarletWitch|  28|
+------------+----+



t: org.apache.spark.sql.DataFrame = [nombre: string, edad: int]


In [91]:
val media_edad = spark2.sql("select genero, mean(edad) from global_temp.personajes_global group by genero")
media_edad.show()


+------+----------+
|genero|mean(edad)|
+------+----------+
|hombre|      84.0|
| mujer|      30.0|
+------+----------+



media_edad: org.apache.spark.sql.DataFrame = [genero: string, mean(edad): double]
