# Learning Spark - Chapter 2 (Python)
## Getting Started

Import the necessary libraries.
Since we are using Python, import the SparkSession and related functions from the PySpark module.

In [1]:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

Build a SparkSession using the SparkSession APIs.
If one does not exist, then create an instance. There can only be one SparkSession per JVM.

In [2]:
spark = (SparkSession
.builder
.appName("PythonMnMCount")
.getOrCreate())

### Transformations, Actions, and Lazy Evaluation

In [3]:
strings = spark.read.text("./SPARK_README.md")
filtered = strings.filter(strings.value.contains("Spark"))
filtered.count()

17

## Counting M&Ms for the Cookie Monster

Get the M&M data set filename from windows path

In [4]:
mnm_file = "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/mnm_dataset.csv"

Read the file into a Spark DataFrame using the CSV format by inferring the schema and specifying that the file contains a header, which provides column names for comma-separated fields.

In [5]:
mnm_df = (spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(mnm_file))

In [6]:
mnm_df.show()

+-----+------+-----+
|State| Color|Count|
+-----+------+-----+
|   TX|   Red|   20|
|   NV|  Blue|   66|
|   CO|  Blue|   79|
|   OR|  Blue|   71|
|   WA|Yellow|   93|
|   WY|  Blue|   16|
|   CA|Yellow|   53|
|   WA| Green|   60|
|   OR| Green|   71|
|   TX| Green|   68|
|   NV| Green|   59|
|   AZ| Brown|   95|
|   WA|Yellow|   20|
|   AZ|  Blue|   75|
|   OR| Brown|   72|
|   NV|   Red|   98|
|   WY|Orange|   45|
|   CO|  Blue|   52|
|   TX| Brown|   94|
|   CO|   Red|   82|
+-----+------+-----+
only showing top 20 rows



We use the DataFrame high-level APIs. Note that we don't use RDDs at all. Because some of Spark's functions return the same object, we can chain function calls.
1. Select from the DataFrame the fields "State", "Color", and "Count"
2. Since we want to group each state and its M&M color count, we use groupBy()
3. Aggregate counts of all colors and groupBy() State and Color
4. orderBy() in descending order

In [7]:
count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.groupBy("State", "Color")
.agg(count("Count").alias("Total"))
.orderBy("Total", ascending=False))

Show the resulting aggregations for all the states and colors; a total count of each color per state.
Note show() is an action, which will trigger the above query to be executed.

In [8]:
count_mnm_df.show(n=60, truncate=False)
print("Total Rows = %d" % (count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
|AZ   |Brown |1698 |
|WY   |Green |1695 |
|CO   |Blue  |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CA   |Red   |1656 |
|CO   |Brown |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|UT   |Yellow|1645 |
|OR   |Red   |1645 |
|CO   |Orange|1642 |
|TX   |Brown 

While the above code aggregated and counted for all the states, what if we just want to see the data for a single state, e.g., CA?
1. Select from all rows in the DataFrame
2. Filter only CA state
3. groupBy() State and Color as we did above
4. Aggregate the counts for each color
5. orderBy() in descending order
Find the aggregate count for California by filtering

In [9]:
ca_count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.where(mnm_df.State == "CA")
.groupBy("State", "Color")
.agg(count("Count").alias("Total"))
.orderBy("Total", ascending=False))

Show the resulting aggregation for California.
As above, show() is an action that will trigger the execution of the entire computation.

In [10]:
ca_count_mnm_df.show(n=10, truncate=False)

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|CA   |Green |1723 |
|CA   |Brown |1718 |
|CA   |Orange|1657 |
|CA   |Red   |1656 |
|CA   |Blue  |1603 |
+-----+------+-----+



### Ejercicios extra
a. Descargar el Quijote https://gist.github.com/jsdario/6d6c69398cb0c73111e49f1218960f79

Aplicar no solo count (para obtener el número de líneas) y show sino probar distintas sobrecargas del método show (con/sin truncate, indicando/sin indicar num de filas, etc) así como también los métodos, head, take, first (diferencias entre estos 3?)

In [11]:
ruta_quijote = "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/JM Jupyter/el_quijote.txt"

In [12]:
quijote_df = spark.read.text(ruta_quijote)

In [13]:
quijote_df.first()

Row(value='DON QUIJOTE DE LA MANCHA')

In [14]:
quijote_df.take(6)

[Row(value='DON QUIJOTE DE LA MANCHA'),
 Row(value='Miguel de Cervantes Saavedra'),
 Row(value=''),
 Row(value='PRIMERA PARTE'),
 Row(value='CAPÍTULO 1: Que trata de la condición y ejercicio del famoso hidalgo D. Quijote de la Mancha'),
 Row(value='En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor. Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lentejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda. El resto della concluían sayo de velarte, calzas de velludo para las fiestas con sus pantuflos de lo mismo, los días de entre semana se honraba con su vellori de lo más fino. Tenía en su casa una ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte, y un mozo de campo y plaza, que así ensillaba el rocín como tomaba la podadera. Fr

In [15]:
quijote_df.head(10)

[Row(value='DON QUIJOTE DE LA MANCHA'),
 Row(value='Miguel de Cervantes Saavedra'),
 Row(value=''),
 Row(value='PRIMERA PARTE'),
 Row(value='CAPÍTULO 1: Que trata de la condición y ejercicio del famoso hidalgo D. Quijote de la Mancha'),
 Row(value='En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor. Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lentejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda. El resto della concluían sayo de velarte, calzas de velludo para las fiestas con sus pantuflos de lo mismo, los días de entre semana se honraba con su vellori de lo más fino. Tenía en su casa una ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte, y un mozo de campo y plaza, que así ensillaba el rocín como tomaba la podadera. Fr

In [16]:
quijote_df.show(n = 10)

+--------------------+
|               value|
+--------------------+
|DON QUIJOTE DE LA...|
|Miguel de Cervant...|
|                    |
|       PRIMERA PARTE|
|CAPÍTULO 1: Que ...|
|En un lugar de la...|
|Tuvo muchas veces...|
|En resolución, e...|
|historia más cie...|
|Decía él, que e...|
+--------------------+
only showing top 10 rows



In [17]:
quijote_df.show(n = 10, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [18]:
quijote_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [19]:
quijote_df.show()

+--------------------+
|               value|
+--------------------+
|DON QUIJOTE DE LA...|
|Miguel de Cervant...|
|                    |
|       PRIMERA PARTE|
|CAPÍTULO 1: Que ...|
|En un lugar de la...|
|Tuvo muchas veces...|
|En resolución, e...|
|historia más cie...|
|Decía él, que e...|
|En efecto, remata...|
|Imaginábase el p...|
|linaje y patria, ...|
|Limpias, pues, su...|
|Capítulo 2: Que ...|
|Hechas, pues, est...|
|Estos pensamiento...|
|Con estos iba ens...|
|Autores hay que d...|
|muertos de hambre...|
+--------------------+
only showing top 20 rows



b. Del ejercicio de M&M aplicar:

i. Otras operaciones de agregación como el Max con otro tipo de ordenamiento (descendiente).

In [21]:
from pyspark.sql import functions as F

In [22]:
prueba1_mnm_df = (mnm_df
                  .select("State", "Color", "count")
                  .where((mnm_df.State == "CA") | (mnm_df.State == "WY"))
                  .groupBy("State", "Color")
                  .agg(F.max(mnm_df.Count))
                  .orderBy("Color"))

In [23]:
prueba1_mnm_df.show()

+-----+------+----------+
|State| Color|max(Count)|
+-----+------+----------+
|   WY|  Blue|       100|
|   CA|  Blue|       100|
|   WY| Brown|       100|
|   CA| Brown|       100|
|   WY| Green|       100|
|   CA| Green|       100|
|   CA|Orange|       100|
|   WY|Orange|       100|
|   WY|   Red|       100|
|   CA|   Red|       100|
|   CA|Yellow|       100|
|   WY|Yellow|       100|
+-----+------+----------+



In [25]:
from pyspark.sql.functions import col

In [26]:
max_mnm_df = (mnm_df
.select("State", "Color", "Count")
.groupBy("State")
.agg(F.max(col("Count"))))

In [27]:
max_mnm_df.show()

+-----+----------+
|State|max(Count)|
+-----+----------+
|   AZ|       100|
|   OR|       100|
|   WY|       100|
|   NV|       100|
|   CA|       100|
|   WA|       100|
|   NM|       100|
|   TX|       100|
|   CO|       100|
|   UT|       100|
+-----+----------+



ii. hacer un ejercicio como el “where” de CA que aparece en el libro pero indicando más opciones de estados (p.e. NV, TX, CA, CO).

In [28]:
count_mnm_df = (mnm_df
.select("State", "Color", "Count")
.where((mnm_df.State == "CA")| (mnm_df.State == "WY") | (mnm_df.State == "NV") | (mnm_df.State == "TX") | (mnm_df.State == "CO"))
.groupBy("State", "Color")
.agg(count("Count").alias("Total"))
.orderBy("State"))

In [29]:
count_mnm_df.show(50)

+-----+------+-----+
|State| Color|Total|
+-----+------+-----+
|   CA|   Red| 1656|
|   CA|Yellow| 1807|
|   CA|  Blue| 1603|
|   CA| Brown| 1718|
|   CA| Green| 1723|
|   CA|Orange| 1657|
|   CO|Yellow| 1721|
|   CO| Green| 1713|
|   CO|  Blue| 1695|
|   CO|   Red| 1624|
|   CO| Brown| 1656|
|   CO|Orange| 1642|
|   NV|  Blue| 1673|
|   NV| Brown| 1657|
|   NV|   Red| 1610|
|   NV|Yellow| 1675|
|   NV|Orange| 1712|
|   NV| Green| 1698|
|   TX|   Red| 1725|
|   TX|  Blue| 1614|
|   TX| Brown| 1641|
|   TX|Yellow| 1703|
|   TX| Green| 1737|
|   TX|Orange| 1652|
|   WY| Green| 1695|
|   WY|Orange| 1595|
|   WY|   Red| 1670|
|   WY| Brown| 1532|
|   WY|Yellow| 1626|
|   WY|  Blue| 1664|
+-----+------+-----+



iii. Hacer un ejercicio donde se calculen en una misma operación el Max, Min, Avg, Count. 

Revisar el API (documentación) donde encontrarán este ejemplo: 

ds.agg(max($"age"), avg($"salary")) 

ds.groupBy().agg(max($"age"), avg($"salary")) 

NOTA: $ es un alias de col()

In [30]:
all_mnm_df = (mnm_df
.select("State", "Color", "Count")
              .groupBy("State", "Color")
              .agg(F.max(col("Count").alias("Maximo")),F.min(col("Count").alias("Minimo")),F.avg(col("Count")).alias("Avg"))
              .orderBy("Color"))

In [31]:
all_mnm_df.show(truncate = False)

+-----+-----+----------------------+----------------------+------------------+
|State|Color|max(Count AS `Maximo`)|min(Count AS `Minimo`)|Avg               |
+-----+-----+----------------------+----------------------+------------------+
|NV   |Blue |100                   |10                    |53.797369994022716|
|WY   |Blue |100                   |10                    |54.68870192307692 |
|UT   |Blue |100                   |10                    |54.366767371601206|
|TX   |Blue |100                   |10                    |54.811648079306075|
|CA   |Blue |100                   |10                    |55.59762944479102 |
|CO   |Blue |100                   |10                    |55.11032448377581 |
|OR   |Blue |100                   |10                    |54.99756986634265 |
|WA   |Blue |100                   |10                    |55.314461538461536|
|AZ   |Blue |100                   |10                    |54.99449877750611 |
|NM   |Blue |100                   |10              

iv. Hacer también ejercicios en SQL creando tmpView

In [32]:
all_mnm_df.registerTempTable("all")

In [33]:
spark.sql("""select * from all where State == 'CA'""").show()

+-----+------+----------------------+----------------------+------------------+
|State| Color|max(Count AS `Maximo`)|min(Count AS `Minimo`)|               Avg|
+-----+------+----------------------+----------------------+------------------+
|   CA|  Blue|                   100|                    10| 55.59762944479102|
|   CA| Brown|                   100|                    10|55.740395809080326|
|   CA| Green|                   100|                    10|54.268717353453276|
|   CA|Orange|                   100|                    10|54.502715751357876|
|   CA|   Red|                   100|                    10| 55.26992753623188|
|   CA|Yellow|                   100|                    10|  55.8693967902601|
+-----+------+----------------------+----------------------+------------------+



Stop the SparkSession

In [34]:
spark.stop()