# Introducción a Spark SQL en Python

1. Pyspark SQL.

2. Uso de la función de ventana sql para el procesamiento del lenguaje natural.

3. Almacenamiento en caché, registro y la interfaz de usuario de Spark.

4. Clasificación de Texto.

Apache Spark es un marco informático para el procesamiento de grandes datos. 

Spark SQL es un componente de Apache Spark que funciona con datos tabulares.

Las funciones de ventana son una característica avanzada de SQL que llevan a Spark a un nuevo nivel de utilidad.

Utilizará Spark SQL para analizar series temporales.

Extraerás las secuencias de palabras más comunes de un documento de texto.

Creará conjuntos de funciones a partir de texto en lenguaje natural y los usará para predecir la última palabra de una oración mediante la regresión logística.

Spark combina el poder de la computación distribuida con la facilidad de uso de Python y SQL.

El curso utiliza un conjunto de datos de texto en lenguaje natural que es fácil de entender.

Las oraciones son secuencias de palabras.

Las funciones de ventana son muy adecuadas para manipular datos de secuencia.

Las mismas técnicas que se enseñan aquí se pueden aplicar a secuencias de identificadores de canciones, identificadores de video o identificadores de podcast.

Los ejercicios incluyen el descubrimiento de secuencias de palabras frecuentes y la conversión de secuencias de palabras en datos de conjuntos de funciones de aprendizaje automático para entrenar un clasificador de texto.


In [32]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.functions import broadcast
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

In [33]:
spark.sparkContext

In [56]:
df = spark.read.csv("Datos/trainsched.txt", header=True)
df.show()

+--------+-------------+-----+--------+
|train_id|      station| time|diff_min|
+--------+-------------+-----+--------+
|     217|       Gilroy|6:06a|     9.0|
|     217|   San Martin|6:15a|     6.0|
|     217|  Morgan Hill|6:21a|    15.0|
|     217| Blossom Hill|6:36a|     6.0|
|     217|      Capitol|6:42a|     8.0|
|     217|       Tamien|6:50a|     9.0|
|     217|     San Jose|6:59a|    null|
|     324|San Francisco|7:59a|     4.0|
|     324|  22nd Street|8:03a|    13.0|
|     324|     Millbrae|8:16a|     8.0|
|     324|    Hillsdale|8:24a|     7.0|
|     324| Redwood City|8:31a|     6.0|
|     324|    Palo Alto|8:37a|    28.0|
|     324|     San Jose|9:05a|    null|
+--------+-------------+-----+--------+



In [57]:
df.createOrReplaceTempView("schedule")

#### Uso de Lenguaje SQL.

In [58]:
spark.sql("DESCRIBE schedule").show()

+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|train_id|   string|   null|
| station|   string|   null|
|    time|   string|   null|
|diff_min|   string|   null|
+--------+---------+-------+



In [60]:
spark.sql("SELECT * FROM schedule WHERE station LIKE '%San Jose%'").show()

+--------+--------+-----+--------+
|train_id| station| time|diff_min|
+--------+--------+-----+--------+
|     217|San Jose|6:59a|    null|
|     324|San Jose|9:05a|    null|
+--------+--------+-----+--------+



#### Funciones de Ventanas

In [62]:
query = """
SELECT train_id, station, time, diff_min,
SUM(diff_min) OVER (PARTITION BY train_id ORDER BY time) AS running_total
FROM schedule
"""

spark.sql(query).show()

+--------+-------------+-----+--------+-------------+
|train_id|      station| time|diff_min|running_total|
+--------+-------------+-----+--------+-------------+
|     217|       Gilroy|6:06a|     9.0|          9.0|
|     217|   San Martin|6:15a|     6.0|         15.0|
|     217|  Morgan Hill|6:21a|    15.0|         30.0|
|     217| Blossom Hill|6:36a|     6.0|         36.0|
|     217|      Capitol|6:42a|     8.0|         44.0|
|     217|       Tamien|6:50a|     9.0|         53.0|
|     217|     San Jose|6:59a|    null|         53.0|
|     324|San Francisco|7:59a|     4.0|          4.0|
|     324|  22nd Street|8:03a|    13.0|         17.0|
|     324|     Millbrae|8:16a|     8.0|         25.0|
|     324|    Hillsdale|8:24a|     7.0|         32.0|
|     324| Redwood City|8:31a|     6.0|         38.0|
|     324|    Palo Alto|8:37a|    28.0|         66.0|
|     324|     San Jose|9:05a|    null|         66.0|
+--------+-------------+-----+--------+-------------+



In [64]:
query = """
SELECT 
ROW_NUMBER() OVER (ORDER BY time) AS row,
train_id, 
station, 
time, 
LEAD(time,1) OVER (ORDER BY time) AS time_next 
FROM schedule
"""
spark.sql(query).show()

# Dar el número de la fila mala como un número entero
bad_row = 7

# Proporcione la cláusula que falta, palabras clave de SQL en mayúsculas
clause = 'PARTITION BY train_id'

+---+--------+-------------+-----+---------+
|row|train_id|      station| time|time_next|
+---+--------+-------------+-----+---------+
|  1|     217|       Gilroy|6:06a|    6:15a|
|  2|     217|   San Martin|6:15a|    6:21a|
|  3|     217|  Morgan Hill|6:21a|    6:36a|
|  4|     217| Blossom Hill|6:36a|    6:42a|
|  5|     217|      Capitol|6:42a|    6:50a|
|  6|     217|       Tamien|6:50a|    6:59a|
|  7|     217|     San Jose|6:59a|    7:59a|
|  8|     324|San Francisco|7:59a|    8:03a|
|  9|     324|  22nd Street|8:03a|    8:16a|
| 10|     324|     Millbrae|8:16a|    8:24a|
| 11|     324|    Hillsdale|8:24a|    8:31a|
| 12|     324| Redwood City|8:31a|    8:37a|
| 13|     324|    Palo Alto|8:37a|    9:05a|
| 14|     324|     San Jose|9:05a|     null|
+---+--------+-------------+-----+---------+



#### Funciones de Agregación.

In [65]:
# Give the identical result in each command
spark.sql('SELECT train_id, MIN(time) AS start FROM schedule GROUP BY train_id').show()
df.groupBy('train_id').agg({'time':'min'}).withColumnRenamed('min(time)', 'start').show()

# Print the second column of the result
spark.sql('SELECT train_id, MIN(time), MAX(time) FROM schedule GROUP BY train_id').show()
result = df.groupBy('train_id').agg({'time':'min', 'time':'max'})
result.show()
print(result.columns[1])

+--------+-----+
|train_id|start|
+--------+-----+
|     217|6:06a|
|     324|7:59a|
+--------+-----+

+--------+-----+
|train_id|start|
+--------+-----+
|     217|6:06a|
|     324|7:59a|
+--------+-----+

+--------+---------+---------+
|train_id|min(time)|max(time)|
+--------+---------+---------+
|     217|    6:06a|    6:59a|
|     324|    7:59a|    9:05a|
+--------+---------+---------+

+--------+---------+
|train_id|max(time)|
+--------+---------+
|     217|    6:59a|
|     324|    9:05a|
+--------+---------+

max(time)


In [66]:
# Write a SQL query giving a result identical to dot_df
query = "SELECT train_id, MIN(time) AS start, MAX(time) AS end FROM schedule GROUP BY train_id"
sql_df = spark.sql(query)
sql_df.show()

+--------+-----+-----+
|train_id|start|  end|
+--------+-----+-----+
|     217|6:06a|6:59a|
|     324|7:59a|9:05a|
+--------+-----+-----+



In [70]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number

# Obtain the identical result using dot notation 
dot_df = df.withColumn('time_next', lead('time', 1)
        .over(Window.partitionBy('train_id')
        .orderBy('time')))

dot_df.show()

+--------+-------------+-----+--------+---------+
|train_id|      station| time|diff_min|time_next|
+--------+-------------+-----+--------+---------+
|     217|       Gilroy|6:06a|     9.0|    6:15a|
|     217|   San Martin|6:15a|     6.0|    6:21a|
|     217|  Morgan Hill|6:21a|    15.0|    6:36a|
|     217| Blossom Hill|6:36a|     6.0|    6:42a|
|     217|      Capitol|6:42a|     8.0|    6:50a|
|     217|       Tamien|6:50a|     9.0|    6:59a|
|     217|     San Jose|6:59a|    null|     null|
|     324|San Francisco|7:59a|     4.0|    8:03a|
|     324|  22nd Street|8:03a|    13.0|    8:16a|
|     324|     Millbrae|8:16a|     8.0|    8:24a|
|     324|    Hillsdale|8:24a|     7.0|    8:31a|
|     324| Redwood City|8:31a|     6.0|    8:37a|
|     324|    Palo Alto|8:37a|    28.0|    9:05a|
|     324|     San Jose|9:05a|    null|     null|
+--------+-------------+-----+--------+---------+

