## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Laboratorio Ventas Ecommers
1. Construyendo streaming DataFrames
1. Mostrando consultas streaming
1. Escribiendo resultados streaming 
1. Monitoreando consultas streaming

In [2]:
# Only when is Local
import findspark

findspark.init()
findspark.find()

'E:\\LibreriasPython\\spark-3.1.2-bin-hadoop2.7\\python\\pyspark'

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ManagementStream').getOrCreate()

In [23]:
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

eventsPath = "../data/events.parquet" # path to events


dfStream = (spark.readStream
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .parquet(eventsPath)
)

### 1. Construyendo streaming DataFrames

- Valide que el DataFrame se procesa en Streaming 'isStreaming'

- Genere emailTrafficDF a partir de:
  - Filtrando df según  (traffic_source == 'email') 
  - Use método withColumn generando la columna "mobile" cuya lógica es col("device").isin(["iOS", "Android"])
  - Seleccione "user_id", "event_timestamp", "mobile"

In [24]:
# validar si procesamiento es streaming
dfStream.isStreaming

True

In [25]:
# Transforma en STREAMING
from pyspark.sql import functions as F

dfStreamFilter = (dfStream.filter(F.col('traffic_source')== F.lit('email')).withColumn('mobile',F.col("device").isin(["iOS", "Android"])).select("user_id", "event_timestamp", "mobile"))


### 2. Mostrando consultas streaming

In [26]:
userhome = "../data"
checkpointPath = userhome + "/email_traffic/checkpoint"
outputPath = userhome + "/email_traffic/output"

devicesQuery = (dfStream.writeStream
  .outputMode("append")
  .format("parquet")
  .queryName("email_traffic_p")
  .trigger(processingTime="1 second")
  .option("checkpointLocation", checkpointPath)
  .start(outputPath)
)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Monitoreando consultas Streaming

- Muestre 'id' de streaming
- Muestre status de streaming
- Pare ejecución streaming

In [27]:
# ID
devicesQuery.id

'e68f6c15-3a7c-413f-935a-e2a6b12d239a'

In [28]:
# status
devicesQuery.status

{'message': "Terminated with exception: Option 'basePath' must be a directory",
 'isDataAvailable': False,
 'isTriggerActive': False}

In [29]:
# stop
devicesQuery.stop

<bound method StreamingQuery.stop of <pyspark.sql.streaming.StreamingQuery object at 0x000002C4C7591BE0>>

In [30]:
devicesQuery.id

'e68f6c15-3a7c-413f-935a-e2a6b12d239a'

In [31]:
df_parquet_describe = spark.read.parquet(outputPath)
# df_parquet_describe().show()

AnalysisException: Unable to infer schema for Parquet at . It must be specified manually

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Laboratorio de ventas de cupones

Procese y agregue datos de transmisión en transacciones usando cupones.

1. Leer flujo de datos
1. Filtrar por transacciones con códigos de cupones
1. Escribir resultados de consultas de transmisión en parquet
1. Supervisar la consulta de transmisión
1. Detener consulta de transmisión

##### Classes
- [DataStreamReader](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html)
- [DataStreamWriter](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html)
- [StreamingQuery](http://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/streaming/StreamingQuery.html)

In [0]:
schema = "order_id BIGINT, email STRING, transaction_timestamp BIGINT, total_item_quantity BIGINT, purchase_revenue_in_usd DOUBLE, unique_items BIGINT, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>"

### 1. Lectura de Datos Streaming
- Genere el schema correspondiente **`schema`**
- Setea el proceso para la lectura de 1 fila por trigger
- Lea a partir de archivo parquet almanenado en **`salesPath`**

Assign the resulting DataFrame to **`df`**

In [0]:
# TODO
salesPath = '' # path

df = (spark.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert df.isStreaming
assert df.columns == ['order_id', 'email', 'transaction_timestamp', 'total_item_quantity', 'purchase_revenue_in_usd', 'unique_items', 'items']

### 2. Filtrado por Transacción
- Explotar campo  **`items`**  en **`df`**
- Filre cada uno de los archivos donde **`items.coupon`** es no nulo

Asigne el Dataframe como resultado **`couponSalesDF`**.

In [0]:
# TODO
couponSalesDF = (df.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
schemaStr = str(couponSalesDF.schema)
assert "StructField(items,StructType(List(StructField(coupon" in schemaStr, "items column was not exploded"

### 3. Escriba consulta Streaming en archivo parquet
- Configure la consulta de estreaming a modo "append"
- Asigne el nombre de "coupon_sales" a la consulta
- Setee el intervalo de guardado a 1 segundo
- Setee la localizaciòn del checkpoint a **`couponsCheckpointPath`**
- Coloque el output path a **`couponsOutputPath`**

Asigne el resultado del streaming a **`couponSalesQuery`**.

In [0]:
# TODO
couponsCheckpointPath = workingDir + "/coupon-sales/checkpoint"
couponsOutputPath = workingDir + "/coupon-sales/output"

couponSalesQuery = (couponSalesDF.FILL_IN
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
untilStreamIsReady("coupon_sales")
assert couponSalesQuery.isActive
assert len(dbutils.fs.ls(couponsOutputPath)) > 0
assert len(dbutils.fs.ls(couponsCheckpointPath)) > 0
assert "coupon_sales" in couponSalesQuery.lastProgress["name"]

### 4. Monitoree consulta streaming
- Obtenga ID de consulta streaming
- Obtenga el status del streaming

In [0]:
# TODO
queryID = couponSalesQuery.FILL_IN

In [0]:
# TODO
queryStatus = couponSalesQuery.FILL_IN

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert type(queryID) == str
assert list(queryStatus.keys()) == ['message', 'isDataAvailable', 'isTriggerActive']

### 5. Pare la ejecuciòn streaming
- Pare ejecuciòn streaming

In [0]:
# TODO
couponSalesQuery.FILL_IN

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert not couponSalesQuery.isActive

### 6. Verifique las filas guardadas en archivo parquet

In [0]:
# TODO

In [0]:
display(spark.read.parquet(couponsOutputPath))