# Clase - Computación Distribuida

## Pyspark Streaming Hands-on 
#### Marcelo Medel Vergara - Diplomado Data Engineer USACH


### Input Sources

Structured Streaming nos permite recibir datos desde distintas fuentes de datos:

- Apache Kafka 0.10 - https://kafka.apache.org/ 
- Archivos en un sistema de archivos distribuidos 
    - HDFS - https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html 
    - Amazon - S3 https://aws.amazon.com/es/s3/ 
- Socket local para propósitos de testing - https://docs.python.org/3/library/socket.html 

### Sinks 

Un sink determina dónde y cómo se almacenan los resultados del procesamiento de streaming

- **File Sink**: útil para escribir los resultados del procesamiento en archivos (CSV, Parquet, JSON).

- **Console Sink**: Los resultados del procesamiento se imprimen en la consola de salida de PySpark.

- **Kafka Sink**: Se puede enviar los resultados del procesamiento a Kafka.

- **JDBC Sink**: Puede escribir los datos en una tabla de una base de datos compatible con JDBC.



In [5]:
import findspark
findspark.find()

'/Users/marcelomedel/opt/anaconda3/envs/pyspark/lib/python3.10/site-packages/pyspark'

In [1]:
import findspark
findspark.find()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("streaming").getOrCreate()

spark.active()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/16 20:40:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
data = spark.read.json("data/streaming/streaming.json")
data.printSchema()
data.show()

                                                                                

root
 |-- conn_country: string (nullable = true)
 |-- episode_name: string (nullable = true)
 |-- episode_show_name: string (nullable = true)
 |-- incognito_mode: boolean (nullable = true)
 |-- ip_addr_decrypted: string (nullable = true)
 |-- master_metadata_album_album_name: string (nullable = true)
 |-- master_metadata_album_artist_name: string (nullable = true)
 |-- master_metadata_track_name: string (nullable = true)
 |-- ms_played: long (nullable = true)
 |-- offline: boolean (nullable = true)
 |-- offline_timestamp: long (nullable = true)
 |-- platform: string (nullable = true)
 |-- reason_end: string (nullable = true)
 |-- reason_start: string (nullable = true)
 |-- shuffle: boolean (nullable = true)
 |-- skipped: string (nullable = true)
 |-- spotify_episode_uri: string (nullable = true)
 |-- spotify_track_uri: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- user_agent_decrypted: string (nullable = true)
 |-- username: string (nullable = true)

+------------+---

In [6]:
from pyspark.sql.types import StructType, StructField, StringType, LongType, BooleanType, TimestampType

schema = StructType([
    StructField("conn_country", StringType(), True),
    StructField("episode_name", StringType(), True),
    StructField("episode_show_name", StringType(), True),
    StructField("incognito_mode", BooleanType(), True),
    StructField("ip_addr_decrypted", StringType(), True),
    StructField("master_metadata_album_album_name", StringType(), True),
    StructField("master_metadata_album_artist_name", StringType(), True),
    StructField("master_metadata_track_name", StringType(), True),
    StructField("ms_played", LongType(), True),
    StructField("offline", BooleanType(), True),
    StructField("offline_timestamp", LongType(), True),
    StructField("platform", StringType(), True),
    StructField("reason_end", StringType(), True),
    StructField("reason_start", StringType(), True),
    StructField("shuffle", BooleanType(), True),
    StructField("skipped", BooleanType(), True),
    StructField("spotify_episode_uri", StringType(), True),
    StructField("spotify_track_uri", StringType(), True),
    StructField("ts", TimestampType(), True),
    StructField("user_agent_decrypted", StringType(), True),
    StructField("username", StringType(), True)
])

data_stream = spark.readStream.schema(schema).json("data/streaming/")
data_stream.printSchema()

root
 |-- conn_country: string (nullable = true)
 |-- episode_name: string (nullable = true)
 |-- episode_show_name: string (nullable = true)
 |-- incognito_mode: boolean (nullable = true)
 |-- ip_addr_decrypted: string (nullable = true)
 |-- master_metadata_album_album_name: string (nullable = true)
 |-- master_metadata_album_artist_name: string (nullable = true)
 |-- master_metadata_track_name: string (nullable = true)
 |-- ms_played: long (nullable = true)
 |-- offline: boolean (nullable = true)
 |-- offline_timestamp: long (nullable = true)
 |-- platform: string (nullable = true)
 |-- reason_end: string (nullable = true)
 |-- reason_start: string (nullable = true)
 |-- shuffle: boolean (nullable = true)
 |-- skipped: boolean (nullable = true)
 |-- spotify_episode_uri: string (nullable = true)
 |-- spotify_track_uri: string (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- user_agent_decrypted: string (nullable = true)
 |-- username: string (nullable = true)



In [7]:
data.isStreaming

False

In [8]:
data_stream.isStreaming

True

In [9]:
data_stream.writeStream.format("console").start()

24/10/16 20:52:00 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-f7967856-7175-4874-8459-de33fb900b8c. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 20:52:00 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x10d3c16c0>

-------------------------------------------
Batch: 0
-------------------------------------------
+------------+------------+-----------------+--------------+-----------------+--------------------------------+---------------------------------+--------------------------+---------+-------+-----------------+--------------------+----------+------------+-------+-------+-------------------+--------------------+-------------------+--------------------+-----------+
|conn_country|episode_name|episode_show_name|incognito_mode|ip_addr_decrypted|master_metadata_album_album_name|master_metadata_album_artist_name|master_metadata_track_name|ms_played|offline|offline_timestamp|            platform|reason_end|reason_start|shuffle|skipped|spotify_episode_uri|   spotify_track_uri|                 ts|user_agent_decrypted|   username|
+------------+------------+-----------------+--------------+-----------------+--------------------------------+---------------------------------+--------------------------+---

In [10]:
data_stream.writeStream.format("memory").queryName("tabla").start()

24/10/16 20:52:54 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-84e551e7-61e1-443a-b2ff-cceb448be78f. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 20:52:54 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x11463a800>

In [11]:
spark.sql("select * from tabla").show()

+------------+------------+-----------------+--------------+-----------------+--------------------------------+---------------------------------+--------------------------+---------+-------+-----------------+--------------------+----------+------------+-------+-------+-------------------+--------------------+-------------------+--------------------+-----------+
|conn_country|episode_name|episode_show_name|incognito_mode|ip_addr_decrypted|master_metadata_album_album_name|master_metadata_album_artist_name|master_metadata_track_name|ms_played|offline|offline_timestamp|            platform|reason_end|reason_start|shuffle|skipped|spotify_episode_uri|   spotify_track_uri|                 ts|user_agent_decrypted|   username|
+------------+------------+-----------------+--------------+-----------------+--------------------------------+---------------------------------+--------------------------+---------+-------+-----------------+--------------------+----------+------------+-------+-------+---

In [13]:
from pyspark.sql.functions import col, expr, round

data_stream_zip = data_stream.select(
    col("master_metadata_album_album_name").alias("album"),
    col("master_metadata_album_artist_name").alias("artist"),
    col("master_metadata_track_name").alias("track"),
    round(expr("ms_played/1000/60"),1).alias("min_played"),
    col("reason_start"),
    col("ts")
)
data_stream_zip.writeStream.format("memory").queryName("tabla2").start()

24/10/16 20:57:50 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-6f648cc0-c2fd-491b-9298-3fd6e42402d1. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 20:57:50 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x11480edd0>

In [14]:
spark.sql("select * from tabla2").show()

+--------------------+------------+--------------------+----------+------------+-------------------+
|               album|      artist|               track|min_played|reason_start|                 ts|
+--------------------+------------+--------------------+----------+------------+-------------------+
|          Audioslave|  Audioslave| Show Me How to Live|       4.6|   trackdone|2017-01-28 11:35:07|
|          Audioslave|  Audioslave|            Gasoline|       4.7|   trackdone|2017-01-28 11:40:01|
|          Audioslave|  Audioslave|        What You Are|       4.2|   trackdone|2017-01-28 11:44:10|
|          Audioslave|  Audioslave|        Like a Stone|       4.9|   trackdone|2017-01-28 11:49:04|
|          Audioslave|  Audioslave|          Set It Off|       4.4|   trackdone|2017-01-28 11:53:51|
|          Audioslave|  Audioslave|   Shadow on the Sun|       5.7|   trackdone|2017-01-28 12:00:02|
|          Audioslave|  Audioslave|    I Am the Highway|       5.6|   trackdone|2017-01-28 

### Output modes

- **Complete mode**:
    - Envía todo el resultado calculado al destino.
    - Útil para datos de estado que cambian con el tiempo.
    - Útil cuando el destino no admite actualizaciones a nivel de fila.
- **Update mode**:
    - Envía solo las filas que difieren de la última escritura al destino.
    - Destino debe admitir actualizaciones a nivel de fila.
    - Si la consulta no contiene agregaciones, es equivalente al modo de *append*
- **Append mode**:
    - Nuevas filas se envían al destino especificado
    - Garantiza que cada fila se envíe una vez y solo una vez
    - Destino debe ser tolerante a fallos

In [15]:
data_stream_zip.groupBy("reason_start").count().writeStream.format("console").outputMode("complete").start()

24/10/16 21:02:18 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-0b2cddd7-c080-48eb-a8a4-7b47a1ee5c78. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:02:18 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x11480e290>



CodeCache: size=131072Kb used=37907Kb max_used=37908Kb free=93164Kb
 bounds [0x00000001069e8000, 0x0000000108f28000, 0x000000010e9e8000]
 total_blobs=13920 nmethods=12912 adapters=919
 compilation: disabled (not enough contiguous free space left)


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------------+-----+
|reason_start|count|
+------------+-----+
|     appload|  137|
|     backbtn|   78|
|   trackdone|10540|
|     playbtn|    2|
|      fwdbtn| 1940|
|  trackerror|   71|
|    clickrow| 3359|
|      remote|   34|
+------------+-----+



In [26]:
from pyspark.sql.functions import count, approx_count_distinct, expr, sum, first, last #, count_distinct

data_stream.groupBy("platform").agg(
    count("*").alias("count"),
    approx_count_distinct("master_metadata_track_name").alias("qd_tracks"),
    #count_distinct("master_metadata_track_name").alias("qd_tracks2"),
    sum(round(expr("ms_played/1000/60"),1)).alias("min_played"),
    first("ts").alias("first_ts"),
    last("ts").alias("last_ts")
#).writeStream.format("memory").outputMode("complete").queryName("tabla3").start()
).writeStream.format("console").outputMode("complete").start()

24/10/16 21:14:30 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-65ada7ee-6831-433f-a2a4-9e242a97875b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:14:30 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x114639fc0>

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+-----+---------+------------------+-------------------+-------------------+
|            platform|count|qd_tracks|        min_played|           first_ts|            last_ts|
+--------------------+-----+---------+------------------+-------------------+-------------------+
|iOS 11.1.1 (iPhon...|  288|      259| 1170.499999999999|2017-11-13 19:00:18|2017-11-20 22:24:01|
|iOS 10.1.1 (iPad4,1)|   12|       11|              18.0|2017-02-12 13:13:33|2017-02-12 13:25:44|
|iOS 10.3.1 (iPhon...| 1025|      852|3225.1999999999903|2017-04-20 08:25:15|2017-05-28 15:42:29|
|iOS 11.2.1 (iPhon...|  190|      178| 833.0000000000002|2017-12-20 09:08:39|2017-12-25 15:05:50|
|OS X 10.12.6 [x86 8]|    9|        8|              42.1|2017-09-05 22:48:11|2017-10-23 11:39:19|
|iOS 11.1.2 (iPhon...| 1186|      932|3802.2999999999925|2017-11-21 08:58:48|2017-12-20 02:26:33|
|OS X 10.12.5 [x86 8]

In [23]:
data_stream.isStreaming

True

In [21]:
spark.sql("select * from tabla3").show()

+--------------------+-----+---------+------------------+-------------------+-------------------+
|            platform|count|qd_tracks|        min_played|           first_ts|            last_ts|
+--------------------+-----+---------+------------------+-------------------+-------------------+
|iOS 11.1.1 (iPhon...|  288|      259| 1170.499999999999|2017-11-13 19:00:18|2017-11-20 22:24:01|
|iOS 10.1.1 (iPad4,1)|   12|       11|              18.0|2017-02-12 13:13:33|2017-02-12 13:25:44|
|iOS 10.3.1 (iPhon...| 1025|      852|3225.1999999999903|2017-04-20 08:25:15|2017-05-28 15:42:29|
|iOS 11.2.1 (iPhon...|  190|      178| 833.0000000000002|2017-12-20 09:08:39|2017-12-25 15:05:50|
|OS X 10.12.6 [x86 8]|    9|        8|              42.1|2017-09-05 22:48:11|2017-10-23 11:39:19|
|iOS 11.1.2 (iPhon...| 1186|      932|3802.2999999999925|2017-11-21 08:58:48|2017-12-20 02:26:33|
|OS X 10.12.5 [x86 8]|    1|        1|               0.2|2017-08-11 21:54:47|2017-08-11 21:54:47|
|Windows 10 (10.0...

### Triggers

Los triggers en Spark Structured Streaming controlan cuándo se envía la data al destino. Por defecto, el streaming comienza a procesar datos tan pronto como el trigger anterior termina. Los triggers son útiles para evitar sobrecargar el destino con demasiadas actualizaciones o para controlar el tamaño de los archivos de salida. Existen dos tipos de triggers:

- **Processing Time Trigger**: Se especifica una duración (ej: “10 segundos”) y Spark esperará múltiplos de esa duración para enviar los datos. Si el procesamiento no termina antes del siguiente *trigger*, Spark esperará al siguiente punto en lugar de disparar inmediatamente.
- **Once Trigger**: Permite ejecutar un trabajo de streaming solo una vez. Es útil tanto en desarrollo (para probar aplicaciones con un solo conjunto de datos) como en producción (para ejecutar trabajos manualmente a una baja frecuencia, ahorrando recursos).


In [31]:
from pyspark.sql.functions import date_trunc
(
    data_stream.groupBy(date_trunc("day","ts").alias("day")).agg(
    count("*").alias("count"),
    approx_count_distinct("master_metadata_track_name").alias("qd_tracks"),
    sum(expr("ms_played/1000/60")).alias("sum_min_played"),
    first("ts").alias("first_ts"),
    last("ts").alias("last_ts")
)
.sort("day")
.writeStream.trigger(processingTime='5 seconds') 
    .format("console").outputMode("complete")
        .start()
)

24/10/16 21:27:55 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-f860cc71-2baa-4f77-b105-e8bcd96ee6ec. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:27:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x1149c6d40>

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-----+---------+------------------+-------------------+-------------------+
|                day|count|qd_tracks|    sum_min_played|           first_ts|            last_ts|
+-------------------+-----+---------+------------------+-------------------+-------------------+
|2017-01-28 00:00:00|   82|       76| 290.1960166666667|2017-01-28 11:35:07|2017-01-28 22:38:53|
|2017-01-29 00:00:00|   50|       38| 123.4223833333333|2017-01-29 04:28:12|2017-01-29 17:14:49|
|2017-01-30 00:00:00|   33|       32|108.40068333333333|2017-01-30 08:44:16|2017-01-30 16:07:14|
|2017-01-31 00:00:00|   57|       35|208.20551666666663|2017-01-31 08:39:29|2017-01-31 18:26:37|
|2017-02-01 00:00:00|   80|       70|260.59501666666665|2017-02-01 00:58:17|2017-02-01 23:06:27|
|2017-02-02 00:00:00|  116|       98|321.07038333333327|2017-02-02 09:05:14|2017-02-02 22:49:46|
|2017-02-03 00:00:00|   30|   

24/10/16 21:28:05 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 9370 milliseconds


## Streaming from socket

Usar este código para simular envío de mensajes a través de un socket --> 


In [32]:
import findspark
findspark.find()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("streaming-socket").getOrCreate()

spark.active()

24/10/16 21:34:39 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [33]:
from pyspark.sql.types import StructType, StructField, StringType, LongType, BooleanType, TimestampType

schema = StructType([
    StructField("conn_country", StringType(), True),
    StructField("episode_name", StringType(), True),
    StructField("episode_show_name", StringType(), True),
    StructField("incognito_mode", BooleanType(), True),
    StructField("ip_addr_decrypted", StringType(), True),
    StructField("master_metadata_album_album_name", StringType(), True),
    StructField("master_metadata_album_artist_name", StringType(), True),
    StructField("master_metadata_track_name", StringType(), True),
    StructField("ms_played", LongType(), True),
    StructField("offline", BooleanType(), True),
    StructField("offline_timestamp", LongType(), True),
    StructField("platform", StringType(), True),
    StructField("reason_end", StringType(), True),
    StructField("reason_start", StringType(), True),
    StructField("shuffle", BooleanType(), True),
    StructField("skipped", BooleanType(), True),
    StructField("spotify_episode_uri", StringType(), True),
    StructField("spotify_track_uri", StringType(), True),
    StructField("ts", TimestampType(), True),
    StructField("user_agent_decrypted", StringType(), True),
    StructField("username", StringType(), True)
])

raw_stream = spark.readStream\
    .format("socket")\
        .option("host","localhost") \
        .option("port",9999) \
        .load()
        

24/10/16 21:35:13 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [34]:
raw_stream.printSchema()

root
 |-- value: string (nullable = true)



In [36]:
from pyspark.sql.functions import from_json, col

parsed_stream = raw_stream.select(from_json(col("value"), schema).alias("data")).select("data.*")

In [45]:
from pyspark.sql.functions import count, approx_count_distinct, sum, avg, date_trunc

query = parsed_stream \
    .groupBy(date_trunc("day","ts").alias("day"), 'master_metadata_album_artist_name').agg(
    count("master_metadata_track_name").alias("q_tracks"),
    approx_count_distinct("master_metadata_track_name").alias("qd_tracks"),
    sum(expr("ms_played/1000/60")).alias("sum_min_played"),
    avg(expr("ms_played/1000/60")).alias("avg_min_played")
)\
.writeStream \
.format("console")\
.outputMode("complete")\
.start()

# query.awaitTermination()

24/10/16 21:50:57 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-b5736bbc-f274-4d14-a1d3-4ff2bbe253f4. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:50:57 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---------------------------------+--------+---------+--------------+--------------+
|day|master_metadata_album_artist_name|q_tracks|qd_tracks|sum_min_played|avg_min_played|
+---+---------------------------------+--------+---------+--------------+--------------+
+---+---------------------------------+--------+---------+--------------+--------------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                       Audioslave|       4|        4|18.342216666666666|4.585554166666666|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|   sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|2017-01-28 00:00:00|                       Audioslave|       7|        7|34.03398333333333|4.861997619047619|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                     Public Enemy|       1|        1| 4.158683333333333|4.158683333333333|
|2017-01-28 00:00:00|                       Audioslave|       8|        8|35.007466666666666|4.375933333333333|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                     Public Enemy|       4|        4|18.998683333333332|4.749670833333333|
|2017-01-28 00:00:00|                       Audioslave|       8|        8|35.007466666666666|4.375933333333333|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|    avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+------------------+
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|25.276733333333333|3.6109619047619046|
|2017-01-28 00:00:00|                      The Prodigy|       1|        1| 4.333333333333333| 4.333333333333333|
|2017-01-28 00:00:00|                       Audioslave|       8|        8|35.007466666666666| 4.375933333333333|
+-------------------+---------------------------------+--------+---------+------------------+------------------+



                                                                                

-------------------------------------------
Batch: 6
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                           N.W.A.|       1|        1|  3.8494666666666664|  3.8494666666666664|
|2017-01-28 00:00:00|                            Adele|       1|        1|0.027466666666666664|0.027466666666666664|
|2017-01-28 00:00:00|                      The Prodigy|       1|        1|   4.333333333333333|   4.333333333333333|
|2017-01-28 00:00:00|                       Ana Tijoux|       1|        1|0.01701666

                                                                                

-------------------------------------------
Batch: 7
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                             Moby|       1|        1|0.012766666666666667|0.012766666666666667|
|2017-01-28 00:00:00|                           N.W.A.|       1|        1|  3.8494666666666664|  3.8494666666666664|
|2017-01-28 00:00:00|                            Adele|       1|        1|0.027466666666666664|0.027466666666666664|
|2017-01-28 00:00:00|                   Massive Attack|       1|        1|          

                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                             Moby|       1|        1|0.012766666666666667|0.012766666666666667|
|2017-01-28 00:00:00|                           N.W.A.|       1|        1|  3.8494666666666664|  3.8494666666666664|
|2017-01-28 00:00:00|                            Adele|       1|        1|0.027466666666666664|0.027466666666666664|
|2017-01-28 00:00:00|                   Massive Attack|       2|        1|          

                                                                                

-------------------------------------------
Batch: 9
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                             Moby|       1|        1|0.012766666666666667|0.012766666666666667|
|2017-01-28 00:00:00|                           N.W.A.|       1|        1|  3.8494666666666664|  3.8494666666666664|
|2017-01-28 00:00:00|                   Gustavo Cerati|       1|        1|          

                                                                                

-------------------------------------------
Batch: 10
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                        Morrissey|       1|        1|  3.5877666666666665|  3.5877666666666665|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                    Amy Winehouse|       1|        1|  4.2824333333333335|  4.2824333333333335|
|2017-01-28 00:00:00|                             Moby|       1|        1|0.0127666

                                                                                

-------------------------------------------
Batch: 11
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                    The Gathering|       1|        1| 0.01933333333333333| 0.01933333333333333|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                        Morrissey|       2|        2|             3.60595|            1.802975|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333333|  3.6109619047619046|
|2017-01-28 00:00:00|                    Amy Winehouse|       1|        1|  4.28243

                                                                                

-------------------------------------------
Batch: 12
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                        Morrissey|       2|        2|             3.60595|            1.802975|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.2767

                                                                                

-------------------------------------------
Batch: 13
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                        Morrissey|       2|        2|             3.60595|            1.802975|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.2767

                                                                                

-------------------------------------------
Batch: 14
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                        Morrissey|       2|        2|             3.60595|            1.802975|
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.2767

                                                                                

-------------------------------------------
Batch: 15
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                       Pink Floyd|       1|        1|  2.0478333333333336|  2.0478333333333336|
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                  The Cranberries|       1|        1|0.0421666

24/10/16 21:51:47 WARN TextSocketMicroBatchStream: Stream closed by localhost:9999
                                                                                

-------------------------------------------
Batch: 16
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                       Pink Floyd|       1|        1|  2.0478333333333336|  2.0478333333333336|
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       2|        2|  3.9366666666666665|  1.9683333333333333|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                  The Cranberries|       1|        1|0.0421666

                                                                                

-------------------------------------------
Batch: 17
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                       Pink Floyd|       1|        1|  2.0478333333333336|  2.0478333333333336|
|2017-01-28 00:00:00|                    The Gathering|       2|        2|             0.03945|            0.019725|
|2017-01-28 00:00:00|                            Björk|       2|        2|  3.9366666666666665|  1.9683333333333333|
|2017-01-28 00:00:00|                          Placebo|       1|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                  The Cranberries|       1|        1|0.0421666

In [43]:
from pyspark.sql.functions import count, approx_count_distinct, sum, avg, date_trunc

query = parsed_stream \
    .groupBy(date_trunc("day","ts").alias("day"), 'master_metadata_album_artist_name').agg(
    count("master_metadata_track_name").alias("q_tracks"),
    approx_count_distinct("master_metadata_track_name").alias("qd_tracks"),
    sum(expr("ms_played/1000/60")).alias("sum_min_played"),
    avg(expr("ms_played/1000/60")).alias("avg_min_played")
)\
.writeStream \
.format("console")\
.outputMode("update")\
.start()

#query.awaitTermination()

24/10/16 21:48:34 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-5423a5d3-279e-4e20-8c49-8439664affd9. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:48:34 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---------------------------------+--------+---------+--------------+--------------+
|day|master_metadata_album_artist_name|q_tracks|qd_tracks|sum_min_played|avg_min_played|
+---+---------------------------------+--------+---------+--------------+--------------+
+---+---------------------------------+--------+---------+--------------+--------------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                       Audioslave|       4|        4|18.342216666666666|4.585554166666666|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|   sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|2017-01-28 00:00:00|                       Audioslave|       7|        7|34.03398333333333|4.861997619047619|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                     Public Enemy|       2|        2|           8.86935|         4.434675|
|2017-01-28 00:00:00|                       Audioslave|       8|        8|35.007466666666666|4.375933333333333|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|   sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|2017-01-28 00:00:00|                     Public Enemy|       5|        5|20.62466666666667|4.124933333333334|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                     Public Enemy|       6|        6|24.392000000000003|4.065333333333334|
|2017-01-28 00:00:00|                      The Prodigy|       1|        1| 4.333333333333333|4.333333333333333|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



                                                                                

-------------------------------------------
Batch: 6
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                     Public Enemy|       7|        7|  25.276733333333336|   3.610961904761905|
|2017-01-28 00:00:00|                           N.W.A.|       1|        1|  3.8494666666666664|  3.8494666666666664|
|2017-01-28 00:00:00|                            Adele|       1|        1|0.027466666666666664|0.027466666666666664|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



                                                                                

-------------------------------------------
Batch: 7
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                             Moby|       1|        1|0.012766666666666667|0.012766666666666667|
|2017-01-28 00:00:00|                     Depeche Mode|       1|        1|0.014316666666666667|0.014316666666666667|
|2017-01-28 00:00:00|                       Ana Tijoux|       1|        1|0.017016666666666666|0.017016666666666666|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                   Massive Attack|       2|        1|                 0.0|                 0.0|
|2017-01-28 00:00:00|                     Depeche Mode|       2|        1|0.014316666666666667|0.007158333333333333|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



                                                                                

-------------------------------------------
Batch: 9
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------+--------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|sum_min_played|avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------+--------------+
|2017-01-28 00:00:00|                          Placebo|       1|        1|           0.0|           0.0|
|2017-01-28 00:00:00|                   Gustavo Cerati|       1|        1|        0.0503|        0.0503|
|2017-01-28 00:00:00|                       The Police|       1|        1|           0.0|           0.0|
+-------------------+---------------------------------+--------+---------+--------------+--------------+



                                                                                

-------------------------------------------
Batch: 10
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                    Amy Winehouse|       1|        1|  4.2824333333333335|  4.2824333333333335|
|2017-01-28 00:00:00|                               U2|       1|        1|0.005416666666666667|0.005416666666666667|
|2017-01-28 00:00:00|                    Kings of Leon|       1|        1|  3.8459999999999996|  3.8459999999999996|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



                                                                                

-------------------------------------------
Batch: 11
-------------------------------------------
+-------------------+---------------------------------+--------+---------+-------------------+-------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|     sum_min_played|     avg_min_played|
+-------------------+---------------------------------+--------+---------+-------------------+-------------------+
|2017-01-28 00:00:00|                        Morrissey|       2|        2|            3.60595|           1.802975|
|2017-01-28 00:00:00|                       Blackfield|       1|        1|0.41828333333333334|0.41828333333333334|
+-------------------+---------------------------------+--------+---------+-------------------+-------------------+



                                                                                

-------------------------------------------
Batch: 12
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------+--------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|sum_min_played|avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------+--------------+
|2017-01-28 00:00:00|                    The Gathering|       2|        2|       0.03945|      0.019725|
|2017-01-28 00:00:00|                        Katatonia|       1|        1|           0.0|           0.0|
+-------------------+---------------------------------+--------+---------+--------------+--------------+



                                                                                

-------------------------------------------
Batch: 13
-------------------------------------------
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|   sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+
|2017-01-28 00:00:00|                            Björk|       1|        1|              0.0|              0.0|
|2017-01-28 00:00:00|                      David Bowie|       1|        1|4.807766666666667|4.807766666666667|
+-------------------+---------------------------------+--------+---------+-----------------+-----------------+



                                                                                

-------------------------------------------
Batch: 14
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|    avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+------------------+
|2017-01-28 00:00:00|                             Moby|       2|        2|4.0323166666666665|2.0161583333333333|
|2017-01-28 00:00:00|                     Depeche Mode|       4|        3| 8.410083333333333| 2.102520833333333|
+-------------------+---------------------------------+--------+---------+------------------+------------------+



                                                                                

-------------------------------------------
Batch: 15
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                       Pink Floyd|       1|        1|  2.0478333333333336|  2.0478333333333336|
|2017-01-28 00:00:00|                  The Cranberries|       1|        1|0.042166666666666665|0.042166666666666665|
|2017-01-28 00:00:00|                 Jay-Jay Johanson|       1|        1|   6.012216666666666|   6.012216666666666|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



                                                                                

-------------------------------------------
Batch: 16
-------------------------------------------
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|      sum_min_played|      avg_min_played|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+
|2017-01-28 00:00:00|                            Björk|       2|        2|  3.9366666666666665|  1.9683333333333333|
|2017-01-28 00:00:00|                   Porcupine Tree|       1|        1|   5.940433333333333|   5.940433333333333|
|2017-01-28 00:00:00|                         Coldplay|       1|        1|0.036366666666666665|0.036366666666666665|
+-------------------+---------------------------------+--------+---------+--------------------+--------------------+



24/10/16 21:49:24 WARN TextSocketMicroBatchStream: Stream closed by localhost:9999
                                                                                

-------------------------------------------
Batch: 17
-------------------------------------------
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|                day|master_metadata_album_artist_name|q_tracks|qd_tracks|    sum_min_played|   avg_min_played|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+
|2017-01-28 00:00:00|                  John McLaughlin|       1|        1| 5.004916666666667|5.004916666666667|
|2017-01-28 00:00:00|                      Soda Stereo|       2|        2|13.276433333333333|6.638216666666667|
+-------------------+---------------------------------+--------+---------+------------------+-----------------+



In [48]:
socket_stream = spark.readStream.format("socket").option("host", "localhost").option("port", 9999)\
    .option("includeTimestamp", True)\
    .load()

24/10/16 21:55:18 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [49]:
socket_stream.printSchema()

root
 |-- value: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)



In [53]:
from pyspark.sql.functions import explode, split

words_data = socket_stream.select(split("value", ' ').alias("words"))

word_count = words_data.select(
     explode("words").alias("word")
 ).groupBy(
     col("word")
 ).count()

#query = words.writeStream.format("console").start()
query = word_count.writeStream.outputMode("update").format("console").start()

#query.awaitTermination()

24/10/16 21:58:15 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/yc/93pqpkn533n6nlff1n38pb2c0000gn/T/temporary-3b28e9f6-f2a5-4bfc-843f-e1629b729ac4. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/10/16 21:58:15 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|      don|    2|
|      del|    1|
|   Mancha|    1|
|   salida|    1|
|   tierra|    1|
|       de|    4|
|    trata|    2|
|     hizo|    1|
|       el|    1|
|       su|    1|
|ejercicio|    1|
|   famoso|    1|
|condición|    1|
|        y|    1|
|      que|    1|
|  hidalgo|    1|
|      Que|    1|
|ingenioso|    1|
|       la|    3|
|  Quijote|    2|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|    1|
|      don|    3|
|   cuando|    1|
|  nuestro|    1|
|       De|    1|
|     tuvo|    1|
|   cuenta|    1|
|  armarse|    1|
|       de|    5|
| graciosa|    1|
|    venta|    1|
|   manera|    1|
|  sucedió|    1|
|caballero|    2|
|    Donde|    1|
|      que|    3|
|    salió|    1|
|        a|    1|
|       le|    1|
|       se|    1|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+----------+-----+
|      word|count|
+----------+-----+
| narración|    1|
|   nuestro|    2|
|  prosigue|    1|
|        de|    7|
|        el|    3|
| caballero|    3|
|         y|    3|
| desgracia|    1|
|    donoso|    1|
|     Donde|    2|
|       que|    4|
|escrutinio|    1|
|    grande|    1|
|        se|    2|
|      cura|    1|
|        la|    7|
|       Del|    1|
+----------+-----+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|    2|
|      don|    4|
|  nuestro|    4|
|  segunda|    1|
|       De|    2|
| librería|    1|
|   Mancha|    2|
|  barbero|    1|
|   salida|    2|
|       de|   10|
|caballero|    4|
|  hidalgo|    2|
| hicieron|    1|
|ingenioso|    2|
|     buen|    1|
|       la|   10|
|  Quijote|    4|
+---------+-----+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|         en|    3|
|        don|    5|
|      jamás|    1|
|     felice|    1|
|     suceso|    1|
|       tuvo|    2|
|    molinos|    1|
|  imaginada|    1|
|     dignos|    1|
|         de|   13|
|        con|    1|
|         el|    4|
| espantable|    1|
|   aventura|    1|
|          y|    4|
|        que|    5|
|recordación|    1|
|    viento,|    1|
|    sucesos|    1|
|       buen|    2|
+-----------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 6
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|      don|    6|
| vizcaíno|    1|
|vizcaíno,|    1|
| gallardo|    1|
|      fin|    1|
| concluye|    1|
|       De|    3|
| valiente|    1|
|      con|    2|
|       el|    7|
|  batalla|    1|
|        y|    7|
|    Donde|    3|
|      que|    7|
|    avino|    1|
| manchego|    1|
| tuvieron|    1|
|estupenda|    1|
|        a|    3|
|       le|    2|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 7
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|    4|
|      una|    1|
|      don|    7|
|      del|    2|
|       De|    4|
| cabreros|    1|
|      vio|    1|
|       de|   14|
|  peligro|    1|
|    turba|    1|
|      con|    4|
|  sucedió|    2|
|     unos|    1|
|      que|    9|
|yangüeses|    1|
|        a|    4|
|       le|    3|
|       se|    4|
|  Quijote|    7|
|       lo|    3|
+---------+-----+



                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+--------+-----+
|    word|count|
+--------+-----+
|     don|    8|
| pastora|    1|
|Marcela,|    1|
| estaban|    1|
|     fin|    2|
|      De|    5|
|  cuento|    1|
| cabrero|    1|
|      de|   15|
|     con|    6|
|   contó|    1|
|      al|    1|
|     que|   11|
|   Donde|    4|
|      un|    1|
|       a|    5|
|      se|    5|
|      la|   13|
|     los|    2|
| Quijote|    8|
+--------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 9
-------------------------------------------
+------------+-----+
|        word|count|
+------------+-----+
|      versos|    1|
|desesperados|    1|
|         del|    3|
|       ponen|    1|
|   esperados|    1|
|     pastor,|    1|
|         con|    7|
|     difunto|    1|
|       Donde|    5|
|          no|    1|
|     sucesos|    3|
|          se|    6|
|         los|    3|
|       otros|    3|
+------------+-----+



                                                                                

-------------------------------------------
Batch: 10
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|         en|    6|
|        don|    9|
|desgraciada|    1|
|      topar|    1|
|         De|    6|
|     cuenta|    2|
|        con|    8|
|      venta|    2|
|   aventura|    2|
|    sucedió|    3|
|         al|    2|
|       unos|    2|
|       topó|    1|
|      Donde|    6|
|        que|   13|
|    hidalgo|    3|
|  yangüeses|    2|
|  ingenioso|    3|
|         le|    4|
|         se|    8|
+-----------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 11
-------------------------------------------
+------------+-----+
|        word|count|
+------------+-----+
|          en|    7|
|         don|   10|
|    trabajos|    1|
|innumerables|    1|
|          él|    1|
|   prosiguen|    1|
|       bravo|    1|
|   imaginaba|    1|
|          el|    8|
|       venta|    3|
|          su|    2|
|           y|    8|
|    escudero|    1|
|         que|   15|
|       Donde|    7|
|      Sancho|    1|
|         ser|    1|
|          se|    9|
|        buen|    3|
|       Panza|    1|
+------------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 12
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|     que,|    1|
|    señor|    1|
|  razones|    1|
|      Don|    1|
|     mal,|    1|
|    otras|    1|
|      era|    1|
|      las|    1|
|      por|    1|
|     pasó|    1|
|    pensó|    1|
|   dignas|    1|
|       de|   16|
|      con|   10|
|       su|    4|
|aventuras|    1|
| Quijote,|    1|
|      que|   17|
|    Donde|    8|
|  cuentan|    1|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 13
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|   pasaba|    1|
|  muerto,|    1|
|       De|    7|
|  razones|    2|
|      las|    2|
|       de|   17|
| contadas|    1|
|      con|   13|
|       su|    5|
| aventura|    3|
|  sucedió|    4|
|        y|    9|
|      que|   19|
|   Sancho|    3|
|       un|    2|
|   cuerpo|    1|
|discretas|    1|
|       le|    5|
|       la|   17|
|    otros|    4|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 14
-------------------------------------------
+---------------+-----+
|           word|count|
+---------------+-----+
|             en|    8|
|          jamás|    2|
|           oída|    1|
|        acabada|    1|
|            fue|    1|
|             De|    8|
|             ni|    1|
|acontecimientos|    1|
|             de|   18|
|        peligro|    2|
|         mundo,|    1|
|            con|   14|
|             el|    9|
|        famosos|    1|
|           poco|    1|
|       aventura|    4|
|         famoso|    2|
|          vista|    1|
|      caballero|    5|
|            que|   21|
+---------------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 15
-------------------------------------------
+----------+-----+
|      word|count|
+----------+-----+
|       don|   11|
|      rica|    1|
|       del|    4|
|   nuestro|    5|
|invencible|    1|
|    Mancha|    3|
|     otras|    2|
| sucedidas|    1|
| Mambrino,|    1|
|        de|   21|
|     trata|    3|
|       con|   15|
|        el|   10|
|     yelmo|    1|
|  aventura|    5|
|         y|   10|
|     cosas|    1|
|       Que|    2|
|      alta|    1|
|  ganancia|    1|
+----------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 16
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|        don|   12|
|       que,|    2|
|        mal|    1|
|         De|    9|
|      donde|    1|
|     grado,|    1|
|         de|   22|
|         su|    6|
|  caballero|    6|
|   llevaban|    1|
|  quisieran|    1|
|        que|   22|
|desdichados|    1|
|     muchos|    1|
|         no|    2|
|   libertad|    1|
|          a|    7|
|         la|   22|
|        los|    5|
|    Quijote|   12|
+-----------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 17
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|   10|
|      una|    2|
|      don|   13|
|     esta|    1|
|      fue|    2|
|       ir|    1|
|       De|   10|
|    raras|    1|
|      las|    3|
|aconteció|    1|
|  Morena,|    1|
|       de|   23|
|aventuras|    2|
|   Sierra|    1|
|   famoso|    3|
|       al|    3|
|      que|   25|
|       le|    6|
|  Quijote|   13|
|       lo|    6|
+---------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 18
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|verdadera|    1|
|       en|   11|
| prosigue|    2|
| historia|    1|
|      las|    4|
|       de|   25|
|    trata|    4|
|   Morena|    2|
|   Sierra|    3|
| aventura|    6|
|    Donde|    9|
|      que|   26|
| estrañas|    1|
|    cosas|    2|
|      Que|    3|
|  cuentan|    2|
|       se|   12|
|       la|   24|
+---------+-----+



                                                                                

-------------------------------------------
Batch: 19
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|Beltenebros|    1|
| penitencia|    1|
|  imitación|    1|
|    Mancha,|    1|
|        las|    5|
|  prosiguen|    2|
|  enamorado|    1|
|         de|   29|
|   valiente|    2|
|       hizo|    3|
|         al|    4|
|  caballero|    7|
|          y|   11|
|        que|   28|
|      Donde|   10|
|          a|    8|
|         se|   13|
| sucedieron|    1|
|    finezas|    1|
|         la|   27|
+-----------+-----+



                                                                                

-------------------------------------------
Batch: 20
-------------------------------------------
+--------+-----+
|    word|count|
+--------+-----+
|      en|   13|
|     don|   14|
|    esta|    2|
| cuenten|    1|
|      De|   11|
|   otras|    3|
|salieron|    1|
|  dignas|    2|
|      de|   30|
|  Morena|    3|
|     con|   17|
|      el|   12|
|      su|    7|
|barbero,|    1|
|  Sierra|    4|
|       y|   12|
|     que|   29|
|   cosas|    3|
|  grande|    2|
|      se|   14|
+--------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 21
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|   14|
|   sierra|    1|
| historia|    2|
|    nueva|    1|
|agradable|    1|
|  barbero|    2|
|       de|   31|
|    trata|    5|
| aventura|    7|
|  sucedió|    5|
|       al|    5|
|        y|   14|
|      que|   30|
|      Que|    4|
|     cura|    3|
|       la|   29|
|    mesma|    1|
+---------+-----+



                                                                                

-------------------------------------------
Batch: 22
-------------------------------------------
+----------+-----+
|      word|count|
+----------+-----+
|       del|    5|
| artificio|    1|
|pasatiempo|    1|
|   hermosa|    1|
|discreción|    1|
|     otras|    4|
|  gracioso|    1|
|      tuvo|    3|
|     mucho|    1|
|        de|   34|
|     trata|    7|
|       con|   18|
|         y|   16|
|       que|   31|
|     cosas|    4|
|       Que|    6|
|     gusto|    1|
|        se|   15|
|     orden|    1|
|        la|   31|
+----------+-----+
only showing top 20 rows



24/10/16 21:59:28 WARN TextSocketMicroBatchStream: Stream closed by localhost:9999
                                                                                

-------------------------------------------
Batch: 23
-------------------------------------------
+-------------+-----+
|         word|count|
+-------------+-----+
|           en|   16|
|       puesto|    1|
|          don|   15|
|   penitencia|    2|
|      nuestro|    6|
|           De|   12|
|razonamientos|    1|
|        entre|    1|
|        había|    1|
|    enamorado|    2|
|        sacar|    1|
|           de|   35|
|     sabrosos|    1|
|    caballero|    8|
|          que|   33|
|   asperísima|    1|
|            a|    9|
|           se|   16|
|           la|   32|
|      pasaron|    2|
+-------------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 24
-------------------------------------------
+---------+-----+
|     word|count|
+---------+-----+
|       en|   17|
|     toda|    1|
|       de|   36|
|    trata|    8|
|      con|   19|
|    venta|    4|
|       su|    8|
|  sucedió|    6|
|        y|   17|
|   Panza,|    1|
|      que|   34|
|      Que|    7|
|   Sancho|    4|
|  sucesos|    4|
|        a|   10|
|escudero,|    1|
|       la|   34|
|    otros|    5|
|  Quijote|   15|
|       lo|    7|
+---------+-----+

