# Streaming Data simulation

For the Streaming data demo, we will use the file we created (Load data notebook) with the last 30 days from our dataset. Then we will split the file by day to create some batches to be fed into the streaming sink.

Then we will use Spark Streaming to read and process the stream (some simple aggregations) and we will save the results to disk for analysis.

In [1]:
import os, pandas as pd
import numpy as np
import datetime as dt
import time

from pyspark.context import SparkContext, SparkConf
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

## Load Data and Prepare Batches

In [2]:
#create spark context
conf = SparkConf().setAppName('final_project').setMaster('local[*]') 
sc = SparkContext.getOrCreate(conf)
spark = SparkSession(sc)

23/01/19 07:58:01 WARN Utils: Your hostname, Andreass-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.0.10 instead (on interface en0)
23/01/19 07:58:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/19 07:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
last_month_df=spark.read.json('data/last30_days_with_locs.json')

                                                                                

In [4]:
last_month_df.printSchema()
last_month_df.persist()
last_month_df.count()

root
 |-- cts: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- number_comments: long (nullable = true)
 |-- numbr_likes: long (nullable = true)
 |-- post_id: string (nullable = true)
 |-- post_type: long (nullable = true)
 |-- profile_id: long (nullable = true)
 |-- sid: long (nullable = true)
 |-- sid_profile: long (nullable = true)



                                                                                

211432

In [5]:
#Preprocessing

last_month_df=last_month_df.select('cts','location.*','numbr_likes','number_comments')\
                            .withColumn('cts',to_timestamp('cts'))\
                            .withColumn('date',to_date('cts'))\
                            .drop('id')

In [6]:
last_month_df.show(10)

+-------------------+--------------------+-------+--------------------+-----------+---------------+----------+
|                cts|                city|country|                name|numbr_likes|number_comments|      date|
+-------------------+--------------------+-------+--------------------+-----------+---------------+----------+
|2019-08-02 19:19:49|     Berlin, Germany|     DE|KW Institute for ...|        211|             14|2019-08-02|
|2019-08-12 16:12:21|  Richmond, Virginia|     US|   New York Deli RVA|         45|              0|2019-08-12|
|2019-07-31 18:33:35|   Trondheim, Norway|     NO|   Lerkendal Stadion|          7|              0|2019-07-31|
|2019-08-06 10:52:12|           Guildford|     GB|       The BOILEROOM|          4|              1|2019-08-06|
|2019-08-02 10:07:37|Cambridge, Cambri...|     GB|Pembroke College ...|         21|              0|2019-08-02|
|2019-08-01 11:17:22|Cambridge, Cambri...|     GB|Pembroke College ...|         28|              0|2019-08-01|
|

In [7]:
#Get distinct dates
days =last_month_df.select('date').distinct().collect()
days= sorted([row.date for row in days])
for day in days:
    print(str(day))

[Stage 5:>                                                          (0 + 8) / 8]

2019-07-31
2019-08-01
2019-08-02
2019-08-03
2019-08-04
2019-08-05
2019-08-06
2019-08-07
2019-08-08
2019-08-09
2019-08-10
2019-08-11
2019-08-12
2019-08-13
2019-08-14
2019-08-15
2019-08-16
2019-08-17
2019-08-18
2019-08-19
2019-08-20
2019-08-21
2019-08-22
2019-08-23
2019-08-24
2019-08-25
2019-08-26
2019-08-27
2019-08-28
2019-08-29
2019-08-30


                                                                                

In [8]:
#Save a CSV file for each day's posts
for day in days:
    df_day = last_month_df.filter(col('date') == day)
    
    df_day.coalesce(1).write.mode('append').option('header','true').csv('data/streaming_data')
    
#Now we can unpersist this DF since we won't be needing it anymore
last_month_df.unpersist()

                                                                                

DataFrame[cts: timestamp, city: string, country: string, name: string, numbr_likes: bigint, number_comments: bigint, date: date]

In [9]:
#we need to define a schema for the streaming data

schema=StructType([StructField('cts', TimestampType(), True),
                   StructField('city', StringType(), True),
                   StructField('country', StringType(), True),
                   StructField('id', StringType(), True),
                   StructField('name', StringType(), True),
                   StructField('numbr_likes', LongType(), True),
                   StructField('number_comments', LongType(), True),
                   StructField('date', DateType(), True)])

## Streaming

In [10]:
#Create the stream that will be triggered by 1 csv file each time

streaming = (spark.readStream.schema(schema)
            .option('maxFilesPerTrigger',1).csv('data/streaming_data/'))

In [11]:
#Define the transformation that will happen on the Result Table

stream_stats=streaming.groupBy('country') \
                        .agg(count("country").alias("total_posts"), \
                             sum("numbr_likes").alias("total_likes"),\
                             round(avg("numbr_likes"),2).alias("avg_likes"))\
                        .filter(col('country').like('__'))\
                        .sort(desc('total_posts'))


In [12]:
def foreach_batch_function(df, epoch_id):
    #write batchDF
    df.coalesce(1).write.csv('data/streaming_stats/results_at_day_'+str(epoch_id),header=True)

In [13]:
#Start streaming, and save the result table after each batch on disk (we can do this because we have a short stream)

query = (stream_stats.writeStream
         .queryName('streaming_stats')
         .foreachBatch(foreach_batch_function)
         .outputMode('complete')
         .start())

23/01/19 07:58:47 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/1m/xl6dk1vd6bs09zz59qz454xr0000gn/T/temporary-01f51b88-18a2-48af-b1fa-ec8a76abbd86. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/01/19 07:58:47 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

In [14]:
query.isActive

True

In [15]:
query.stop()

## View the Results

In [24]:
all_df=pd.DataFrame(columns=['country', 'total_posts', 'total_likes', 'avg_likes','day'])

for i in range(31):
    print('Results at day',i)
    d='data/streaming_stats/results_at_day_'+str(i)
    file=sorted(os.listdir(d))[-1]
    df=pd.read_csv(d+'/'+file)
    display(df.head(5))
    
    #lets also concatenate all the dfs into 1 for visualization
    df['day']=dt.date(2019, 7, 31)+dt.timedelta(days=i)
    all_df=pd.concat([all_df,df])

Results at day 0


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,4307,39981,9.28
1,GB,2249,13498,6.0
2,US,1711,40707,23.81
3,IT,493,4079,8.27
4,ES,415,3766,9.07


Results at day 1


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,8882,66916,7.54
1,GB,4752,24912,5.25
2,US,4416,72750,16.48
3,IT,989,8385,8.48
4,ES,918,9196,10.02


Results at day 2


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,13345,96621,7.24
1,US,7265,95459,13.15
2,GB,7105,35742,5.03
3,IT,1522,12723,8.36
4,ES,1355,13151,9.71


Results at day 3


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,17590,126631,7.2
1,US,9985,132295,13.26
2,GB,9898,46890,4.74
3,IT,2034,17236,8.47
4,ES,1760,19068,10.83


Results at day 4


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,21894,148757,6.8
1,US,13004,153692,11.83
2,GB,12891,58714,4.56
3,IT,2566,28490,11.1
4,ES,2252,23039,10.23


Results at day 5


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,26338,182914,6.95
1,US,15938,181568,11.4
2,GB,15135,69613,4.6
3,IT,3074,33289,10.83
4,ES,2704,28595,10.58


Results at day 6


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,30535,214693,7.03
1,US,18483,205054,11.1
2,GB,17066,79553,4.66
3,IT,3573,37258,10.43
4,ES,3117,33028,10.6


Results at day 7


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,34454,241114,7.0
1,US,20798,222132,10.69
2,GB,18922,88932,4.7
3,IT,4009,39534,9.86
4,ES,3461,40777,11.78


Results at day 8


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,37959,271374,7.15
1,US,22859,238267,10.43
2,GB,20764,96799,4.66
3,IT,4489,42116,9.38
4,ES,3795,42741,11.26


Results at day 9


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,41313,295023,7.14
1,US,25074,257790,10.29
2,GB,22406,105746,4.72
3,IT,4875,46322,9.5
4,ES,4141,46668,11.27


Results at day 10


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,44644,314673,7.05
1,US,27211,272864,10.03
2,GB,24006,111651,4.65
3,IT,5219,48883,9.37
4,ES,4427,48781,11.02


Results at day 11


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,47672,335937,7.05
1,US,29472,284965,9.67
2,GB,25692,117529,4.58
3,IT,5653,50979,9.02
4,ES,4744,51329,10.82


Results at day 12


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,50585,360819,7.14
1,US,31306,300151,9.59
2,GB,26954,121705,4.52
3,IT,6011,53553,8.92
4,ES,5032,53420,10.62


Results at day 13


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,53259,383754,7.21
1,US,32810,311396,9.49
2,GB,28004,125642,4.49
3,IT,6354,55183,8.69
4,ES,5351,56327,10.53


Results at day 14


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,55417,399002,7.2
1,US,34033,321813,9.46
2,GB,28722,128415,4.47
3,IT,6642,57274,8.63
4,ES,5593,58141,10.4


Results at day 15


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,57059,412187,7.23
1,US,34960,328945,9.41
2,GB,29123,130043,4.47
3,IT,6867,58827,8.57
4,ES,5756,59653,10.36


Results at day 16


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,57932,417733,7.22
1,US,35449,333108,9.4
2,GB,29363,130952,4.46
3,IT,7021,59901,8.54
4,ES,5848,60656,10.37


Results at day 17


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58462,419496,7.18
1,US,35778,334502,9.35
2,GB,29505,131204,4.45
3,IT,7103,60340,8.5
4,ES,5900,61271,10.38


Results at day 18


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58586,420709,7.19
1,US,35895,334793,9.33
2,GB,29553,131306,4.45
3,IT,7111,60427,8.5
4,ES,5902,61275,10.38


Results at day 19


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35905,334928,9.33
2,GB,29554,131307,4.45
3,IT,7111,60427,8.5
4,ES,5902,61275,10.38


Results at day 20


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35913,334997,9.33
2,GB,29555,131312,4.45
3,IT,7111,60427,8.5
4,ES,5902,61275,10.38


Results at day 21


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35922,335145,9.33
2,GB,29555,131312,4.45
3,IT,7111,60427,8.5
4,ES,5902,61275,10.38


Results at day 22


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35927,335200,9.33
2,GB,29555,131312,4.45
3,IT,7111,60427,8.5
4,ES,5903,61319,10.39


Results at day 23


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35937,335436,9.34
2,GB,29556,131313,4.44
3,IT,7111,60427,8.5
4,ES,5903,61319,10.39


Results at day 24


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35947,335747,9.34
2,GB,29556,131313,4.44
3,IT,7111,60427,8.5
4,ES,5903,61319,10.39


Results at day 25


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35956,335848,9.34
2,GB,29557,131314,4.44
3,IT,7111,60427,8.5
4,ES,5903,61319,10.39


Results at day 26


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35960,335850,9.34
2,GB,29557,131314,4.44
3,IT,7111,60427,8.5
4,ES,5904,61325,10.39


Results at day 27


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35969,335917,9.34
2,GB,29557,131314,4.44
3,IT,7111,60427,8.5
4,ES,5904,61325,10.39


Results at day 28


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35976,336085,9.35
2,GB,29558,131316,4.44
3,IT,7111,60427,8.5
4,ES,5904,61325,10.39


Results at day 29


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35982,336191,9.35
2,GB,29560,131322,4.44
3,IT,7111,60427,8.5
4,ES,5904,61325,10.39


Results at day 30


Unnamed: 0,country,total_posts,total_likes,avg_likes
0,RU,58588,420710,7.19
1,US,35983,336192,9.35
2,GB,29560,131322,4.44
3,IT,7111,60427,8.5
4,ES,5904,61325,10.39


In [25]:
all_df.to_csv('data/streaming_stats/concat_df.csv',index_label='index')

By taking a look at the tables, we can see how the metrics are being updated after each batch/day. Russia is again on top in terms of posts, but the average number of likes is higher in Spain. Even though the avg_likes seem low, we should consider that this is the latest days of the dataset and the posts were published for shorter periods. In a real time scenario, the avg_likes would be low as well.

This streaming process can be generalized on live data generated in real time. However, in this short demo I only used daily batches of only the last 30 days.


More insights and visualizations can be found on a report I prepared on Google Looker at:  
https://datastudio.google.com/reporting/e647d5ac-e2e2-437f-ac48-cb63d82fe382/page/p_zbw36zck2c