Skip to content

Commit

Permalink
Update step_by_step.md
Browse files Browse the repository at this point in the history
  • Loading branch information
emiliocimino committed Jul 25, 2023
1 parent d1ccd9a commit 9f3a122
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/step_by_step.md
Expand Up @@ -288,7 +288,7 @@ ssc.start()
ssc.awaitTermination()
```

- First of all, we are in the try-catch block. Here it is necessary to start the [Spark Streaming Context](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.html) This is done thanks the *Prime* function in connector's library. This function requires three arguments, which are the spark context, the duration (in seconds) of the receiving window, and a [Storage Level])https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html= in which data are stored during execution. Check the link for the list of possible storage levels and their meaning. Since PySpark streaming is based on the concept of DStreams (Discretized Streams), that are continuous RDDs of the same type catched and processed within a time window, we setup the connector to process those data every x seconds. The *Prime* function returns the DStream channel of NGSIEvents and the streaming context itself, created inside the connector. Those two objects are used in later steps.
- First of all, we are in the try-catch block. Here it is necessary to start the [Spark Streaming Context](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.html) This is done thanks the *Prime* function in connector's library. This function requires three arguments, which are the spark context, the duration (in seconds) of the receiving window, and a [Storage Level](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html) in which data are stored during execution. Check the link for the list of possible storage levels and their meaning. Since PySpark streaming is based on the concept of DStreams (Discretized Streams), that are continuous RDDs of the same type catched and processed within a time window, we setup the connector to process those data every x seconds. The *Prime* function returns the DStream channel of NGSIEvents and the streaming context itself, created inside the connector. Those two objects are used in later steps.
- Since we don't know how many events are catched in a second and how many entities are contained in each NGSIEvent, we flatten the DStream channel to be a list of entities captured in the time window defined before. In this way, we are able to capture each entity update, passing from an "Event Context" to the Entity itself.
- For each RDD contained in the result of the *FlatMap* function (so each entity) we sink our data. PySpark Streaming is defined *lazy*, meaning that all operations are performed only when data are sunk. Some ways to sink data are the *foreachRDD* function or the *pprint* one. The first function is able to map RDDs to workers and perform operations, while pprint only sinks data showing them to terminal. Each RDD is processed by *foreachPartition* that requires a callback function, in this case named *InjectAndPredict* (it is shown in the following step).
- Once defined the flow of data (it can be even more complex), the streaming context is started and will process data until termination.
Expand Down

0 comments on commit 9f3a122

Please sign in to comment.