#### About

> Stream processing

Stream processing is a method of processing continuous streams of data in real time or near real time as they are generated. This involves processing and analyzing data while it is still in motion, rather than waiting for the data to be stored in a database or data warehouse. Data streams can come from a variety of sources, such as sensors, social media, or IoT devices, and contain large amounts of data that need to be processed quickly to gain insights, detect anomalies, or trigger alerts.

Stream processing technology makes it possible to process these high-speed data streams in real-time by breaking the data into smaller chunks and processing each chunk as it is received. It enables businesses to make decisions and act in real-time based on insights gained from data streams.

For example, stream processing can be used to monitor stock market data in real time, identify trading patterns, and send alerts to traders when certain conditions are met. It can also be used to analyze social media feeds to identify trends and sentiment around a particular product or event, or process IoT sensor data to identify equipment failures before they occur.



We will use PySpark's built-in streaming API, which allows us to read data from a variety of sources (such as Kafka, Flume, and HDFS), process it using Spark's RDD (Resilient Distributed Datasets) operations, and write the results to an output sink (such as HDFS, console, or a database).

In [None]:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext


In [None]:
# spark session init
spark = SparkSession.builder \
    .appName("Stream Processing Example") \
    .master("local[2]") \
    .getOrCreate()

ssc = StreamingContext(spark.sparkContext, 1) # Batch interval of 1 second

In [None]:
# converting a stream of data by generating random numbers and converting them to a pyspark RDD
import random

def generate_random_data():
    while True:
        yield random.randint(1, 100)

rdd = spark.sparkContext.parallelize(generate_random_data())

In [None]:
#creating a Dstream(Discretized Stream) by dividing the RDD into small batches and treating each batch as a separate RDD
dstream = ssc.queueStream([rdd])


In [None]:
#performing some transformation on D stream
filtered_dstream = dstream.filter(lambda num: num > 50)
#filtering numbers less than 50

In [None]:
filtered_dstream.pprint()


In [None]:
#streaming context
ssc.start()
ssc.awaitTermination()