# Lesson 37 - Introduction to Streaming

## Stream Processing

Up to this point, we have worked only with static datasets that are able to be read and processed once. This data processing paradigm, known as **batch processing**, is not sufficient for tasks where data is being received continuously and needs to be reacted to in real-time. 

**Stream Processing** refers to the real-time processing of data that is being received continuously from a (theoretically) never-ending streaming source. There are many important applications of stream processing. These include, but are not limited to:

- Monitor credit card transactions to perform real-time fraud detection. 
- Monitoring activity on a website to manage server loads. 
- Recording and reacting to readings received from Internet of Things (IoT) devices.
- Monitoring social media for trends or for suspicious activity. 
- Scheduling services such Ride Sharing and food delivery. 
- Providing real-time analytics for market data.

## Continuous and Micro-Batch Processing

One design decision that needs to be considered when developing a stream processing system is the choice between **continuous processing** and **micro-batch processing**. 

- When using **continuous processing**, the system processes each new piece of information as soon as it is received. 
- Under **micro-batch processing**, the system allows small batches of data to accumulate over a short period of time. Batches are then processed periodically. 

Two concepts related to these two processing strategies are **latency** and **throughput**. 

- **Latency** refers to the delay between when data is received and when it is acted on. 
- **Throughput** refers to the rate at which data is processed. 

Continuous processing offers the lowest possible latency since data is processed as soon as it is received. However, continuous processing systems typically have a lower maximum throughput than micro-batch processing. This is because there is often a significant amount of overhead incurred during processing. Micro-batch processing processes the data more efficiently by reducing the number of times the processing needs to be performed, substantially reducing this overhead.

## Stream Processing in Spark

Spark provides two streaming APIs. These are the the **DStream** and **Structured Streaming** APIs. 

- The **DStream API** is RDD-based and has been available since the initial release of Spark. It supports micro-batch processing, but does not support continuous processing. 
- The **Structured Streaming** API is a built on top of the DataFrame API, and was introduced in Spark 2.0. It has supported micro-batch processing since its introduction. Continuous processing was introduced as an experimental feature in Spark 2.3. 

We will focus only on the Structured Streaming API in this course. The DStream API is still used by many companies, especially in legacy applications, but Structured Streaming is generally preferred today. This is because it leverages the more advanced tools provided for working with DataFrames and can also take advantage of beneath-the-hood code optimization tools that Spark provide for DataFrames.