# Batch Processing and Streaming

> Data processing is the process of changing data somehow, perhaps by cleaning it or computing more meaningful information from it

As mentioned earlier, there are 2 main approaches to data processing: batch and streaming

- _Batch processing_ is where data is aggregated, before being processed all at once in bulk
- _Streaming (or stream processing)_ is where data is processed as soon as it is ingested into the system

### Other types of processing

- _Micro-batch processing_ is where streamed data is batched into small groups to achieve some of the performance advantages of batch processing. It is somewhere between batch processing and streaming
- _Real Real-time processing_ (yes, there are supposed to be two "real"s there) is where a system has a requirement to process data in not just _near real-time_, but _real real-time_. In streaming, data is processed as soon as possible, but the processing may not be complete quickly enough to use predictions in real time. With real real-time processing, the results of processing are required right now, not just as soon as possible. For example, in a selfie filter there can't be any noticable delay.

Let's take a closer look at batch processing and streaming.

## Batch data processing 

Batch processing is widely used in organisations, as many of the legacy systems were built upon this philosophy of data engineering where:
- Data is collected over time
- The collected data is sent for processing either at regular intervals, when a certain criteria is met, or when manually performed
- Datasets could be huge (terabytes or petabytes in size) and processing this data can be time-consuming. Hence, it's meant for information that isn't very time-sensitive.

### When does it make sense to do batch processing?

- You already have all of the data in storage
- Results are not time-sensitive
- Data migrations are required from one storage system to another (such as from on-premise hardware to cloud-based storage)

### When does batch processing not make sense?
- When results are required instantly or in (near) real-time 
- When data is constantly flowing in, and operations depend on up-to date results being shown as they arrive (for instance, Google maps)

### Example use-cases of batch processing

#### Pricing products for the next quarter
If you want to set the price of 1 million new products for the upcoming season, it would make no sense to do that one by one as they are added to the system. Firstly it would be inefficient to spin up and down compute resources for processing each item, secondly you would not be able to price them in the _context_ of all other items which is an important factor, and thirdly streaming simply adds no benefit because the product prices will not be used until the new season drops, at which point they will all be released at once.

#### Performing some historical analysis
You may want to ask questions about a huge amount of data that has been collected over the lifetime of your organisation in storage. "What is our most popular product?", "What is the most common way we acquire users?", "What is the average spend of men, based in London, aged between 20-25?".

Continuing with our example from the Emirates Airlines check-in system, below is a high-level diagram demonstrating the batch processing components within that system:

<p align="center">

<img src= "images/emirates-batch2.png" width=1000>

</p>

__In the above example, the data flow would be as follows__:

1. XML data is ingested in batches into the HDFS raw-layer data lake every 30 minutes (this is called the _incremental data_)
2. The XML data is accumulated throughout the day, and saved in folders arranged by day and sub-folders by time of arrival 
3. At the end of the work day, the newly collected incremental data, in addition to the previously stored historical (full) dataset are processed using a Spark Scala application. This application is a batch job that runs once daily.
4. The Spark Scala application performs basic data integrity and quality checks (such as checking for duplicates), and transforms all available files into a file format often used in big data environments called Parquet format. The process can take several hours to complete.
5. The transformed Parquet files are then loaded into a new HDFS intermediate storage location called the _decomposed layer_. This layer is used as a staging layer to perform more advanced transformations.
6. The Parquet files stored in the decomposed layer are then further processed by a more detailed Spark Scala application that implements more complex transformations as per the business requirements (for example, linking check-in event data to a particular passenger). This is another batch job that runs once per day and requires several hours to complete.
7. The final cleaned and processed data is stored in a seperate HDFS location called the _modelled layer_, which is the layer that contains the final dataset that can be used by the various stakeholders.
8. Data in the modelled layer is then exposed to the approved stakeholders who will query and analyse that data

## 2. Batch data processing tools:

### Hadoop and HDFS

<p></p>
<p align="left">
<img src= "images/hdfs2.png" width=100>
</p>

> Apache Hadoop is among the most popular tools in the big data industry. An open-source framework developed by Apache, it runs solely on commodity hardware and is used for big data storage, processing, and analysis.

Hadoop, which was the first big data framework introduced, it enables parallel data processing on multiple machines simultaneously - we call this a _distributed cluster_.  The tool was extremely popular for various big data processing activities during the past 10 years, but now it seems Spark is replacing it.

#### Hadoop consists of three main parts
- The Hadoop Distributed File System (HDFS), which is the data storage layer
- MapReduce, which handles data processing
- YARN (Yet Another Resource Negotiator), which is the software layer designed for resource management

#### Features of Hadoop
- It can authenticate with an HTTP proxy server
- Supports the POSIX-style file system with extended attributes
- Offers a robust ecosystem for analytics capable of meeting most needs
- Makes data processing faster and more flexible

#### Main Limitations
- Hadoop itself lacks real-time processing capabilities
- Inability to do in-memory calculations, which are much faster

#### Example Hadoop use cases
- Building and running applications that rely on analytics to assess risks and create investment models
- Cleaning and transforming the entire dataset of customers in an airline company
- Creating a data analytics infrastructure for improving customer experience
- Predictive maintenance of IoT devices and other infrastructure machines

For more details, [check out Hadoop's homepage](https://hadoop.apache.org/)


### Apache Hive

<p></p>
<p align="left">
<img src= "images/hive.png" width=100>
</p>


> Apache Hive is an open-source, big data warehouse for reading, writing and managing large data files that are stored directly in either HDFS or other big data systems like HBase.

#### Features of Hive:
-   Hive enables SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements for querying and analysing data
-   It is designed to make MapReduce programming easier because you don’t have to write lengthy Java code. Instead, you can write queries more simply in HQL, and Hive can then create the map and reduce the functions automatically under the hood.
-   Hive also includes the Hive _metastore_, which enables developers to apply a table structure onto large amounts of unstructured data. Once a table is created, details defining the columns, rows, data types, etc., are all stored in the metastore and becomes part of the Hive architecture. 
    -   Other tools such as Apache Spark and Apache Pig can then access the data in the metastore

#### Hive vs SQL
- Hive uses a query language very similar to that of traditional SQL databases. However, Hive is based on Apache Hadoop and uses MapReduce operations, resulting in key differences:
    -   Hadoop is intended for long full-table scans and, because Hive is based on Hadoop, queries have a higher latency compared to SQL, making Hive less appropriate for applications that need very fast response times. Why are the queries slower?
        - Network latency between distributed machines which store the data
        - Hive and Hadoop do not cache data in memory like relational databases do
- Hive is better suited than relational databases (RDBMS) for big data warehousing tasks. This is because:
    -   Hive can scale up or down as needed using the power of distributed computing
    -   Hive is better at handling complicated data than RDBMSs (which are designed to handle less complicated data faster)
    -   Hive's schema can be more flexible than in relational databases (which require a fixed schema)
    -   Hive can support a wider variety of input data types 

For more details, [check Hive's homepage](https://hive.apache.org/)

### Apache Spark

<p></p>
<p align="left">
<img src= "images/spark.png" width=100>
</p>


> Apache Spark is an open-source unified analytics engine for large-scale data processing 

Spark is often considered to be the successor of Hadoop, as it fills the gaps of Hadoop's drawbacks. For instance, Spark supports both batch and real-time data processing. It also supports in-memory calculations, thus yielding results faster than Hadoop, thanks to a reduction in the number of read and write processes. Spark is also a more versatile and flexible tool for big data crunching, capable of working with an array of data stores such as Apache Cassandra, OpenStack, and HDFS.

#### Spark features:
- Fast processing for both batch and streaming data
- Easy to use and provides support for multiple programming languages, including Python
- Supports complex analytics as it comes with several built-in components
- Flexibility to connect to various big data tools
- In-memory computation makes Spark much faster than Hadoop for many use cases

#### Spark use cases:
- With Spark, data can be cleaned and aggregated continuously before being transferred to data stores
- Combines live data with static data, enabling powerful real-time analysis
- Detects and addresses unusual behaviors quickly, thus eliminating potential serious threats
- Its interactive analysis capabilities are used for processing and interactively visualising complex data sets

#### Spark Components
<p></p>
<p align="center">
<img src= "images/spark-ecosystem.png" width=600>
</p>


Spark consists of 5 main components, 4 of which can be used for batch processing:
1. Spark Core:
    - This is the main, general purpose data processing engine for Spark
    - It provides the in-memory computing model that enables code execution
    - Supports a wide variety of API's for different programming languages like Python and Scala
<p></p>

2. Spark SQL: 
    - Is one of the main componets of Spark which is designed to handle structured data
    - Uses a concept called _Dataframes_, which is the abstraction framework used to store and organise data into rows and columns. For those who are familiar with Python, they are similar to Panda's dataframes.
    - Enables running SQL queries on top of the data stored in dataframes, which makes it easier than coding for various data transformation tasks like joining and aggregations

<p></p>

3. Spark MLLib: 
    - Stands for Machine Learning Library 
    - This component is for performing common machine learning tasks such as customer segmentation, predictive intelligence, and sentiment analysis
<p></p>
    
4. Spark GraphX: 
    - This component is mainly used to store and analyse data that is in graph format
    - Graph data is that which consists of entities and their relationships to each other. One example is social media data regarding "friends of friends" of a particular user.
    
For more details, [check Spark's homepage](https://spark.apache.org/)

### Comparing the Batch Processing Tools

<table>
    <thead>
        <tr>
             <th style="width:auto;text-align:center">Tool</th>
             <th style="width:auto;text-align:center">Maturity</th>
		     <th style="width:auto;text-align:center">Key Features</th>
             <th style="width:auto;text-align:center">Limitations</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Hadoop</th>
            <td>High, becoming outdated</td>
            <td>Strong for batch processing</td>
            <td>Does not natively support streaming data. Also, difficult to code with MapReduce</td>
        </tr>
        <tr>
            <th>Hive</th>
            <td>High, becoming outdated</td>
            <td>Easy to use HQL queries which are similar to SQL. Uses HDFS as a the data storage layer</td>
            <td>Can only handle structured data</td>
        </tr>  
		<tr>
            <th>Spark</th>
            <td>Still maturing</td>
            <td>Faster in-memory computation. Support both batch and real-time. Easier to code than Hadoop MapReduce</td>
            <td>Does not have its own file management/storage system. Requires more RAM than Hadoop</td>
        </tr>
    </tbody>
 </table>

## Stream Processing

> Streaming, or stream data processing, involves processing data as soon as it is ingested into the software system

The main characteristics of stream data processing include:
- Data is processed individually as the records arrive (not in big batches)
- Data is processed as soon as possible, which means that it's useful for applications where live updates are key such as progressively updating metrics, reports, and summary statistics in response to every data record that becomes available
- Modern data processing has progressed from batch processing of data towards working with stream processing, although streaming is not appropriate for every use-case

Here's an analogy: Historically, you would have to download the whole movie before you could watch it (batch processing). Nowadays you can stream the movie by downloading small pieces at a time (streaming).

### When does it make sense to do stream processing?
- When the data ingestion source provides data in streaming mode (one data point at a time)
- No historical data anaysis is required
- When the data is time-sensitive. For example, while using an application like Google Maps which requires constant location data

### When does stream processing not make sense?
- When any type of historical analysis is required
- When the data is not time-sensitive
- When complex transformations are required before deriving insights from the data

### Example use-cases of stream processing

#### Stock Trading
Most trading platforms require analysing data in real-time to execute all orders as soon as they are entered and to provide second-by-second recommendations on which stocks to buy, sell or hold.

#### Multiplayer Gaming
When playing online multiplayer games, the system must be able to handle all communication and interactions in real-time as they occur.

#### Location Applications
Any platform that requires location information, such as Uber or Google Maps, must be able to support stream data processing to provide accurate and immediate information to its users.

A more specific example of a stream data processing system can be seen in the below image. This system is one of Pintrest's main data engineering pipelines that is designed to provide real-time metrics and alerts regarding the state of the system:

<p align="center">

<img src= "images/pintrest-streaming.png" width=1000>

</p>

The data flow is as follows:

1. Stream data is created by the API and mobile application
2. The produced data is ingested into Kafka as it arrives in real-time
3. Kafka pushes the data to Spark Streaming, which processes all incoming records as they arrive
4. The processed metrics data is then stored into a relational database (MemSQL)
5. The database feeds a Grafana dashboard which provides real-time alerts to users

### Streaming data processing tools

There are a wide variety of modern tools that are designed to handle stream data processing. Most of these tools provide more-or-less similar functionality, although they may differ on some of the technical details. For instance, some tools can handle both batch and streaming data, while others are designed to specifically process streaming data. It's worth mentioning that, out of all tools currently available, Apache Spark is outshining its rivals and is becoming the most popular tool used by global companies.

### Spark Streaming
<p align="left">
<img src= "images/spark-streaming.jpg" width=100>
</p>

> Spark Streaming is an extension of the core Spark API that allows users to process data in real-time from various sources including (but not limited to) Kafka and Flume 

This processed data can be pushed out to various file systems, databases, and live dashboards. This allows Spark Streaming to seamlessly integrate with any other Spark components like MLLib and Spark SQL. 

Spark Streaming is different from other systems which have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems.

#### Features of Spark Streaming (over Hadoop)
- More user-friendly than Hadoop and supports more programming languages
- Faster, in-memory data processing (a feature that isn't present in Hadoop which uses disks for data storage)
- Ability to handle both batch and streaming data simultaneously
- Provides ready to use application-specific libraries (such as MLLib, GraphX and SQL)

#### Use Cases for Spark Streaming
- Global companies use Spark Streaming to perform sentiment analysis on data ingested from Facebook, Twitter and customer reviews on websites
- Uber uses Spark Streaming as part of its data pipeline to ingest and process event data generated from the mobile application for real time telemtry analysis
- eBay leverages Spark Streaming as part of its platform to deliver real-time targeted product recommendations and offers to customers 


For more details, [check Spark Streaming's homepage](https://spark.apache.org/docs/latest/streaming-programming-guide.html)

### Apache Storm

<p align="left">
<img src= "images/storm.png" width=100>
</p>

> Apache Storm is an open-source tool for big data used for processing streaming data

Storm is also a fault-tolerant data processing system. It's distributed design enables it to easily process data in real-time. Moreover, it's compatible with all major programming languages and also supports JSON-based protocols. Storm is easily scalable and highly user-friendly. Storm was the first popular streaming platform introduced which was compatible with Hadoop. Although it was popular some time ago, it's popularity has declined since Apache Spark was released.

#### Features of Storm
- High throughput as it can process up to one million 100-byte messages per second per node
- Can scale easily as it uses parallel data processing across a cluster of distributed machines
- Self-healing network. In case of node failure, the system automatically restarts and transfers work to another node
- Relatively easy to setup and configure compared to Hadoop

#### Use Cases for Storm
- Real-time analytics (Twitter uses Storm to manage some of the analytics around tweets)
- Fraud detection for online transactions (such as credit card payments)
- Machine learning applications that process a continuous flow of data for real-time predictions (real-time product recommendations)

For more details, [check Storm's homepage](https://storm.apache.org/)

### Apache Flink

<p align="left">
<img src= "images/flink.png" width=100>
</p>


> Flink is another open-source distributed processing framework for big data that can manage both real-tme and batch data processing

Flink is designed to run on commodity (inexpensive) hardware, and can perform fast computations in-memory at massive scale. Flink can run in a stand-alone mode, or it can integrate with popular big data frameworks like Hadoop. 

#### Features of Flink
- Fault-tolerant infrastructure that does not have a single point of failure
- High scalability due to its distributed architecture
- Low latency and high throughput
- Support wide array of connectors to third-party resources such as data sources and targets such as Elasticsearch, Kinesis Kafka, and JDBC database systems
- Flink’s SQL interface (Table API) can perform data enrichment and transformation tasks and supports user-defined functions
- Its Gelly library offers building blocks and algorithms for high-performance, large-scale graph analytics on data batches

#### Use Cases for Flink
- Large telecom companies (such as [Bouygues Telecom](https://2016.flink-forward.org/kb_sessions/a-brief-history-of-time-with-apache-flink-real-time-monitoring-and-analysis-with-flink-kafka-hb/)) use Flink to monitor the network and quality in real-time
- Uber uses Flink as part of their open-source AthenaX analytics platform (more details can be found  [here](https://eng.uber.com/athenax/)
- Alibaba is using Flink as part of its online search engine. More details on this use case can be found [here](https://www.ververica.com/blog/blink-flink-alibaba-search)


For more details, [check Flink's homepage](https://flink.apache.org/)


### Apache Samza

<p align="left">
<img src= "images/samza.png" width=100>
</p>

> Apache Samza is another distributed data processing framework for streaming data originally developed by Linkedin

It enables building applications that process data in real-time from multiple sources. It can run indepedently or within a Hadoop YARN cluster and can also integrate with other tools such as ElasticSearch and AWS Kinesis. 

#### Features of Samza 
- Samza offers extremely low latency and high throughput to analyze data in real-time
- Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are lost
- Samza is partitioned, distributed and is easy to scale up or down as required
- Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets it run with other messaging systems and execution environments
- It has flexible deployment options which allow its applications to run on-premise, in the cloud, or in containerised enviornments

#### Use Cases for Samza

- Ebay uses Samza to power low-latency fraud prevention in real time. You can read more about this use case [here](https://samza.apache.org/case-studies/ebay)
- Slack leverages Samza to provide near real-timne alerting and to process billions of events such as metrics and logs data. More on this use case [here](https://samza.apache.org/case-studies/slack)
- LinkedIn uses Samza as part of its new email and notification platform. You can read more about this use case [here](https://samza.apache.org/case-studies/linkedin)

For more details on the tool itself, check Samza's [homepage](https://samza.apache.org/)

### Comparing the Stream Processing Tools

<table>
    <thead>
        <tr>
             <th style="width:auto;text-align:center">Tool</th>
             <th style="width:auto;text-align:center">Latency</th>
		     <th style="width:auto;text-align:center">Processing Approach</th>
             <th style="width:auto;text-align:center">Key Features</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Spark Streaming</th>
            <td>Seconds</td>
            <td>Tiny micro-batch</td>
            <td>Currently most popular tool for streaming. Also has a mature support community</td>
        </tr>
        <tr>
            <th>Storm</th>
            <td>Sub-second</td>
            <td>True record-by-record streaming</td>
            <td>Support a wide variety of processing modes (exactly once, at least once, and at most once)</td>
        </tr>  
		<tr>
            <th>Flink</th>
            <td>Sub-second</td>
            <td>True record-by-record streaming</td>
            <td>Can also support batch data processing</td>
        </tr>
        <tr>
            <th>Samza</th>
            <td>Sub-second</td>
            <td>True record-by-record streaming</td>
            <td>Easily integrates with Kafka and Hadoop</td>
        </tr>  
    </tbody>
 </table>

## Key Takeaways

- Data processing is a critical component of an enterprise's technology foundation as it involves converting raw data into more meaningful information that businesses can use to help enhance decision making
- Batch data processing is the more traditional approach organisations have used for data processing. Batch is leveraged for handling large datasets that contain historical data, while stream data processing focuses on processing small amounts of newly created data in real-time.
- Hadoop, Hive, and Spark Core are the most common tools organisations leverage for big data batch processing at scale. This is because they can handle both structured and unstructured data and can address a wide variety of use cases.
- For streaming data, Spark Streaming, Storm, Flink, and Samza are the popular data processing frameworks used in industry. Each tool has its strengths, although Spark is quickly becoming the dominant technology.