# <font color='green'>Data Engineering Zoom Camp - Detailed Week 5 Notes</font>

# <font color='green'><a id='the_destination_1'>1) Batch Processing</a></font>

There are two key approaches to processing data:

- Batch processing
- Stream processing (sometimes called real-time processing)

In some circles, you’ll hear the first talked about as being the old way of doing things and the second as the more modern approach. The same sort of language is used when comparing monolithic apps to microservices or on-premise solutions to the cloud.

In reality, things aren’t quite that simple in this case or in those other cases mentioned. Stream processing isn’t so much a replacement for batch processing as it is a different approach, and it’s not without its challenges.

### <font color='green'>What is Batch Processing?</font>

Batch processing is a term used to describe collecting, modifying, or exporting multiple data records at a regular cadence with downtime in betwen batches. Because large amounts of data can be processed all at once in these batches it can be a very efficient approach and is best suited for handling frequent, repetitive tasks. It is the most common form of data processing that fits many businesses data needs.

Many businesses face increasingly complicated and diverse data challenges due to the sheer magnitude of data available. Batch processing has increased in sophistication, and is also often used in conjunction with other processing techniques for modern analysis. While batch processing used to be by far the most common and widely used method of data processing, recently real-time or near real-time stream processing has proven to be a worthy competitor. As traditional batch systems run overnight to process data accumulated during the day, there is naturally a delta between the real world versus what the data is actually describing. Advanced Batch Processing partially solves this issue, but even the most advanced systems cannot compete with stream processing for real-time continuous data.


### <font color='green'>Process Flow</font>

The general flow of batch processing can be broken down into the following steps:

- `Data acquisition`: This involves obtaining data from various sources, such as databases, flat files, or web services.

- `Data preparation`: This involves cleaning, filtering, and transforming data to make it ready for processing.

- `Batch scheduling`: This involves scheduling batch processing jobs using a batch scheduling tool, which automates the execution of the jobs.

- `Batch processing`: This involves executing the batch processing jobs, which can include tasks such as data integration, data transformation, and data analysis.

- `Error handling and recovery`: This involves detecting and handling errors that may occur during batch processing, such as missing or invalid data.

- `Reporting and analysis`: This involves generating reports and analyzing the processed data using business intelligence and analytics tools.

- `Archiving and storage`: This involves archiving and storing the processed data for future use or reference.


### <font color='green'>Tech Stack</font>

Batch processing is a technique that can be used with a wide range of technologies and tools, depending on the specific requirements and constraints of the application. Some of the common technologies used for batch processing are:

`Batch scheduling tools`: These tools are used to schedule and automate batch processing jobs. Some popular batch scheduling tools include Control-M, IBM Tivoli Workload Scheduler, and Autosys.

`Data integration tools`: These tools are used to extract, transform, and load (ETL) data from various sources into a target system. Some popular data integration tools include Informatica, Talend, and SSIS.

`Scripting languages`: Scripting languages like Python, Perl, and shell scripts are often used to write the code for batch processing tasks such as data transformation, file handling, and error handling.

`Relational database management systems (RDBMS)`: RDBMS such as Oracle, MySQL, and SQL Server are commonly used for storing and processing large volumes of data.

`Big data technologies`: Big data technologies like Apache Hadoop, Spark, and Hive are used for processing large volumes of unstructured or semi-structured data.

`Workflow automation tools`: Workflow automation tools like Apache Airflow, Luigi, and Azkaban are used for automating the workflow of batch processing jobs.

`Business intelligence and analytics tools`: Business intelligence and analytics tools like Tableau, QlikView, and Power BI are used for analyzing and visualizing the processed data.

### <font color='green'>Advantages</font>

<b>1) Efficiency</b>

Batch processing allows a company to process data when computing or other resources are available. For example, a common schedule is to process data overnight when the database and servers aren't being used by employees. If data isn't frequently updated, one can simply change the batch processing schedule to make it less frequent as well.

<b>2) Simplicity</b>

Compared to stream processing, batch processing is usually less complex and doesn't require special hardware or system support for incoming data. Batch processing systems typically require less maintenance than stream processing.

<b>3) Processing Speed</b>

Because batch processing allows companies to process large amounts of data quickly, this speeds up procesing time and delivers data that companies can use in a timely fashion.


### <font color='green'>Disadvantages</font>

<b>1) Processing delays</b>

Batch processing can cause delays in processing large volumes of data or transactions, which may impact the overall performance of the system.

<b>2) Limited real-time processing</b>

Batch processing is limited to processing data or transactions in a batch mode, which may not be suitable for applications that require real-time processing.

<b>3) Security</b>

Batch processing may pose security risks, as large volumes of data or transactions are processed at once, making it easier for cyber attackers to access sensitive information.


### <font color='green'>Applications</font>

Batch processing is a widely used technique for processing large volumes of data or transactions in various industries. Here are some real-world examples of batch processing:

`Banking and Finance`: In banking and finance, batch processing is used to process large volumes of financial transactions, such as clearing and settlement of trades, reconciling account balances, and generating financial reports. These tasks are typically run overnight, and the results are made available to users the following morning.

`Retail and E-commerce`: In retail and e-commerce, batch processing is used to update inventory levels, process customer orders, and generate reports. For example, at the end of the day, a retailer may run a batch process to update the inventory levels in their system based on the sales that were made during the day.

`Healthcare`: In healthcare, batch processing is used for tasks such as claims processing, billing, and patient record updates. For example, a health insurer may run a batch process at the end of the day to process claims submitted by healthcare providers during the day.

`Manufacturing`: In manufacturing, batch processing is used to manage production runs of batches of products. For example, a food manufacturer may run a batch process to produce a specific quantity of a product, with each batch consisting of a set number of units.

`Marketing`: In marketing, batch processing is used to manage large volumes of customer data, such as contact information and purchase history. For example, a company may run a batch process to update their marketing database with the latest customer information, allowing them to target specific customers with personalized marketing campaigns.

Overall, batch processing is a common technique used in various industries to process large volumes of data or transactions efficiently and effectively.


### <font color='green'>Advanced Batch Processing</font>

Traditionally, batch processing was usually configured to run sequentially. Each job was processed one after another on a single machine. The need for more sophistication led to the rise of concurrent and parallel batch processing.

<b>Concurrent Batch Processing</b>

Concurrent batch processing typically refers to jobs that run batches partially overlapping in time. This overlap allows for a piece of the data to always be analyzed at a given time. Concurrent batch processing gives the illusion of parallelism without requiring more than a single CPU core. Due to this concurrent "multi-threading" behavior, the architecture for concurrent batch processing must have fault tolerance in mind. As batches are not run one after another, a single batch failure could cause a domino effect on other batches should the architecture be configured poorly.

<b>Parallel Batch Processing</b>

Parallel batch processing takes a similar approach as concurrent batch processing, however instead of overlapping parts of batches over time, entire batches are scheduled in parallel. By taking advantage of the relative cheapness of multicore machines in the modern age, parallel batch processing can multitask effectively.

<b>Modern Batch Processing</b>

Modern day batch processing methods often use a combination of both concurrent and parallel batch processing. Also called parallel concurrent batch processing, by finding the right balance of parameter tunings to optimize how each CPU core handles multiple tasks and how each worker system handles a single task, when properly configured, parallel concurrent batch processing is a state of the art solution. Institutions that require greater stability and security such as the financial sector most commonly use parallel concurrent batch processing. For the most important data, often multiple redundant batches are run so that even if one batch fails, other batches can cover for the mistakes of the failure.

As mentioned earlier, live data streaming is a challenge for batch processing traditionally. While attempts have been made to use concurrent and parallel batch processing methods to analyze "microbatches" stacked on top of eachother on extremely powerful machines, the use case for complex architectures like this is niche. For the majority of live data cases, stream processing is still preferred. The main business use case for batch processing for this application is when such large quantities of data needs to be analyzed that stream data processing is not a viable option.

# <font color='green'><a id='the_destination_2'>2) Apache Spark</a></font>

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads - batch processing, interactive queries, real-time analytics, machine learning, and graph processing. You’ll find it used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017.

Spark is designed to be fast, scalable, and easy to use, and it provides a range of features that make it well-suited for big data processing, machine learning, and real-time stream processing. Some of the key features of Spark include:

`In-memory processing`: Spark processes data in-memory, which allows for faster processing times than traditional disk-based processing systems.

`Distributed computing`: Spark is designed to run on a cluster of machines, allowing it to process large amounts of data in parallel across multiple nodes.

`Data processing APIs`: Spark provides a range of APIs for processing data, including SQL, Streaming, Machine Learning, and Graph processing APIs.

`Fault tolerance`: Spark is designed to be fault-tolerant, meaning that it can recover from failures in the cluster without losing data.

`Community support`: Spark has a large and active community of users and developers who contribute to the development and maintenance of the system.


### <font color='green'>When to use Spark?</font>

Spark is a versatile tool for processing big data and can be used in a wide range of applications. Here are some scenarios where Spark is particularly well-suited:

<b>1) Processing large volumes of data</b> 
    
Spark is designed to process large volumes of data quickly and efficiently. If you have large datasets that are too big to fit into memory on a single machine, Spark can help you distribute the processing across a cluster of machines, allowing you to process the data faster.

<b>2) Real-time stream processing</b> 

Spark Streaming is a component of Spark that allows you to process real-time data streams. If you need to process data in real-time, Spark Streaming provides a scalable and fault-tolerant platform for doing so.

<b>3) Machine learning</b> 

Spark's Machine Learning Library (MLlib) provides a range of algorithms for building machine learning models. If you need to train machine learning models on large datasets, Spark can help you distribute the processing across a cluster of machines, allowing you to train models faster.

<b>4) Graph processing</b> 

Spark provides a Graph Processing API (GraphX) that allows you to process large-scale graphs. If you need to perform graph analysis on large datasets, Spark can help you distribute the processing across a cluster of machines, allowing you to process the graph faster.

<b>5) Ad-hoc data analysis</b>

Spark provides an SQL API (Spark SQL) that allows you to run SQL queries on large datasets. If you need to perform ad-hoc data analysis on large datasets, Spark SQL can help you do so quickly and efficiently.

Overall, Spark is well-suited for applications that involve processing large volumes of data, real-time stream processing, machine learning, graph processing, and ad-hoc data analysis. If you have data processing needs in any of these areas, Spark may be a good choice for your application.

![image.png](attachment:image.png)

### <font color='green'>Spark and PySpark Installation</font>

Follow the instructions from here - 

https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_5_batch_processing/setup/windows.md


# <font color='green'><a id='the_destination_3'>3) Remaining Notes</a></font>

The remaining notes can be found as individual code files along with explanations here - https://github.com/Balajirvp/DE-Zoomcamp/tree/main/Week%205/Code

# <font color='green'><a id='the_destination_4'>4) References</a></font>

https://dataengineering.wiki/Concepts/Batch+Data+Processing \
https://www.montecarlodata.com/blog-stream-vs-batch-processing/