# Spark
<p align=center><a href=https://spark.apache.org/><img src=images/spark-logo-trademark.png width=300></a></p>

> <font size=+1>Spark is a unified engine for large-scale distributed data processing on computer clusters</font>

It was originally written in [__Scala__](https://www.scala-lang.org/) programming language, and its open source project is available on [GitHub](https://github.com/apache/spark).

Spark provides in-memory storage for intermediate computations, and it is designed considering four key points:

1. Speed: Spark's framework takes advantages of the current hardware improvements. It uses DAGs and query optimizers that allow it to run multiple tasks in parallel. 

2. Ease of use: Spark offers a simple programming model so that high-level data structures (DataFrames for example) are handled using familiar languages.

3. Modularity: Spark supports different programming languages (Python, Java, Scala, and R), and thus, it has unified libraries that run under a single engine (Tungsten). That means that you can write a single Spark application that can do all the tasks you need.

4. Extensibility: Spark doesn't focus on storage, and as such, the developers made it compatible with many myriad sources. You can see the connections in the following image:

<p align=center><img src=images/Spark_Connections.png width=400></p>


One of the main uses of Spark is parallelizing computations, hiding all the complexity of distributions. That way data engineers can focus on high-level operations, such as ETL. 

However, Spark integrates many tools, such as Spark MLlib which offers a set of ML algorithms to build model pipelines. 

Thus, Spark is not just a Data Engineering tool. Some popular use cases of Spark are:

- Processing large datasets distributed across a cluster
- Performing queries to explore and visualize datasets
- Implementing end-to-end data pipelines from myriad sources
- Analyzing Graph datasets

In this notebook we will explore the theory behind Spark, so the operations you perform in next lessons are more sensible!


## Supported language frontends

Official APIs are provided for different languages:
- [PySpark](https://spark.apache.org/docs/latest/api/python/) - as the name suggests Python frontend for Spark
- [Java API](https://sparkjava.com/) - as Scala is based off JVM and Java language with high interoperability before both languages
- [SparkR](https://spark.apache.org/docs/latest/sparkr.html) - [R langauge](https://www.r-project.org/) front-end for statistical oriented code

> __We will use PySpark in order to interact with Spark engine__

<br>

## High level libraries

High level libraries are provided on top of `Spark`, namely:
- [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) - Query language for data processing
- [MLlib](https://spark.apache.org/docs/latest/ml-guide.html) - Machine Learning on Spark computing engine
- [GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html) - graph related operations
- [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) - streaming related operations

> __In this notebook we will focus on core Spark functionalities__

> Other functionalities can be used on the same engine, __please refer to documentation if you need specific part in your workflow__

# Cluster mode overview

Now, you know that Spark is a distributed data processing engine, and its components are working on a cluster. Before we dive in, let's see what the engine consists of in more detail and how can we choose one.

Spark applications usually run on __clusters__ and consists of:

- Spark Driver, which in turn contains the SparkSession
- Cluster Manager
- Spark Executor

From the image below, you can observe that a Spark application consists on a driver program that orchestrates parallel operations on a cluster. The Spark Application contains information about the session, which is used by the driver to access the executors (inside the nodes) and the manager.

<p align=center><img src=images/pyspark-driver-executor.png width=500></p>


Let's observe these components in more detail

## Spark Driver

The Spark Driver instantiates a `SparkSession` which is responsible for:

- Communicating with the __Cluster Manager__
- Requesting resources from the __Cluster Manager__ to allocate those to the Executors
- Orchestrating and scheduling the Spark operations
- After allocating the resources, it communicates with the executors to 'tell' them the schedule
    - __Sends code to the executors__, one of:
        - Python files (in case of PySpark)
        - JAR files for Scala/Java code
    - __Sends tasks to the executors__, which are __single unit of work send to a single executor__

> The code is sent via `spark-submit` scripts

## SparkSession

SparkSession is a unified conduit for Spark operations and data. Through this conduit, Spark can communicate with its surroundings:

- It can create runtime parameters for the executors (JVM - Java Virtual Machine)
- Define Dataframe and Datasets
- Read from data sources
- Send SQL queries

SparkSessions is one of the core components of a Spark application, so the high-level API is available in a variety of programming languages.

<font size=3>In Spark 1.x, SparkContext was used. In newer versions, you can still see SparkContext, and SparkSessions included backward compatibility to include code containing SparkContext<font>

## Cluster Manager

> <font size=+1>Cluster manager is a program handling resources to our application(s)</font>

Cluster manager is responsible for:
- Handling requests from a driver for resources

There are a few available options, most important of which are:
- Local - run everything on a single machine (__non distributed!__)
- [Standalone](https://spark.apache.org/docs/latest/spark-standalone.html) - PySpark "default" cluster manager
- [Apache Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html) - Apache Spark "modern" approach, useful for __more generic workloads__
- [Hadoop YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) - Apache Spark "older" approach, specific for Hadoop oriented operations (e.g. map-reduce)
- [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) - container first auto-scalable workloads


## Local

> Use it when you are developing and testing your app

As it is the simplest one, we can verify everything works correctly using a single machine

## Standalone

> Small clusters WORKING ONLY WITH PYSPARK APPLICATION

This one, while it doesn't require additional software has the following drawbacks:
- We cannot run other workload on it (e.g. monitoring)
- PySpark first
- __Runs main and child processes of PySpark on each node__ hence it has an additional overhead

## Mesos

> __Larger/production clusters with GENERAL capabilities__

- Better for new projects
- More generic than YARN
- Good option for non-containerized 

## YARN (Hadoop 2.0)

> __Larger/production clusters with GENERAL capabilities BETTER AT RUNNING HADOOP SPECIFIC OPERATIONS__

Other than that quite similar to Mesos

<p align=center><img src=images/spark-standalone-hadoop.png width=400></p>

## Kubernetes

> __Workloads which can autoscale (create more/less instances based on workload) and containerized__

Using Kubernetes (also named `k8s`) one has a lot of benefits and becomes a go-to for the following reasons:
- We can containerize most of the applications
- Because of that our deployment is streamlined and less error-prone (different OS different behaviour)
- __Autoscaling__ - create more node workers if needed
- __Available as service for many clouds__ ([Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/), [Google Cloud Kubernetes Engine](https://www.google.com/search?client=firefox-b-d&q=Google+Cloud+Kubernetes+engine) or [Microsoft's Azure Kubernetes Service](https://azure.microsoft.com/en-us/services/kubernetes-service/))

This approach scales well across:
- different regions if needed
- different cloud providers if needed
- for smaller teams (for which handling Kubernetes cluster is too costly) via out-of-the-box cloud solutions

## Nomad

> __Workloads mixing containerized and non-containerized workloads across large amount of clusters__

Similiar to `k8s` but:
- No autoscaling out of the box (needs additional software for that)
- Smaller community support
- __Easier to use than Kubernetes__
- __Less popular than Kubernetes__

## Executor

> __Processes which run computations and store data__

Data can be stored in a few different ways which we will later talk about (see `Data Locality` below).

Things to note:
- __Each application has a single executor on the node__
- __There might be multiple executors on a single node__
- Due to above applications are isolated (each is run in a separate JVM machine)
- __DATA CANNOT BE EASILY SHARED BETWEEN SPARK APPLICATIONS__ (we need to save the data in some widely available storage like Kubernetes volumes for other apps to use)

## Useful things to note

> __See [glossary](https://spark.apache.org/docs/latest/cluster-overview.html) for a quick reminder of all of the concepts__

- __Job is a set of parallel tasks__ distributed across the cluster, for example `collect` across nodes
- __Driver should be close to workers__ (or most of them) as it orchestrates the whole workload (best when in the same local network if possible)

# HDFS (Hadoop Distributed FileSystem)

<p><a href='https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction'><img src=images/hdfs.png width=200></a></p>

As different cluster managers can handle data differently, hence data could be (theoretically):
- shared across cluster
- shared across parts of the cluster
- kept local for each node
- kept local and pulled/moved around

> __This would affect computation speeds tremendously!__ 

Apache has an answer to that: __HDFS__
<p align=center><img src=images/hdfsarchitecture.png width=500></a></p>


The Hadoop Distributed File System (HDFS) provides high throughput access to application data and is suitable for applications that have large data sets. This file system will allow us to prevent the aforementioned problems throughout an operation named MapReduce

# MapReduce

> __Processing layer over distributed filesystem designed for processing large volumes of data in parallel by dividing work into a set of independent tasks__

The idea works as follows:
- User submits __job__ (usually large amount of work) which consists of:
    - Execution of __Mapper__
    - Execution of __Reducer__
- __Job__ is splitted into tasks (done in parallel by worker nodes)
- These are sent to child processes on cluster
- __Individual Mapper and Reducer executions are done on each node__
- Each task returns an output which is latter aggregated to give final result

> __MapReduce operates on lists!__

This means:
- Input to our functions are lists
- Procedures outputs lists
- __Functional programming approach__ (data is unmutable)


## Task Attempt

> Each node can attempt to perform a task (Task In Progress a.k.a. TIP status) __but may fail due to various reasons__

If a node fails:
- Hadoop reschedules the task to other node
- It can be done multiple times (__up to `4` by default__)
- After that program fails



## High level flow

<p align=center><img src=images/map_reduce_counting.jpg width=600></p>

Let's see how we obtain results step by step by extending the diagram above:
1. Our input data (usually saved in HDFS), in this case text
2. `InputFormat` defines:
    - __How to split data__
    - __How to read them__
    - __Creates `InputSplit`s__
3. __`InputSplit`s represent data processed by each `Mapper`__:
    - One `map` task for each split
    - `InputSplit` is divided into separate records
    - And these records are processed by `map` operation
4. __`RecordReader`__ communicates with `InputSplit` to:
    - Transform the split into readable format for mapper (`(key, value)` pairs)
5. __`Mapper`__ - processes `(key, value)` pair from `RecordeReader` and:
    - Generates new `(key, value)` pair
    - __Does it by our specified logic__ (in this case counting word occurences)
    - Outputs values to disk creating __temporary results__ (__THESE ARE NOT SAVED TO HDFS!__)
6. __`Combiner`__ (a.k.a. `mini-reducer`) - takes temporary values and:
    - Combines them into larger batches
    - This is done in order to minimize data transfers over the network
7. __`Partitioner`__ (__USED ONLY FOR MULTIPLE `Reducer`s__):
    - Takes output from `combiner`
    - __`key` is used to make a single partition__ (in our case specific word)
    - __Records having the same `(key, value)`s are called a partition__
    - __GUARANTEES APPROXIMATELY THE SAME LOAD FOR EACH `Reducer`__
8. __Shuffling and Sorting__ - data set via network to Reducer notes:
    - __Each `Reducer` might get multiple partitions__
    - Each partition is sorted so they are __a consecutive block of data__
9. __`Reducer`__ - takes combined values from previous step and:
    - Runs __user defined reduction operation__ on each temporary `(key, value)` pair 
    - In our case it counts how many of the same records are there
    - __Stores output on HDFS__ via `RecordWriter`
    - We can modify it (e.g. in Java) via specifying custom `OutputFormat` ([documentation](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputFormat.html))

## Map Reduce FAQ

There might be a few misconceptions, so let's clear them out:

> Why does shuffle happen?

It happens as chunks of data are moved across the network. They might:
- come at different times
- __from any node in the HDFS__

Hence they are unorganized on disk __and that's why we have to sort them afterwards__

> How many mappings are run on one node?

Usually around `100` parallel tasks __per node__ are run. For lighter tasks, up to `300` is reasonable

> Why does sorting happen twice?

__This is done only for multiple `Reducers`__ in order to:
- Make the network congestion smaller (because single partition will land on a single `Reducer`)
- There might be multiple "same" partitions (based on `(key, value)`) from different mappers
- Multiple partitions might be processed by one node

Let's look at the last example:
1. There are `3` `A` and `3` `B` partitions in total in HDFS network
2. Each partition lies on different mapper node
3. Intermediate results are send to `Reducers`

We might obtain the following (__already sorted by mappers!__) data scheme: `ABABABAB`. This means we have to sort them once again.

> Can `Reducer` run when some of the `mappers` did not finish?

__No__ as it might mean "reduce" operation would need to be recalculated. This is done only after "aggregating"

# Spark vs Hadoop

Hadoop itself was mentioned a few times, what is it and what's the difference between it and Spark?

> __Hadoop consists of HDFS, MapReduce computational layer and YARN (Hadoop cluster manager)__

Written in `Java` and released in `2006`, provides multiple front-end languages to interact as well. When compared to `Spark`:

- It is a batch-processing large-scale data-efficient processing framework
- __DOES NOT PROVIDE REAL-TIME CAPABILITIES FOR CALCULATIONS__ because:
    - Writing to disk all the time is too slow
    - Solved by `Spark`
- __`Spark` DOES IT'S COMPUTATIONS IN-MEMORY WHEREVER POSSIBLE__:
    - Not writing to disk intermediate results from nodes (__or at least does not for a part of data which fits in RAM__)
    - Due to above `Spark` is about `100` times faster
- __Spark can use multiple cluster managers__
- __Spark is more of a "high-level" tool__ which uses various concepts from `Hadoop` and applies abstraciton layer over it
- __It does not provide specific functionalities__ like Spark (e.g. `MLLib`)
- __It is a part of Apache Spark__ (e.g. using `HDFS` and `YARN` as cluster manager)

Essentially we have all of the pieces of Hadoop in place (at least in theory).

# Data Locality

> __Data Locality is the process of moving COMPUTATIONS closer to DATA__ (so they are run locally a.k.a. "per-node")

In general, if `data` and `operations` reside close to each other the whole computation will be fast.
In other cases, these might be slower, hence __computation has to be moved towards data__.

There are a few possibilities when it comes to data locality in Spark (__ordered by best to worst__):
1. `PROCESS_LOCAL` - __code is in the same `JVM` as data__ 
2. `NODE_LOCAL` - __data on the same node__, for example:
    - HDFS on the same node
    - Another executor on the same node
    - __Data has to travel between processes__
3. `NO_PREF` - data has no preference where it is located because:
    - It does not matter for computation
    - __Example:__ shared volumes in `k8s`
4. `RACK_LOCAL` - data on the same rack of servers, __data has to be send through a single switch in the network__
5. `ANY` - data is elsewhere on the network, __not in the same rack__

__When `Spark` does scheduling for the computations it does it w.r.t. data locality__ which means:
- `Spark` checks whether best node to process data is available
- If not `Spark` waits for the busy `CPU` with best data locality to finish it's computation __but only for a short while__ 
- If it does not finish in a predefined `timeout` __spark moves data to next free `CPU`__


> <font size=+1>One can control data locality via `spark.locality` setting we will later see</font>

<br>

> <font size=+1>YOU SHOULD INCREASE TIMEOUT IF YOU SEE POOR DATA LOCALITY WITH DEFAULT SETTINGS!</font>

<br>

> <font size=+1>Timeouts should be traced to how long your jobs run on the cluster</font>

- Check out [Apache Myriad Project](https://incubator.apache.org/projects/myriad.html) - what might be its benefits for cluster management in PySpark?
- What is a secondary `NameNode` and what is its purpose in Hadoop's FileSystem?
- Check out how to work with Hadoop's FileSystem via command line using this series of tutorials ([1](https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/), [2](https://data-flair.training/blogs/hadoop-hdfs-commands/) and [3](https://data-flair.training/blogs/hdfs-hadoop-commands/))
- Why is `Partitioner` __NOT__ used in Hadoop MapReduce processing layer with single `Reducer`?

- What is the SIMR approach included in the second graphic in this notebook? Check out [this article](https://databricks.com/blog/2014/01/21/spark-and-hadoop.html)
- What is RAID and how does Erasure Coding work in Hadoop? Check [this tutorial](https://data-flair.training/blogs/hadoop-hdfs-erasure-coding/)
- What are `BackupNode`s and `CheckpointNode`s in HDFS?