In distributed systems, programs should not rely on resources created by previous executions.

Idempotent programs do not rely on existing state before starting and can run multiple times without effect on the result.

The code below is an example of a program that relies on a previous state

`ledgerBalance = getLedgerBalanceFromLastRun()`

`ledgerBalance = ledgerBalance + getLedgerBalanceSinceLastRun()`


Avoiding the prior state would be by writing like:

`ledgerBalance = addAllTransactions()`

The idempotent code allows data to be processed in parrallel. The same code can be called in different threads on different nodes and servers for each block of data. Each program has no reliance on prior execution, therefore, there is no problem splitting up the processing.

In writing Spark code, data is not read into regular list or array because the amount of data can be very large to be read to memory. Instead, special datasets such as *Resilient Distributed Datasets (RDDs)* and *Dataframes* are used. These special datasets act like SQL query cursor because they do not hold all the data in memory. 

These datasets give Spark job access to shared resources of clusters in a controlled way managed outside the Spark job.

In [None]:
# Instead of doing something like this 

textFile = open("invoices.txt", "r")

# invoiceList could occupy Gigabytes of Memory
invoiceList = textFile.readlines()

print(invoiceList)

# Do something like this instead

invoiceDataFrame = spark.read.text("invoices.txt")

# Leverage Spark DataFrames to handle large datasets
invoiceDataFrame.show(n=10)

#### Directed Acyclic Graph (DAG)

Similar to how cyclists carry their own water, every Spark program makes a copy of its input data and never changes the original parent data. Because Spark doesn't change or mutate the input data, it's known as immutable. But what happens when you have lots of function calls in your program?

In Spark, you do this by chaining together multiple function calls that each accomplish a small chunk of the work.
It may appear in your code that every step will run sequentially. However, they may be run more efficiently if Spark finds a more optimal execution plan.

Spark uses a programming concept called **lazy evaluation.** Before Spark does anything with the data in your program, it first builds step-by-step directions of what functions and data it will need.

In Spark, and in other similar computational processes, this is called a Directed Acyclic Graph (DAG). 

DAG is a progam's path of execution that avoids explicit repetition. 

For example, if a specific file is read more than once in your code, Spark will only read it one time. Spark builds the DAG from your code, and checks if it can procrastinate, waiting until the last possible moment to get the data.

A cycling team rotates their position in shifts to preserve energy. In Spark, these shifts are called **stages.** 


As you watch the output of your Spark code, you will see output similar to this:

`[Stage 19:> ======>                                         (0 + 1) / 1]`

[Spark Overview](https://spark.apache.org/docs/latest/)

[Cluster Mode](https://spark.apache.org/docs/3.0.2/cluster-overview.html)

Python and SQL when used for data retrieval are run through a query optimizer. Spark converts the optimized query into an execution plan (DAG). The code in the DAG generates RDD.

RDD is what is needed to be used in programming Spark in the earlier versions (<1.3). In 1.3, it becomes DataFrame API, and in version 2.0. it is DataFrame and DatasetsAPI.

Data sources can be database, CSV, JSON or text files. Spark converts this RDD for efficient data processing. RDD is fault-tolerant, cacheable and partitioned.

Working with RDDs might be important when we need to deal with lower level of abstractions.

[RDD programming](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

[RDD vs Dataframes vs Datasets](https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)