# Workbook for spark the definitve guide
<hr>

## Chapter 1. What Is Apache Spark?

***
![image.png](attachment:image.png)
***

###  Apache Spark’s Philosophy
* __Apache Spark__ — A unified computing engine and set of libraries for big data.
    * __Unified__
            
          Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs.
    
    * __Computing engine__
 
          Spark handles loading data from storage systems and performing computation on it, not permanent storage as the end itself.
    * __Libraries__
    
          Spark’s final component is its libraries, which build on its design as a unified engine to provide a unified API for common data analysis tasks.

## Chapter 2. A Gentle Introduction to Spark

### Spark Applications

Spark Applications consist of
  * A driver process.  
  * A set of executor processes.

 The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things:
     
   * Maintaining information about the Spark Application
   * Responding to a user’s program or input
   * Analyzing, distributing, and scheduling work across the executors

An executor is responsible for only two things:

   * Executing code assigned to it by the driver.
   * Reporting the state of the computation on that executor back to the driver node.

The cluster manager controls physical machines and allocates resources to Spark Applications.

![image.png](attachment:image.png)

### Spark’s Language APIs
 Spark presents some core “concepts” in every language; these
 concepts are then translated into Spark code that runs on the cluster of machines.
 If you use just the Structured APIs, you can expect all languages to have similar performance characteristics.

***
![image.png](attachment:image.png)
***

### Spark’s APIs

 Spark has two fundamental sets of APIs: the low-level “unstructured” APIs, and the higher-level structured APIs.

####  Starting Spark: The SparkSession

You control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application.

In [None]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder\
        .appName("spark_the_defnitive_guide")\
        .getOrCreate()

In [2]:
spark

### DataFrames

* A DataFrame is the most common Structured API and simply represents a table of data with rows and columns.
<br/><br/>
*  The list that defines the columns and the types within those columns is called the __schema__. 
<br/><br/>
* A spreadsheet sits on one computer in one specific location, whereas a Spark DataFrame can span thousands of computers.

***
![image.png](attachment:image.png)
***

### Partitions

A partition is a collection of rows that sit on one physical machine in your cluster.  If you have one partition, Spark    will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource.

### Transformations

1. In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created.To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do what you want. These instructions are called transformations.
<br/><br/>
2. There are two types of transformations: those that specify narrow dependencies, and those that specify wide dependencies.
<br/><br/>
3. Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are those for which each input partition will contribute to only one output partition.
<br/><br/>
4. A wide dependency (or wide transformation) style transformation will have input partitions contributing to many output partitions.
<br/><br/>
5.  With narrow transformations, Spark will automatically perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory. The same cannot be said for shuffles. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Lazy Evaluation
1. You now can see how transformations are simply ways of specifying different series of data manipulation. This leads us to a topic called lazy evaluation.
<br/><br/>
2. Lazy evaulation means that Spark will wait until the very last moment to execute the graph of computation instructions.
<br/><br/>
3. By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster.

### Actions
1. Transformations allow us to build up our logical transformation plan. To trigger the computation, we run an action. An action instructs Spark to compute a result from a series of transformations.
<br/><br/>
 There are three kinds of actions:
   1. Actions to view data in the console
   2. Actions to collect data to native objects in the respective language
   3. Actions to write to output data sources

***
![image.png](attachment:image.png)
***

***
![image.png](attachment:image.png)
***

***
![image.png](attachment:image.png)
***

***
![image.png](attachment:image.png)
***

***
![image.png](attachment:image.png)
***

***
![image.png](attachment:image.png)
***