# 1. Introduction

Spark is a tool for data processing. It can be used as a data storage, but it is not its main capability.

Spark is a great tool for Big Data for its clustering processing method. Data processed by Spark enables partitioning, scalability, etc. Another great feature of this framework is the duplication of data along the cluster. This makes data available even if one node of the cluster fails to continue processing (dies or something XP)

Looking further into partitioning, is the process of separating data into particular groups. Groups that has its unique processing executors. This is a great way to reduce reprocessing of data.

We will learn PySpark, one possible language of Spark.

## 1.1. Components

* Machine Learning (Mlib)
* SQL (Spark SQL)
* Streaming processing
* Graphs processing

## 1.2. Structure

* **Driver**: initialize SparkSession and acquire computational resources of the Cluster Manager; transforms operation into DAGs (Directed Acyclic Graphs); distribute operations throughout executors.

* **Manager**: manage cluster's resorces. We have four different managers, built-in, YARN, Mesos and Kubernetes.

* **Executer**: Runs throughout each node executing tasks

## 1.3. Transformations and Actions

The dataframe is the basic unit of Spark. They are immutable, characteristic that bring failed tolerance.
When we execute process in Spark we have two basic operation: Transformations and Actions.

Transformations generate a new df. And, the processing of a transformation only occurs once an Action happens (Lazy Evaluation),

![Lazy Evaluation](images/Lazy.png "Lazy Evaluation")

We have two main types of transformations: Narrow and Wide. they indicate if the transform uses data from the same (Narrow) or different (Wide) partitions

## 1.4. Components

* Job
* Stage
* Task

![Spark Components](images/Contents.png "Spark Components")

## 1.5. Big Data Formats

Modern data formats are open to every capable framework to read. Throughout this course, we will use **parquet** files. These data formats are decoupled from the reading tools. They are also binary and compressed files. Moreover, they support schemas, are passive to clustering and partitioning

## 1.6. Installation and initial configuration

To install Spark, one must go to their website and copy the download link. Then, simply copy the link into the terminal with the **wget** command. After this, you must move the extracted folder into the **opt** folder and add the required environmental variables to the **~/.bashrc** file.

Once the installation and variables are setup, use the following url on the browser to validate: **http://localhost:8080/**

Finnaly, to access Spark through the terminal, run the following commands:
* start-master.sh
* /opt/spark/sbin/start-slave.sh spark://localhost:7077

Now, one can access the Spark shell (python language) via **pyspark** command. For this course, you must also install **numpy** and **pandas**

*P.S.: For PySpark only, just use **pip install pyspark** ʕ•ᴥ•ʔ*

![Spark Successfull Install](images/install_success.png "Spark Successfull Installation")

For a more in depth reading, see: https://www.bmc.com/blogs/jupyter-notebooks-apache-spark/

# 2. Data Structures

The Spark framework can interpret three data structures: **RDD - Resilien Distributed Datasets**; **Datasets**; **DataFrames**.

RDD are the most basic structure that Spark can process, and they normaly are:

* Low level basic data structure
* Complex and wordy - one might need a lot of code to process RDD
* Not optimized for Spark

DataFrames and Dataset are easier to manipulate. We already know their tabular structure, however, Datasets are not available for PySpark

## 2.1. RDD 

One might create an **RDD** on shell by calling the method **sc.parallelize**, which takes as an argument a list.
Ex: ***nums = sc.parallelize([1,2,3,4,5,6,7,8,9,10])***

Another way to create the said object is via a hardcoded inicialization of a PySpark Session. The next kernel has such code. 

In [14]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName('SparkByExamples.com').getOrCreate()

#Create RDD from parallelize    
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = spark.sparkContext.parallelize(data)

This object has a variaety of methods, such as **take**, **top**, **colect**, **sum**, **mean**, **stdev**, etc.

Moreover, using python lambda functions one can apply transformations - such as **filter** and **map**- on the **RDD**, then, perform an action to get the result.

In [29]:
print(f'Take method: {rdd.take(5)}')

print(f'Count method: {rdd.count()}')

print(f'Standard deviation method: {rdd.stdev()}')

#Filter using lambda functions
rdd_filtered = rdd.filter(lambda rdd_filtered: rdd_filtered > 4) 
print(f'Collecting filtered RDD: {rdd_filtered.collect()}')
#Collect method to collect data. Not a great idea for actual bigdata ｡◕‿◕｡

#Mapping using lambda functions
rdd_mapped = rdd.map(lambda rdd_mapped: rdd_mapped*3)
print(f'Collecting mapped RDD: {rdd_mapped.collect()}')

Take method: [1, 2, 3, 4, 5]
Count method: 12
Standard deviation method: 3.452052529534663
Collecting filtered RDD: [5, 6, 7, 8, 9, 10, 11, 12]
Collecting mapped RDD: [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36]


With two **RDDs**, one can manipulate them with similar methods from mathematical sets. We can perform **union**, **intersection**, **subtract**, **cartesian**, etc.

In [33]:
data2 = [10, 11, 12, 13, 14, 15]
rdd2 = spark.sparkContext.parallelize(data2)

union = rdd.union(rdd2)
print(f'Union of both RDDs: {union.collect()}')
      
inter = rdd.intersection(rdd2)
print(f'Intersection of both RDDs: {inter.collect()}')

sub = rdd.subtract(rdd2)
print(f'Subtraction of both RDDs: {sub.collect()}')

cartesian_prod = rdd.cartesian(rdd2)
print(f'Cartesian product of both RDDs: {cartesian_prod.collect()}')

Union of both RDDs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 10, 11, 12, 13, 14, 15]
Intersection of both RDDs: [10, 11, 12]
Subtraction of both RDDs: [8, 1, 9, 2, 3, 4, 5, 6, 7]
Cartesian product of both RDDs: [(1, 10), (2, 10), (3, 10), (1, 11), (2, 11), (3, 11), (1, 12), (2, 12), (3, 12), (1, 13), (2, 13), (3, 13), (1, 14), (2, 14), (3, 14), (1, 15), (2, 15), (3, 15), (4, 10), (5, 10), (6, 10), (4, 11), (5, 11), (6, 11), (4, 12), (5, 12), (6, 12), (4, 13), (5, 13), (6, 13), (4, 14), (5, 14), (6, 14), (4, 15), (5, 15), (6, 15), (7, 10), (8, 10), (9, 10), (7, 11), (8, 11), (9, 11), (7, 12), (8, 12), (9, 12), (7, 13), (8, 13), (9, 13), (7, 14), (8, 14), (9, 14), (7, 15), (8, 15), (9, 15), (10, 10), (11, 10), (12, 10), (10, 11), (11, 11), (12, 11), (10, 12), (11, 12), (12, 12), (10, 13), (11, 13), (12, 13), (10, 14), (11, 14), (12, 14), (10, 15), (11, 15), (12, 15)]


Now let us take a look at an example. Say the previous cartesian product is the sales registry of a given store. The first entry of each tuple is the customer code, and the second entry is the number of cucumbers they bought in our store. (◕‿◕✿)

We can extract **keys** and **values** from our registry, and create different **RDDs** to store them.

In [47]:
cucumbaLTDA_registry = cartesian_prod

customers = cucumbaLTDA_registry.keys().distinct()
print(f'Distinct customers that bought cucumbers on our store: {customers.collect()}')

total_value = cucumbaLTDA_registry.values().sum()
print(f'Total number of cucumber bought: {total_value}.\nOH YEAH! That\'s a lot of CUCUMBAS! (づ￣ ³￣)づ')

#This reincidence count is broken because we've created our registry via a cartesian product (ಥ﹏ಥ)
print(f'Count how many times each customer passed by our store: {cucumbaLTDA_registry.countByKey()}')

Distinct customers that bought cucumbers on our store: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Total number of cucumber bought: 900.
OH YEAH! That's a lot of CUCUMBAS! (づ￣ ³￣)づ
Count how many times each customer passed by our store: defaultdict(<class 'int'>, {1: 6, 2: 6, 3: 6, 4: 6, 5: 6, 6: 6, 7: 6, 8: 6, 9: 6, 10: 6, 11: 6, 12: 6})


Let's say that we also have a registry of debts for some customers. We may create a **join RDD**, where we have the registry of purchase while maintaining the number of unpaid cucumbers... 

In [52]:
debts = [(1,3),(5,10),(10,2)]
unpaid_cucumbers = spark.sparkContext.parallelize(debts)

comp_registry = cucumbaLTDA_registry.join(unpaid_cucumbers)
print(f'The complete registry obtained by the join of our sales registry and our debts is:\n {comp_registry.collect()}')

purchase_by_no_debt = cucumbaLTDA_registry.subtractByKey(unpaid_cucumbers)
print(f'Purchases made by clients without debts: {purchase_by_no_debt.collect()}')

The complete registry obtained by the join of our sales registry and our debts is:
 [(1, (10, 3)), (1, (11, 3)), (1, (12, 3)), (1, (13, 3)), (1, (14, 3)), (1, (15, 3)), (5, (10, 10)), (5, (11, 10)), (5, (12, 10)), (5, (13, 10)), (5, (14, 10)), (5, (15, 10)), (10, (10, 2)), (10, (11, 2)), (10, (12, 2)), (10, (13, 2)), (10, (14, 2)), (10, (15, 2))]
Purchases made by clients without debts: [(2, 10), (2, 11), (2, 12), (2, 13), (2, 14), (2, 15), (3, 10), (3, 11), (3, 12), (3, 13), (3, 14), (3, 15), (4, 10), (4, 11), (4, 12), (4, 13), (4, 14), (4, 15), (6, 10), (6, 11), (6, 12), (6, 13), (6, 14), (6, 15), (7, 10), (7, 11), (7, 12), (7, 13), (7, 14), (7, 15), (8, 10), (8, 11), (8, 12), (8, 13), (8, 14), (8, 15), (9, 10), (9, 11), (9, 12), (9, 13), (9, 14), (9, 15), (11, 10), (11, 11), (11, 12), (11, 13), (11, 14), (11, 15), (12, 10), (12, 11), (12, 12), (12, 13), (12, 14), (12, 15)]


## 2.2. DataFrame

**DataFrames** are the data structure that we will focus on in this course. Some bullet points that are of great importance about **DataFrames**:

* Tabular structure. Excel-like data ☜(ﾟヮﾟ☜)
* Immutable
* Possess know schemas
* Preserved linage. Transformations are saved step by step
* Columns may have different d-types
* Common methods such as **group by**, **order by** and **filter**
* ***Extremely optimized on Spark***

Without further adieu, let's create a **DataFrame** and play with it. As you may see, to correctly create a **DF** one need to pass as arguments the required data and its schema.

In [65]:
data3 = [('John', 15),('Mary', 14),('James', 12)]

schema = "Name STRING, Age INT" #Model of a schema accepted by spark

df = spark.createDataFrame(data3, schema) #Here we pass the data and its schema

df.show()
df.show(1)

+-----+---+
| Name|Age|
+-----+---+
| John| 15|
| Mary| 14|
|James| 12|
+-----+---+

+----+---+
|Name|Age|
+----+---+
|John| 15|
+----+---+
only showing top 1 row



In [72]:
schema2 = "Product STRING, Quantity INT"
data4 = [('Pen', 9),('Pineappple', 22),('Apple', 12),('Pen', 13)]

df2 = spark.createDataFrame(data4, schema2)

df2.show()

+----------+--------+
|   Product|Quantity|
+----------+--------+
|       Pen|       9|
|Pineappple|      22|
|     Apple|      12|
|       Pen|      13|
+----------+--------+



The new **DF** contains data from purchased products. One may need to see to total sum of a particular product, for instance. With this intention, we will use the method **groupBy** and **agg** (*agregate*).

In [73]:
from pyspark.sql.functions import sum
df2.groupBy("Product").agg(sum("Quantity")).show()

+----------+-------------+
|   Product|sum(Quantity)|
+----------+-------------+
|       Pen|           22|
|Pineappple|           22|
|     Apple|           12|
+----------+-------------+



We can use the method **select** to choose different columns, or even add a particular expression for a new column

In [75]:
from pyspark.sql.functions import expr
df2.select("Product", "Quantity", expr("Quantity - 4"), expr("Quantity * 0.5")).show()

+----------+--------+--------------+----------------+
|   Product|Quantity|(Quantity - 4)|(Quantity * 0.5)|
+----------+--------+--------------+----------------+
|       Pen|       9|             5|             4.5|
|Pineappple|      22|            18|            11.0|
|     Apple|      12|             8|             6.0|
|       Pen|      13|             9|             6.5|
+----------+--------+--------------+----------------+

