
<img src='static/spark.png' width="30%" height="30%" />

### Introducción a spark.

Apache Spark is a general purpose distributed computing platform scalable, designed to support processing large volumes of data. Spark extends the mapping-reduction model by efficiently implementing calculations in RAM. Spark components are a series of components in their core, which is the "computer engine" responsible for scheduling, distributing, and monitoring applications which consist of a series of tasks that are distributed along units of processing that execute distributed code called (workers), the workers reside in a physical machine that is integrated to the cluster, a machine of the cluster can contain one or several workers.


<img src='static/driver.png' width="30%" height="30%" />

Each spark application consists of a diver program which is the main program in which both distributed and local structures and operations are defined. In the driver program the flow and logic of the application to be executed reside. The driver accesses the cluster through a SparkContext object which represents the physical connection to the cluster and thus is the object with which to create distributed data structures as a first instance, this SparkContext object is assigned to the driver by the master when Performs the submit of the application.

In [1]:
import pyspark
sc = pyspark.SparkContext('local[*]')
sc

### Resilient distributed datasets

Resilient distributed datasets or (RDD's) is a distributed data structure composed of an immutable collection of objects, this collection is divided into sub-sets called partitions, each partition is hosted in a cluster worker. On this collection of objects it is only possible to perform distributed tasks provided by the Spark core, so that all the work defined in the driver program flow on these data structures will be automatically parellizado and applied on the processing units that Partitions of the RDD to operate. These collections of elements can be stored in RAM or disk, and may contain primitive data of java, scala and python, or even classes defined by the user or own classes of the spark core.

<img src='static/rdd.png' width="50%" height="50%"/>

<center>*RDD separado en 5 particiones, algunas de ellas residen en el mismo worker*</center>

Some considerations about this data structure are:
* RDDs are automatically rebuilt when a machine failure occurs in some processing unit.
* RDD's can be created from parallelizing some local language data structure (lists, arrays, matrices, etc ...) through the SparkContext with the method paralelize (local_local) or when applying certain distributed transformation operations to From another RDD (or some data structure of lower hierarchy in the sense of inheritance).
* Due to the resilience these are immutable, ie it is not possible to modify them once created. When we transform a RDD we are really creating a new one.

**Example**: Creating a RDD from a list of numbers through SparkContext:

In [4]:
miFirstRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25])
miFirstRDD

ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:175

Once the elements of the RDD are distributed in the cluster it is not possible to access them in a sequential way, since the RDD functions as a generic pointer to an abstract collection of homogeneous elements, so an instruction of type 'object = RDD [ I] 'is meaningless in this context, however there are RDD functions that allow elements of the rdd to be brought into the physical memory of the local machine:

* **Collect ()**: This function collects all the objects of the rdd that are distributed throughout the cluster and stores them in the driver's local memory in the form of a sequential list, the indiscriminate use of this function in distributed environments with high volumetry can ocacionar a Machine failure in the master because the volume of the data can exceed by far the physical capacities of the machine where the driver is executed.

* **Take (small number)**: This function collects an amount of n arbitrary elements of the distributed dataset and similarly accommodates in local memory. This function is not deterministic so you may get a different set on the data set when you run it several times.

**Example**: Obtaining items from RDD to local memory.

In [6]:
miFirstRDD.glom().collect()

[[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9],
 [10, 11, 12],
 [13, 14, 15],
 [16, 17, 18],
 [19, 20, 21],
 [22, 23, 24, 25]]

In [5]:
print('Collect: '+str(miFirstRDD.collect()))
print('Take: '+str(miFirstRDD.take(5)))

Collect: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Take: [1, 2, 3, 4, 5]


**Example:** Creating a RDD through a text file.

In [7]:
numbersRDD = sc.textFile('data/amounts.txt')
print('Take: '+str(numbersRDD.take(5)))

Take: ['594.38', '281.37', '144.85', '20.99', '797.40']
