# PySpark on Mac

To install PySpark, I used this [tutorial](https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b) that offers a complete guideline about how to configure and update PySpark driver environment variables adding lines to your ~/.bash_profile file.

In [1]:
# import SparkContext
from pyspark import SparkContext

### Resilient Distributed Datasets (RDDs)

RDDs are the backbone of Apache Spark. They perform calculations faster because the dataset is parallelized, it means, distributed or split into chuncks based on keys and executor nodes.  

The transformations to the dataset only occur when the action is taken, optimizing the execution.

Let's try an example of RDDs:

In [2]:
sc = SparkContext.getOrCreate()

1 million of 2D dots are randomly generated. A basic multiplication and substraction is applied to every coordinate and then we calculate the mean and standard deviation of every population of coordinates. 

In [3]:
import numpy as np

TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())

stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

Number of random points: 1000000
Mean: [-4.10199644e-04 -4.56263556e-05]
stdev: [0.5775736  0.57721487]
