# PySpark on Mac

To install PySpark, I used this [tutorial](https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b) that offers a complete guideline about how to configure and update PySpark driver environment variables adding lines to your ~/.bash_profile file.

In [1]:
# import SparkContext
from pyspark import SparkContext

### Resilient Distributed Datasets (RDDs)

RDDs are the backbone of Apache Spark. They perform calculations faster because the dataset is parallelized, it means, distributed or split into chuncks based on keys and executor nodes.  

The transformations to the dataset only occur when the action is taken, optimizing the execution.

Let's try an example of RDDs:

In [2]:
sc = SparkContext.getOrCreate()

1 million of 2D dots are randomly generated. A basic multiplication and substraction is applied to every coordinate and then we calculate the mean and standard deviation of every population of coordinates. 

In [3]:
import numpy as np

TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()

In [4]:
#access to the first 10 elements on dots:
dots.take(10)

[array([0.02294497, 0.14497392]),
 array([ 0.43679125, -0.27209092]),
 array([ 0.50293121, -0.00947994]),
 array([-0.04634862, -0.28718432]),
 array([-0.44772028,  0.67731428]),
 array([ 0.36578485, -0.47904216]),
 array([ 0.46645309, -0.19562448]),
 array([-0.52289423,  0.72305018]),
 array([0.34470048, 0.50470049]),
 array([ 0.50013807, -0.66653415])]

In [5]:
#count the elements on dots:
dots.count()

1000000

In [6]:
#inspect firt line
dots.first()

array([0.02294497, 0.14497392])

In [8]:
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

Mean: [0.00040623 0.00047935]
stdev: [0.57703526 0.57736937]


### RDD Data Transformations


What kind of transformations we can do? Mapping, filtering, joining, and transcoding are the operations that transform the values in the dataset.

In [24]:
sc.parallelize(['a', 'b', 'c', 1, 1.1]).count()

5

In [26]:
rdd = sc.parallelize([('flat white', 1), ('latte', 4), ('pour over', 1), ('flat white', 3)]) 
sorted(rdd.countByKey().items())

[('flat white', 2), ('latte', 1), ('pour over', 1)]

In [28]:
sorted(sc.parallelize(['a', 'b', 'c', 'd', 'e', 'a', 'b']).distinct().collect())

['a', 'b', 'c', 'd', 'e']

In [32]:
rdd = sc.parallelize(['flat white', 'capuccino', 'latte', 'tea', 'matcha'])
rdd.map(lambda x: 'cup of '+''.join(x)).collect()

['cup of flat white',
 'cup of capuccino',
 'cup of latte',
 'cup of tea',
 'cup of matcha']

In [33]:
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x % 2 == 0).collect()

[2, 4]

In [34]:
rdd = sc.parallelize(['flat white', 'capuccino', 'latte', 'tea', 'matcha'])
rdd.first()

'flat white'

### Data Frames

In [36]:
sc_dataframe = SparkSession.builder.appName("pysparkDataframes").getOrCreate()

In [37]:
dfStormEvents2019 = sc_dataframe.read.csv('../pyspark/data/StormEvents_details-ftp_v1.0_d2019_c20200317.csv.gz')

In [38]:
dfStormEvents2019.show(2)

+---------------+---------+----------+-------------+-------+--------+----------+--------+-----+----------+----+----------+-----------+-------+-------+-------+----+------------------+-----------+------------------+---------------+-----------------+-------------+---------------+---------------+------------+---------------+---------+--------------+-----------+--------+-----------+----------+---------+-------------+------------------+-----------------+-----------------+-----------+-------------+--------------+---------+-----------+------------+---------+---------+-------+--------+--------------------+--------------------+-----------+
|            _c0|      _c1|       _c2|          _c3|    _c4|     _c5|       _c6|     _c7|  _c8|       _c9|_c10|      _c11|       _c12|   _c13|   _c14|   _c15|_c16|              _c17|       _c18|              _c19|           _c20|             _c21|         _c22|           _c23|           _c24|        _c25|           _c26|     _c27|          _c28|       _c29|   

In [39]:
dfStormEvents2019.count()

67338

In [40]:
dfStormEvents2019.distinct().count()

67338

In [41]:
dfStormEvents2019.describe().show()

+-------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------+-----------------+------+---------+--------------------+-----+------------------+--------------------+-----+------------------+------+------------------+-------------------+--------------------+--------------------+--------------------+---------------+------------+---------------+------------------+-----+-------------------+-----------------+-----------+-----------------+------------------+----+----+------------------+------+------------------+-------------+----------------+------------------+-----+----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-----------+
|summary|               _c0|               _c1|               _c2|               _c3|               _c4|               _c5|               _c6|               _c7|    _c8|            