# RDD object
---

## Create SparkContext and SparkSession

In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

## Create an RDD object

In [2]:
mtcars = sc.textFile('data/mtcars.csv', use_unicode=True)
mtcars.take(5)

[',mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb',
 'Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4',
 'Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4',
 'Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1',
 'Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1']

## Map functions
There are several map functions that operate on RDD objects:

* map()
* mapValues()
* flatMap()
* flatMapValues()

### `map()`
The `map()` applies a function to each element of the RDD.

In [3]:
mtcars_map = mtcars.map(lambda x: x.split(',')).\
    filter(lambda x: x[1] != 'mpg')
mtcars_map.take(2)

[['Mazda RX4',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.62',
  '16.46',
  '0',
  '1',
  '4',
  '4'],
 ['Mazda RX4 Wag',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.875',
  '17.02',
  '0',
  '1',
  '4',
  '4']]

### `mapValues()`

Each element in the RDD is a **tuple**. The first element is the key, and the second element is the value. The mapValues() applies a function to the values of each elements and keep the original keys.

#### Create a tuple RDD

In [4]:
mtcars_tuple = mtcars_map.map(lambda x: (x[0], x[1:]))
mtcars_tuple.take(2)

[('Mazda RX4',
  ['21', '6', '160', '110', '3.9', '2.62', '16.46', '0', '1', '4', '4']),
 ('Mazda RX4 Wag',
  ['21', '6', '160', '110', '3.9', '2.875', '17.02', '0', '1', '4', '4'])]

Note that the x below refers to the value in each tuple elements in the RDD. it does not include the key. The `map()` function returns a map object, instead of a list. The code **`[*map(float, x)]`** converts a map object to a list.

In [5]:
mtcars_tuple.mapValues(lambda x: [*map(float, x)]).take(2)

[('Mazda RX4',
  [21.0, 6.0, 160.0, 110.0, 3.9, 2.62, 16.46, 0.0, 1.0, 4.0, 4.0]),
 ('Mazda RX4 Wag',
  [21.0, 6.0, 160.0, 110.0, 3.9, 2.875, 17.02, 0.0, 1.0, 4.0, 4.0])]

### `flatMap()`

`flatMap()` applies a function to each element of the RDD and then flaten each results. In another words, each value in the returned results for each elements will become a new row.


The following script converts the data into a data frame of two variables: one is the car model, the other is observations from that car model.

In [6]:
mtcars_map.flatMap(lambda x: [(x[0], number) for number in x[1:]]).take(20)

[('Mazda RX4', '21'),
 ('Mazda RX4', '6'),
 ('Mazda RX4', '160'),
 ('Mazda RX4', '110'),
 ('Mazda RX4', '3.9'),
 ('Mazda RX4', '2.62'),
 ('Mazda RX4', '16.46'),
 ('Mazda RX4', '0'),
 ('Mazda RX4', '1'),
 ('Mazda RX4', '4'),
 ('Mazda RX4', '4'),
 ('Mazda RX4 Wag', '21'),
 ('Mazda RX4 Wag', '6'),
 ('Mazda RX4 Wag', '160'),
 ('Mazda RX4 Wag', '110'),
 ('Mazda RX4 Wag', '3.9'),
 ('Mazda RX4 Wag', '2.875'),
 ('Mazda RX4 Wag', '17.02'),
 ('Mazda RX4 Wag', '0'),
 ('Mazda RX4 Wag', '1')]

### `flatMapValues()`

`flatMapValues()` operates on key-value pair RDD and flatten the results without changing the keys.

In [7]:
mtcars_tuple.flatMapValues(lambda x: x).take(20)

[('Mazda RX4', '21'),
 ('Mazda RX4', '6'),
 ('Mazda RX4', '160'),
 ('Mazda RX4', '110'),
 ('Mazda RX4', '3.9'),
 ('Mazda RX4', '2.62'),
 ('Mazda RX4', '16.46'),
 ('Mazda RX4', '0'),
 ('Mazda RX4', '1'),
 ('Mazda RX4', '4'),
 ('Mazda RX4', '4'),
 ('Mazda RX4 Wag', '21'),
 ('Mazda RX4 Wag', '6'),
 ('Mazda RX4 Wag', '160'),
 ('Mazda RX4 Wag', '110'),
 ('Mazda RX4 Wag', '3.9'),
 ('Mazda RX4 Wag', '2.875'),
 ('Mazda RX4 Wag', '17.02'),
 ('Mazda RX4 Wag', '0'),
 ('Mazda RX4 Wag', '1')]

## Aggregate functions
Two aggregate functions:

* `aggregate()`
* `aggregateByKey()`

### `aggregate(zeroValue, seqOp, combOp)`

* **zeroValue** is like a data container. Its structure should match with the data structure of the returned values from the seqOp function.
* **seqOp** is a function that takes two arguments: the first argument is the zeroValue and the second argument is an element from the RDD. The zeroValue gets updated with the returned value every run.
* **combOp** is a function that takes two arguments: the first argument is the final zeroValue from one partition and the other is another final zeroValue from another partition.

The code below calculates the sum of square roots for mpg and disp.

In [8]:
mtcars_map.take(2)

[['Mazda RX4',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.62',
  '16.46',
  '0',
  '1',
  '4',
  '4'],
 ['Mazda RX4 Wag',
  '21',
  '6',
  '160',
  '110',
  '3.9',
  '2.875',
  '17.02',
  '0',
  '1',
  '4',
  '4']]

### Calculate Total Sum of Squares (TSS) with `aggregate()`
#### Calculate the averages of mpg and disp. 

In [9]:
mtcars_vars = mtcars_tuple.mapValues(lambda x: [*map(float, x)]).map(lambda x: x[1])
mtcars_vars.take(5)

[[21.0, 6.0, 160.0, 110.0, 3.9, 2.62, 16.46, 0.0, 1.0, 4.0, 4.0],
 [21.0, 6.0, 160.0, 110.0, 3.9, 2.875, 17.02, 0.0, 1.0, 4.0, 4.0],
 [22.8, 4.0, 108.0, 93.0, 3.85, 2.32, 18.61, 1.0, 1.0, 4.0, 1.0],
 [21.4, 6.0, 258.0, 110.0, 3.08, 3.215, 19.44, 1.0, 0.0, 3.0, 1.0],
 [18.7, 8.0, 360.0, 175.0, 3.15, 3.44, 17.02, 0.0, 0.0, 3.0, 2.0]]

In [10]:
mpg_mean = mtcars_vars.map(lambda x: x[0]).mean()
disp_mean = mtcars_vars.map(lambda x: x[2]).mean()
print('mpg mean = ', mpg_mean, '; '
     'disp mean = ', disp_mean)

mpg mean =  20.090625000000003 ; disp mean =  230.721875


#### Use `aggregate()` function

In [11]:
# define zeroValue
zero_value = (0, 0) # we need to calculate two variances. Our initial value has two elements
# define seqOp
seqOp = lambda z, x: (z[0] + (x[0] - mpg_mean)**2, z[1] + (x[2] - disp_mean)**2)
# define combOp
combOp = lambda px, py: ( px[0] + py[0], px[1] + py[1] )

# implements aggregate().
mtcars_vars.aggregate(zero_value, seqOp, combOp)

(1126.0471874999998, 476184.7946875)

**The operator that connects parameter `z` and `x` in the `seqOp` function determines how values will be aggregated within one partion. The operator that connects paramter `px` and `py` in the `combOp` function determines how values will be aggregated among partitions.**

The `+` operator in both **`seqOp`** and **`combOp`** indicates that it is a cumulative sum calculation.

### `aggregateByKey(zeroValue, seqOp, combOp)`

This function does similar things as aggregate(). The aggregate() aggregate all results to the very end, but aggregateByKey() merge results by key.

#### Import data

In [12]:
iris_rdd = sc.textFile('data/iris.csv', use_unicode=True)
iris_rdd.take(2)

['sepal_length,sepal_width,petal_length,petal_width,species',
 '5.1,3.5,1.4,0.2,setosa']

#### Transform data to a tuple RDD

In [13]:
iris_rdd_2 = iris_rdd.map(lambda x: x.split(',')).\
    filter(lambda x: x[0] != 'sepal_length').\
    map(lambda x: (x[-1], [*map(float, x[:-1])]))
iris_rdd_2.take(5)

[('setosa', [5.1, 3.5, 1.4, 0.2]),
 ('setosa', [4.9, 3.0, 1.4, 0.2]),
 ('setosa', [4.7, 3.2, 1.3, 0.2]),
 ('setosa', [4.6, 3.1, 1.5, 0.2]),
 ('setosa', [5.0, 3.6, 1.4, 0.2])]

#### Define initial values, seqOp and combOp

In [14]:
zero_value = (0, 0)
seqOp = (lambda x, y: (x[0] + (y[0])**2, x[1] + (y[1])**2))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))

#### Implement `aggregateByKey()`

In [15]:
iris_rdd_2.aggregateByKey(zero_value, seqOp, combOp).collect()

[('setosa', (1259.0899999999997, 591.2500000000002)),
 ('versicolor', (1774.8600000000006, 388.4700000000001)),
 ('virginica', (2189.9000000000005, 447.33))]

#### R results

```
ddply(iris, .(Species), summarise, 
      sst_sepal_length=sum((Sepal.Length)^2),
      sst_sepal_width=sum((Sepal.Width)^2))
```

```
     Species sst_sepal_length sst_sepal_width
1     setosa          1259.09          594.60
2 versicolor          1774.86          388.47
3  virginica          2189.90          447.33
```