# Lesson 11 - Pair RDDs

## Prepare Environment

We will begin the lesson by importing some packages and creating `SparkSession` and `SparklContext` objects.

In [0]:
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt

In [0]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Introduction to Pair RDDs

A **pair RDD** is an RDD whose elements are lists or tuples containing two elements. The first element of each tuple is referred to the **key** of the tuple and the second element is referred as the **value**. Either tuple element is allows to consist of any data type, including lists. We will see an example later in this lesson of a pair RDD in which the value element is a list.  

The "key/value" terminology will likely remind you of Python dictionaries, but aside from sharing some terminology, pair RDDs don't share much in common with dictionaries. 

Pair RDDs are actually not that different from any other RDD. What makes a pair RDD special is that Spark provides a few methods that can applied only to pair RDDs. These methods can simplify certain tasks that you might naturally wish to perform on an RDD with this structure. 

To illustrate the use of these methods, we will create a small pair RDD containing fictional sales data. We will create our RDD using the data stored in the list below. Suppose that each element in this list provides information about an order placed with a company that sells widgets during the last week. The element provides the name of the salesperson who is responsible for the order, as well as the number of widgets purchased.

In [0]:
sales_list = [
    'Alex 26', 'Beth 16', 'Chad 24', 'Emma 15',
    'Beth 19', 'Alex 32', 'Beth 26', 'Chad 14',
    'Emma 16', 'Drew 17', 'Beth 16', 'Drew 23',
    'Drew 18', 'Drew 21', 'Beth 11', 'Emma 32'
]

sales_rdd = sc.parallelize(sales_list)

To simulate the a situation similar to what we would encounter when reading a data file with the `textFile()` method, our records are currently stored as strings. we will use `map()` and `split()` to create a pair RDD.

In [0]:
def process_row(row):
    tokens = row.split(' ')
    return (tokens[0], int(tokens[1]))

pairs_rdd = sales_rdd.map(process_row)

for row in pairs_rdd.take(5):
    print(row)

## The `countByKey()` Action
- returns a dictionary

The `countByKey()` action groups together elements with the same key, and performs a count of elements with each key. It returns a `dict` containing the results. In the example below, we will use `countByKey()` to determine the number of orders credited to each salesperson.

In [0]:
count_by_employee = pairs_rdd.countByKey()

for k, v in count_by_employee.items():
    print(k, v)

## The `reduceByKey()` Transformation

The `reduceByKey()` transformation is similar to the `reduce()` action, except that it first groups together pair RDD elements based on their key and then performs the desired aggregation on values within each of these key-groups separately. The parameters of the argument function `f` provided to `reduceByKey()` are assumed to be `value` elements from the pair RDD tuples, and not the tuples themselves. 

Unlike `reduce()`, the `reduceByKey()` method is a transformation. It returns a pair RDD with one element for each `key`.

In the example below, we will determine the total number of widgets sold by each salesperson during the last week.

In [0]:
total_by_employee = pairs_rdd.reduceByKey(lambda x, y : x + y)

for row in total_by_employee.collect():
    print(row)

## The `sortByKey()` Transformation

The `sortByKey()` transformation performs a sort of the RDD elements based on the key entries. It sorts into ascending order by default, but it has an `ascending` parameter that can be set to `False` to sort in descending order. 

We will sort the elements of `total_by_employee` according to the names of the salespeople.

In [0]:
sorted_by_name = total_by_employee.sortByKey(ascending=False)

for row in sorted_by_name.collect():
    print(row)

## The `sortBy()` Transformation

More general sorting options are provided by the `sortBy()` transformation, which was discussed in a previous lesson. Since every pair RDD is still an RDD, we can use standard RDD methods on pair RDDs as well. 

In the cell below, we will use `sortBy()` to sort the elements of `total_by_employee` in decreasing order of sales.

In [0]:
sorted_by_sales = total_by_employee.sortBy(lambda x : x[1], ascending=False)

for row in sorted_by_sales.collect():
    print(row)

## The `mapValues()` Transformation

The `mapValues()` transformation allows us to apply a function to the value enries within the tuples of a pair RDD without affecting the keys. It does not provide any new functionality over `map()`, but allows us to write somewhat cleaner code. As with `reduceByKey()`, the parameter for the argument function `f` is assumed to be a `value` element selected from a pair RDD tuple, and not an actual tuple. 

Assume that the widgets have a unit price of 137. We will multiple each of the sales numbers by 137 to determine the total revenue generated by each salesperson during the last week. We will see how to do this two different ways. We will first use `map()` and will then see how to use `mapValues()` to accomplish this task.

In [0]:
revenue_rdd = total_by_employee.map(lambda x : (x[0], 137 * x[1]))

for row in revenue_rdd.collect():
  print(row)


Notice that when using `map()`, the argument function is assumed to accept pair tuples as its input and we have to specify that the value returned is also expected to be a tuple. Compare this approach with the `mapValues()` approach shown below.

In [0]:
revenue_rdd = total_by_employee.mapValues(lambda x : 137 * x)

for row in revenue_rdd.collect():
  print(row)

## Finding Average Sales

In the cell below, we will apply what we have covered in this section to calculate the average number of units sold per order by each employee. The purpose of each of the first four lines of code the cell below are explained as follows:

1. This produces a pair RDD named `step1` with elements of the form: `(name, (units, 1))`
2. This produces a pair RDD named `step2` with one element for each salesperson and with elements of the form: `(name, (total_units, count_of_sales))`
3. This produces a pair RDD named `avg_sale` with elements of the form: `(name, total_units/count_of_sales)`
4. This sorts the previous pair RDD by the average number of units sold per order. 

To confirm that the structure of the RDDs created is as described above, I encourage you to add a few lines of code to display the elements of each of these RDDs.

In [0]:
for row in pairs_rdd.take(5):
    print(row)

In [0]:
step1 = pairs_rdd.mapValues(lambda x: (x, 1))
for row in step1.take(5):
    print(row)

In [0]:
step2 = step1.reduceByKey(lambda x, y: ())

In [0]:
step1 = pairs_rdd.mapValues(lambda x : (x, 1))
step2 = step1.reduceByKey(lambda x, y : (x[0] + y[0], x[1] + y[1]))
avg_sale = step2.mapValues(lambda x : x[0] / x[1])
avg_sale = avg_sale.sortBy(lambda x : x[1], ascending=False)

for row in avg_sale.collect():
    print(row)

## Iris Dataset

In this example, we will look at the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). This dataset contains information about 150 flowers from three different iris species: setosa, versicolor, and virginica. For each flower, we are provided with the species of the flower, as well as measurements for certain leaf-like structures on that flower. Specifically, we are provided with the length and width of the sepals, and the length and width of the petals for each flower. 

In the cell below, we load the data set and view the first 5 rows.

In [0]:
iris_raw = sc.textFile('/FileStore/tables/iris.txt')

for row in iris_raw.take(5):
    print(row)

Our first task will be to process the dataset. We will filter out the header row and will process each other line by tokenizing the string and coercing each value into the appropriate datatype. We will structure our resulting RDD as a pair RDD in which the key idicates the flower species and the value contains a list of sepal and petal measurements.

In [0]:
header = iris_raw.take(1)[0].split('\t')

def process_row(row):
    tokens = row.split('\t')
    sl, sw = float(tokens[0]), float(tokens[1]), 
    pl, pw = float(tokens[2]), float(tokens[3])
    species = tokens[4]
    return (species, [sl, sw, pl, pw])

iris = (iris_raw
        .filter(lambda x : 'Species' not in x)
        .map(process_row))

for row in iris.take(5):
    print(row)

We will now use pair RDD methods to determine the number of flowers of each species, as well as the average for each of the four measures within each species.

In [0]:
iris_means = (
    iris
    .mapValues(lambda x : x + [1])
    .reduceByKey(lambda x, y : [a+b for a,b in zip(x,y)])
    .mapValues(lambda x : [x[-1]] + [round(a/x[-1],2) for a in x[:-1]])
)

for row in iris_means.collect():
    print(row)

In the cell below, we use bar charts to visually display the results from the previous cell.

In [0]:
iris_list = iris_means.collect()
species = ['setosa', 'versicolor', 'virginica']

plt.figure(figsize=[6,4])
for i in range(1, 5):
    means = [item[1][i] for item in iris_list]    
    plt.subplot(2,2, i)
    plt.bar(species, means, color=['lightcoral', 'steelblue', 'lightgreen'], edgecolor='k')
    plt.title(f'Mean {header[i-1]}')
    
plt.tight_layout()
plt.show()
    