# Lesson 09 - Filter, sortBy, and Reduce

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Introduction and Setup

In this lesson, we will introduce three RDD methods: `filter()`, `sortBy()`, and `reduce()`.

* **`filter()`** is a transformation that returns a new RDD containing only those elements of the original RDD that satisfy a certain criteria. 
* **`sortBy()`** is a transformation that returns a new RDD in which the elements of the original RDD have been sorted in some fashion. 
* **`reduce()`** is an action that aggregates the elements of an RDD according to a supplied binary operation.

We will start by creating a few RDDs to help us illustrate the use of these methods.

In [0]:
#------------------------------------------
# Numerical RDD
#------------------------------------------
num_rdd = sc.parallelize([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])

#------------------------------------------
# RDD of US President Names
#------------------------------------------
pres_rdd = sc.parallelize([
    'George Washington', 'John Adams', 'Thomas Jefferson', 
    'James Madison', 'John Quincy Adams', 'Andrew Jackson'
])

## The `filter()` Transformation

Like the `map()` and `flatMap()` transformations, the **`filter()`** transformation accepts a function `f` as a parameter. The function passed to `filter()` should take elements of the source RDD as inputs and should return a boolean value for each such element. The `filter()` transformation returns a new RDD that contains only the elements of the source RDD for which `f` returns `True`. 

In the next cell, we apply `filter()` to `num_rdd`, keeping only those elements that are less than 5.

In [0]:
#-----------------------------------------
# Example: Basic numerical filter
#-----------------------------------------

print(num_rdd.collect())

small_rdd = num_rdd.filter(lambda x : x < 5)
print(small_rdd.collect())

In [0]:
print(num_rdd.map(lambda x: x<5).collect())
print(num_rdd.filter(lambda x: x<5).collect())

We can use the logical operators **`and`**, **`or`**, and **`not`** to build more complex filters.

In [0]:
#-----------------------------------------
# Example: Filtering on two criteria
#-----------------------------------------

mid_rdd = num_rdd.filter(lambda x : (x > 2) and (x < 5))
print(mid_rdd.collect())

The **`in`** operator is useful for apply filters to RDDs containing string elements.

In [0]:
print(pres_rdd.collect())

In [0]:
#---------------------------
# Example: Filtering text
#---------------------------

adams_rdd = pres_rdd.filter(lambda x : 'Adam' in x)
print(adams_rdd.collect())

Note that the comparison performed by **`in`** is case sensitive.

In [0]:
#---------------------------
# Example: Filtering text
#---------------------------

ad_rdd = pres_rdd.filter(lambda x : 'ad' in x)
print(ad_rdd.collect())

If we would like for string comparisons to ignore capitalization we can convert the strings involved to lowercase.

In [0]:
#---------------------------
# Example: Filtering text
#---------------------------

ad_rdd = pres_rdd.filter(lambda x : 'ad' in x.lower())
print(ad_rdd.collect())

Recall that we used `map()` to process the `diamonds.txt` data file in the last lesson. When doing so, we had to process the first line containing the header information differently from the rest of the file. We will now revisit this task. This time, we will simply filter the header out of the RDD before processing the other records.

In [0]:
diamonds_pre = sc.textFile('/FileStore/tables/diamonds.txt')

header_info = diamonds_pre.take(1)[0].split('\t')
print(header_info)

In [0]:
#--------------------------------------------
# Example: Processing diamonds.txt data file
#--------------------------------------------

diamonds_pre = sc.textFile('/FileStore/tables/diamonds.txt')

header_info = diamonds_pre.take(1)[0].split('\t')

def process_row(row):
    items = row.split('\t') 
    return [float(items[0]), items[1], items[2], items[3], 
            float(items[4]), float(items[5]), int(items[6]), 
            float(items[7]), float(items[8]), float(items[9])]

# This makes nicer code to chain methods together. 
# HAVE to put (line1, 
#              line2, 
#              line3, 
#              ...) 
diamonds = (
    diamonds_pre
    .filter(lambda x : 'carat' not in x)
    .map(process_row)
)
            
for row in diamonds.take(5):
    print(row)

We can use filter to determine the number of elements within an RDD that satisfy a certain condition. In the cell below, we will count the number of diamonds with each of the five levels of `cut`. As a reminder, these levels are `Fair`, `Good`, `Very Good`, `Premium`, and `Ideal`.

In [0]:
#----------------------------------------------------
# Example: Using filter to perform conditional count
#----------------------------------------------------

cut_levels = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']

for cl in cut_levels:
    n = diamonds.filter(lambda x : x[1] == cl).count()
    print(f'Number of {cl} Diamonds: {n}') 

## The `sortBy()` Transformation

The `sortBy()` transformation allows us to sort the contents of an RDD. This method accepts a function as a parameter. The function will be applied to all RDD elements, and the RDD will be sorted based on these results. If we want to sort RDD elements according to their actual values (and not some derived value) then the lambda function should just return the input provided to it.

In [0]:
#-------------------------------------
# Example: Sorting a Numerical RDD.
#-------------------------------------

print(num_rdd.sortBy(lambda x : x).collect())

In [0]:
print(pres_rdd.sortBy(lambda x: x).collect()) # sort string by first letter

In the next example, we will use the `len()` function to sort an RDD of strings according to the number of characters contained within those strings.

In [0]:
#----------------------------------------------
# Example: Sorting an RDD of Strings By Length
#----------------------------------------------

print(pres_rdd.sortBy(len).collect())
print(pres_rdd.map(len).collect())

The `sortBy()` method sorts into ascending order by default, but it has an `ascending` parameter that can be set to `False` to sort in descending order.

In [0]:
#----------------------------------------------
# Example: Sorting in descending order
#----------------------------------------------

print(pres_rdd.sortBy(len, ascending=False).collect())

As one final example of `sortBy()`, we will sort the records in the `diamonds` RDD in decreasing order by price. We will then display the information for the 5 most expensive diamonds in the dataset.

In [0]:
#----------------------------------------------
# Example: Sorting an RDD of Lists
#----------------------------------------------

diamonds_sorted = diamonds.sortBy(lambda x : x[6], ascending=False)

for row in diamonds_sorted.take(5):
    print(row)


## The `reduce()` Action

- TO PERFORM AGGREGATION: sum(), mean(), count(), least common multiple...
- strings: concat strings, return the longest/shortest string in RDD, 

### Function will take 2 arguments:

The `reduce()` action allows us to aggregate the contents of an RDD by repeatedly applying a binary operation to the elements. The operation in question should be represented by a function `f` that is to be passed to `reduce()` as an argument. The argument function `f` should accept two parameters, which we will refer to as `x` and `y` for the sake of discussion. The second parameter `y` should represent an element of the RDD and the first parameter `x` should represent a "running total" containing the aggregated value of all previously considered elements of the RDD.

To understand the behavior of reduce(), consider an RDD created as follows: `myRDD = sc.parallelize([7, 3, 2, 5])`. Then the command `myRDD.reduce(f)` would return the value: `f(f(f(7, 3), 2), 5)`. In other words, `reduce()` would apply the binary function to the first two elements of the RDD, then it will apply the function to the result of that calculation and the third element, and so on. In each case, the function would be applied to the previous result as well as a new value. 


We typically require the function `f` to be both associative and commutative. If `f` has these properties, then the result of the `reduce()` function does not depend on the order in which the elements are processed. If `f` fails to be either associate or commutative then the order of the elements will likely alter the final result. When working on a cluster, we won't always be able to control or predict the order in which the elements are processed. 

The simplest application of `reduce()` is to calculate the sum or product of a collection of numbers stored in an RDD. 


- **associative**: (a*b)*c or a*(b*c) ---> sum is associative, subtraction is not associative. 
- **communititive**: a*b = b*a

In [0]:
print(num_rdd.collect())

In [0]:
#----------------------------------------------
# Example: Using reduce to find sum and product
#----------------------------------------------

# sum
total = num_rdd.reduce(lambda x, y : x + y)
print(total)

# product
prod = num_rdd.reduce(lambda x, y : x * y)
print(prod)

In the next example, we will use `reduce()` to locate the longest element in an RDD of strings.

In [0]:
#----------------------------------------------
# Example: Using reduce to find longest string
#----------------------------------------------

def comp_len(x, y):
    if len(x) < len(y):
        return y
    if len(y) < len(x):
        return x
    if x <= y:
        return y
    return x
  
longest_elt = pres_rdd.reduce(comp_len)
print(longest_elt)

In the next example, we will use `reduce()` to calculate the least common multiple of an RDD of integer values. This calculation will require the use of the `math` package.

In [0]:
#--------------------------
# Example: Calculating LCM 最小公倍数
#--------------------------

import math

def lcm(x, y):
    return int(x * y / math.gcd(x, y))
  
result = num_rdd.reduce(lcm)
print(result)

### Example: Sum of Squared Errors

Calculating the sum of the squares of elements in an set is an extremely common calculation in machine learning applications. 

To provide further insight into the behavior of `reduce()`, we will consider two attempts at using `reduce()` to calculate the sum of squares for an RDD. These approaches might seem valid at first glance, but neither will produce the correct answer. Review these solutions to see if you can determine why they fail to yield the correct value of 29.

In [0]:
#----------------------------------------------------
# Example: Calculating Sum of Squares (incorrectly)
#----------------------------------------------------

my_rdd = sc.parallelize([2,3,4])

attempt_1 = my_rdd.reduce(lambda x, y : x**2 + y**2)
print(attempt_1)

attempt_2 = my_rdd.reduce(lambda x, y : x + y**2)
print(attempt_2)

See if you can write some code to correctly calculate the sum of squares for the RDD above.