In [15]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySparkLearning').getOrCreate()

In [24]:
data = [("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),("B", 60)]
inputRDD = spark.sparkContext.parallelize(data)
  
listRdd = spark.sparkContext.parallelize([1,2,3,4,5,3,2])

### aggregate() 
aggregates the elements of each partition, and then the results for all the partitions.

Aggregate Function definition in PySpark

**aggregate(zeroValue, seqOp, combOp)**

`zeroValue` is the initial value for the accumulated result of each partition for the seqOp operator, and also the initial value for the combine results from different partitions for the combOp operator - this will typically be the neutral element (e.g. Null for list concatenation or 0 for summation). In short, it will be used in the both operations

`seqOp`: seqOp is an operator used to accumulate results within a partition. This operation will be applied to  elements in all the partitions. (ie The operation you want to apply to RDD records. Runs once for every record in a partition.)

`combOp`: combOp is an operator used to combine results from different partitions. (ie Defines how the resulted objects (one for every partition), should combined)

Overall, seqOp will aggregate the elements from all the partitions, and combOp will merge all the result of seqOp in all the partitions. Both of the operation share the same initial values which is called zeroValue.

In [17]:
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg=listRdd.aggregate(0, seqOp, combOp)
print(agg) 
# output 20

20


In [21]:
seqOp2 = (lambda x, y: (x[0] + y, x[1] + 1))
combOp2 = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
agg2=listRdd.aggregate((0, 0), seqOp2, combOp2)
print(agg2) 
# output (20,7)

(20, 7)


In [20]:
#  One more example with explanation
list_RDD = spark.sparkContext.parallelize([1,2,3,4], 2)
seqOp = (lambda local_result, list_element: (local_result[0] + list_element, local_result[1] + 1) )
combOp = (lambda some_local_result, another_local_result: (some_local_result[0] + another_local_result[0], some_local_result[1] + another_local_result[1]) )

In [19]:
list_RDD.aggregate( (0, 0), seqOp, combOp)

(10, 4)

As you can see, I gave descriptive names to my variables, but let me explain it further:

The first partition has the sublist [1, 2]. We will apply the `seqOp` to each element of that list and this will produce a local result, a pair of (sum, length), that will reflect the result locally, only in that first partition.

So, let's start: local_result gets initialized to the zeroValue parameter we provided the aggregate() with, i.e. (0, 0) and list_element is the first element of the list, i.e. 1. As a result this is what happens:

```
0 + 1 = 1
0 + 1 = 1
```
Now, the `local result` is (1, 1), that means, that so far, for the 1st partition, after processing only the first element, the sum is 1 and the length 1. Notice, that `local_result` gets updated from (0, 0), to (1, 1).
```
1 + 2 = 3
1 + 1 = 2
```
and now the `local result` is (3, 2), which will be the final result from the 1st partition, since they are no other elements in the sublist of the 1st partition.

Doing the same for 2nd partition, we get (7, 2).

Now we apply the combOp to each local result, so that we can form, the final, global result, like this: (3,2) + (7,2) = (10, 4)

Example described in 'figure':


<img src="../Reference%20Images/rdd-aggregate-functionality.png">

**treeAggregate()** 

– Aggregates the elements of this RDD in a multi-level tree pattern. The output of this function will be similar to the aggregate function.

In [26]:
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg2=listRdd.treeAggregate(0,seqOp, combOp)
print(agg2)

20


**fold()**

– Aggregate the elements of each partition, and then the results for all the partitions.

In [28]:
from operator import add
foldRes=listRdd.fold(0, add)
print(foldRes)

20


**reduce()**

– Reduces the elements of the dataset using the specified binary operator.

In [30]:
from operator import add
redRes=listRdd.reduce(add)
print(redRes)

20


**treeReduce()**

– Reduces the elements of this RDD in a multi-level tree pattern.

In [31]:
# treeReduce. This is similar to reduce
add = lambda x, y: x + y
redRes=listRdd.treeReduce(add)
print(redRes)

20


**collect()**

-Return the complete dataset as an Array.

In [32]:
data = listRdd.collect()
print(data)


[1, 2, 3, 4, 5, 3, 2]


`count()` – Return the count of elements in the dataset.

`countApprox()` – Return approximate count of elements in the dataset, this method returns incomplete when execution time meets timeout.

`countApproxDistinct()` – Return an approximate number of distinct elements in the dataset.

In [34]:
# count, countApprox, countApproxDistinct
print("Count : "+str(listRdd.count()))
print("countApprox : "+str(listRdd.countApprox(1200)))
print("CountApproxDistinct : "+str(listRdd.countApproxDistinct()))

Count : 7
countApprox : 7
CountApproxDistinct : 5


`countByValue()` – Return Map[T,Long] key representing each unique value in dataset and value represents count each value present.

`countByValueApprox()` – Same as countByValue() but returns approximate result.

In [39]:
print("countByValue :  "+str(listRdd.countByValue()))

countByValue :  defaultdict(<class 'int'>, {1: 1, 2: 2, 3: 2, 4: 1, 5: 1})


`first()` – Return the first element in the dataset.

In [40]:
print("first :  "+str(listRdd.first()))
print("first :  "+str(inputRDD.first()))

first :  1
first :  ('Z', 1)


`top()` – Return top n elements from the dataset.

Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

In [44]:
print("top : "+str(listRdd.top(2)))
print("top : "+str(inputRDD.top(2)))

top : [5, 4]
top : [('Z', 1), ('C', 40)]


`min()` – Return the minimum value from the dataset.
`max()` – Return the maximum value from the dataset.

In [43]:
print("min :  "+str(listRdd.min()))
print("min :  "+str(inputRDD.min()))

print("max :  "+str(listRdd.max()))
print("max :  "+str(inputRDD.max()))

min :  1
min :  ('A', 20)
max :  5
max :  ('Z', 1)


`take()` – Return the first num elements of the dataset.

`takeOrdered()` – Return the first num (smallest) elements from the dataset and this is the opposite of the take() action.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

`takeSample()` – Return the subset of the dataset in an Array.
Note: Use this method only when the resulting array is small, as all the data is loaded into the driver’s memory.

In [49]:
print("take : "+str(listRdd.take(2)))
print("takeOrdered : "+ str(listRdd.takeOrdered(2)))


take : [1, 2]
takeOrdered : [1, 2]
