## How do I make an RDD?

RDDs can be created from stable storage or by transforming other RDDs. Run the cells below to create RDDs from files on the local drive.  All data files can be downloaded from https://www.cse.ust.hk/msbd5003/data/
For example, https://www.cse.ust.hk/msbd5003/data/fruits.txt

In [21]:
# Read data from local file system:
print(sc.version)

fruits = sc.textFile('fruits.txt') #
yellowThings = sc.textFile('yellowthings.txt')

3.3.1


In [2]:
print(fruits.collect())
print(yellowThings.collect())

print(fruits.getNumPartitions())

print(fruits.glom().collect())
print(yellowThings.glom().collect())

['apple', 'banana', 'canary melon', 'grap', 'lemon', 'orange', 'pineapple', 'strawberry']
['banana', 'bee', 'butter', 'canary melon', 'gold', 'lemon', 'pineapple', 'sunflower']
2


[Stage 2:>                                                          (0 + 2) / 2]

[['apple', 'banana', 'canary melon', 'grap', 'lemon'], ['orange', 'pineapple', 'strawberry']]
[['banana', 'bee', 'butter', 'canary melon', 'gold'], ['lemon', 'pineapple', 'sunflower']]


                                                                                

In [None]:
# Read data from HDFS :
fruits = sc.textFile('hdfs://url:9000/pathname/fruits.txt')
fruits.collect()

----------

##  RDD operations

In [None]:
fruitsReversed = fruits.map(lambda fruit: fruit[::-1])

In [None]:
fruitsReversed.cache()
# try changing the file and re-execute with and without cache
fruitsReversed.collect()
# What happens when you uncomment the first line and run the whole program again with cache()?

In [None]:
# filter
k = 5
shortFruits = fruits.filter(lambda fruit: len(fruit) <= k)
print(shortFruits.collect())

In [None]:
# flatMap; flatmap like list extension; map like list append
characters = fruits.flatMap(lambda fruit: list(fruit))
print(characters.collect())

In [None]:
# union
fruitsAndYellowThings = fruits.union(yellowThings)
print(fruitsAndYellowThings.collect())

In [None]:
# intersection
yellowFruits = fruits.intersection(yellowThings)
print(yellowFruits.collect())

In [None]:
print(fruits.subtract(yellowThings).collect())

In [None]:
# distinct
distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
print(distinctFruitsAndYellowThings.collect())

In [None]:
#Cartesian product
print(fruits.cartesian(yellowThings).collect())

In [None]:
print(fruits.collect())
print(fruits.glom().collect())

In [22]:
# zip 
#zip() function is used to combine values in both the RDD’s as pairs by returning a new RDD.
#newyellowThings = yellowThings.repartition(3)
print(fruits.zip(yellowThings).collect())

#a new RDD of tuples containing positional index information
print(fruits.zipWithIndex().collect())

# Items in the kth partition will get ids k, n+k, 2*n+k, …, where n is the number of partitions. 
# This is more efficient since each partition is processed independently(useful when considering partitions)
print(fruits.zipWithUniqueId().collect())

[('apple', 'banana'), ('banana', 'bee'), ('canary melon', 'butter'), ('grap', 'canary melon'), ('lemon', 'gold'), ('orange', 'lemon'), ('pineapple', 'pineapple'), ('strawberry', 'sunflower')]
[('apple', 0), ('banana', 1), ('canary melon', 2), ('grap', 3), ('lemon', 4), ('orange', 5), ('pineapple', 6), ('strawberry', 7)]
[('apple', 0), ('banana', 2), ('canary melon', 4), ('grap', 6), ('lemon', 8), ('orange', 1), ('pineapple', 3), ('strawberry', 5)]


### RDD actions

In [None]:
# collect: fruitsArray(RDD: list like)
fruitsArray = fruits.collect()
yellowThingsArray = yellowThings.collect()
print(fruitsArray)

In [None]:
# count
numFruits = fruits.count()
print(numFruits)

In [None]:
# take
first3Fruits = fruits.take(3)
print(first3Fruits)

In [None]:
# Tip: Don't use count() when you don't need to return the exact number of rows
# use take() or isEmpty()
print(fruits.isEmpty())

In [None]:
print(fruits.map(lambda fruit: len(fruit)).sum())
print(fruits.map(lambda fruit: len(fruit)).reduce(lambda x, y: x+y))

In [None]:
# the reduce function must be associative,communicative; otherwise the result is nondeterministic 
#rdd = sc.parallelize([1, 2, 1, 3, 4, 5, 2], 4)
rdd = sc.parallelize([1, 2, 3], 2)
print(rdd.glom().collect())
print(rdd.reduce(lambda x, y: 2*x+y))
#reduce function is not associative, will perform on each partition first, then left to right partition-level op

In [None]:
# reduce
letterSet = fruits.map(lambda fruit: set(fruit)).reduce(lambda x, y: x.union(y))
print(letterSet)

In [None]:
# treeReduce 
# Data are combined partially on a small set of executors before they are sent to the driver, 
# which dramatically reduces the load the driver has to deal with. 

letterSet = fruits.map(lambda fruit: set(fruit)).treeReduce(lambda x, y: x.union(y))
print(letterSet)

In [None]:
letterSet = fruits.flatMap(lambda fruit: list(fruit)).distinct().collect()
#list("shi jia")->['s','h','i'," ","j",'i','a']
print(letterSet)

In [None]:
# fold:Aggregate the elements of each partition, and then the results for 
# all the partitions, using a given associative function and a neutral “zero value.”
# op(t1, t2) is allowed to modify t1(empty set here) and return it as its result value to avoid object allocation;
# however, it should not modify t2.
print(fruits.glom().collect())
letterSet = fruits.map(lambda fruit: set(fruit)).fold(set(), lambda x, y: x.union(y))
print(letterSet)

In [83]:
# reducing an empty rdd is not allowed, but fold is OK.
r = sc.parallelize([])
#r.reduce(lambda x, y: x+y)
r.fold(0, lambda x, y: x+y)

0

In [86]:
# aggregate / treeAggregate can return a different result type than the type of the RDD
#aggregate: Aggregate the elements of each partition, and then the results for all the partitions, 
# using a given combine functions and a neutral “zero value.”
#seqOp: an operator used to accumulate results within a partition
#combOp:an associative operator used to combine results from different partitions
def f(x, y):
    # x is "zero value": empty set here, y is the elem is rdd
    x.add(y)
    return x

letterSet = fruits.flatMap(lambda fruit: list(fruit)).treeAggregate(set(), f, lambda x, y: x.union(y))
print(letterSet)

# It avoids object allocation.
# This is the most efficient way for solving this problem.

{'o', 'r', 'a', 'i', 'p', 'g', 'c', ' ', 'l', 'y', 'e', 'w', 'n', 'b', 'm', 't', 's'}


In [88]:
# foreach is an action, map is a transformation.
fruits.foreach(lambda x: print('I have a', x))

I have a apple
I have a banana
I have a canary melon
I have a grap
I have a lemon
I have a orange
I have a pineapple
I have a strawberry


In [89]:
print(fruits.map(lambda x: 'I have a ' + x).collect())

['I have a apple', 'I have a banana', 'I have a canary melon', 'I have a grap', 'I have a lemon', 'I have a orange', 'I have a pineapple', 'I have a strawberry']


### Closure

In [20]:
#closure is those variables and methods which must be visible for 
# the executor to perform its computations on the RDD.
counter = 0# closure is just copy, values only change inside the function; need to be returned by the function(shared variable)
rdd = sc.parallelize(range(10), 3)

print(rdd.glom().collect())

# Wrong: Don't do this!!
def increment_counter(x):
    global counter
    counter += x
    print(x, counter)

print(rdd.collect())
rdd.foreach(increment_counter)

print(counter)
print(rdd.sum())

[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0
45


06 0
1  1
2 63

7 13
8 21
9 30
3 3
4 7
5 12


In [19]:
rdd = sc.parallelize(range(10))
#Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the 
#driver program is allowed to access its value, using value. Updates from the workers get propagated
#automatically to the driver program.
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x

a = rdd.foreach(g)

print(accum.value)

45


In [103]:
rdd = sc.parallelize(range(10))
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x
    return x * x

a = rdd.map(g)
print(accum.value)#0 since map is a transformer, not action
# print(a.reduce(lambda x, y: x+y))
#a.cache()
tmp = a.count()
print(accum.value)#count is an action: accum's value is updated#45
print(rdd.reduce(lambda x, y: x+y))#45

tmp = a.count()#rdd.map(g) is computed twice by spark
print(accum.value)#90
print(rdd.reduce(lambda x, y: x+y))#45


0
45
45
90
45


In [104]:
n = 100
rdd = sc.parallelize(range(1000000*n) , n)
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x
    return x

a = rdd.map(g)
tmp = a.count()
print(accum.value)


# correct answer: 4999999950000000
# Spark will only update the accumulator from the successful task,and the failed tasks are completely ignored.
# So the accumulator is computed correctly even if some tasks fail



4999999950000000


                                                                                

In [3]:
n = 100
rdd1 = sc.parallelize(range(1000000*n) , n)
rdd2 = sc.parallelize(range(1000000*n) , n)
rdd = rdd1.zip(rdd2)
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x[0]
    return x

a = rdd.map(g)
b = a.reduceByKey(sum)  # this causes a shuffle
tmp = b.count()
print(accum.value)

# Because shuffle output is stored locally, if a node goes down, that shuffle output is gone. 
# Spark goes back to the stage that generated the shuffle output, looks at which tasks need
# to be rerun, and re-executes them on one of the nodes that is still alive.
# This results in the accumulator being an over-count

# Summary: It's OK to use accumulators in an action (e.g., foreach) but not in a transformation
# Or better: Avoid using them at all.



4999999950000000


                                                                                

### Closure and Persistence

In [10]:
# Linear-time selection（quick select)
#choose the k largest elem
data = [34, 67, 21, 56, 47, 89, 12, 44, 74, 43, 26]
A = sc.parallelize(data,2)
k = 4

while True:
    x = A.first()
    A1 = A.filter(lambda z: z < x)
    A2 = A.filter(lambda z: z > x)
    #A1.cache()#need to cache these; otherwise, A1's value is not changed(transformation,not action)
    #A2.cache()
    mid = A1.count()
    if mid == k:
        print(x)
        break
    if k < mid:
        A = A1
    else:
        A = A2
        k = k - mid - 1
    A.cache()#option2: only cache A

43


In [6]:
sorted(data)

[12, 21, 26, 34, 43, 44, 47, 56, 67, 74, 89]

In [11]:
#similar to the one in the quiz
A = sc.parallelize(range(10)) 

x = 5
B = A.filter(lambda z: z < x)
#B.cache() if cached, A is triggered once
print(B.count())
x = 3
print(B.count())#Transformation is setting up a tunnel; action is flushing the data through the tunnel

5
3


In [13]:
A = sc.parallelize(range(10))

x = 5
B = A.filter(lambda z: z < x)
B.cache()  #the first time its is computed, the value is stored  (when caching, replace x with current value of 5)
# In the Scala version of Spark, whether caching B or not will always return [0, 1, 2].
# However, in PySpark, it seems that cache() actually resolves all variables in the transformation (filter in this example)
# at the time the RDD is cached, but not at the time of an action that triggers the transformation.  
# This inconsistency between the Python version and the Scala version of Spark has not been documented.  
# If anyone can find more information, please share with us.
print(B.take(10))
#print(B.collect())

x = 3
print(B.take(10))
print(B.collect())
# B.collect() doesn't always re-collect data - bad design!
# Always use take() instead of collect()

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


In [20]:
# From the official spark examples.
from random import random
from operator import add

partitions = 100
n = 100000 * partitions
#makes a circle of unit radius 1
def f(_: int) -> float:
    x = random() * 2 - 1 #random() returns [0,1]
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0
#random seed generated in each partition is the same
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)# count/n = pi/4
print("Pi is roughly %f" % (4.0 * count / n))



Pi is roughly 3.140680


                                                                                

In [18]:
from random import random
#seed in the driver program won't be changed
#all the computation is run on the worker(status won't changed)
a = sc.parallelize(range(0,10),2)
print(a.map(lambda _: random()).glom().collect())

# Why are random numbers in all partitions the same? all the workers are generated the same number(
#random number in spark generated same number every time)

[[0.744948514497711, 0.8954111195391162, 0.2218497023630468, 0.022007612366036433, 0.8910718215566176], [0.744948514497711, 0.8954111195391162, 0.2218497023630468, 0.022007612366036433, 0.8910718215566176]]


In [22]:
random()

0.8954111195391162

### mapPartitions and mapPartitionsWithIndex

In [23]:
# mapPartitions: Return a new RDD by applying a function to each partition of this RDD.
#map: 1 elem in,1 elem out| partition: 1 partition in and 1 partition out
def f(pa):#pa is a partition list(iterator)
    l = []
    for x in pa:
        l.append(x[::-1])
    return l
        
fruitsReversed = fruits.mapPartitions(f)
print(fruitsReversed.collect())

['elppa', 'ananab', 'nolem yranac', 'parg', 'nomel', 'egnaro', 'elppaenip', 'yrrebwarts']


In [25]:
# It's more efficient to use yield
#pa entry point to a list
def f(pa):
    for x in pa:
        yield x[::-1]#for 循环中的 yield 会把当前的元素记下来，保存在集合中，循环结束后将返回该集合
        #for loop has a buffer you can't see, for loop finishes running, it will return 
        #this collection of all the yielded values from the unseen buffer.

fruitsReversed = fruits.mapPartitions(f)
print(fruitsReversed.collect())

#  It provides a facility to do heavy initializations (for example Database connection) once for each partition
# instead of doing it on every element in the RDD.

['elppa', 'ananab', 'nolem yranac', 'parg', 'nomel', 'egnaro', 'elppaenip', 'yrrebwarts']


In [26]:
# Can also do some transformation on the partition level

def f(pa):
    return sorted(pa, reverse = True)

print(fruits.glom().collect())
print(fruits.mapPartitions(f).collect())#sort on each partition

[['apple', 'banana', 'canary melon', 'grap', 'lemon'], ['orange', 'pineapple', 'strawberry']]
['lemon', 'grap', 'canary melon', 'banana', 'apple', 'strawberry', 'pineapple', 'orange']


In [28]:
# mapPartitionsWithIndex
#i is the partition index
def f(i, pa):# do partition specific things
    for x in pa:
        yield str(i) + x

fruitsIndexd = fruits.mapPartitionsWithIndex(f)
print(fruitsIndexd.collect())

# These two functions will be useful for advanced algorithm design (will see later)

['0apple', '0banana', '0canary melon', '0grap', '0lemon', '1orange', '1pineapple', '1strawberry']


In [33]:
# Correct version for computing Pi
from random import random, seed
from time import time

partitions = 1000
n = 100000 * partitions

s = time()

def f(index, it):
    seed(index + s)#the seed in each partiton is different, can generate different random number
    for i in it:
        x = random() * 2 - 1
        y = random() * 2 - 1
        yield 1 if x ** 2 + y ** 2 <= 1 else 0

count = sc.parallelize(range(1, n + 1), partitions).mapPartitionsWithIndex(f).sum()

print("Pi is roughly", 4.0 * count / n)



Pi is roughly 3.14170008




In [34]:
# Check that random numbers are different
from random import random, seed
from time import time

s = time()

def f(index, it):
    seed(index + s)
    for i in it:
        yield random()

print(sc.parallelize(range(10), 2).mapPartitionsWithIndex(f).glom().collect())

[[0.1103785573901771, 0.8930514571020657, 0.9095138951659408, 0.48617438764010323, 0.7785035676844477], [0.178013983301885, 0.8953354693764833, 0.2642794179289347, 0.14804273449930028, 0.1521417802028162]]


### Key-Value Pairs

In [35]:
# groupByKey only on tuples with (a,b)->(key,value) pair
groupFruitsByLength = fruits.map(lambda fruit: (len(fruit), fruit)).groupByKey()
print(groupFruitsByLength.take(10))
for x in groupFruitsByLength.take(1)[0][1]:
    print(x)#return the fruit with length 6

[(6, <pyspark.resultiterable.ResultIterable object at 0x7fb755b6a6a0>), (12, <pyspark.resultiterable.ResultIterable object at 0x7fb755b47310>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fb755b47d60>), (10, <pyspark.resultiterable.ResultIterable object at 0x7fb755ca69d0>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fb755ca6370>), (9, <pyspark.resultiterable.ResultIterable object at 0x7fb755ca62b0>)]
banana
orange


In [36]:
# count the number of fruits by length: mimics MapReduce
#mapValues: map function for value in group by key(values with the same key)
print(fruits.map(lambda fruit: (len(fruit), 1)).groupByKey().mapValues(sum).collect())

[(6, 2), (12, 1), (4, 1), (10, 1), (5, 2), (9, 1)]


In [37]:
# reduceByKey: this more efficient
# reduceByKey will compute local sums for each key in each partition and combine those local sums 
# into larger sums after shuffling.(do some local shuffling first)

numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
print(numFruitsByLength.take(10))

# aggregateByKey, foldByKey also available.
# but there is no treeAggregateByKey(在每个RDD里的，跟driver没关系）

[(6, 2), (12, 1), (4, 1), (10, 1), (5, 2), (9, 1)]


In [15]:
from operator import add

lines = sc.textFile('course.txt')
counts = lines.flatMap(lambda x: x.split()) \
              .map(lambda x: (x, 1)) \
              .reduceByKey(add).filter(lambda x: x[1]>=3)
print(counts.take(20))

[('of', 3), ('data', 4), ('and', 3)]


In [43]:
#collectAsMap: turn list into dictio
print(counts.collectAsMap())#collectAsMap()['data'] find the count of specific key value pairds; not efficient
#(return the whole rdd)

4


In [13]:
print(counts.sortBy(lambda x: x[1], False).take(1)[0])

('data', 4)


In [42]:
counts.lookup('data')
# This scans the whole RDD, unless there is a partitioner (to be discussed later)

[4]

### Join vs. Broadcast Variables

In [None]:
# Join simple example

products = sc.parallelize([(1, "Apple"),(1,"apple"), (2, "Orange"), (3, "TV"), (5, "Computer")])
#trans = sc.parallelize([(1, 134, "OK"), (3, 34, "OK"), (5, 162, "Error"), (1, 135, "OK"), (2, 53, "OK"), (1, 45, "OK")])
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

print(products.join(trans).take(20))#join on key value(1,2,3...)-> nested struc(key,value from 1st RDD, value from 2nd RDD)

In [None]:
products = {1: "Apple", 2: "Orange", 3: "TV", 4: "PC", 5: "Computer"}
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

broadcasted_products = sc.broadcast(products)# broadcasted_products.value[x[0](key)]=="Apple"
# broadcast all the servers can get the shared broadcasted variables 
results = trans.map(lambda x: (x[0], broadcasted_products.value[x[0]], x[1]))#structure of the result
#  results = trans.map(lambda x: (x[0], products[x[0]], x[1]))
print(results.take(20))


In [None]:
# Compare with cogroup(more efficient)

products = sc.parallelize([(1, "Apple"),(1,"apple"), (2, "Orange"), (3, "TV"), (4, "PC"), (5, "Computer")])
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

for x,y in products.cogroup(trans).collect():
    print(x, tuple(map(list, y)))

### K-means clustering

In [3]:
import numpy as np

def parseVector(line):
    return np.array([float(x) for x in line.split()])

def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_data.txt
lines = sc.textFile('kmeans_data.txt', 5)  


# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_bigdata.txt
# lines = sc.textFile('../data/kmeans_bigdata.txt', 5)  
# lines is an RDD of strings
#print(lines.collect())
K = 3
convergeDist = 0.01  #（floating point can never check equality-> precision problem check error range can be done）
# terminate algorithm when the total distance from old center to new centers is less than this value

data = lines.map(parseVector).cache() # data is an RDD of arrays must cache()

#takeSample(withReplacement: bool, num: int, seed: Optional[int] = None) 
#Return a fixed-size sampled subset of this RDD.
kCenters = data.takeSample(False, K, 1)  # intial centers as a list of arrays 1:seed for sampling
tempDist = 1.0  # total distance from old centers to new centers

while tempDist > convergeDist:
    closest = data.map(lambda p: (closestPoint(p, kCenters), (p, 1)))
    # for each point in data, find its closest center
    # closest is an RDD of tuples (index of closest center, (point, 1))
    #print("nothing",closest.collect())
        
    #pointStats = closest.reduceByKey(lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1]))# sum of pointer and sum of counter
    
    # pointStats is an RDD of tuples (index of center(same key,
    # (array of sums of coordinates(np:+entry-level add), total number of points assigned))
    # aggregate by Key is more efficient
    #print("test",pointStats.collect())
    
    newCenters = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()
    # compute the new centers
    
    tempDist = sum(np.sum((kCenters[i] - p) ** 2) for (i, p) in newCenters)
    # compute the total disctance from old centers to new centers
    
    for (i, p) in newCenters:
        kCenters[i] = p
        
print("Final centers: ", kCenters)


TypeError: aggregateByKey() missing 2 required positional arguments: 'seqFunc' and 'combFunc'

### PageRank

In [5]:
import re
from operator import add

def computeContribs(urls, rank):
    # Calculates URL contributions to the rank of other URLs.
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)

def parseNeighbors(urls):
    # Parses a urls pair string into urls pair."""
    parts = urls.split(' ')
    return parts[0], parts[1]

# Loads in input file. It should be in format of:
#     URL         neighbor URL
#     URL         neighbor URL
#     URL         neighbor URL
#     ...

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/*
lines = sc.textFile("pagerank_data.txt", 2)
# lines = sc.textFile("../data/dblp.in", 5)

numOfIterations = 10

# Loads all URLs from input file and initialize their neighbors. 
links = lines.map(lambda urls: parseNeighbors(urls)) \
             .groupByKey()
links.take(5)

[('1', <pyspark.resultiterable.ResultIterable at 0x7fe253cbf250>),
 ('4', <pyspark.resultiterable.ResultIterable at 0x7fe253cbffa0>),
 ('2', <pyspark.resultiterable.ResultIterable at 0x7fe253ce1400>),
 ('3', <pyspark.resultiterable.ResultIterable at 0x7fe253ce1430>)]

In [12]:
import re
from operator import add

def computeContribs(urls, rank):
    #each destination in urls equally share the page rank
    # Calculates URL contributions to the rank of other URLs.
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)#yield is used in for-loop and writes a new element into the resulting sequence.

def parseNeighbors(urls):
    # Parses a urls pair string into urls pair."""
    parts = urls.split(' ')
    return parts[0], parts[1]

# Loads in input file. It should be in format of:
#     URL(source)         neighbor URL(destination)
#     URL         neighbor URL
#     URL         neighbor URL
#     ...

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/*
lines = sc.textFile("pagerank_data.txt", 2)
# lines = sc.textFile("../data/dblp.in", 5)

numOfIterations = 10

# Loads all URLs from input file and initialize their neighbors. 
links = lines.map(lambda urls: parseNeighbors(urls)) \
             .groupByKey()

# Loads all URLs with other URL(s) link to from input file 
# and initialize ranks of them to one.
ranks = links.mapValues(lambda neighbors: 1.0)#do mapping only for value in the list of (key,value) pairs

# Calculates and updates URL ranks continuously using PageRank algorithm.
for iteration in range(numOfIterations):
    # Calculates URL contributions to the rank of other URLs.
    contribs = links.join(ranks)  \
                    .flatMap(lambda url_urls_rank:
                             computeContribs(url_urls_rank[1][0],
                                             url_urls_rank[1][1]))
    # After the join, each element in the RDD is of the form
    # (url, (list of neighbor urls, rank))
    # join: Return an RDD containing all pairs of elements with matching keys in self and other.
    
    #contribs is of form:[(u1,1/3),(u2,1/3),...,...] key: destincation url, value: contribution to this destination
    # Re-calculates URL ranks based on neighbor contributions.
    ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
    # ranks = contribs.reduceByKey(add).map(lambda t: (t[0], t[1] * 0.85 + 0.15))

print(ranks.top(5, lambda x: x[1]))#only one action ; for loop only construct the lineage graph


[('1', 1.2981882732854677), ('4', 0.9999999999999998), ('3', 0.9999999999999998), ('2', 0.7018117267145316)]
