#### **This is the first version of PageRank on MapReduce.**

The Internet is stored as a big matrix ***M*** (size n × n). Specifically the column-normalized adjacency matrix
where each column represents a webpage and where it links to are the non-zero entries.
Break M into k vertical stripes M = [M1 M2 . . . Mk] so each Mj fits on a machine. İnitiate ***q*** as a vector of PageRank with  values as 1/number of pages(n). 
* ***Mapper:***  j → key= j' ∈ [k] ; value = row r of Mj ∗ qj 
* ***Reducer:***  adds values for each key i to get qi+1[j] ∗ β + (1 − β)/n.

typically β = 0.85.

In [1]:
import numpy as np
from pyspark import SparkContext

In [2]:
#Set up the Spark context
sc = SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/03 09:52:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
#Take matrix M from the example on the pdf file
M=[[0, 0.5, 0, 0],
   [0.3, 0, 1, 0.5],
   [0.3, 0, 0, 0.5],
   [0.3, 0.5, 0, 0]]

n=len(M)

#Beta is usually sets as 0.85
beta=0.85


In [4]:
#Convert M to RDDs (into k vertical stripes)
M_rdd = sc.parallelize(np.transpose(M))
rddCollect = M_rdd.collect()
N=M_rdd.count()

#Inizialize q 
q =[]
for i in range(N):
    q.append(1/N)


print(M_rdd.collect())
print(q)

[Stage 1:>                                                          (0 + 2) / 2]

[array([0. , 0.3, 0.3, 0.3]), array([0.5, 0. , 0. , 0.5]), array([0., 1., 0., 0.]), array([0. , 0.5, 0.5, 0. ])]
[0.25, 0.25, 0.25, 0.25]


                                                                                

In [5]:
def mapper(row,q):
    m=[]
    for i in range(len(row)):
        m.append(((i+1),row[i]*q[0]))
    return m

In [6]:
#Map the grouped tiles
M_mapped=M_rdd.flatMap(lambda x: mapper(x,q))

#Reduce the mapped output by adding the values for each row
M_reduced = M_mapped.reduceByKey(lambda x, y: x + y)

#Update the values of q using the map-reduce output
q_updated = M_reduced.map(lambda row: (row[0], row[1] * beta + (1 - beta) / n))

#Return the updated values of q
q_updated.collect()

                                                                                

[(2, 0.42000000000000004),
 (4, 0.20750000000000002),
 (1, 0.14375),
 (3, 0.20750000000000002)]

Where q is a probability vector with tuples where the first values are equal to the number of a spacific node (one of n), the second values are the probability that you are in that node.

In [7]:
#Stop the Spark context
sc.stop()