# MDA Final Project - Group33

由於我的電腦執行pyspark計算PPR score時無法迭代夠多的次數，因此我使用networkX重做一次，才能得出可令人接受的結果。首先，先引入需要的package。

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import random

## 一. 讀取dataset

利用pandas讀取資料集。

In [5]:
df = pd.read_csv('CA-GrQc.txt', sep='\t', header=None) # open the dataset
df.columns = ['node_1', 'node_2']                 # assign the headers of each column
df['node_1'] = df['node_1'].astype(int)           # change data type
df['node_2'] = df['node_2'].astype(int)

df=df.convert_dtypes()
df.head()

Unnamed: 0,node_1,node_2
0,3466,937
1,3466,5233
2,3466,8579
3,3466,10310
4,3466,15931


由於我們不希望有self edge，因此需要排除掉它們。

In [7]:
# Build a list of edge. Each edge is stored in a tuple
edge_tuple_list=[]
for i in range(0,len(df.node_1)): # Add all edges
    edge_tuple_list.append((df.node_1[i],df.node_2[i]))

for (i, j) in edge_tuple_list:
    for (s,t) in edge_tuple_list:
        if (s,t) == (j, i): # delete (j,i) if (i,j) is in the dataset
            edge_tuple_list.remove((s,t))

NetworkX可以直接由邊的列表建立一個圖。

In [8]:
G = nx.Graph() # construct the graph
G.add_edges_from(edge_tuple_list)

## 二. 定義所需的資料結構
為了儲存每個頂點的$d_u, q_u, r_u$的值，我使用class來儲存每一個頂點，同時也在這個class中定義push中的第一步，也就是
$$r_u+(1-\beta)q_u\to r_u$$
$$\frac{1}{2}\beta q_u\to q_u$$

In [28]:
beta = 0.8

In [44]:
class vertex():
    def __init__(self, num, neighbors, du, qu, ru):
        self.num = num
        self.neighbors = neighbors
        self.du = du
        self.qu = qu
        self.ru = ru
    def push_step1(self):
        self.ru = self.ru + (1-beta)*self.qu
        self.qu = 0.5*beta*self.qu

建立所有vertex。

In [45]:
vertex_list = [n for n in G] # get the list of all vertices in the network
vertex_class = []
for i in vertex_list:
    if i == 9572: # seed node
        qu = 1
    else:
        qu = 0
    v = vertex(i, [n for n in G.neighbors(i)], G.degree[i], qu, 0) # call the constructer
    vertex_class.append(v)

## 三. Approximate PPR演算法實作

因為迴圈的判斷條件是$\max_{u\in V}\frac{q_u}{d_u}\geq\epsilon$，因此先計算所有vertex的$q_u/d_u$。

In [46]:
epsilon = 1e-7
ratio_list = [v.qu/v.du for v in vertex_class]  # qu/du for all vertices

接下來進到演算法的本體，由於這裡沒有使用map reduce，因此可以完全按照講義上的虛擬碼做。

In [47]:
while max(ratio_list) > epsilon: # loop condition
    # Find all the vertices that qu/du > epsilon
    over = [] 
    for v in vertex_class: 
        if v.qu/v.du > epsilon: # qu/du > epsilon
            over.append(v)
            
    push_vertex = random.choice(over) # randomly choose a vertex whose qu/du > epsilon
    push_vertex.push_step1() # assign new value of ru and qu (lazy random walk)
    
    for w in vertex_class: 
        if w.num in push_vertex.neighbors: # its neighbor
            w.qu += 0.5*beta*push_vertex.qu/push_vertex.du # update q_v for all neighbor
                
    ratio_list = [v.qu/v.du for v in vertex_class] # update the ratio list qu/du

將所有頂點與對應的PPR score $r_u$儲存到output.txt之中。

In [49]:
output = [[v.num, v.ru] for v in vertex_class]

In [50]:
with open('output.txt', 'w') as f:
    for item in output:
        f.write("%s\t%s\n" % (str(item[0]), str(item[1])))