# Week 4
# PageRank Examples

In class, we see an example network, and we formulate a system of equations for page ranks:

$y = y/2 + a/2$

$a = y/2 + m$

$m = a/2$

The usual method for solving linear system of equations is Gaussian Elimination. It is too costly if the matrix is too large. Alternatively, let's consider the power iteration method.

In [1]:
import numpy as np

In [2]:
# Explore the power iteration method.

# Construct the PageRank matrix
M = np.array([[1/2, 1/2, 0],
              [1/2, 0, 1],
              [0, 1/2, 0]])

In [3]:
# Initialize vector r
r = np.array([[1/3],
             [1/3],
             [1/3]])

In [4]:
# iterate once
r = M.dot(r)
r = r
print(r)

[[0.33333333]
 [0.5       ]
 [0.16666667]]


In [5]:
# iterate many times
for i in range(100):
    r = M.dot(r)
    r = r
    if i % 10 == 0:
        print(r)

[[0.41666667]
 [0.33333333]
 [0.25      ]]
[[0.40323893]
 [0.39152018]
 [0.20524089]]
[[0.40038904]
 [0.39898149]
 [0.20062947]]
[[0.40004673]
 [0.39987767]
 [0.20007561]]
[[0.40000561]
 [0.39998531]
 [0.20000908]]
[[0.40000067]
 [0.39999824]
 [0.20000109]]
[[0.40000008]
 [0.39999979]
 [0.20000013]]
[[0.40000001]
 [0.39999997]
 [0.20000002]]
[[0.4]
 [0.4]
 [0.2]]
[[0.4]
 [0.4]
 [0.2]]


In [6]:
# Verify that r = M * r
print(r)
print(M.dot(r))

[[0.4]
 [0.4]
 [0.2]]
[[0.4]
 [0.4]
 [0.2]]


## Issues with the Power Iteration

In [7]:
# Spider trap
M = np.array([[1/2, 1/2, 0],
              [1/2, 0, 0],
              [0, 1/2, 1]])

# Initialize vector r
r = np.array([[1/3],
             [1/3],
             [1/3]])

# Power iteration
for i in range(40):
    r = M.dot(r)
    r = r
    if i % 4 == 0:
        print(r)

[[0.33333333]
 [0.16666667]
 [0.5       ]]
[[0.13541667]
 [0.08333333]
 [0.78125   ]]
[[0.05794271]
 [0.03580729]
 [0.90625   ]]
[[0.02482096]
 [0.01534017]
 [0.95983887]]
[[0.01063283]
 [0.00657145]
 [0.98279572]]
[[0.00455491]
 [0.00281509]
 [0.99263   ]]
[[0.00195124]
 [0.00120593]
 [0.99684283]]
[[8.35873807e-04]
 [5.16598423e-04]
 [9.98647528e-01]]
[[3.58072769e-04]
 [2.21301142e-04]
 [9.99420626e-01]]
[[1.53391704e-04]
 [9.48012870e-05]
 [9.99751807e-01]]


In [8]:
# Dead ends
M = np.array([[1/2, 1/2, 0],
              [1/2, 0, 0],
              [0, 1/2, 0]])

# Initialize vector r
r = np.array([[1/3],
             [1/3],
             [1/3]])

# Power iteration
for i in range(40):
    r = M.dot(r)
    r = r
    if i % 4 == 0:
        print(r)

[[0.33333333]
 [0.16666667]
 [0.16666667]]
[[0.13541667]
 [0.08333333]
 [0.05208333]]
[[0.05794271]
 [0.03580729]
 [0.02213542]]
[[0.02482096]
 [0.01534017]
 [0.00948079]]
[[0.01063283]
 [0.00657145]
 [0.00406138]]
[[0.00455491]
 [0.00281509]
 [0.00173982]]
[[0.00195124]
 [0.00120593]
 [0.00074531]]
[[0.00083587]
 [0.0005166 ]
 [0.00031928]]
[[0.00035807]
 [0.0002213 ]
 [0.00013677]]
[[1.53391704e-04]
 [9.48012870e-05]
 [5.85904175e-05]]


## Calculate Pagerank with Networkx

For this Colab we will be using [NetworkX](https://networkx.github.io), a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

*Reference:* [Stanford CS246](https://web.stanford.edu/class/cs246/)

In [9]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



The dataset we will analyze is a snapshot of the Web Graph centered around [stanford.edu](https://stanford.edu), collected in 2002. Nodes represent pages from Stanford University (stanford.edu) and directed edges represent hyperlinks between them. [[More Info]](http://snap.stanford.edu/data/web-Stanford.html)

In [10]:
id='1EoolSK32_U74I4FeLox88iuUB_SUUYsI'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('web-Stanford.txt')

In [11]:
#code uses networkx < version 3
!pip install --upgrade networkx==2.8.8



In [12]:
import networkx as nx

G = nx.read_edgelist('web-Stanford.txt', create_using=nx.DiGraph)

In [13]:
list(G.edges)[:10]

[('1', '6548'),
 ('1', '15409'),
 ('6548', '57031'),
 ('15409', '13102'),
 ('57031', '6548'),
 ('57031', '59749'),
 ('13102', '15409'),
 ('13102', '19974'),
 ('2', '17794'),
 ('2', '25202')]

In [14]:
# Extract the largest weakly connected component
largest_cc = G.subgraph(max(nx.weakly_connected_components(G), key=len))
print(nx.info(largest_cc))


  print(nx.info(largest_cc))


DiGraph with 255265 nodes and 2234572 edges


Compute the PageRank vector, using the default parameters in NetworkX: [https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html#networkx.algorithms.link_analysis.pagerank_alg.pageranky](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html#networkx.algorithms.link_analysis.pagerank_alg.pagerank)

In [15]:
import time
start = time.time()
pr = nx.pagerank(largest_cc)
end = time.time()
print("Time cost:", (end - start))

Time cost: 14.698321342468262


In [16]:
# Top 10 pages
count = 0
for i in sorted(pr.items(), key=lambda item: item[1], reverse=True):
    print(i)
    count += 1
    if count > 10:
        break

('89073', 0.011051585882434985)
('226411', 0.010694113250567905)
('241454', 0.009829260884719991)
('134832', 0.00650923773721211)
('69358', 0.003753708143672675)
('67756', 0.003543473943866138)
('105607', 0.0032305919516859047)
('225872', 0.0031736850016296342)
('234704', 0.0031708863624340614)
('186750', 0.00314345200380852)
('231363', 0.003061480040107788)


In [17]:
nx.__version__

'2.8.8'

In [18]:
start = time.time()
pr_scipy = nx.pagerank_scipy(largest_cc)
end = time.time()
print("Time cost:", (end - start))

  pr_scipy = nx.pagerank_scipy(largest_cc)


Time cost: 11.925999641418457


In [19]:
# Top 10 pages
count = 0
for i in sorted(pr_scipy.items(), key=lambda item: item[1], reverse=True):
    print(i)
    count += 1
    if count > 10:
        break

('89073', 0.011051585882434985)
('226411', 0.010694113250567905)
('241454', 0.009829260884719991)
('134832', 0.00650923773721211)
('69358', 0.003753708143672675)
('67756', 0.003543473943866138)
('105607', 0.0032305919516859047)
('225872', 0.0031736850016296342)
('234704', 0.0031708863624340614)
('186750', 0.00314345200380852)
('231363', 0.003061480040107788)


In [20]:
start = time.time()
# pr_numpy = nx.pagerank_numpy(largest_cc)
pr_numpy = nx.pagerank(largest_cc)
end = time.time()
print("Time cost:", (end - start))

Time cost: 12.528838157653809
