# Page Rank exercise

## Introduction

For this exercise we'll use the `hollins.dat` file provided with the project.
The first line of the file indicate the number of websites (6012) from 1 to 6012, the second number (23875) is the number of relations successor/predecessor.
The following 6012 are all the website link with their index.
The last 23875 are in column 1 the predecessors and in column 2 the successors of the websites represented by their index.

## Pre-process data

First thing to do we load the `hollins.dat` file load data and create dictionary between the index of the website to its link.
We retrieve the datas from the file:
- The size of the stochastic matrix M.
- The number of relations.
- The index and link of each website to create a python dictionary
- The relations between each website (predecessors/successors)

In [374]:
# Load file
filepath = "hollins.dat"
f = open(filepath, "r")

# Get size of matrix and total number of relations
matrix_size, total_relations = f.readline().strip().split(" ")
matrix_size = int(matrix_size)
total_relations = int(total_relations)

# Create dictionary from number to website link.
dictionary_index_link = {}
for _ in range(0, matrix_size):
    line = f.readline().strip().split(' ')
    dictionary_index_link[line[0]] = line[1]

# Store all relations in an array
relations_tab = []
for _ in range(0, total_relations):
    relations_tab.append(f.readline().strip().split(' '))

# Close file
f.close()

# Print infos
print("Size of Matrix (number of website) : {}".format(matrix_size))
print("Total number of relations : {}".format(total_relations))

Size of Matrix (number of website) : 6012
Total number of relations : 23875


## Page rank simple resolution

In a first time we will calculate our page rank wimply without teleport and dead ends resolution.

To do so we need in a first time to create a function returning the number of successors a website has.

The second part of the `hollins.dat` file is organised as follows:

| left column | right column |
|-------------|--------------|
| 3           | 5            |
| 1           | 2            |
| 8           | 199          |

Example :
5 is the successor of 3.
3 is the predecessor of 5.
etc...

So to count the number of successors a website has we count how many times a website is present in the left column.

In [375]:
def get_successors_number(relations_tab, index):
    return sum(i[0] == index for i in relations_tab)


print(f"The website with index 1 has {get_successors_number(relations_tab, '1')} successors")

The website with index 1 has 24 successors


Now we can create our M matrix:

In [376]:
import numpy as np


def create_matrix_M(matrix_size, relations_tab):
    # Create 2D array sized (matrix_size*matrix_size) full of 0
    M = [[0 for _ in range(matrix_size)] for _ in range(matrix_size)]

    # Add the relations in the matrix
    for i in relations_tab:
        nbr_successors = get_successors_number(relations_tab, i[0])
        if nbr_successors != 0:
            website, successor = int(i[0]), int(i[1])
            M[successor - 1][website - 1] = 1 / nbr_successors

    # Convert to numpy array
    return np.array(M)


M = create_matrix_M(matrix_size, relations_tab)
print(M)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.04166667 0.         0.         ... 0.         0.         0.        ]
 [0.04166667 0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


The vector r0:

In [377]:
def create_vector_r0(matrix_size):
    return np.array([1 / matrix_size for _ in range(matrix_size)]).transpose()


r0 = create_vector_r0(matrix_size)
print(r0)

[0.00016633 0.00016633 0.00016633 ... 0.00016633 0.00016633 0.00016633]


Matrix method resolution:

In [378]:
def calculate_page_rank(M, r0, epsilon):
    num_iteration = 0
    do_loop = True
    rk1 = np.dot(M, r0)
    print(f"iteration r{num_iteration} = " + np.array2string(rk1, precision=2, separator=',', suppress_small=True))
    while do_loop:
        num_iteration += 1
        rk0 = rk1
        rk1 = np.dot(M, rk1)
        print(f"Iteration r{num_iteration} = " + np.array2string(rk1, precision=2, separator=',', suppress_small=True))
        do_loop = not (np.linalg.norm((rk1 - rk0), ord=1) < epsilon)
    return rk1


epsilon = 0.1
pagerank_result = calculate_page_rank(M, r0, epsilon)

iteration r0 = [0.  ,0.03,0.  ,...,0.  ,0.  ,0.  ]
Iteration r1 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]
Iteration r2 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]
Iteration r3 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]


To check if our result is right, we sum all the values from the page rank. The sum equals 0.999 really close to one, our calculator seems right. The error is due to the approximation of the values.

## Spider-trap and teleport

Now to prevent the spider-trap issue we need to implement teleport.
To do so we create two function, one for the teleport operation (for our new M matrix) and another to create the T matrix.

In [379]:
def create_matrix_T(matrix_size):
    return np.array([[1 / matrix_size for _ in range(matrix_size)] for _ in range(matrix_size)])


T = create_matrix_T(matrix_size)

[1. 1. 1. ... 1. 1. 1.]


Resolution with teleport:

In [380]:
def teleport_operation(M, T, beta):
    return M * beta + T * (1 - beta)

beta = 0.8
M = create_matrix_M(matrix_size, relations_tab)
M = teleport_operation(M, T, beta)
pagerank_result = calculate_page_rank(M, r0, epsilon)

iteration r0 = [0.  ,0.02,0.  ,...,0.  ,0.  ,0.  ]
Iteration r1 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]
Iteration r2 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]
Iteration r3 = [0.  ,0.01,0.  ,...,0.  ,0.  ,0.  ]


## Dead-ends resolution

To prevent the dead-ends issue we need to delete all website that could cause dead ends. To do so, it is a necessity to re-create the dictionary and the complete relation website/successor array.

In [386]:
# Load file
f = open(filepath, "r")
f.readline()

# Create an array that contains the index all website that doesn't cause dead end
website_to_keep_list = []
for i in range(1, matrix_size + 1):
    if get_successors_number(relations_tab, str(i)) != 0:
        website_to_keep_list.append(int(i))

# Create dictionary and add element website if it is in the website_to_keep_list
dictionary_index_link_new = {}
new_index=0
for _ in range(matrix_size):
    index, website = f.readline().strip().split(' ')
    if int(index) in website_to_keep_list:
        dictionary_index_link_new[new_index] = website
        new_index+=1

# Define matrix size
matrix_size = len(website_to_keep_list)

# Close file
f.close()

# for i in website_to_keep_list:

print(len(website_to_keep_list))
print(len(dictionary_index_link_new))


605
605
