# Page Rank exercise

## Introduction

For this exercise we'll use the `hollins.dat` file provided with the project.
The first line of the file indicate the number of websites (6012) from 1 to 6012, the second number (23875) is the number of relations successor/predecessor.
The following 6012 are all the website link with their index.
The last 23875 are in column 1 the predecessors and in column 2 the successors of the websites represented by their index.

## Pre-process data

First thing to do we load the `hollins.dat` file load data and create dictionary between the index of the website to its link.
We retrieve the datas from the file:
- The size of the stochastic matrix M.
- The number of relations.
- The index and link of each website to create a python dictionary
- The relations between each website (predecessors/successors)

In [379]:
# Load file
filepath = "hollins.dat"
filepath = "myDataset-deadends.txt"
# filepath = "myDataset-spidertrap.txt"
f = open(filepath, "r")

# Get size of matrix and total number of relations
matrix_size, total_relations = f.readline().strip().split(" ")
matrix_size = int(matrix_size)
total_relations = int(total_relations)

# Create dictionary from number to website link.
dictionary_index_link = {}
for _ in range(0, matrix_size):
    line = f.readline().strip().split(' ')
    dictionary_index_link[line[0]] = line[1]

# Store all relations in an array
relations_tab = []
for _ in range(0, total_relations):
    relations_tab.append(f.readline().strip().split(' '))

# Close file
f.close()

# Print infos
print("Size of Matrix (number of website) : {}".format(matrix_size))
print("Total number of relations : {}".format(total_relations))
print(relations_tab)

Size of Matrix (number of website) : 5
Total number of relations : 7
[['1', '2'], ['2', '3'], ['3', '4'], ['3', '5'], ['5', '1'], ['5', '2'], ['5', '4']]


## Page rank simple resolution

In a first time we will calculate our page rank wimply without teleport and dead ends resolution.

To do so we need in a first time to create a function returning the number of successors a website has.

The second part of the `hollins.dat` file is organised as follows:

| left column | right column |
|-------------|--------------|
| 3           | 5            |
| 1           | 2            |
| 8           | 199          |

Example :
5 is the successor of 3.
3 is the predecessor of 5.
etc...

So to count the number of successors a website has we count how many times a website is present in the left column.

In [380]:
def get_successors_number(relations_tab, index):
    return sum(i[0] == index for i in relations_tab)


print(f"The website with index 1 has {get_successors_number(relations_tab, '1')} successors")

The website with index 1 has 1 successors


Now we can create our M matrix:

In [381]:
import numpy as np


def create_matrix_M(matrix_size, relations_tab):
    # Create 2D array sized (matrix_size*matrix_size) full of 0
    M = [[0 for _ in range(matrix_size)] for _ in range(matrix_size)]

    # Add the relations in the matrix
    for i in relations_tab:
        nbr_successors = get_successors_number(relations_tab, i[0])
        if nbr_successors != 0:
            website, successor = int(i[0]), int(i[1])
            M[successor - 1][website - 1] = 1 / nbr_successors
    # Convert to numpy array
    return np.array(M)


M = create_matrix_M(matrix_size, relations_tab)
print(M)

[[0.         0.         0.         0.         0.33333333]
 [1.         0.         0.         0.         0.33333333]
 [0.         1.         0.         0.         0.        ]
 [0.         0.         0.5        0.         0.33333333]
 [0.         0.         0.5        0.         0.        ]]


The vector r0:

In [382]:
def create_vector_r0(matrix_size):
    return np.array([1 / matrix_size for _ in range(matrix_size)]).transpose()


r0 = create_vector_r0(matrix_size)
print(r0)

[0.2 0.2 0.2 0.2 0.2]


Matrix method resolution:

In [383]:
def calculate_page_rank(M, r0, epsilon):
    num_iteration = 0
    do_loop = True
    rk1 = np.dot(M, r0)
    print(f"iteration r{num_iteration} = " + np.array2string(rk1, precision=2, separator=',', suppress_small=True))
    while do_loop:
        num_iteration += 1
        rk0 = rk1
        rk1 = np.dot(M, rk1)
        print(f"Iteration r{num_iteration} = " + np.array2string(rk1, precision=2, separator=',', suppress_small=True))
        do_loop = not (np.linalg.norm((rk1 - rk0), ord=1) < epsilon)
    return rk1


epsilon = 0.1
pagerank_result = calculate_page_rank(M, r0, epsilon)
print(sum(pagerank_result))

iteration r0 = [0.07,0.27,0.2 ,0.17,0.1 ]
Iteration r1 = [0.03,0.1 ,0.27,0.13,0.1 ]
Iteration r2 = [0.03,0.07,0.1 ,0.17,0.13]
Iteration r3 = [0.04,0.08,0.07,0.09,0.05]
Iteration r4 = [0.02,0.06,0.08,0.05,0.03]
Iteration r5 = [0.01,0.03,0.06,0.05,0.04]
0.1888888888888889


To check if our result is right, we sum all the values from the page rank. The sum equals 0.999 really close to one, our calculator seems right. The error is due to the approximation of the values.

## Spider-trap and teleport

Now to prevent the spider-trap issue we need to implement teleport.
To do so we create two function, one for the teleport operation (for our new M matrix) and another to create the T matrix.

In [384]:
def create_matrix_T(matrix_size):
    return np.array([[1 / matrix_size for _ in range(matrix_size)] for _ in range(matrix_size)])


T = create_matrix_T(matrix_size)

Resolution with teleport:

In [385]:
def teleport_operation(M, T, beta):
    return M * beta + T * (1 - beta)


beta = 0.8
M = create_matrix_M(matrix_size, relations_tab)
M = teleport_operation(M, T, beta)
pagerank_result = calculate_page_rank(M, r0, epsilon)

iteration r0 = [0.09,0.25,0.2 ,0.17,0.12]
Iteration r1 = [0.07,0.14,0.24,0.15,0.11]
Iteration r2 = [0.06,0.11,0.14,0.15,0.12]
Iteration r3 = [0.06,0.1 ,0.11,0.11,0.08]
Iteration r4 = [0.04,0.08,0.1 ,0.08,0.06]


## Dead-ends resolution

To prevent the dead-ends issue we need to delete all website that could cause dead ends. To do so, it is a necessity to re-create the dictionary and the complete relation website/successor array.

Before creating M, T and r0, we need to preprocess our data to delete the website that might produce dead end and change the index of all websites.

In [386]:
# Create an array that contains the index of all website that doesn't cause dead end
website_to_keep_list = []
website_to_delete_list = []
for i in range(1, matrix_size + 1):
    if get_successors_number(relations_tab, str(i)) != 0:
        website_to_keep_list.append(i)
    else:
        website_to_delete_list.append(i)

# Create new dictionary from first dictionary
dictionary_index_link_new = {}
index = 1
for i in website_to_keep_list:
    dictionary_index_link_new[str(index)] = dictionary_index_link[str(i)]
    index += 1

relations_tab_new = []
# Create new relation tab
for i in relations_tab:
    if int(i[0]) in website_to_keep_list and int(i[1]) in website_to_keep_list:
        # here calculate nex index for i
        # calculate number of index deleted below the i[0] and i[1]
        sum_below_index_website = sum(index_deleted < int(i[0]) for index_deleted in website_to_delete_list)
        sum_below_index_successor = sum(index_deleted < int(i[1]) for index_deleted in website_to_delete_list)
        new_website_successor_indexes = [int(i[0]) - sum_below_index_website, int(i[1]) - sum_below_index_successor]
        relations_tab_new.append(new_website_successor_indexes)

matrix_size = len(website_to_keep_list)

# print(len(website_to_keep_list))
# print(website_to_keep_list)
#
# print(len(website_to_delete_list))
# print(website_to_delete_list)
#
# print(len(dictionary_index_link_new))
# print(dictionary_index_link_new)
#
# matrix_size = len(website_to_keep_list)
# print(matrix_size)
#
# print(relations_tab_new)

Now we can create our new matrix and calculate the new values without dead ends.

In [387]:
beta = 0.8
epsilon = 0.1

M = create_matrix_M(matrix_size, relations_tab_new)

T = create_matrix_T(matrix_size)

r0 = create_vector_r0(matrix_size)

M = teleport_operation(M, T, beta)

pagerank_result = calculate_page_rank(M, r0, epsilon)

iteration r0 = [0.15,0.35,0.25,0.25]
Iteration r1 = [0.15,0.27,0.33,0.25]
Iteration r2 = [0.15,0.27,0.27,0.31]
Iteration r3 = [0.18,0.3 ,0.27,0.26]
Iteration r4 = [0.16,0.3 ,0.29,0.26]


In [388]:
print(dictionary_index_link['4'])
print(dictionary_index_link['5'])

print(dictionary_index_link_new['4'])

D
E
E


In [389]:
print(website_to_keep_list)

for i in range(matrix_size, 0, -1):
    print(f"{i}")

[1, 2, 3, 5]
4
3
2
1


Now we need to re-create our M matrix, T matrix etc...


In [390]:
# # Modify dictionary and create new M from M
# M_dead_end = create_matrix_M(matrix_size, relations_tab)
# for i in range(matrix_size,0,-1): # from 6012 to 1
#     if get_successors_number(relations_tab,str(i))==0:

# # Create dictionary and add element website if it is in the website_to_keep_list
# dictionary_index_link_new = {}
# new_index=0
# for i in range(matrix_size):
#     index, website = f.readline().strip().split(' ')
#     if i in website_to_keep_list:
#         dictionary_index_link_new[new_index] = website
#         new_index+=1
#
# # Define matrix size
# matrix_size = len(website_to_keep_list)
#
# # Close file
# f.close()
#
# # for i in website_to_keep_list:

# print(len(website_to_keep_list))
# print(len(dictionary_index_link_new))


# def get_successors_number_dead_end(relations_tab, index):
#     return sum(i[0] == index for i in relations_tab)
#
#
# def create_matrix_M_dead_end():
#     # Create 2D array sized (matrix_size*matrix_size) full of 0
#     M = [[0 for _ in range(matrix_size)] for _ in range(matrix_size)]
#
#     # Add the relations in the matrix
#     for i in relations_tab:
#         nbr_successors = get_successors_number(relations_tab, i[0])
#         if nbr_successors != 0:
#             website, successor = int(i[0]), int(i[1])
#             M[successor - 1][website - 1] = 1 / nbr_successors
#
#     # Convert to numpy array
#     return np.array(M)
