# Page Rank exercise

## Introduction

For this exercise we'll use the `hollins.dat` file provided with the project.
The first line of the file indicate the number of websites (6012) from 1 to 6012, the second number (23875) is the number of relations successor/predecessor.
The following 6012 are all the website link with their index.
The last 23875 are in column 1 the predecessors and in column 2 the successors of the websites represented by their index.

## Pre-process data

First thing to do we load the `hollins.dat` file load data and create dictionary between the index of the website to its link.
We retrieve the datas from the file:
- The size of the stochastic matrix M.
- The number of relations.
- The index and link of each website to create a python dictionary
- The relations between each website (predecessors/successors)

In [92]:
"""
Retrieve data size from file.
"""

# Load file
# f = open("hollins.dat", "r")
f = open("myDataset.txt", "r")

# Get size of matrix and total number of relations
matrix_size, total_relations = f.readline().split(" ")
matrix_size = int(matrix_size)
total_relations = int(total_relations)

# Create dictionary from number to website link.
dictionary_index_link = {}
for _ in range(0, matrix_size):
    line = f.readline().split()
    dictionary_index_link[line[0]] = line[1]

# Store all relations in an array
relations_tab = []
for _ in range(0, total_relations):
    relations_tab.append(f.readline().strip().split(' '))

# Print infos
print("Size of Matrix (number of website) : {}".format(matrix_size))
print("Total number of relations : {}".format(total_relations))
print(dictionary_index_link)
print(relations_tab)

Size of Matrix (number of website) : 5
Total number of relations : 8
{'1': 'A', '2': 'B', '3': 'C', '4': 'D', '5': 'E'}
[['1', '2'], ['2', '3'], ['3', '4'], ['3', '5'], ['4', '1'], ['5', '1'], ['5', '2'], ['5', '4']]


## Page rank simple resolution

In a first time we will calculate our page rank wimply without teleport and dead ends resolution.

To do so we need in a first time to create a function returning the number of successors a website has.

The second part of the `hollins.dat` file is organised as follows:

| left column | right column |
|-------------|--------------|
| 3           | 5            |
| 1           | 2            |
| 8           | 199          |

Example :
5 is the successor of 3.
3 is the predecessor of 5.
etc...

So to count the number of successors a website has we count how many times a website is present in the left column.

In [98]:
def get_successors_number(relations_tab, index):
    return sum(i[0] == index for i in relations_tab)


print(f"The website with index 1 has {get_successors_number(relations_tab, '1')} successors")

The website with index 1 has 1 successors


Now we can create our M matrix:

In [101]:
import numpy as np


def create_matrix_M(matrix_size, relations_tab):
    # Create 2D array sized (matrix_size*matrix_size) full of 0
    M = [[0 for _ in range(matrix_size)] for _ in range(matrix_size)]

    # Add the relations in the matrix
    for i in relations_tab:
        nbr_successors = get_successors_number(relations_tab, i[0])
        if nbr_successors != 0:
            website, successor = int(i[0]), int(i[1])
            M[successor-1][website-1] = 1 / nbr_successors

    # Convert to numpy array
    return np.array(M)


M = create_matrix_M(matrix_size, relations_tab)

The vector r0:

In [102]:
print(M[0][1])
print(M[1][0])
print(M)

0.0
1.0
[[0.         0.         0.         1.         0.33333333]
 [1.         0.         0.         0.         0.33333333]
 [0.         1.         0.         0.         0.        ]
 [0.         0.         0.5        0.         0.33333333]
 [0.         0.         0.5        0.         0.        ]]


In [96]:
import numpy as np


def calculate_pagerank(M, r0):
    epsilon = 0.1
    num_iteration = 0
    do_loop = True

    rk1 = np.dot(M, r0)
    result = np.array2string(rk1, precision=2, separator=',', suppress_small=True)
    print("L'itération a pour valeur r{}".format(num_iteration) + " = " + result)

    while do_loop:
        num_iteration += 1
        rk0 = rk1
        rk1 = np.dot(M, rk1)
        result = np.array2string(rk1, precision=2, separator=',', suppress_small=True)
        print(
            "L'itération a pour valeur r{}".format(num_iteration) + " = " + result)
        do_loop = not (np.linalg.norm((rk1 - rk0), ord=1) < epsilon)

    return rk1


def display_websites(dictionary_links, result_page_rank, matrix_size):
    # i is the website real id
    sorted_index_website_array = [(dictionary_links[i], result_page_rank[i - 1]) for i in range(1, matrix_size + 1)]
    sorted_index_website_array.sort(key=take_pagerank_result, reverse=True)
    print(sorted_index_website_array[:10])


def take_pagerank_result(elem):
    return elem[1]


def create_spider_trap_matrix(matrix_size):
    matrix = []
    for _ in range(matrix_size):
        matrix.append([(float(1) / matrix_size) for _ in range(matrix_size)])
    return np.array(matrix)


def create_matrix_M_with_spidertrap(M, spider_matrix, beta):
    return M * beta + spider_matrix * (1 - beta)


def create_matrix_r(matrix_size):
    r0 = []
    for i in range(matrix_size):
        r0.append(float(1) / matrix_size)
    return np.array(r0).transpose()

We can now create our stochastic matrix with the r matrix.

In [97]:
"""
Create matrix M.
"""
M = create_matrix_m(matrixSize, websites_nbr_successors_list, column1, column2)

"""
Create matrix r
"""
r0 = create_matrix_r(matrixSize)

NameError: name 'create_matrix_m' is not defined

Calculation of the PageRank and display.

In [None]:
"""
Apply algorithm to calculate the solution.
"""
print("\nPageRank with spider trap, teleport not implemented")
pageRank = calculate_pagerank(M, r0)
display_websites(dictionary_links, pageRank, matrixSize)

## Question 2

The Spider Trap can be prevented by implementing teleport with a matrix that only has the value in each location.
$$\frac{1}{numberOfRelations}$$

In [None]:
# Question 2
"""
Implementing spider trap.
Recreating M matrix.
"""
beta = 0.8
spider_trap_matrix = create_spider_trap_matrix(matrixSize)


After calculating the spider_trap_matrix we need to do an operation between our matrix M and the spider_matrix.
The operation is :  
$$ M * \beta + spider_matrix * (1 - \beta)$$  
Beta is choosen between 0.8 and 0.9.  
In our exercice we choose \beta = 0.8

In [None]:
M_plus_spider = create_matrix_M_with_spidertrap(M, spider_trap_matrix, beta)

Now that we have calculated our matrix M with the teleportation to prevent from spider trap we can calculate our PageRank.

In [None]:
pageRank = calculate_pagerank(M_plus_spider, r0)
display_websites(dictionary_links, pageRank, matrixSize)

## Question 3

In the case of a dead ends, we need to delete all websites that have no successors or its only successor is itself.  
To do so we create a list, to know which website have a dead ends.

In [None]:
# Question 3
"""
Dead ends
"""

"""
List all the website that have no successors or has only one successors that is itself
"""
website_to_delele_list = []
for website in range(1, matrixSize + 1):
    number_sucessors = column2.count(website)
    if number_sucessors == 0:
        website_to_delele_list.append(website)
    elif number_sucessors == 1 and column1[website - 1] == website:
        website_to_delele_list.append(website)

print("Websites that have dead ends : ")
print(website_to_delele_list)

To prevent dead ends we are deleting all reference to a website that has dead ends.
We are deleting all lines where a reference to one of these wesbite is made.

In [None]:
"""
recreate the list of website/successors with only the website that have successors
"""
new_column1 = []
new_column2 = []

for successor, website in zip(column1, column2):
    if website not in website_to_delele_list and successor not in website_to_delele_list:
        new_column1.append(successor)
        new_column2.append(website)

new_matrixSize = matrixSize - len(website_to_delele_list)

Then we shift all ID of the websites and we re-create a dictionary to re-create a link between the website and its new ID.

In [None]:
"""
Once some websites are deleted, we need to shift all the website id that are greater than the deleted one 
"""

totalRelations = len(new_column1)

for website_to_delete in website_to_delele_list:
    for i in range(0, totalRelations):
        if new_column1[i] > website_to_delete:
            new_column1[i] -= 1
        if new_column2[i] > website_to_delete:
            new_column2[i] -= 1

"""
Rearange the dictionary of website index/links 
"""

for website_to_delete in website_to_delele_list:
    for i in range(1, totalRelations + 1):
        if i >= website_to_delete and str(i) in dictionary_links:
            dictionary_links[str(i)] = dictionary_links[str(i + 1)]

Now that we have all ours data. We can redo the previous step :
- creation of matrix M
- creation of spider matrix 
- ...

In [None]:
"""
Recreate the list of number of successor a website have
"""
new_websites_nbr_successors_list = {}
for website in range(1, new_matrixSize + 1):
    new_websites_nbr_successors_list[website] = new_column2.count(website)

"""
Create the new matrix M.
"""
M = create_matrix_m(new_matrixSize, new_websites_nbr_successors_list, new_column1, new_column2)

"""
Create the new spide matrix 
"""
beta = 0.8
spider_trap_matrix = create_spider_trap_matrix(new_matrixSize)

"""
Calulate the new matrix M with teleport
"""
M_plus_spider = create_matrix_M_with_spidertrap(M, spider_trap_matrix, beta)

"""
Create matrix r
"""
r0 = create_matrix_r(new_matrixSize)

"""
Apply algorithm to calculate the solution.
"""
print("\nPageRank with spider trap, teleport not implemented")
pageRank = calculate_pagerank(M_plus_spider, r0)
display_websites(dictionary_links, pageRank, new_matrixSize)
