# Page Rank exercise

## introduction

For this exercise we'll use the `hollins.dat` file provided with the project.
The first line of the file indicate the number of websites (6012) from 1 to 6012, the second number (23875) is the number of relations successor/predecessor.
The following 6012 are all the website link with their index.
The last 23875 are in column 1 the predecessors and in column 2 the successors of the websites represented by their index.

First thing to do we load the `hollins.dat` file and create dictionary between the index of the website to its link.

In [2]:
# Here we retrive the datas from the file. The size of the stochastic matrix M and the number of successors.

"""
Retrieve data size from file.
"""
dictionary_links = {}
f = open("hollins.dat", "r")
matrixSize, totalRelations = f.readline().split(" ")
matrixSize = int(matrixSize)
totalRelations = int(totalRelations)
print("Size of Matrix : {}".format(matrixSize))
print("Number of total relations : {}".format(totalRelations))

Size of Matrix : 6012
Number of total relations : 23875


In [1]:
import numpy as np

def calculate_pagerank(M, r0):
    epsilon = 0.1
    num_iteration = 0
    do_loop = True

    rk1 = np.dot(M, r0)
    result = np.array2string(rk1, precision=2, separator=',', suppress_small=True)
    print("L'itération a pour valeur r{}".format(num_iteration) + " = " + result)

    while do_loop:
        num_iteration += 1
        rk0 = rk1
        rk1 = np.dot(M, rk1)
        result = np.array2string(rk1, precision=2, separator=',', suppress_small=True)
        print(
            "L'itération a pour valeur r{}".format(num_iteration) + " = " + result)
        do_loop = not (np.linalg.norm((rk1 - rk0), ord=1) < epsilon)

    return rk1


def display_websites(dictionary_links, result_page_rank, matrix_size):
    # i is the website real id
    sorted_index_website_array = [(dictionary_links[i], result_page_rank[i-1]) for i in range(1, matrix_size + 1)]
    sorted_index_website_array.sort(key=take_pagerank_result, reverse=True)
    print(sorted_index_website_array[:10])



def take_pagerank_result(elem):
    return elem[1]


def create_matrix_m(matrix_size, websites_nbr_successors_list, column1, column2):
    # Create 2D array of matrix_size*matrix_size
    M = []
    for _ in range(matrix_size):
        M.append([0 for _ in range(matrix_size)])
    for successor, website in zip(column1, column2):
        if websites_nbr_successors_list[website]:
            website_nbr_successors = websites_nbr_successors_list[website]
        else:
            website_nbr_successors=0
        # Array index begins at 0 so we need to substract 1 from the real index of the website/successor
        if website_nbr_successors != 0:
            M[successor - 1][website - 1] = (float(1) / website_nbr_successors)
        else:
            M[successor - 1][website - 1] = 0
    return np.array(M)


def create_spider_trap_matrix(matrix_size):
    matrix = []
    for _ in range(matrix_size):
        matrix.append([(float(1)/matrix_size) for _ in range(matrix_size)])
    return np.array(matrix)

def create_matrix_M_with_spidertrap(M, spider_matrix, beta):
    return M * beta + spider_matrix * (1 -beta)

def create_matrix_r(matrix_size):
    r0 = []
    for i in range(matrix_size):
        r0.append(float(1) / matrix_size)
    return np.array(r0).transpose()

## Question 1 - a)



We also create a dictionary to know the relation between an id of a webstie and its name.

In [3]:
"""
Create dictionary from number to website link.
"""
for i in range(1, matrixSize + 1):
    dictionary_links[i] = f.readline().split()[1]

## Question 1 - b)

Here we create two arrays to represent the successors in the left column (column1) and the website in the right column (column2).
For example :

In the example below, the website 5 has a successor that is the website 3.

| column1 (right)  | column2 (right) |
|---|---|
| 3  | 5  |
| 1  | 2  |
| 8  | 199  |

In [4]:
"""
Create two arrays for each column of the relations.
"""
column1 = []
column2 = []
for i in range(totalRelations):
    line = f.readline().split(" ")
    column1.append(int(line[0].strip()))
    column2.append(int(line[1].strip()))

Then we create a array that indicate the number of successors a website has (websites_nbr_successors_list).

In [5]:
"""
Count the number of successors a website has
"""
websites_nbr_successors_list = {}
for website in range(1, matrixSize + 1):
    websites_nbr_successors_list[website]=column2.count(website)

We can now create our stochastic matrix with the r matrix.

In [6]:
"""
Create matrix M.
"""
M = create_matrix_m(matrixSize, websites_nbr_successors_list, column1, column2)

"""
Create matrix r
"""
r0 = create_matrix_r(matrixSize)

Calculation of the PageRank and display.

In [7]:
"""
Apply algorithm to calculate the solution.
"""
print("\nPageRank with spider trap, teleport not implemented")
pageRank = calculate_pagerank(M, r0)
display_websites(dictionary_links, pageRank, matrixSize)


PageRank with spider trap, teleport not implemented
L'itération a pour valeur r0 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r1 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r2 = [0.02,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r3 = [0.04,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r4 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r5 = [0.02,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r6 = [0.02,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r7 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r8 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r9 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
[('http://www1.hollins.edu/registrar/body.htm', 0.02596292021224375), ('http://www1.hollins.edu/', 0.012038830265826344), ('http://www1.hollins.edu/registrar/studfacinfo.htm', 0.010544061170831728), ('http://www1.hollins.edu/registrar/Maj-Min%20Years.htm', 0.008000435769581319), ('http://

## Question 2

The Spider Trap can be prevented by implementing teleport with a matrix that only has the value in each location.
$$\frac{1}{numberOfRelations}$$

In [8]:
# Question 2
"""
Implementing spider trap.
Recreating M matrix.
"""
beta = 0.8
spider_trap_matrix = create_spider_trap_matrix(matrixSize)


After calculating the spider_trap_matrix we need to do an operation between our matrix M and the spider_matrix.
The operation is :  
$$ M * \beta + spider_matrix * (1 - \beta)$$  
Beta is choosen between 0.8 and 0.9.  
In our exercice we choose \beta = 0.8

In [9]:
M_plus_spider = create_matrix_M_with_spidertrap(M, spider_trap_matrix, beta)

Now that we have calculated our matrix M with the teleportation to prevent from spider trap we can calculate our PageRank.

In [10]:
pageRank = calculate_pagerank(M_plus_spider, r0)
display_websites(dictionary_links, pageRank, matrixSize)

L'itération a pour valeur r0 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r1 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r2 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r3 = [0.02,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r4 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r5 = [0.01,0.  ,0.  ,...,0.  ,0.  ,0.  ]
[('http://www1.hollins.edu/registrar/body.htm', 0.014554678207504917), ('http://www1.hollins.edu/', 0.01274272810850791), ('http://www1.hollins.edu/registrar/Maj-Min%20Years.htm', 0.010102191669712613), ('http://www1.hollins.edu/Registrar/body.htm', 0.009370289790500058), ('http://www1.hollins.edu/registrar/studfacinfo.htm', 0.008736212935789417), ('http://www1.hollins.edu/Registrar/Maj-Min%20Years.htm', 0.007621776620837354), ('http://www1.hollins.edu/homepages/saloweyca/Roanoke%20College_files/outline.htm', 0.006899444909569466), ('http://www1.hollins.edu/Registrar/studfacinfo.htm', 0.006453309716832427), ('http:

## Question 3

In the case of a dead ends, we need to delete all websites that have no successors or its only successor is itself.  
To do so we create a list, to know which website have a dead ends.

In [11]:
# Question 3
"""
Dead ends
"""

"""
List all the website that have no successors or has only one successors that is itself
"""
website_to_delele_list = []
for website in range(1, matrixSize + 1):
    number_sucessors = column2.count(website)
    if number_sucessors==0:
        website_to_delele_list.append(website)
    elif number_sucessors==1 and column1[website -1 ]==website:
        website_to_delele_list.append(website)

print("Websites that have dead ends : ")
print(website_to_delele_list)

Websites that have dead ends : 
[1, 51]


To prevent dead ends we are deleting all reference to a website that has dead ends.
We are deleting all lines where a reference to one of these wesbite is made.

In [12]:
"""
recreate the list of website/successors with only the website that have successors
"""
new_column1 = []
new_column2 = []

for successor, website in zip(column1, column2):
    if website not in website_to_delele_list and successor not in website_to_delele_list:
        new_column1.append(successor)
        new_column2.append(website)

new_matrixSize = matrixSize-len(website_to_delele_list)

Then we shift all ID of the websites and we re-create a dictionary to re-create a link between the website and its new ID.

In [13]:
"""
Once some websites are deleted, we need to shift all the website id that are greater than the deleted one 
"""

totalRelations = len(new_column1)

for website_to_delete in website_to_delele_list:
    for i in range(0, totalRelations):
        if new_column1[i] > website_to_delete:
            new_column1[i]-=1
        if new_column2[i] > website_to_delete:
            new_column2[i]-=1

"""
Rearange the dictionary of website index/links 
"""

for website_to_delete in website_to_delele_list:
    for i in range(1, totalRelations +1):
        if i >= website_to_delete and str(i) in dictionary_links:
            dictionary_links[str(i)]=dictionary_links[str(i+1)]

Now that we have all ours data. We can redo the previous step :
- creation of matrix M
- creation of spider matrix 
- ...

In [14]:
"""
Recreate the list of number of successor a website have
"""
new_websites_nbr_successors_list = {}
for website in range(1, new_matrixSize + 1):
    new_websites_nbr_successors_list[website]=new_column2.count(website)

"""
Create the new matrix M.
"""
M = create_matrix_m(new_matrixSize, new_websites_nbr_successors_list, new_column1, new_column2)

"""
Create the new spide matrix 
"""
beta = 0.8
spider_trap_matrix = create_spider_trap_matrix(new_matrixSize)

"""
Calulate the new matrix M with teleport
"""
M_plus_spider = create_matrix_M_with_spidertrap(M, spider_trap_matrix, beta)

"""
Create matrix r
"""
r0 = create_matrix_r(new_matrixSize)

"""
Apply algorithm to calculate the solution.
"""
print("\nPageRank with spider trap, teleport not implemented")
pageRank = calculate_pagerank(M_plus_spider, r0)
display_websites(dictionary_links, pageRank, new_matrixSize)



PageRank with spider trap, teleport not implemented
L'itération a pour valeur r0 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r1 = [0.  ,0.  ,0.01,...,0.  ,0.  ,0.  ]
L'itération a pour valeur r2 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r3 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r4 = [0.,0.,0.,...,0.,0.,0.]
L'itération a pour valeur r5 = [0.,0.,0.,...,0.,0.,0.]
[('http://www.hollins.edu/alumnae/products/chairs/h-chairs.htm', 0.014540261900985956), ('http://www1.hollins.edu/registrar/forms.htm', 0.010060444808438667), ('http://www1.hollins.edu/library/illform.htm', 0.009356482351050533), ('http://www1.hollins.edu/docs/comptech/checkin/info.htm', 0.008719700398634175), ('http://www1.hollins.edu/Registrar/InternationalStudies.htm', 0.007579546712721587), ('http://www1.hollins.edu/homepages/saloweyca/ancpaint_files/outline.htm', 0.006840843697111947), ('http://www1.hollins.edu/Docs/academics/DivisionI/Classical%20Studies/saloweyca/clas%20260/default.html', 0.