# Assignment 04
## Web Search
## CSCI E-108
### Luciano Carvalho

> **Instructions:** For this assignment you will complete the exercises shown. Most exercises involve creating and executing some Python code. Additionally, most exercises have questions for you to answer. You can answer questions by creating a Markdown cell and writing your answer. If you are not familiar with Markdown, you can find a brief tutorial [here](https://www.markdownguide.org/cheat-sheet/).     

In this assignment you will gain some experience and insight into how web search algorithms work. Specifically, you will implement versions of three algorithms, simple PageRank, damped PageRank, and the HITS algorithm. All three of these algorithms use a **directed graph model** of the web.   

The small data examples and coding methods used here are not directly scalable to web sized problems. Rather, the point is for you to understand the basic characteristics of these web search algorithms. Web scale searching requires massive resources not readily available to most people.

## Simple PageRank Example

To get a feeling for the basics of the PageRank algorithm you will create and test simple code.

As a first step, execute the code in the cell below to import the packages you will need for the exercises.

In [1]:
import numpy as np

### Directed Graph of Web Pages

We will start with a simple example. Figure 1 shows a set of web pages and their hyperlinks. This is a **directed graph** with the **pages as nodes** and the **hyperlinks as the directed edges**. This graph is **complete**. Every page is accessible from any other page, possibly with visits to intermediate nodes required.  

<img src="../images/Web1.png" alt="Drawing" style="width:500px; height:400px"/>
<center>Figure 1: A small set of web pages</center>

The directed edges of the graph define the association between the nodes. For the **association matrix**, a directed edge, or hyperlink, runs from a node's column to the terminal node's row. The association is binary. The directed edge either exists or it does not.

> **Exercise 04-1:** In the cell below you will create the association matrix and the initial page probability vector. Do the following:  
> 1. Create the association matrix, $A$, using [numpy.array](https://numpy.org/doc/stable/reference/generated/numpy.array.html). This matrix is constructed with a 1 where a page in a column has a directed edge linking to another page in a row, and 0 elsewhere. In matrix notation, the element $a_{i,j}$ indicates the presence or absence of a directed edge from node $n_j$ to node $n_i$.   
> 2. Print the shape of your association matrix as a check.
> 3. Print the in degree and out degree of each node in your association matrix, using [numpy.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html). Set the argument `axis` to 1 to sum across rows and 0 to sum down columns.

In [8]:
# Association matrix based on the provided graph
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Print the shape of the association matrix
print("Shape of the association matrix:", A.shape)

# Calculate the in-degree and out-degree of each node
in_degree = np.sum(A, axis=0)
out_degree = np.sum(A, axis=1)

print("In-Degree of each node:", in_degree)
print("Out-Degree of each node:", out_degree)


Shape of the association matrix: (5, 5)
In-Degree of each node: [2 2 2 3 2]
Out-Degree of each node: [3 3 1 2 2]


> Are the out degree and in degree you computed from the association matrix consistent with the graph in Figure 1?    
> **End of exercise.**

> **Answer:** Yes, the out-degree and in-degree values I computed from the association matrix are consistent with the graph in Figure 1. The in-degrees and out-degrees match the links shown in the graph perfectly

### Apply Simple Page Rank

The normalized transition probability matrix, $M$, is then computed from the association matrix, $A$:

$$M = A D^{-1}$$

Where, $D^{-1}$ is the inverse of a matrix with the out degree values on the diagonal and zeros elsewhere.  

You can see from the foregoing that $M$ distributes the influence of the page by the in|verse of the out degree. In other words, the influence is inversely weighted by the number of pages each page links to.

> **Exercise 04-2:** You will now compute the normalized transition matrix, $M$. To do so create a function called `norm_association` with the association matrix as the argument. Do the following:
> 1. Create your function `norm_association` which will do the following:  
>    - Compute the sum of the columns of the association matrix using `numpy.sum` with the `axis=0` argument to sum along columns.
>   - Compute the inverse of the column sums as a vector. Be sure to avoid zero divides, which will occur in subsequent exercises. Use the `where` argument of [numpy.divide](https://numpy.org/doc/stable/reference/generated/numpy.divide.html) to do so. If the column sum is 0 the inverse is set to 0.0.   
>   - Create a square diagonal matrix from the inverse column sums using [numpy.diag](https://numpy.org/doc/stable/reference/generated/numpy.diag.html) to form the inverse out degree diagonal matrix.
>  - Finally, return the matrix product of the association matrix and the (diagonal) inverse out degree matrix using [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html).  
> 2. Save and print the normalized transition matrix.  
> 3. Compute and print the column sums of the normalized transition matrix to ensure they all add to 1.0.
> Execute the code you have created and examine the results.

In [9]:
def norm_association(A):
    '''Function to normalize the association matrix by out degree.
    The function accounts for cases where the column sum is 0'''

    # 1.1 Compute the sum of the columns of the association matrix
    col_sums = np.sum(A, axis=0)

    # 1.2 Compute the inverse of the column sums, avoiding divide by zero
    inv_col_sums = np.divide(1, col_sums, where=col_sums!=0, out=np.zeros_like(col_sums, dtype=float))

    # 1.3 Create a square diagonal matrix from the inverse column sums
    D_inv = np.diag(inv_col_sums)

    # 1.4 Return the matrix product of the association matrix and the inverse out degree matrix
    M = np.matmul(A, D_inv)

    return M

# Code to execute the function and check the column sums

# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# 2. Save and Print the association matrix
M = norm_association(A)
print("Normalized transition matrix:\n", M)

# 3. Compute and print the column sums of the normalized transition matrix
col_sums_M = np.sum(M, axis=0)
print("Column sums of the normalized transition matrix:", col_sums_M)


Normalized transition matrix:
 [[0.         0.5        0.5        0.33333333 0.        ]
 [0.5        0.         0.         0.33333333 0.5       ]
 [0.         0.         0.         0.33333333 0.        ]
 [0.         0.5        0.         0.         0.5       ]
 [0.5        0.         0.5        0.         0.        ]]
Column sums of the normalized transition matrix: [1. 1. 1. 1. 1.]


> Provide short answers to the following questions:     
> 1. Do the number of non-zero values in each column match the out degree for the corresponding node?     
> 2. Are all the column sums 1.0, and why is this required for a transition probability matrix.    
> **End of exercise.**

> **Answers:**

> 1. Yes, the number of non-zero values in each column matches the out degree for the corresponding node.

> 2. Yes, all the column sums are 1.0. This is required because in a transition probability matrix, each column represents the distribution of probability from one node to all other nodes. The sum of probabilities for each node must equal 1 to ensure that the total probability is conserved and properly distributed


### Computing the Simple Page Rank

With the transition probability matrix, $M$ computed it is time to investigate the convergence of the PageRank algorithm. You can think of the PageRank algorithm as a series of transitions of a Markov Chain. Given the transition probability matrix, $M$, the update, or single Markov transition, of the page probabilities, $p_i$, is computed:

$$p_i = M p_{i-1}$$

The Markov chain can be executed for a great many transitions. The result of $n$ transitions, starting from an initial set of page probabilities, $p_0$, can be written:  

$$p_n = M^n p_{0}$$

At convergence the page probabilities, $p_n$, approach a constant or **steady state** value. This steady state probability vector values are the PageRank of the web pages.  

> **Exercise 04-3:** You will now create and execute code with the goal of getting a feel for how the page probabilities change for a single transition of a Markov process. The accomplish this task you will create a function called `transition` with arguments of the the normalized transition probability matrix and the vector of page probabilities. Specifically you will:   
> 1. Complete the function `transition` which uses [numpy.dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) to compute the product of the transition matrix and the page probability vector.  
> 2. Define a state vector p, with uniform initial probabiltes. Print this initial statevector.
> 3. Execute the `transition` function on the normalized transition probability matrix and vector of initial page probabilities you have created, saving the result to a new variable name. Print the result.
> 4. Print the Euclidean (L2) norm of the difference between the initial page probabilities and the updated page probabilities.
> 5. Print the sum of the page probabilities computed with `transition`.

In [11]:
def transition(transition_probs, probs):
    '''Function to compute the probabilities resulting from a
    single transition of a Markov process'''

    # 1. compute the product of the transition matrix and the page prob vector
    return np.dot(transition_probs, probs)

# Compute probabilities after first state transition and print the summaries

# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Normalize the association matrix
M = norm_association(A)

# 2. Define a state vector p, with uniform initial probabilities
p0 = np.full(M.shape[0], 1.0 / M.shape[0])
print("Initial state vector p0:", p0)

# 3. Execute the transition function
p1 = transition(M, p0)
print("Updated state vector p1:", p1)

# 4. Print the Euclidean (L2) norm of the difference
l2_norm_diff = np.linalg.norm(p1 - p0)
print("Euclidean (L2) norm of the difference:", l2_norm_diff)

# 5. Print the sum of the page probabilities computed with transition
sum_p1 = np.sum(p1)
print("Sum of the page probabilities:", sum_p1)


Initial state vector p0: [0.2 0.2 0.2 0.2 0.2]
Updated state vector p1: [0.26666667 0.26666667 0.06666667 0.2        0.2       ]
Euclidean (L2) norm of the difference: 0.16329931618554527
Sum of the page probabilities: 1.0


> Provide short answers to the following questions:   
> 1. Is the sum of the page probabilities equal to 1.0 as it should be?       
> 2. Considering in degree of the pages, are the relative changes in the page probabilities what you would expect and why?    
> **End of exercise.**

> **Answers:**

> 1. Yes, the sum of the page probabilities is equal to 1.0, which is expected for a transition probability matrix

> 2. Yes, the relative changes in the page probabilities are what I would expect. Pages with higher in-degrees, like Page 1 and Page 2, see an increase in their probabilities because they receive more links, indicating they are more important or influential in the network. Pages with lower in-degrees, like Page 3, have a lower probability, reflecting their lower influence

> **Exercise 04-4:** You will continue with computing transitions of the Markov chain. Use the `transition` function with the normalized transition probability matrix and the page probability vector computed from the first transition as arguments. Your code must do the following:  
> 1. Compute and print the resulting page probabilities of the second transition.
> 2. Compute and print the Euclidean (L2) norm of the difference between the page probabilities before and after the transition.
> 3. Compute and print the sum of the page probabilities.
> 4. Display the page probabilities.

In [12]:
# Compute the second transition of the Markov chain

# 1. Compute and print the resulting page probabilities of the second transition
p2 = transition(M, p1)
print("Updated state vector p2:", p2)  # Bullet 1

# 2. Compute and print the Euclidean (L2) norm of the difference
l2_norm_diff_2 = np.linalg.norm(p2 - p1)
print("Euclidean (L2) norm of the difference (second transition):", l2_norm_diff_2)  # Bullet 2

# 3. Compute and print the sum of the page probabilities
sum_p2 = np.sum(p2)
print("Sum of the page probabilities (second transition):", sum_p2)  # Bullet 3

# 4. Display the page probabilities
print("Page probabilities after second transition:", p2)  # Bullet 4


Updated state vector p2: [0.23333333 0.3        0.06666667 0.23333333 0.16666667]
Euclidean (L2) norm of the difference (second transition): 0.0666666666666667
Sum of the page probabilities (second transition): 1.0000000000000002
Page probabilities after second transition: [0.23333333 0.3        0.06666667 0.23333333 0.16666667]


> Note the difference between the Euclidean norms of the differences for the first and second transition calculations. Does the change in this difference from one step to the next indicate the algorithm is converging to the steady state probabilities?  
> **End of exercise.**

> **Answer:** Yes, the change in the Euclidean norms of the differences between the first and second transition calculations indicates that the algorithm is converging to the steady state probabilities. The Euclidean norm of the difference decreased from approximately 0.1633 (first transition) to 0.0667 (second transition), showing that the changes in the page probabilities are becoming smaller with each transition. This decreasing trend suggests that the page probabilities are stabilizing and approaching their steady state values

> **Exercise 04-5:** The question now is how does this simplified version of page rank converge with more iterations? To find out, do the following:   
> 1. Create a function `pagerank1` having the following arguments, `the normalized transition matrix`, the `initial page probabilities` and a `convergence threshold value of 0.01`, which does the following:  
>    - Initialize a euclidean distance norm variable to 1.0 and the resulting page probabilities to a vector of 0.0 values of length equal to the dimension of the transition matrix.   
>    - Set a loop counter to 1.  
>    - Use a 'while' loop with termination conditions the euclidean distance norm greater than the threshold value AND the loop counter less than 50.  Inside this loop do the following:  
>      1. Update the page probabilities using the `transition` function you created.
>      2. Compute the Euclidean norm of the difference between the previous and the updated page probabilities following the transition.   
>      3. Print the value of the loop counter and the Euclidean norm of the difference.
>      4. Copy the updated page probability vector into the input page probability vector.  
>      5. Increment the loop counter by 1.
>    - Return the page probabilities at convergence.  
> 2. Execute your `pagerank1` function using the transition matrix and initial page probability vector.

In [13]:
# 1. Define the PageRank function with a convergence threshold
def pagerank1(M, in_probs, threshold=0.01):
    # Initialize variables
    euclidean_dist = 1.0
    page_probabilities = np.array([0.0] * len(M))
    i = 1

    # Loop until convergence or max iterations
    while euclidean_dist > threshold and i < 50:
        # 1.1. Update the page probabilities
        new_probs = transition(M, in_probs)

        # 1.2. Compute the Euclidean norm of the difference
        euclidean_dist = np.linalg.norm(new_probs - in_probs)

        # 1.3. Print the value of the loop counter and the Euclidean norm of the difference
        print(f"Iteration {i}: Euclidean norm of the difference: {euclidean_dist}")

        # 1.4. Copy the updated page probability vector into the input page probability vector
        in_probs = new_probs

        # 1.5. Increment the loop counter
        i += 1

    # Return the page probabilities at convergence
    return in_probs

# Compute probabilities after a larger number of state transitions
# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Normalize the association matrix
M = norm_association(A)

# Define the initial state vector p with uniform initial probabilities
p0 = np.full(M.shape[0], 1.0 / M.shape[0])

# 2. Execute the pagerank1 function
print('Final state probabilities: ' + str(pagerank1(M, p0)))


Iteration 1: Euclidean norm of the difference: 0.16329931618554527
Iteration 2: Euclidean norm of the difference: 0.0666666666666667
Iteration 3: Euclidean norm of the difference: 0.04082482904638634
Iteration 4: Euclidean norm of the difference: 0.0285989726138528
Iteration 5: Euclidean norm of the difference: 0.014829275350043464
Iteration 6: Euclidean norm of the difference: 0.006843400694025186
Final state probabilities: [0.25300926 0.28472222 0.07546296 0.22523148 0.16157407]


> Provide short answers for the following questions:  
> 1. Judged from the rate of decline of the Euclidean distances, does the algorithm appear to converge rapidly and why?
> 2.  Does the rank order of the computed page probabilities make sense given the relative degree of the pages of the directed graph?
> **End of exercise.**

> **Answers:**

> 1. Yes, the algorithm appears to converge rapidly. The Euclidean distances between iterations decrease significantly with each step, indicating that the page probabilities are quickly stabilizing to their steady state values.

> 2. Yes, the rank order of the computed page probabilities makes sense. Pages with higher in-degrees, which receive more links, have higher PageRank values, reflecting their greater importance or influence in the graph

### Page Rank by Eigendecomposition

Consider the relationship:     

$$p_i = M p_{i-1}$$      

At convergence $p_i = p_{i-1}$ which suggest an eigenvalue-eigenvector problem for some eigenvalue, $\lambda$:    

$$\lambda p^* = M p^*$$  

For the transition probability matrix with normalized columns, the largest eigenvalue has a magnitude 1.0. The eigenvector associated with this eigenvalue is the PageRank vector. To demonstrate this point, the code in the cell below does the following:     
1. Compute the eigendecomposition using [numpy.linalg.eig](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html).     
2. Print the magnitude of the eigenvalues.  
3. Get the eigenvector associated with the first, largest, eigenvalue.     
4. Normalize the eigenvector so the values sum to 1.0 and display the results.    

In [14]:
# 1. Compute the eigendecomposition of the transition matrix M
eigenvalues, eigenvectors = np.linalg.eig(M)

# 2. Print the magnitude of the eigenvalues
print("Magnitude of the eigenvalues:", np.abs(eigenvalues))

# 3. Get the eigenvector associated with the first, largest, eigenvalue
principal_eigenvector = eigenvectors[:, np.argmax(np.abs(eigenvalues))]

# 4. Normalize the eigenvector so the values sum to 1.0
principal_eigenvector = principal_eigenvector / np.sum(principal_eigenvector)

# Display the results
print("PageRank vector from eigendecomposition:", principal_eigenvector.real)


Magnitude of the eigenvalues: [1.00000000e+00 6.04992664e-01 5.86819216e-01 5.86819216e-01
 3.95038058e-17]
PageRank vector from eigendecomposition: [0.25373134 0.28358209 0.07462687 0.2238806  0.1641791 ]


> **Exercise 04-06:** Do the PageRanks computed by the two methods agree to the precision expected with the iterative method?    
> **End of exercise.**

> **Answer:**
>>
>> Iterative Method Final State Probabilities:
>>
>> [0.25300926, 0.28472222, 0.07546296, 0.22523148, 0.16157407]
>>
>> Eigendecomposition PageRank Vector:
>>
>> [0.25373134, 0.28358209, 0.07462687, 0.2238806, 0.1641791]
>
> The values from both methods are very close to each other, with minor differences that are within the expected numerical precision. This small difference can be attributed to the iterative method's convergence criteria and floating-point arithmetic precision. Overall, both methods provide consistent and reliable PageRank values.

## A More Complicated Example   

You will now work with a more complicated example The graph of 6 web pages, shown in Figure 2, is no longer complete. The out degree of page 6 is 0. A random surfer transitioning to page 6 will have no escape, a **spider trap**!

<img src="../images/Web2.png" alt="Drawing" style="width:500px; height:500px"/>
<center>Figure 3: A small set of web pages with a dead end</center>

> **Exercise 04-7:** You will now create both the normalized transition matrix and the initial page probability vector for the graph of Figure 3. In this exercise you will . Do the following:  
> 1. Create the association matrix and save it to a named variable, `A_deadend`. You will need this association marrix for later
> 2. Normalize the association matrix using your `norm_association` function. Name your transition matrix `M_deadend`. Print the result.
> 3. Create a vector containing the uniformly distributed initial probability values. Save and print the result.   

In [15]:
# 1. Create the association matrix for the graph in Figure 3
A_deadend = np.array([
    [0, 1, 1, 0, 0, 1],  # Page 1
    [1, 0, 0, 1, 0, 0],  # Page 2
    [0, 1, 0, 1, 1, 0],  # Page 3
    [1, 1, 0, 0, 1, 0],  # Page 4
    [0, 0, 1, 1, 0, 0],  # Page 5
    [0, 0, 0, 0, 0, 0]   # Page 6 (dead end)
])

# 2. Normalize the association matrix using the norm_association function
M_deadend = norm_association(A_deadend)
print("Normalized transition matrix M_deadend:\n", M_deadend)

# 3. Create a vector containing the uniformly distributed initial probability values
p0_deadend = np.full(M_deadend.shape[0], 1.0 / M_deadend.shape[0])
print("Initial state vector p0_deadend:", p0_deadend)


Normalized transition matrix M_deadend:
 [[0.         0.33333333 0.5        0.         0.         1.        ]
 [0.5        0.         0.         0.33333333 0.         0.        ]
 [0.         0.33333333 0.         0.33333333 0.5        0.        ]
 [0.5        0.33333333 0.         0.         0.5        0.        ]
 [0.         0.         0.5        0.33333333 0.         0.        ]
 [0.         0.         0.         0.         0.         0.        ]]
Initial state vector p0_deadend: [0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]


> Examine your results. Are the 0 values for the transition probabilities of page 6 consistent with the graph of these pages? and why?      
> **End of exercise.**     

> **Answers:** Yes, the 0 values for the transition probabilities of page 6 are consistent with the graph of these pages. Page 6 is a dead end or spider trap, meaning it has no outbound links. In the association matrix, the row for Page 6 contains all 0s, indicating that there are no links from Page 6 to any other page. Consequently, in the normalized transition matrix, the column for Page 6 contains all 0s, indicating that no probability is distributed from Page 6 to any other page.

### Apply Simple PageRank Algorithm

> **Exercises 04-8:** What happens if you apply the simplified PageRank algorithm to the pages on a graph that is not complete, like the one shown in Figure 2? To find out, execute your `pagerank1` function with arguments `M_deadend`, `p_deadend` and `threshold=0.00001`. The smaller threshold value is to ensure convergence.

In [16]:
# Execute the pagerank1 function with M_deadend, p0_deadend and threshold=0.00001
final_state_probabilities = pagerank1(M_deadend, p0_deadend, threshold=0.00001)
print('Final state probabilities: ' + str(final_state_probabilities))


Iteration 1: Euclidean norm of the difference: 0.22906142364542562
Iteration 2: Euclidean norm of the difference: 0.19289506162962652
Iteration 3: Euclidean norm of the difference: 0.10088863659679724
Iteration 4: Euclidean norm of the difference: 0.03412165422176333
Iteration 5: Euclidean norm of the difference: 0.02590731238341437
Iteration 6: Euclidean norm of the difference: 0.01671295239807624
Iteration 7: Euclidean norm of the difference: 0.008603922335551545
Iteration 8: Euclidean norm of the difference: 0.004451250160228181
Iteration 9: Euclidean norm of the difference: 0.002525129392336522
Iteration 10: Euclidean norm of the difference: 0.0015294848163096952
Iteration 11: Euclidean norm of the difference: 0.0009792169050741875
Iteration 12: Euclidean norm of the difference: 0.0006394541209994019
Iteration 13: Euclidean norm of the difference: 0.00040849467529339994
Iteration 14: Euclidean norm of the difference: 0.0002541321991486564
Iteration 15: Euclidean norm of the differe

> Now, create and execute code to compute and display the eigenvalues and eigenvectors of the `M_deadend`. Then display the magnitudes of the eigenvalues, and the normalized values of the first eigenvalue (column 0).    

In [17]:
# Compute the eigendecomposition of the transition matrix M_deadend
eigenvalues_deadend, eigenvectors_deadend = np.linalg.eig(M_deadend)

# Print the magnitude of the eigenvalues
print("Magnitude of the eigenvalues:", np.abs(eigenvalues_deadend))

# Get the eigenvector associated with the first, largest, eigenvalue
principal_eigenvector_deadend = eigenvectors_deadend[:, np.argmax(np.abs(eigenvalues_deadend))]

# Normalize the eigenvector so the values sum to 1.0
principal_eigenvector_deadend = principal_eigenvector_deadend / np.sum(principal_eigenvector_deadend)

# Display the results
print("Normalized PageRank vector from eigendecomposition:", principal_eigenvector_deadend.real)


Magnitude of the eigenvalues: [1.         0.21999481 0.44952418 0.44952418 0.62485456 0.        ]
Normalized PageRank vector from eigendecomposition: [0.17073171 0.16463415 0.23170732 0.23780488 0.19512195 0.        ]


> Answer the following questions:  
> 1. Examine the page probabilities computed with the iterative methods. Do these PageRank values sum to 1.0 and why is this outcome a problem?       
> 2. Examine the eigenvalues of the `M_deadend` matrix. What problem can you see with these eigenvalues?
> 3. Notice the PageRank values have only 0 values. What does this tell you about the convergence of a random surfer on this graph?             
> **End of exercise.**

In [18]:
print("Sum of the final state probabilities:", np.sum(final_state_probabilities))

Sum of the final state probabilities: 0.9999999999999993


> **Answers:**

> 1. These PageRank values might not sum exactly to 1.0, but the sum is very close: 0.9999999999999993, slightly off due to floating-point precision errors at times. PageRank values are supposed to represent a probability distribution, and probabilities must sum to 1. A discrepancy might indicate that the random surfer model isn't properly normalized

> 2. The problem is that the largest eigenvalue is 1.0, but there is also an eigenvalue of 0. This zero eigenvalue indicates that there are dead ends or spider traps in the graph, where some pages (like Page 6) have no outbound links. This can cause issues with the convergence of the PageRank algorithm, as the probability "leaks" out of the system at these dead ends

> 3. The PageRank vector from eigendecomposition shows a 0 value for Page 6. This indicates that a random surfer will get "stuck" at Page 6 if they reach it, because there are no outbound links from this page. This results in the probability accumulating at Page 6 and not being distributed back into the system. This behavior highlights the problem of dead ends in the graph and shows that the PageRank algorithm needs a mechanism (like teleportation or damping) to handle such cases and ensure proper convergence

### Damped PageRank Algorithm

It is clear from the results of the foregoing exercise that the simple PageRank algorithm does not converge to a usable set of page probabilities when faced with graph that is not complete.Fortunately, there is a simple fix, add a damping term. You can think of the damping term as allowing a random surfer to make an arbitrary transition or jump with some small probability. These random jumps help the random surfer to better explore the graph and to escape from spider traps. The jump probabilities from states, $p_i$, are a function of the damping factor $d$:

$$Jump\ Probability = \frac{(1-d)}{n}$$

Where $n$ is the dimension of the transition probability matrix.

The updated page probabilities, $p_i$, are then computed with the damped PageRank algorithm as:   

$$p_{i} = d * M p_{i-1} + \frac{(1-d)}{n}$$

Where $M$ is the transition probability matrix and p are the initial page probability values.   

> **Exercise 04-9:** To implement the PageRank algorithm with a damping factor do the following:  
> 1. Create a `transition_damped` function with arguments, the transition probability matrix, the initial page probabilities, and the damping factor, $d=0.85$, which does the following:  
>   - Compute the updated page probabili|ties by computing the inner (dot) product of the transition probability matrix with the page probabilities and then multiplying by the damping factor, `d`.    
>   - Compute the jump probabilities vector of length the dimension of the transition matrix. Note: the jump probabilities are constant, so you can create code that only computes them once if you so choose.      
>   - Return the sum of the damped page probabilities and the jump probabilities.  
> 2. Create a `pagerank_damped` function. This function is identical to the `pagerank1` function you already created except that it uses the `transiton_damped` function in place of the `transition` function.  
> 3. Call your `pagerank_damped` function using arguments of `M_deadend`, `p_deadend` and `threshold=0.0001` and display the final PageRank vector.
> . Compute and display the sum of the values in the PageRank vector.

In [19]:

# 1. Create the transition_damped function:
def transition_damped(transition_probs, probs, d=0.85):
    '''Function to compute the probabilities resulting from a
    single transition of a Markov process including a damping
    factor to deal with dead ends'''
    # Compute the updated page probabilities by computing the inner (dot) product
    # of the transition probability matrix with the page probabilities and then
    # multiplying by the damping factor, d.
    damped_probs = d * np.dot(transition_probs, probs)

    # Compute the jump probabilities vector of length the dimension of the transition matrix
    n = len(transition_probs)
    jump_probs = (1 - d) / n

    # Return the sum of the damped page probabilities and the jump probabilities
    return damped_probs + jump_probs

# 2. Create the pagerank_damped function:
def pagerank_damped(M, in_probs, d=0.85, threshold=0.01):
    ## Function for the PageRank algorithm using the damped transition algorithm
    euclidean_dist = 1.0
    page_probabilities = np.array([0.0] * len(M))
    i = 1

    # Loop until convergence or max iterations
    while euclidean_dist > threshold and i < 50:
        new_probs = transition_damped(M, in_probs, d)
        euclidean_dist = np.linalg.norm(new_probs - in_probs)
        print(f"Iteration {i}: Euclidean norm of the difference: {euclidean_dist}")
        in_probs = new_probs
        i += 1

    # Return the page probabilities at convergence
    return in_probs

# Compute probabilities after a larger number of state transitions
# Define the association matrix A_deadend
A_deadend = np.array([
    [0, 1, 1, 0, 0, 1],  # Page 1
    [1, 0, 0, 1, 0, 0],  # Page 2
    [0, 1, 0, 1, 1, 0],  # Page 3
    [1, 1, 0, 0, 1, 0],  # Page 4
    [0, 0, 1, 1, 0, 0],  # Page 5
    [0, 0, 0, 0, 0, 0]   # Page 6 (dead end)
])

# Normalize the association matrix
M_deadend = norm_association(A_deadend)
print("Normalized transition matrix M_deadend:\n", M_deadend)

# Create a vector containing the uniformly distributed initial probability values
p0_deadend = np.full(M_deadend.shape[0], 1.0 / M_deadend.shape[0])
print("Initial state vector p0_deadend:", p0_deadend)

# 3. Execute the pagerank_damped function
damped_rank = pagerank_damped(M_deadend, p0_deadend, threshold=0.0001)
print(f"Sum of the PageRank values = {np.sum(damped_rank)}")
print('Final state probabilities: ' + str(damped_rank))


Normalized transition matrix M_deadend:
 [[0.         0.33333333 0.5        0.         0.         1.        ]
 [0.5        0.         0.         0.33333333 0.         0.        ]
 [0.         0.33333333 0.         0.33333333 0.5        0.        ]
 [0.5        0.33333333 0.         0.         0.5        0.        ]
 [0.         0.         0.5        0.33333333 0.         0.        ]
 [0.         0.         0.         0.         0.         0.        ]]
Initial state vector p0_deadend: [0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]
Iteration 1: Euclidean norm of the difference: 0.19470221009861177
Iteration 2: Euclidean norm of the difference: 0.13936668202740518
Iteration 3: Euclidean norm of the difference: 0.06195823395000812
Iteration 4: Euclidean norm of the difference: 0.0178117167640994
Iteration 5: Euclidean norm of the difference: 0.011495212137118046
Iteration 6: Euclidean norm of the difference: 0.006303281901598147
Iteration 7: Euclidean norm of the diffe

> Provide short answers to the following questions:   
> 1. Examine the final page probabilities. Does the rank of these page probabilities make sense given the in degree of the pages of this graph?   
> 2. Why is it reasonable that the sum of the PageRanks is $< 1.0$?

> **Answers:**

> 1. Yes, the rank of the final page probabilities makes sense given the in-degree of the pages. Pages with higher in-degrees tend to have higher PageRank values, as they receive more links from other pages. For example, Page 4 has the highest PageRank (0.22781628) and has multiple incoming links, which aligns with its higher in-degree

> 2. The sum of the PageRanks being slightly less than 1.0 (0.9999999999999996) is due to floating-point precision errors inherent in numerical computations. In theory, the sum should be exactly 1.0, but due to the limitations of floating-point arithmetic, we observe a very small deviation from the expected value. This small discrepancy is generally acceptable in practical applications.

> Next you will examine the some properties of the damped matrix, $M$. You will do so by the following steps:    
> 4. Create a Numpy array of $M$ including the damping, with a damping factor, $d = 0.85$. Display this matrix.  
> 5. Compute and display the column sums of the damped matrix.  

In [21]:
d = 0.85
n = M_deadend.shape[0]
damped_matrix = d * M_deadend + (1 - d) / n
print("Damped matrix M with d=0.85:\n", damped_matrix)

column_sums = np.sum(damped_matrix, axis=0)
print("Column sums of the damped matrix:", column_sums)


Damped matrix M with d=0.85:
 [[0.025      0.30833333 0.45       0.025      0.025      0.875     ]
 [0.45       0.025      0.025      0.30833333 0.025      0.025     ]
 [0.025      0.30833333 0.025      0.30833333 0.45       0.025     ]
 [0.45       0.30833333 0.025      0.025      0.45       0.025     ]
 [0.025      0.025      0.45       0.30833333 0.025      0.025     ]
 [0.025      0.025      0.025      0.025      0.025      0.025     ]]
Column sums of the damped matrix: [1. 1. 1. 1. 1. 1.]


> Examine your results and asnwer these questions:   
> 3. How does this array allow the random surfer to 'teleport' (or transition) from any page to any other page even when there is no directed edge?   
> 4. Why are the column sums reasonable dispite the obvious devision from aximonatic probability theory?      
> **End of exervise.**

> **Answers:**

> 3. The damping factor introduces a uniform probability of transitioning to any page, including those without a direct link. This is achieved by adding a small jump probability, (1 - d) / n, to each element of the transition matrix. This ensures that every page has a non-zero probability of being visited, allowing the random surfer to 'teleport' to any page in the graph, regardless of the existence of a direct edge

> 4. The column sums of the damped matrix are 1.0 because each column represents a probability distribution over all pages. The damping adjustment ensures that the total probability is conserved within the system, maintaining a valid probability distribution. Despite the addition of teleportation probabilities, the sum of probabilities from any given page remains 1.0, which is consistent with the requirements of a Markov process. This ensures the model's mathematical integrity while addressing practical issues like dead ends and spider traps in the graph


## Hubs, Authorities, and the HITS Algorithm  

The hubs and authorities model is an alternative to PageRank. Rather than using a single metric to rank the importance of web pages, the **HITS** algorithm iteratively updates the **hub score** and **authority score** for each of the pages.

The HITS algorithm updates the authority and hub scores iteratively. The authority score is sum of the hubs linked to it. This is computed by the matrix product of the association matrix and hubness vector:
$$𝑎= \beta 𝐴 ℎ$$

Hub score is sum of the authorities it links to. The hub score (hubness) is compute by the matrix produce of the authority scores and the transpose of the association matrix:
$$ℎ= \alpha 𝐴^𝑇 a$$

The algorithm iterates between updates to $𝑎$ and $ℎ$ until convergence. To ensure convergence, must normalize $𝑎$ and $ℎ$ to have unit Euclidean norm at each iteration. Therefore, the choice of $\alpha$ and $\beta$ are can therefore be set to a value of 1.0, and effectively ignored.       

> **Exercise 04-10:** To understand the HITS algorithm you will now create and test code for this algorithm. Follow these steps:  
> 1. Create a function called `HITS` with arguments of the association matrix, initial hub vector, initial authority vector, and the number of iterations of the algorithm to run. This function does the following inside a loop over the number of iterations:  
>    1. Updates the authority vector using the association matrix and the hub vector as argument to the `transition` function.
>    2. Normalizes the authority vector by using `numpy.divide` with arguments of the updated authority vector and its L2 norm, computed with [numpy.linalg.norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).  
>    3. Updates the hub vector using the transpose of the association matrix and the authority vector as argument to the `transition` function.
>    4. Normalizes the hub vector by using `numpy.divide` with arguments of the updated hub vector and its L2 norm, computed with `numpy.linalg.norm`.  
> 2. The function returns both the hub and authority vectors
> 3. Initialize an initial hub and authority vector of length the dimension of the association matrix with uniformly distributed values of $\frac{1.0}{dimension(association\ matrix)}$.  
> 4. Display the resulting hub and authority vectors.  
> 5. Execute your function using the association matrix for the 6-page network and the initial hub and authority vectors as arguments.

In [23]:
def HITS(association, hub, authority, iters=100):
    '''HITS algorithm implementation'''
    for _ in range(iters):
        # 1.1. Update the authority vector using the association matrix and the hub vector
        authority = np.dot(association, hub)

        # 1.2. Normalize the authority vector
        authority = np.divide(authority, np.linalg.norm(authority))

        # 1.3. Update the hub vector using the transpose of the association matrix and the authority vector
        hub = np.dot(association.T, authority)

        # 1.4. Normalize the hub vector
        hub = np.divide(hub, np.linalg.norm(hub))

    # 2. Return both the hub and authority vectors
    return hub, authority

# 3. Initialize an initial hub and authority vector of length the dimension of the association matrix
# with uniformly distributed values of 1.0/dimension(association matrix)
hub_start = np.ones(A_deadend.shape[0])
auth_start = np.ones(A_deadend.shape[0])

# 4. Display the resulting hub and authority vectors
print("Initial hub vector:", hub_start)
print("Initial authority vector:", auth_start)

# 5. Execute the HITS function using the association matrix for the 6-page network
# and the initial hub and authority vectors as arguments
hubness, authority = HITS(A_deadend, hub_start, auth_start)
print(f"The hub ranks = {hubness}")
print(f"The authority ranks = {authority}")

Initial hub vector: [1. 1. 1. 1. 1. 1.]
Initial authority vector: [1. 1. 1. 1. 1. 1.]
The hub ranks = [0.33649127 0.60496025 0.27303459 0.47123038 0.44601115 0.1589491 ]
The authority ranks = [0.40598191 0.31623732 0.59596894 0.54321619 0.29139291 0.        ]


> Examine your results and answer the following questions:
> 1. Which three of the pages have the highest hub scores? Considering the graph of the pages, is this ordering consistent?  
> 2. Notice the last value of the hub scores. Is this value expected given the graph of the pages?
> 3. Which three of the pages have the highest authority. Given the in degree of the pages is this ranking consistent?  
> 4. Compare the ranking of the pages based on authority that found with damped PageRank. Are these results consistent?
> **End of exercise.**

> **Answers:**

> 1. The three pages with the highest hub scores are Page 2 (0.60496025), Page 4 (0.47123038), and Page 5 (0.44601115). Yes, this ordering is consistent with the graph because these pages link to several other pages with high authority, making them strong hubs

> 2. The last value of the hub scores for Page 6 is 0.1589491. This value is expected given the graph because Page 6 is a dead end with no outbound links, making it a poor hub

> 3. The three pages with the highest authority scores are Page 3 (0.59596894), Page 4 (0.54321619), and Page 1 (0.40598191). Yes, this ranking is consistent with the in-degree of the pages since these pages receive a higher number of inbound links, making them strong authorities

> 4. The ranking based on authority scores is Page 3, Page 4, and Page 1. The ranking based on damped PageRank is Page 4, Page 3, and Page 1. These results are largely consistent. Both methods identify Page 3 and Page 4 as the top authorities, though the exact order differs slightly. This is expected because both algorithms rank pages based on their connectivity, but they use slightly different methods to calculate the importance of each page


#### Copyright 2021, 2022, 2023, 2024 Stephen F Elston. All rights reserved.