# Assignment 04
## Web Search
## CSCI E-108
### Luciano Carvalho

> **Instructions:** For this assignment you will complete the exercises shown. Most exercises involve creating and executing some Python code. Additionally, most exercises have questions for you to answer. You can answer questions by creating a Markdown cell and writing your answer. If you are not familiar with Markdown, you can find a brief tutorial [here](https://www.markdownguide.org/cheat-sheet/).     

In this assignment you will gain some experience and insight into how web search algorithms work. Specifically, you will implement versions of three algorithms, simple PageRank, damped PageRank, and the HITS algorithm. All three of these algorithms use a **directed graph model** of the web.   

The small data examples and coding methods used here are not directly scalable to web sized problems. Rather, the point is for you to understand the basic characteristics of these web search algorithms. Web scale searching requires massive resources not readily available to most people.

## Simple PageRank Example

To get a feeling for the basics of the PageRank algorithm you will create and test simple code.

As a first step, execute the code in the cell below to import the packages you will need for the exercises.

In [1]:
import numpy as np

### Directed Graph of Web Pages

We will start with a simple example. Figure 1 shows a set of web pages and their hyperlinks. This is a **directed graph** with the **pages as nodes** and the **hyperlinks as the directed edges**. This graph is **complete**. Every page is accessible from any other page, possibly with visits to intermediate nodes required.  

<img src="../images/Web1.png" alt="Drawing" style="width:500px; height:400px"/>
<center>Figure 1: A small set of web pages</center>

The directed edges of the graph define the association between the nodes. For the **association matrix**, a directed edge, or hyperlink, runs from a node's column to the terminal node's row. The association is binary. The directed edge either exists or it does not.

> **Exercise 04-1:** In the cell below you will create the association matrix and the initial page probability vector. Do the following:  
> 1. Create the association matrix, $A$, using [numpy.array](https://numpy.org/doc/stable/reference/generated/numpy.array.html). This matrix is constructed with a 1 where a page in a column has a directed edge linking to another page in a row, and 0 elsewhere. In matrix notation, the element $a_{i,j}$ indicates the presence or absence of a directed edge from node $n_j$ to node $n_i$.   
> 2. Print the shape of your association matrix as a check.
> 3. Print the in degree and out degree of each node in your association matrix, using [numpy.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html). Set the argument `axis` to 1 to sum across rows and 0 to sum down columns.

In [8]:
# Association matrix based on the provided graph
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Print the shape of the association matrix
print("Shape of the association matrix:", A.shape)

# Calculate the in-degree and out-degree of each node
in_degree = np.sum(A, axis=0)
out_degree = np.sum(A, axis=1)

print("In-Degree of each node:", in_degree)
print("Out-Degree of each node:", out_degree)


Shape of the association matrix: (5, 5)
In-Degree of each node: [2 2 2 3 2]
Out-Degree of each node: [3 3 1 2 2]


> Are the out degree and in degree you computed from the association matrix consistent with the graph in Figure 1?    
> **End of exercise.**

> **Answer:** Yes, the out-degree and in-degree values I computed from the association matrix are consistent with the graph in Figure 1. The in-degrees and out-degrees match the links shown in the graph perfectly

### Apply Simple Page Rank

The normalized transition probability matrix, $M$, is then computed from the association matrix, $A$:

$$M = A D^{-1}$$

Where, $D^{-1}$ is the inverse of a matrix with the out degree values on the diagonal and zeros elsewhere.  

You can see from the foregoing that $M$ distributes the influence of the page by the in|verse of the out degree. In other words, the influence is inversely weighted by the number of pages each page links to.

> **Exercise 04-2:** You will now compute the normalized transition matrix, $M$. To do so create a function called `norm_association` with the association matrix as the argument. Do the following:
> 1. Create your function `norm_association` which will do the following:  
>    - Compute the sum of the columns of the association matrix using `numpy.sum` with the `axis=0` argument to sum along columns.
>   - Compute the inverse of the column sums as a vector. Be sure to avoid zero divides, which will occur in subsequent exercises. Use the `where` argument of [numpy.divide](https://numpy.org/doc/stable/reference/generated/numpy.divide.html) to do so. If the column sum is 0 the inverse is set to 0.0.   
>   - Create a square diagonal matrix from the inverse column sums using [numpy.diag](https://numpy.org/doc/stable/reference/generated/numpy.diag.html) to form the inverse out degree diagonal matrix.
>  - Finally, return the matrix product of the association matrix and the (diagonal) inverse out degree matrix using [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html).  
> 2. Save and print the normalized transition matrix.  
> 3. Compute and print the column sums of the normalized transition matrix to ensure they all add to 1.0.
> Execute the code you have created and examine the results.

In [9]:
def norm_association(A):
    '''Function to normalize the association matrix by out degree.
    The function accounts for cases where the column sum is 0'''

    # 1.1 Compute the sum of the columns of the association matrix
    col_sums = np.sum(A, axis=0)

    # 1.2 Compute the inverse of the column sums, avoiding divide by zero
    inv_col_sums = np.divide(1, col_sums, where=col_sums!=0, out=np.zeros_like(col_sums, dtype=float))

    # 1.3 Create a square diagonal matrix from the inverse column sums
    D_inv = np.diag(inv_col_sums)

    # 1.4 Return the matrix product of the association matrix and the inverse out degree matrix
    M = np.matmul(A, D_inv)

    return M

# Code to execute the function and check the column sums

# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# 2. Save and Print the association matrix
M = norm_association(A)
print("Normalized transition matrix:\n", M)

# 3. Compute and print the column sums of the normalized transition matrix
col_sums_M = np.sum(M, axis=0)
print("Column sums of the normalized transition matrix:", col_sums_M)


Normalized transition matrix:
 [[0.         0.5        0.5        0.33333333 0.        ]
 [0.5        0.         0.         0.33333333 0.5       ]
 [0.         0.         0.         0.33333333 0.        ]
 [0.         0.5        0.         0.         0.5       ]
 [0.5        0.         0.5        0.         0.        ]]
Column sums of the normalized transition matrix: [1. 1. 1. 1. 1.]


> Provide short answers to the following questions:     
> 1. Do the number of non-zero values in each column match the out degree for the corresponding node?     
> 2. Are all the column sums 1.0, and why is this required for a transition probability matrix.    
> **End of exercise.**

> **Answers:**

> 1. Yes, the number of non-zero values in each column matches the out degree for the corresponding node.

> 2. Yes, all the column sums are 1.0. This is required because in a transition probability matrix, each column represents the distribution of probability from one node to all other nodes. The sum of probabilities for each node must equal 1 to ensure that the total probability is conserved and properly distributed


### Computing the Simple Page Rank

With the transition probability matrix, $M$ computed it is time to investigate the convergence of the PageRank algorithm. You can think of the PageRank algorithm as a series of transitions of a Markov Chain. Given the transition probability matrix, $M$, the update, or single Markov transition, of the page probabilities, $p_i$, is computed:

$$p_i = M p_{i-1}$$

The Markov chain can be executed for a great many transitions. The result of $n$ transitions, starting from an initial set of page probabilities, $p_0$, can be written:  

$$p_n = M^n p_{0}$$

At convergence the page probabilities, $p_n$, approach a constant or **steady state** value. This steady state probability vector values are the PageRank of the web pages.  

> **Exercise 04-3:** You will now create and execute code with the goal of getting a feel for how the page probabilities change for a single transition of a Markov process. The accomplish this task you will create a function called `transition` with arguments of the the normalized transition probability matrix and the vector of page probabilities. Specifically you will:   
> 1. Complete the function `transition` which uses [numpy.dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) to compute the product of the transition matrix and the page probability vector.  
> 2. Define a state vector p, with uniform initial probabiltes. Print this initial statevector.
> 3. Execute the `transition` function on the normalized transition probability matrix and vector of initial page probabilities you have created, saving the result to a new variable name. Print the result.
> 4. Print the Euclidean (L2) norm of the difference between the initial page probabilities and the updated page probabilities.
> 5. Print the sum of the page probabilities computed with `transition`.

In [11]:
def transition(transition_probs, probs):
    '''Function to compute the probabilities resulting from a
    single transition of a Markov process'''

    # 1. compute the product of the transition matrix and the page prob vector
    return np.dot(transition_probs, probs)

# Compute probabilities after first state transition and print the summaries

# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Normalize the association matrix
M = norm_association(A)

# 2. Define a state vector p, with uniform initial probabilities
p0 = np.full(M.shape[0], 1.0 / M.shape[0])
print("Initial state vector p0:", p0)

# 3. Execute the transition function
p1 = transition(M, p0)
print("Updated state vector p1:", p1)

# 4. Print the Euclidean (L2) norm of the difference
l2_norm_diff = np.linalg.norm(p1 - p0)
print("Euclidean (L2) norm of the difference:", l2_norm_diff)

# 5. Print the sum of the page probabilities computed with transition
sum_p1 = np.sum(p1)
print("Sum of the page probabilities:", sum_p1)


Initial state vector p0: [0.2 0.2 0.2 0.2 0.2]
Updated state vector p1: [0.26666667 0.26666667 0.06666667 0.2        0.2       ]
Euclidean (L2) norm of the difference: 0.16329931618554527
Sum of the page probabilities: 1.0


> Provide short answers to the following questions:   
> 1. Is the sum of the page probabilities equal to 1.0 as it should be?       
> 2. Considering in degree of the pages, are the relative changes in the page probabilities what you would expect and why?    
> **End of exercise.**

> **Answers:**

> 1. Yes, the sum of the page probabilities is equal to 1.0, which is expected for a transition probability matrix

> 2. Yes, the relative changes in the page probabilities are what I would expect. Pages with higher in-degrees, like Page 1 and Page 2, see an increase in their probabilities because they receive more links, indicating they are more important or influential in the network. Pages with lower in-degrees, like Page 3, have a lower probability, reflecting their lower influence

> **Exercise 04-4:** You will continue with computing transitions of the Markov chain. Use the `transition` function with the normalized transition probability matrix and the page probability vector computed from the first transition as arguments. Your code must do the following:  
> 1. Compute and print the resulting page probabilities of the second transition.
> 2. Compute and print the Euclidean (L2) norm of the difference between the page probabilities before and after the transition.
> 3. Compute and print the sum of the page probabilities.
> 4. Display the page probabilities.

In [12]:
# Compute the second transition of the Markov chain

# 1. Compute and print the resulting page probabilities of the second transition
p2 = transition(M, p1)
print("Updated state vector p2:", p2)  # Bullet 1

# 2. Compute and print the Euclidean (L2) norm of the difference
l2_norm_diff_2 = np.linalg.norm(p2 - p1)
print("Euclidean (L2) norm of the difference (second transition):", l2_norm_diff_2)  # Bullet 2

# 3. Compute and print the sum of the page probabilities
sum_p2 = np.sum(p2)
print("Sum of the page probabilities (second transition):", sum_p2)  # Bullet 3

# 4. Display the page probabilities
print("Page probabilities after second transition:", p2)  # Bullet 4


Updated state vector p2: [0.23333333 0.3        0.06666667 0.23333333 0.16666667]
Euclidean (L2) norm of the difference (second transition): 0.0666666666666667
Sum of the page probabilities (second transition): 1.0000000000000002
Page probabilities after second transition: [0.23333333 0.3        0.06666667 0.23333333 0.16666667]


> Note the difference between the Euclidean norms of the differences for the first and second transition calculations. Does the change in this difference from one step to the next indicate the algorithm is converging to the steady state probabilities?  
> **End of exercise.**

> **Answer:** Yes, the change in the Euclidean norms of the differences between the first and second transition calculations indicates that the algorithm is converging to the steady state probabilities. The Euclidean norm of the difference decreased from approximately 0.1633 (first transition) to 0.0667 (second transition), showing that the changes in the page probabilities are becoming smaller with each transition. This decreasing trend suggests that the page probabilities are stabilizing and approaching their steady state values

> **Exercise 04-5:** The question now is how does this simplified version of page rank converge with more iterations? To find out, do the following:   
> 1. Create a function `pagerank1` having the following arguments, `the normalized transition matrix`, the `initial page probabilities` and a `convergence threshold value of 0.01`, which does the following:  
>    - Initialize a euclidean distance norm variable to 1.0 and the resulting page probabilities to a vector of 0.0 values of length equal to the dimension of the transition matrix.   
>    - Set a loop counter to 1.  
>    - Use a 'while' loop with termination conditions the euclidean distance norm greater than the threshold value AND the loop counter less than 50.  Inside this loop do the following:  
>      1. Update the page probabilities using the `transition` function you created.
>      2. Compute the Euclidean norm of the difference between the previous and the updated page probabilities following the transition.   
>      3. Print the value of the loop counter and the Euclidean norm of the difference.
>      4. Copy the updated page probability vector into the input page probability vector.  
>      5. Increment the loop counter by 1.
>    - Return the page probabilities at convergence.  
> 2. Execute your `pagerank1` function using the transition matrix and initial page probability vector.

In [13]:
# 1. Define the PageRank function with a convergence threshold
def pagerank1(M, in_probs, threshold=0.01):
    # Initialize variables
    euclidean_dist = 1.0
    page_probabilities = np.array([0.0] * len(M))
    i = 1

    # Loop until convergence or max iterations
    while euclidean_dist > threshold and i < 50:
        # 1.1. Update the page probabilities
        new_probs = transition(M, in_probs)

        # 1.2. Compute the Euclidean norm of the difference
        euclidean_dist = np.linalg.norm(new_probs - in_probs)

        # 1.3. Print the value of the loop counter and the Euclidean norm of the difference
        print(f"Iteration {i}: Euclidean norm of the difference: {euclidean_dist}")

        # 1.4. Copy the updated page probability vector into the input page probability vector
        in_probs = new_probs

        # 1.5. Increment the loop counter
        i += 1

    # Return the page probabilities at convergence
    return in_probs

# Compute probabilities after a larger number of state transitions
# Define the association matrix A
A = np.array([
    [0, 1, 1, 1, 0],  # Page 1
    [1, 0, 0, 1, 1],  # Page 2
    [0, 0, 0, 1, 0],  # Page 3
    [0, 1, 0, 0, 1],  # Page 4
    [1, 0, 1, 0, 0]   # Page 5
])

# Normalize the association matrix
M = norm_association(A)

# Define the initial state vector p with uniform initial probabilities
p0 = np.full(M.shape[0], 1.0 / M.shape[0])

# 2. Execute the pagerank1 function
print('Final state probabilities: ' + str(pagerank1(M, p0)))


Iteration 1: Euclidean norm of the difference: 0.16329931618554527
Iteration 2: Euclidean norm of the difference: 0.0666666666666667
Iteration 3: Euclidean norm of the difference: 0.04082482904638634
Iteration 4: Euclidean norm of the difference: 0.0285989726138528
Iteration 5: Euclidean norm of the difference: 0.014829275350043464
Iteration 6: Euclidean norm of the difference: 0.006843400694025186
Final state probabilities: [0.25300926 0.28472222 0.07546296 0.22523148 0.16157407]


> Provide short answers for the following questions:  
> 1. Judged from the rate of decline of the Euclidean distances, does the algorithm appear to converge rapidly and why?
> 2.  Does the rank order of the computed page probabilities make sense given the relative degree of the pages of the directed graph?
> **End of exercise.**

> **Answers:**

> 1. Yes, the algorithm appears to converge rapidly. The Euclidean distances between iterations decrease significantly with each step, indicating that the page probabilities are quickly stabilizing to their steady state values.

> 2. Yes, the rank order of the computed page probabilities makes sense. Pages with higher in-degrees, which receive more links, have higher PageRank values, reflecting their greater importance or influence in the graph

### Page Rank by Eigendecomposition

Consider the relationship:     

$$p_i = M p_{i-1}$$      

At convergence $p_i = p_{i-1}$ which suggest an eigenvalue-eigenvector problem for some eigenvalue, $\lambda$:    

$$\lambda p^* = M p^*$$  

For the transition probability matrix with normalized columns, the largest eigenvalue has a magnitude 1.0. The eigenvector associated with this eigenvalue is the PageRank vector. To demonstrate this point, the code in the cell below does the following:     
1. Compute the eigendecomposition using [numpy.linalg.eig](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html).     
2. Print the magnitude of the eigenvalues.  
3. Get the eigenvector associated with the first, largest, eigenvalue.     
4. Normalize the eigenvector so the values sum to 1.0 and display the results.    

In [14]:
# 1. Compute the eigendecomposition of the transition matrix M
eigenvalues, eigenvectors = np.linalg.eig(M)

# 2. Print the magnitude of the eigenvalues
print("Magnitude of the eigenvalues:", np.abs(eigenvalues))

# 3. Get the eigenvector associated with the first, largest, eigenvalue
principal_eigenvector = eigenvectors[:, np.argmax(np.abs(eigenvalues))]

# 4. Normalize the eigenvector so the values sum to 1.0
principal_eigenvector = principal_eigenvector / np.sum(principal_eigenvector)

# Display the results
print("PageRank vector from eigendecomposition:", principal_eigenvector.real)


Magnitude of the eigenvalues: [1.00000000e+00 6.04992664e-01 5.86819216e-01 5.86819216e-01
 3.95038058e-17]
PageRank vector from eigendecomposition: [0.25373134 0.28358209 0.07462687 0.2238806  0.1641791 ]


> **Exercise 04-06:** Do the PageRanks computed by the two methods agree to the precision expected with the iterative method?    
> **End of exercise.**

> **Answer:**
>>
>> Iterative Method Final State Probabilities:
>>
>> [0.25300926, 0.28472222, 0.07546296, 0.22523148, 0.16157407]
>>
>> Eigendecomposition PageRank Vector:
>>
>> [0.25373134, 0.28358209, 0.07462687, 0.2238806, 0.1641791]
>
> The values from both methods are very close to each other, with minor differences that are within the expected numerical precision. This small difference can be attributed to the iterative method's convergence criteria and floating-point arithmetic precision. Overall, both methods provide consistent and reliable PageRank values.

## A More Complicated Example   

You will now work with a more complicated example The graph of 6 web pages, shown in Figure 2, is no longer complete. The out degree of page 6 is 0. A random surfer transitioning to page 6 will have no escape, a **spider trap**!

<img src="../images/Web2.png" alt="Drawing" style="width:500px; height:500px"/>
<center>Figure 3: A small set of web pages with a dead end</center>

> **Exercise 04-7:** You will now create both the normalized transition matrix and the initial page probability vector for the graph of Figure 3. In this exercise you will . Do the following:  
> 1. Create the association matrix and save it to a named variable, `A_deadend`. You will need this association marrix for later
> 2. Normalize the association matrix using your `norm_association` function. Name your transition matrix `M_deadend`. Print the result.
> 3. Create a vector containing the uniformly distributed initial probability values. Save and print the result.   

In [None]:
## Create the Association Matrix
## Put your code below



## Normalize the association matrix
## Put your code below




## Create the equal probability starting values
## Put your code below




> Examine your results. Are the 0 values for the transition probabilities of page 6 consistent with the graph of these pages? and why?      
> **End of exercise.**     

> **Answers:**        

### Apply Simple PageRank Algorithm

> **Exercises 04-8:** What happens if you apply the simplified PageRank algorithm to the pages on a graph that is not complete, like the one shown in Figure 2? To find out, execute your `pagerank1` function with arguments `M_deadend`, `p_deadend` and `threshold=0.00001`. The smaller threshold value is to ensure convergence.

In [None]:
## Put your code below



> Now, create and execute code to compute and display the eigenvalues and eigenvectors of the `M_deadend`. Then display the magnitudes of the eigenvalues, and the normalized values of the first eigenvalue (column 0).    

In [None]:
## Put you code below




> Answer the following questions:  
> 1. Examine the page probabilities computed with the iterative methods. Do these PageRank values sum to 1.0 and why is this outcome a problem?       
> 2. Examine the eigenvalues of the `M_deadend` matrix. What problem can you see with these eigenvalues?
> 3. Notice the PageRank values have only 0 values. What does this tell you about the convergence of a random surfer on this graph?             
> **End of exercise.**

> **Answers:**   
> 1.           
> 2.              
> 3.          

### Damped PageRank Algorithm

It is clear from the results of the foregoing exercise that the simple PageRank algorithm does not converge to a usable set of page probabilities when faced with graph that is not complete.Fortunately, there is a simple fix, add a damping term. You can think of the damping term as allowing a random surfer to make an arbitrary transition or jump with some small probability. These random jumps help the random surfer to better explore the graph and to escape from spider traps. The jump probabilities from states, $p_i$, are a function of the damping factor $d$:

$$Jump\ Probability = \frac{(1-d)}{n}$$

Where $n$ is the dimension of the transition probability matrix.

The updated page probabilities, $p_i$, are then computed with the damped PageRank algorithm as:   

$$p_{i} = d * M p_{i-1} + \frac{(1-d)}{n}$$

Where $M$ is the transition probability matrix and p are the initial page probability values.   

> **Exercise 04-9:** To implement the PageRank algorithm with a damping factor do the following:  
> 1. Create a `transition_damped` function with arguments, the transition probability matrix, the initial page probabilities, and the damping factor, $d=0.85$, which does the following:  
>   - Compute the updated page probabili|ties by computing the inner (dot) product of the transition probability matrix with the page probabilities and then multiplying by the damping factor, `d`.    
>   - Compute the jump probabilities vector of length the dimension of the transition matrix. Note: the jump probabilities are constant, so you can create code that only computes them once if you so choose.      
>   - Return the sum of the damped page probabilities and the jump probabilities.  
> 2. Create a `pagerank_damped` function. This function is identical to the `pagerank1` function you already created except that it uses the `transiton_damped` function in place of the `transition` function.  
> 3. Call your `pagerank_damped` function using arguments of `M_deadend`, `p_deadend` and `threshold=0.0001` and display the final PageRank vector.
> . Compute and display the sum of the values in the PageRank vector.

In [None]:
## Add a damping facgtor to the transiton
def transition_damped(transition_probs, probs, d=0.85):
    '''Function to compute the probabilities resulting from a
    single transition of a Markov process including a damping
    factor to deal with dead ends'''
    ## Put your code below



def pagerank_damped(M, in_probs, d=0.85, threshold = 0.01):
    ## function for the PageRank algorithm using the damped transition algorithm
    ## Put your code below



    return page_probabilities

## Execute your funciton
damped_rank = pagerank_damped(M_deadend,p_deadend,threshold=0.0001)
print(f"Sum of the PageRank values = {sum(damped_rank)}")
print('Final state probabilities: ' + str(damped_rank))

> Provide short answers to the following questions:   
> 1. Examine the final page probabilities. Does the rank of these page probabilities make sense given the in degree of the pages of this graph?   
> 2. Why is it reasonable that the sum of the PageRanks is $< 1.0$?

> **Answers:**     
> 1.            
> 2.             

> Next you will examine the some properties of the damped matrix, $M$. You will do so by the following steps:    
> 4. Create a Numpy array of $M$ including the damping, with a damping factor, $d = 0.85$. Display this matrix.  
> 5. Compute and display the column sums of the damped matrix.  

In [None]:
## Put your code below










> Examine your results and asnwer these questions:   
> 3. How does this array allow the random surfer to 'teleport' (or transition) from any page to any other page even when there is no directed edge?   
> 4. Why are the column sums reasonable dispite the obvious devision from aximonatic probability theory?      
> **End of exervise.**

> **Answers:**    
> 3.       
> 4.        

## Hubs, Authorities, and the HITS Algorithm  

The hubs and authorities model is an alternative to PageRank. Rather than using a single metric to rank the importance of web pages, the **HITS** algorithm iteratively updates the **hub score** and **authority score** for each of the pages.

The HITS algorithm updates the authority and hub scores iteratively. The authority score is sum of the hubs linked to it. This is computed by the matrix product of the association matrix and hubness vector:
$$𝑎= \beta 𝐴 ℎ$$

Hub score is sum of the authorities it links to. The hub score (hubness) is compute by the matrix produce of the authority scores and the transpose of the association matrix:
$$ℎ= \alpha 𝐴^𝑇 a$$

The algorithm iterates between updates to $𝑎$ and $ℎ$ until convergence. To ensure convergence, must normalize $𝑎$ and $ℎ$ to have unit Euclidean norm at each iteration. Therefore, the choice of $\alpha$ and $\beta$ are can therefore be set to a value of 1.0, and effectively ignored.       

> **Exercise 04-10:** To understand the HITS algorithm you will now create and test code for this algorithm. Follow these steps:  
> 1. Create a function called `HITS` with arguments of the association matrix, initial hub vector, initial authority vector, and the number of iterations of the algorithm to run. This function does the following inside a loop over the number of iterations:  
>    1. Updates the authority vector using the association matrix and the hub vector as argument to the `transition` function.
>    2. Normalizes the authority vector by using `numpy.divide` with arguments of the updated authority vector and its L2 norm, computed with [numpy.linalg.norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).  
>    3. Updates the hub vector using the transpose of the association matrix and the authority vector as argument to the `transition` function.
>    4. Normalizes the hub vector by using `numpy.divide` with arguments of the updated hub vector and its L2 norm, computed with `numpy.linalg.norm`.  
> 2. The function returns both the hub and authority vectors
> 3. Initialize an initial hub and authority vector of length the dimension of the association matrix with uniformly distributed values of $\frac{1.0}{dimension(association\ matrix)}$.  
> 4. Display the resulting hub and authority vectors.  
> 5. Execute your function using the association matrix for the 6-page network and the initial hub and authority vectors as arguments.

In [None]:
def HITS(association, hub, authority, iters=100):
    ## Put your code below




    return hub, authority

## Compute the intial hub and authority vectors
## Put your code below



## Execute your funciton
hubness, authority = HITS(A_deadend, hub_start, auth_start)
print(f"The hub ranks = {hubness}")
print(f"The authority ranks = {authority}")

> Examine your results and answer the following questions:
> 1. Which three of the pages have the highest hub scores? Considering the graph of the pages, is this ordering consistent?  
> 2. Notice the last value of the hub scores. Is this value expected given the graph of the pages?
> 3. Which three of the pages have the highest authority. Given the in degree of the pages is this ranking consistent?  
> 4. Compare the ranking of the pages based on authority that found with damped PageRank. Are these results consistent?
> **End of exercise.**

> **Answers:**     
> 1.       
> 2.       
> 3.          
> 4.         

#### Copyright 2021, 2022, 2023, 2024 Stephen F Elston. All rights reserved.