# <a name="homework5">Python Lab 5: Due Tuesday, April 2 at 11:59PM</a>

---

## <a name="HW1Inst">Instructions</a>

---

1. <font color="tomato">Either click the COPY TO DRIVE button or use the menu File/Save/Save a Copy to save your own version of the notebook in your own folder in Google Drive.</font>

  - If you do not have a Google account, you will need to create a Google account in order to create your own individual copy of this notebook.
  - By default, the notebook will be saved in a folder named **Colab Notebooks** in your drive.
  - Feel free to name and store the notebook wherever you like!

2. <font color="mediumseagreen">After you have copied the notebook into your Drive, you can begin editing and saving your work.</font>


**Upload your completed assignment into Canvas as a Google Colab (or Jupyter) Notebook with file extension `.ipynb`.**

Upload your file into the Python Lab 5 assignment in Canvas before <font color="dodgerblue">**11:59PM on Tuesday, April 2.**</font>

**You must provide supporting work for your all of your answers.**

  - <font color="dodgerblue">That work will include using Python code cells.</font>
    - **Feel free to insert as much Python code as you like.**
    - **Code with incomplete or missing explanations of why the output is useful may not receive full credit.**
  - <font color="dodgerblue">Your work will also involve typing matrices in text cells with LaTeX.</font>


## <a name="policy">Important Academic Policies</a>

---

- **Be your own work.** Though you may collaborate with others, everyone is responsible for writing up the work in their own way using their own methods. Plagiarism of any form is not tolerated.
- **Be complete.** You must provide all work and/or explanations needed to find the solution. Answers with insufficient or incomplete supporting work may lose credit.
- **Adhere to the Code of Academic Honesty.**
- **Be clearly written (and legible if written).** Your solution to a problem must be clear, written in complete sentences. You may lose credit for work that is unclear or hard to follow.

## <font color="dodgerblue">Let me know if you need help, and **GOOD LUCK!!!**</font>

----

# <a name="HW5before">Before Working on This Assignment</a>

---

Google is the [world's most used search engine](https://www.statista.com/statistics/216573/worldwide-market-share-of-search-engines/). In 1996, Google founders Larry Page and Sergey Brin published ["The Anatomy of a Large-Scale Hypertextual Web Search Engine"](http://infolab.stanford.edu/pub/papers/google.pdf) in which they describe an early version of Google's PankRank algorithm for ranking webpages. The PageRank algorithm was the core of Google's original search engine, and it is highly effective at identifying quality webpages. Google's PageRank algorithm essentially measures the quality of webpage by considering the following:

- How many other webpages link to that webpage?
- Incoming links are weighted based on the ranking of the source webpage.

In this lab, we will use directed graphs, Markov chains, eigenvalues and eigenvectors to explore Google's PageRank algorithm. Before beginning this assignment, please look through the Colab Notebook with guided examples and more information about Google's PageRank algorithm.

- <font color="dodgerblue">**First work through the guided example in the notebook [Google-PageRank-Background.ipynb](https://colab.research.google.com/drive/12Jxh3uttt7DHutpfUbuttMuugDAX0mMn?usp=sharing).**</font>
- After working through that example, you ready to answer the questions below!


# <a name="import">Importing Required Packages and Functions</a>

---

In the code cell above we import the Python package [NetworkX](https://networkx.org/). NewtworkX can be used to analyze the structure and dynamics of networks. We also define a `draw_graph` function we will use to construct directed graphs to illustrate networks.

- <font color="dodgerblue">**Be sure you run the code cell below each time you work on this Lab in order to access functions in the NetworkX package.**</font>

In [None]:
import networkx as nx  # create and analyze networks
import random  # module to set seed for randomization
import numpy as np

# function to construct graph representing a network
def draw_graph(G):
    nx.draw_circular(G, node_size=1000, with_labels=True, node_color='dodgerblue', connectionstyle='arc3, rad = 0.4')

# <a name="create-net">Creating a Network</a>

---

The Internet can be represented as a directed graph.

- Each node in the graph corresponds to a different webpage.
- Edges are links from one page to another.

In the code cell below, you will generate a random graph that you will use to explore the PageRank algorithm in the questions that follow.

- Your graph will have `n=20` nodes. Each node represents a different webpage.
- Each node has the `k=5` out-links.
- The number of in-links varies and is controlled by the parameter `alpha=0.75`.

<br>

<font color="dodgerblue">**Be sure you run the code cell below each time you work on this Lab in order to define graph `G`.**</font>

In [None]:
random.seed(121)  # set seed for randomization to fix output
G = nx.random_k_out_graph(n=20, k=5, alpha=0.75)

draw_graph(G)

# <a name="HWq1">Question 1: Create the Adjacency Matrix</a>

---



## <a name="HWq1a">Question 1a</a>

---

Use the `to_numpy_array` function in the NetworkX package to generate the adjacency matrix corresponding to the network of webpages in the graph `G` created and sketch above.

<br>  

### Solution to Question 1a

---

Edit and run the code cell below.

- Be sure you have already worked through the example in [Google-PageRank-Background.ipynb](https://colab.research.google.com/drive/12Jxh3uttt7DHutpfUbuttMuugDAX0mMn?usp=sharing) before beginning this assignment!

<br>  

In [None]:
# replace ?? using the to_numpy_array command
M = ??

## <a name="HWq1b">Question 1b</a>

---

Looking at the graph `G` plotted above, give an entry in the adjacency matrix `M` that will have a value equal to 1.

- In the space below, explain how you determined your answer.
  - For example, "Entry `M[35, 88]` will be equal to 1 since ..."
- Then complete and run the code cell below to check your answer.

### Solution to Question 1b

---

<br>  

Entry `M[??, ??]` will be equal to 1 since ??

<br>  
<br>  


In [None]:
# complete and run to check your answer above
M[??, ??]

# <a name="HW5q2">Question 2: Normalize the Adjacency Matrix</a>

---

Normalize the adjacency matrix relative to the total in each row.

- Entry `M[i, j]` gives the proportion of all out-links from node `i` that go to node `j`.
- The sum of the entries in each row should be 1.

<br>  

## Solution to Question 2

---

- Hint: Use the `M.sum( )` function.

<br>  
<br>  

In [None]:
# normalize adjacency matrix relative to row totals
M = ??

# <a name="HW5q3">Question 3: The Transition Matrix</a>

---

Recall a Markov chain consists of:

- State vectors $\mathbf{x}_0$, $\mathbf{x}_1$, $\mathbf{x}_2$, $\ldots$, and
- A transition (or stochastic) matrix $P$ where we organize the probabilities of going from one state to the next.


If we denote the transition matrix `P`:
- `P[i, j]` is the probability of going from node `j` to node `i`.
- The entries of column of `P` add up to 1.



## <a name="HW5q3">Question 3a</a>

---

Use the code cell below to convert the the normalized adjacency matrix `M` you created in the previous code cell to the transition matrix that is stored as `P`.

### Solution to Question 3a

---

Edit and run the code cell below.

<br>  
<br>  

In [None]:
# Use normalized adjacency matrix M to create transition matrix P
P = ??

## <a name="HWq3b">Question 3b</a>

---

Run the code cell below to print entry `P[7, 8]` to the screen. In the space below, interpret the practical meaning of the output in terms of webpages and links.


### Solution to Question 3b

---

Interpret the practical meaning of the output in one or two sentences. The output should not equal 0.

<br>  
<br>  
<br>  


In [None]:
# run code cell and interpret output in space above
P[7, 8]

# <a name="HW5q4">Question 4: Making a Short Term Prediction</a>

---

Consider an initial state where people online are evenly divided between the 20 webpages (nodes in our graph) that is given by the column vector

$$\mathbf{x}_0 = \begin{bmatrix} 0.05 \\ 0.05 \\ \vdots \\ 0.05 \end{bmatrix}.$$

In the code cell below, the initial state vector is created and stored in `x0`. Each person clicks on one of the outgoing links on their current page, and everyone is redistributed among the 20 webpages.


<font color="dodgerblue">Complete the code cell below to predict what proportion of Internet users are located at which webpage after one iteration.<font>

<br>

## Solution to Question 4

---


<br>  
<br>  

In [None]:
# initially all users equally divided between 20 webpages
x0 = np.full(??, ??) # create a vector with 20 entries, each set equal to 1/20

# compute the state vector after one iteration
x1 = ??
x1

# <a name="HW5q5">Question 5: Making a Long Term Prediction</a>

---

Using the same initial state `x0` created in the previous question (where people online are evenly divided between the 20 webpages), predict the proportion of users that will be at each webpage after 100 iterations.

<br>  

## Solution to Question 5

---

<font color="dodgerblue">Complete the code cell below to predict what proportion of Internet users are located at which webpage (node) after 100 iterations.<font> Store the result to a vector named `ranks_markov`

<br>  
<br>  

In [None]:
ranks_markov = ??  # find state vector after 100 iterations
ranks_markov  # print vector to screen

# <a name="HW5q6">Question 6: An Eigenvector Approach</a>

---

A vector $\mathbf{q}$ is a steady state (or equilibrium) vector if it satisfies the matrix equation

$$P \mathbf{q} = \mathbf{q}.$$

Recall that a steady state vector $\mathbf{q}$ is an eigenvector corresponding to the eigenvalue $\lambda = 1$.


## <a name="HW5q6a">Question 6a</a>

---

In the code cell below, extract the <font color="dodgerblue">**real part**</font> of the eigenvector corresponding to $\lambda = 1$ and store the result to a vector named `eig_one` that is printed to the screen.

<br>  

### Solution to Question 6a

---

<br>  
<br>  

In [None]:
# enter lines of code to answer
# you likely want to enter a series of commands
# rather than do it all in one line


eig_one = ??
eig_one

## <a name="HW5q6b">Question 6b</a>

---

Normalize the eigenvector corresponding to $\lambda = 1$ so the values of the eigenvector add up to 1. Store the result to a vector named `ranks_eig`.

- Note: Your answer should be consistent with your answer to [Question 5](#HW5q5). If there are not consistent, be sure to check your work before moving on!

<br>  

### Solution to Question 6b

---

<br>  
<br>  


In [None]:
ranks_eig = ??  # normalized steady state vector
ranks_eig  # print steady state vector

# <a name="HW5question7">Question 7: Constructing the Google Matrix</a>

---

The PageRank algorithm accounts for the ability of somebody surfing around online to jump from one page to any other page. The original PageRank algorithm does so as follows:

1. If our graph has $N$ distinct nodes, we define an $N \times N$ matrix we denote `R` to account for randomly jumping from one node to another at each step.
  - If we assume the jumping is random, then at each stage the probability of randomly jumping to any other page is $\frac{1}{N}$.

$$R = \begin{bmatrix} \frac{1}{N} & \frac{1}{N} & \ldots & \frac{1}{N} \\
\frac{1}{N} & \frac{1}{N} & \ldots & \frac{1}{N} \\
\vdots & \vdots & & \vdots \\
\frac{1}{N} & \frac{1}{N} & \ldots & \frac{1}{N} \end{bmatrix}$$

2. Choose a <font color="dodgerblue">**damping factor,** $\alpha$</font>, to account for the theory that a person who is randomly surfing online eventually stops clicking on random links.

  - The damping factor $\alpha$ is the probability, at any step, that a person will continue randomly clicking on links.
  - <font color="dodgerblue">In their original paper, the founders of Google used a damping factor $\alpha=0.85$.</font>
  - Other studies have tested other damping factors, but a damping factor $\alpha=0.85$ is still generally assumed to perform the best.


3. If we use a damping factor of $\alpha$, the odds of randomly selecting one of links on the current page is $\alpha$. Otherwise, a person is equally likely to jump to any other random page where they again begin randomly clicking on links. Implementing this model, we have

$$\color{dodgerblue}{{\large \boxed{G = \alpha P + (1 - \alpha) R}}}$$

  - The resulting matrix $G$ is called a <font color="dodgerblue">**Google matrix**</font>.
  - The matrix $P$ is the transition matrix from the Markov chain model we found earlier.
  - The matrix $R$ accounts for the ability to randomly jump to any page.

<br>  

<font color="dodgerblue">Complete and run the code cell below to create and store the Google matrix corresponding to our network to a matrix names `GM`.</font>

<br>  

## Solution to Question 7

---

Replace each `??` in the code cell below and run the code cell to answer the question.

<br>  
<br>  


In [None]:
R = np.full((??, ??), ??)  # create 20 by 20 matrix with each entry equal to 1/20
alpha = 0.85  # we set the damping factor to 0.85
GM = ??  # enter matrix formula to compute Google matrix GM

# <a name="HW5q8">Question 8: Steady State Vector of Google Matrix</a>

---

If we denote the Google matrix `GM`, then a vector $\mathbf{q}$ is a steady state (or equilibrium) vector if it satisfies the matrix equation

$$GM \mathbf{q} = \mathbf{q}.$$

Recall that a steady state vector $\mathbf{q}$ is an eigenvector corresponding to the eigenvalue $\lambda = 1$ for matrix Google matrix `GM`.

## <a name="HW5q8a">Question 8a</a>

---

In the code cell below, extract the <font color="dodgerblue">**real part**</font> of the eigenvector of `GM` corresponding to $\lambda = 1$ and store the result to a vector named `google_one` that is printed to the screen.

<br>  

### Solution to Question 8a

---

<br>  
<br>  

In [None]:
# enter lines of code to answer
# you likely want to enter a series of commands
# rather than do it all in one line


google_one = ??
google_one

## <a name="HW5q8b">Question 8b</a>

---

Normalize the eigenvector of `GM` corresponding to $\lambda = 1$ so the values of the eigenvector add up to 1. Store the result to a vector named `ranks_google`.

<br>  

### Solution to Question 8b

---

<br>  
<br>  


In [None]:
ranks_google = ??  # normalized steady state vector
ranks_google  # print steady state vector

# <a name="HW5q9">Question 9: Checking the Results</a>

---

NetworkX has a built-in `pagerank()` function that implements the original PageRank algorithm to compute an "importance" ranking for each page.

- Run the code cell below to see how the PageRank algorithm ranks the pages in our original graph.


## Solution to Question 9

---

Nothing to edit in either code cell below. Run the code cell to check your answer from [Question 8b](#HW5q8b).

<br>  
<br>  


In [None]:
ranks_pr = nx.pagerank(G)
ranks_pr

# <a name="HW5q10">Question 10: Summarizing the Results</a>

---

Run the code cell below to print the results of each method to a table.

<br>  

## Solution to Question 10

---

Nothing to edit or comment on. Run the code cell and check to make sure the outputs seem reasonable. If something seems off, double check your work!

<br>  
<br>  


In [None]:
import pandas as pd

s1 = pd.Series(ranks_pr)
s2 = pd.Series(ranks_eig)
s3 = pd.Series(ranks_google)

df = pd.DataFrame(dict(PageRank=s1, Markov=s2, GoogleMatrix=s3))
round(df, 6)

# <a name="HW5q11">Question 11: Most Relevant Pages</a>

---

Which webpage is ranked as the most relevant by PageRank? Which is the second most relevant by PageRank? Which is the third most relevant by PageRank?


## Solution to Question 11

---

1. The most relevant webpage is node <mark>??</mark>.
2. The second most relevant webpage is node <mark>??</mark>.
3. The third most relevant webpage is node <mark>??</mark>.

Run the code cell below to look at the graph and verify these rankings make sense. You do not need to explain why this makes sense.

<br>  
<br>  

In [None]:
draw_graph(G)