# PageRank

In this problem, you have to implement PageRank algorithm to rank papers based on their references.

PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references.

It's recommended to have a look at the following sources to get more familiar with PageRank:

[https://en.wikipedia.org/wiki/PageRank](https://en.wikipedia.org/wiki/PageRank)

[https://towardsdatascience.com/pagerank-3c568a7d2332](https://towardsdatascience.com/pagerank-3c568a7d2332)

## Dataset
You are given a number of papers in Computer Science field, which are crawled from [Semantic Scholar](https://www.semanticscholar.org/). Each paper has Id, title, references, etc. You can download the dataset from here. For your convenience, we have considered only a limited number of references for each paper.


## Hint
Each paper is a node in the graph. Paper A links to paper B if and only if B is in A's references. (Similary, we have a directed edge from A to B.) Note that some papers may not have any input or output edge. Don't forget to consider such papers as well.

### Using libraries such as networkx is not allowed (Except in the last part). You have to implement PageRank from scratch. Feel free to add cells when needed.   


## Download data, import dependencies

In [None]:
# Download resources https://drive.google.com/drive/folders/1GvUc06eKX2Knf3JP5RCvIJjUkIjE2fTu?usp=share_link
!mkdir -p resources
%cd ./resources
!gdown 1C9l4uWzABZomkZQdAMxETz-b12PcOzBv # clean_data.json
%cd ..

/content/resources
Downloading...
From: https://drive.google.com/uc?id=1C9l4uWzABZomkZQdAMxETz-b12PcOzBv
To: /content/resources/clean_data.json
100% 23.3M/23.3M [00:00<00:00, 54.5MB/s]
/content


In [None]:
import pandas as pd
from tqdm import tqdm
import numpy as np
import json
import networkx as nx
import requests
from time import sleep
tqdm.pandas()

In [None]:
df = pd.read_json('resources/clean_data.json')
print(df.shape)
df.head()

## PageRank
Implement PageRank from scratch!


*   Don't forget to consider the damping factor in your implementation.
*   Report number of nodes and number of edges of the constructed graph.
*   Identify the node with maximum number of input edges. Which paper corresponds to this node?
*   Report 10 most importatnt papers with PageRank.

See the most important paper according to PageRank. Recall that papers are in CS and NLP topics. What is your opinion about this paper? :)



## Networkx
Implement PageRank with networkx. Report previous items and compare the results with your implementation. Explain if there is any differrence.

## Utils
Below is the main function we used to get the papers, in case you were wondering. You may want to use it to get more information about the papers.

In [None]:
# fields are separated by ",". For more information see https://api.semanticscholar.org/api-docs/graph
def request_papers_by_id(IDs, fields='title,url,year,fieldsOfStudy,citationCount,referenceCount'):
    papers = []
    for id in tqdm(IDs):
        response = requests.get(f'https://api.semanticscholar.org/graph/v1/paper/{id}?fields={fields}')
        js = response.json()
        papers.append(js)
        # sleep(3.1)
    return papers