This repo is my implementation of the PageRank Algorithm.
Currently, I took the liberty to crawl the VJTI website. One can change the code in index.js
and easily crawl other websites too.
_ | _ |
---|---|
Name | Ameya S. Daddikar |
College I.D. | 161070015 |
Department | Computer Engineering |
Year & Programme | Final Year B.Tech. |
npm install #install dependencies
node index.js # make sure that the node version is at least v10.15.1
Package Name | Usage |
---|---|
async-mutex | Safely handle async HTTP crawler requests by locking the _linkQueue object. |
jsdom | Used by src/linkcrawler.js to extract links from the retrieved HTML page. |
uuid | Generate unique ID for each PageRanker instance to avoid event-name colisions. |
- Makes HTTP requests. Finds next links by searching for the
href
attribute of the<a></a>
tag. - All URLs sanitzied using the
url.parse
andurl.format
of the URL module. - Crawler checks for
content-type
. Sends empty outBound links array for non-HTML contents.
- The async actions of PageRanker are maintained using 4 events viz.
append
(notifies when next set of links are ready to be crawled)save
(called after all pages have been crawled and the links are ready to be saved)compute
(calls after the links object is instantiated and the PageRank can be applied)release
(calls after thecompute
is done. Releases the eventHandlers and calls thecallback()
)
- Tries to find of a cached version of crawler's result exists. If not, then it starts the
append
event which calls the crawler. - Maintains
_linkQueue
which is made thread safe using async-mutex. - The
_handlePageVisit()
is called until the_linkQueue
is empty. - Saves the crawled results to a new JSON file for furture use (because crawling can take 2-5 mins for a regular webiste).
- Filters out visited sites and sites with hashPaths to save additional network calls.
index.js
code with comments explaining the various parameters required to instantiate and use the PageRanker class.
Snippet from _handleCompute()
function inside src/pageranker.js
that shows the algorithm steps applied. I've not used Matrices in my approach as I was unaware of that method. In hindsight, the current forEach is more time-consuming than the Matrix multiplication approach.
Start of output of the program. The pageranker tries to find the fie, however being unable to find it, the pageranker starts crawling the website instead.
Shows output after the crawling is done and the PageRank is calculated.
- Currently the crawler supports HTTP only. HTTPS routes forced to use HTTP protocol instead.
- Crawler will not work on websites relying on DOM manipulation. (e.g.: React, Angular, VueJS, etc.)