A Python implementation of the PageRank algorithm to evaluate the relative importance of pages in a corpus of HTML files.
- Parse a directory of HTML pages to build a link graph (corpus).
- Compute PageRank using:
- Sampling (random surfer simulation).
- Iterative update until convergence.
- Handle dangling pages (pages without outgoing links).
- Output normalized PageRank scores for each page.
- Configurable damping factor and sample size.
Clone the repository and run the script with a directory containing HTML files.
py pagerank.py corpus
python3 pagerank.py corpus
$ python3 pagerank.py corpus0
PageRank Results from Sampling (n = 10000)
1.html: 0.2200
2.html: 0.3900
3.html: 0.3900
PageRank Results from Iteration
1.html: 0.2187
2.html: 0.3906
3.html: 0.3906
- Default damping factor: 0.85
- Default sample size: 10,000
- Pure Python implementation; no external dependencies required