Simple heuristic for measuring web page similarity (& data set)
Clone or download
Latest commit 836f6fe May 30, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Initial commit. Apr 17, 2015 Update May 29, 2018 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015 Initial commit. Apr 17, 2015
sites.json Initial commit. Apr 17, 2015

Page Compare

This is a simple toolset for measuring the similarity of web pages.

Quick Start

We have included a dataset to play with in /data, but you can also generate your own dataset using the tools provided here. (If you want to use the provided data, skip this step.)

Define the sites you want to scrape in a JSON file (see the included sites.json as an example). Now you can run the scraper:

$ python
Usage: <splash url> <sites JSON> <run number> <output path>
$ python http://localhost:8050 sites.json 1 data

The <run number> argument is appended to the filename to help keep track of multiple scrapes, for example you might want to scrape today and again tomorrow to compare the similarity of the two resulting pages. This will populate the data directory with *.html scrapes of each site and *.png thumbnails of each site.

Next, run pair-wise comparision for all *.html files in your dataset. (You can use the included data directory to get started.)

$ python3 data
data/about-1.html (1/186)
data/about-2.html (2/186)
data/amex-1.html (3/186)
data/amex-2.html (4/186)
data/answers-1.html (5/186)
data/answers-2.html (6/186)
data/aol-1.html (7/186)
data/apple-1.html (8/186)
data/apple-2.html (9/186)
data/archlinux-1.html (10/186)

(This step is O(N^2) over the number of sites, so it can be quite slow.)

When it's done, it will output a file called compare-tags.json that contains pairwise similarity values for each pair of .html files in the data directory:

        "path1": "data/cnn-1.html",
        "path2": "data/cnn-2.html",
        "similarity": 66.2429723783916
        "path1": "data/cnn-1.html",
        "path2": "data/comcast-1.html",
        "similarity": 2.2954091816367264
        "path1": "data/cnn-1.html",
        "path2": "data/comcast-2.html",
        "similarity": 1.226215644820296

The similarity values are real numbers between 0.0 and 100.0, inclusive, where 0.0 indicates no similarity and 100.0 indicates identical page structures. The excerpt above shows that the two CNN scrapes have a similarity score of about 66, while comparing CNN to various versions of Comcast's site yields very low similarity.

What threshold should you choose for determining similarity? The script can identify an optimal similarity threshold:

$ python compare-tags.json
Maximum f1 0.944 at threshold=35 tp=84 fp=4 fn=6 prec=0.955 rec=0.933

In our sample dataset, a similarity threshold of 35 maximizes precision and recall. That is, if two pages have a similarity score greater than 35, than we determine that they are in fact, the same page (although with some content slightly changed).

Finally, you can construct a graph of the related sites using

$ python3 compare-tags.json data/ >
$ neato -O -Tpng

This will result in an image called If you're using the sample dataset, it will look something like this:

similarity graph

Nodes are connected if they are very similar. You can easily see that even though many of the thumbnails are slightly different, the heuristic has successfully recognized similar pages in many instances.


Similarity Threshold

Evaluating with different values for similarity threshold:

With similarity > 0.25:

tp 88 fp 14 fn 2
precision 0.86 recall 0.98 f1 0.92

With similarity > 0.33:

tp 85 fp 8 fn 5
precision 0.91 recall 0.94 f1 0.93

With similarity > 0.50:

tp 77 fp 0 fn 13
precision 1.0 recall 0.86 f1 0.92

Use to find threshold>35 as the ideal setting.

Next Steps

Error analysis shows one false positive is comparing and These sites look very different, but both have a simple page with a long list (<select>) of countries. These lists create a lot of similarity and dominate the result because there are relatively few other elements on the page.

  • Compute histograms of elements?
  • Edit distance of title?
  • Maybe elements can be weighted by their depth in the tree? Or prune elements at a certain depth?




define hyperion gray