Tool for detecting (price) changes on webpages by Statistics Netherlands.
Note: As of June 2022, this repo will no longer be updated to reflect changes on the internal CBS production version of the tool. We keep this version here "as is" for whoever wants to use it or borrow from it, but note that security updates were no longer applied.
The RobotTool is an interactive tool for price analysts.
The analyst defines a number of products and their locations on the web. During data collection the tool checks the products on the websites. If nothing changed the product status turns green and the last known price for that product is added to the database. If a change was detected the product status turns red. The analyst checks all products with status red:
- If the price was unchanged the analysts keeps the price.
- If the price was changed the analyst corrects the price.
- If the product was not found usually because of a website change the analyst redefines the product configuration.
The analyst typically repeats this proces on a regular basis, per week or month, for example in the proces to compose the HICP.
- Install node.js and a modern browser (Firefox, Chrome)
- Download the latest release, unzip it, and run from the command line in the extracted folder:
$ npm install
Run from the command line:
$ npm start
A webserver will be started and a browser window will pop up.
From release 4.0.1 an example database is preloaded. This database shows several ways to retrieve data from two fake webshops:
- ABC_Bikes: this site contains static prices
- Cheap_Bikes: this site contains dynamic prices: some of them change when retrieving the page again.
When pressing the Get new priceinfo
button on the right upper corner the tool will collect prices from the fake webshops.
You can then manually initialize the price from the 'price' field in the panel that pops up after pressing one of the red boxes.
After retrieving new data using the Get new priceinfo
the changes will become visible. For further interaction we refer to the user guide below.
This is a screenshot of the tool after some runs:
Versions prior to 4.0.1: load the example database from the file Example_1_bikes.csv
in the folder ImportExport
via the Edit
-> Import configuration
button from the products panel (left top)
The RobotTool user guide contains a more detailed description of the functionality of the tool, the import and export of configuration and prices, to work with XPaths and how to configure the tool.
It is available in two versions in de /Doc
folder of this repo:
In addition there is a Poster pdf
- This tool uses a headless version of your browser (usually FireFox). Upon exit of the tool the headless browser process keep running in the background until you explicitly stop it (using the task manager) or restart your computer.
Questions, suggestions, ideas are welcome:
- Add an item to the issue tracker issue tracker (you need a GH account).
- Send us a pull request if you have an improvement you think is valuable to all.
- Send an e-mail to
o.tenbosch <at> cbs.nl
.
The ideas and concepts behind webscraping for official statistics are described in the following publications:
- Web scraping meets survey design: combining forces, O. ten Bosch et al., BIGSURV18 Conference, 2 september 2018
- On the use of Internet data for the Dutch CPI, R. Griffioen and O. ten Bosch, Meeting of the Group of Experts on Consumer Price Indices (2016)
- On the use of internet robots for official statistics, O. ten Bosch and D. Windmeijer, UNECE MSIS 2014
- Automated data collection from web sources for official statistics: First experiences, R. Hoekstra, O. ten Bosch, F. Harteveld, Statistical Journal of the IAOS, Volume 28, Number 3-4 / 2012, p. 99-111, mrt. 2013
This tool is provided under an EUPL license on an ‘as is’ basis and without warranties of any kind (see license file).
Go to Dick Windmeijer, the original developer of this tool, and to many price analysts from the price department of Statistics Netherlands. Early versions of this tool were partly subsidized by a Grant from Eurostat. Older versions are still available at our research server.