Replace scrapername everywhere with your scraper's name eg. worldbank Replace ScraperName everywhere with your scraper's name eg. World Bank Look for xxx and ... and replace add text accordingly.
Scrapers can be tested locally in a Python virtualenv which can be set up by running ./setup.sh and run with ./run.sh.
Scrapers can either be installed on GitHub Actions or Jenkins. In either case, they can be set up to run on a schedule. For scripts that run for more than 6 hours and/or for which resuming failed runs in important, Jenkins must be used.
Uses the .github/workflows/python-package.yml file. Set up the environment variables to be passed to your script in that file mapping them from secrets configured in the GitHub repository's UI. You can also set up the cron schedule. (The files listed under Jenkins below are not used.)
docker-compose.yml, docker-requirements.yml, Dockerfile, run-dev.sh, run-once-dev.sh and run_env are only needed for Jenkins. (.github/workflows/python-package.yml is not used.)
It is highly recommended to have a README.md that details what your script does, from where it obtains data, how to run it, what resources it uses (eg. CPU, memory, disk, network) and what environment variables are needed. An example is shown below the heading "Text for Scraper's README.md below"
It is very helpful to comment your code appropriately especially things that you think might confuse someone looking at your code for the first time.
For a full scraper following this template for GitHub Actions see: OAD
For a scraper that also creates datasets disaggregated by indicator (not just country) and reads metadata from a Google spreadsheet exported as csv, see: IDMC
Text for Scraper's README.md below
Collector for ScraperName's Datasets
For the script to run, you will need to either pass in your HDX API key as a parameter or have a file called .hdx_configuration.yml in your home directory containing your HDX key eg.
hdx_key: "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" hdx_read_only: false hdx_site: test
You will also need to pass in your user agent as a parameter or pass a parameter user_agent_config_yaml specifying where your user agent file is located. It should be of the form:
If you have many user agents, you can create a file of this form, put its location in user_agent_config_yaml and specify the lookup in user_agent_lookup:
myscraper: user_agent: MY_USER_AGENT myscraper2: user_agent: MY_USER_AGENT2
Note for HDX scrapers: there is a universal .useragents.yml file you should use.
Alternatively, you can set up environment variables eg. for production runs: USER_AGENT, HDX_KEY, HDX_SITE, BASIC_AUTH, EXTRA_PARAMS, TEMP_DIR, LOG_FILE_ONLY