Skip to content

Datafable/epu-index

Repository files navigation

EPU index

At the Applied Data Mining research group at the University of Antwerp, a classifier was developed to classify news articles as Economic Policy Uncertainty (EPU) related or not. The EPU index is the number of EPU related articles per day divided by the number of news journals that were crawled. In order to get a daily update of the EPU index, a number of scrapers were developed that can scrape Belgian (Flemish) news jounals every day. The resulting EPU index data is available to the broad public here.

The application consists of 3 parts: the web scrapers, a web application and a front end.

  • Scrapers: 8 scrapers were developed using Python's Scrapy framework. Scrapy is well documented here and a tutorial will guide you through Scrapy's main concepts. The crawlers that do the actual crawling work are called spiders and the spiders that were developed are documented here.
  • Web application: The web application is the place where all articles and their EPU classification scores are stored. The web application is developed using Django and the most important part are the models. The data in the web application is served using Django's REST framework to the front end.
  • Front end: The front end contains purely html and JavaScript and uses the C3 and d3-cloud libraries. The data that is needed for generating the charts are fetched from the web applications REST end points.

Installation

Check out the installation documentation.

Configuration

The application allows for some configurable parameters. Most notably:

  • Journal authentication settings: these should be set in the crawling settings file. See the crawlers documentation for more information about those settings.
  • Period and term to scrape: these can also be found in the crawling settings file.
  • Model file: this file should contain a comma separated list of words and their weights to be used to score an article. It should include a header word,weight. This setting in the crawling settings file points to the model file. Since this models file is used by the scraper, only newly scraped articles will be affected when a new file is used.
  • EPU Score Cutoff: this cutoff defines at which score articles are considered positive. You can alter this cutoff in the models file but note that you will have to re-run the custom django command calculate_daily_epu for all dates already in the database.
  • Stopwords: the stopwords are documented as a tuple in the models file. To generate such a tuple from a text file, you can use the stand alone script stopwords_to_tuple.py and paste the result in the models file.
  • Email notifications: the application will check every day whether the scrapers are still working. This is done by checking for a number of consecutive days that no articles were returned. This cutoff is defined here and the email recipients should be added as the ALERT_EMAIL_TO setting in the same file. A number of other settings regarding the email alerts are set in production only (host, port, etc.).