No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download

README.md

Data Collection System (DCS)

This project contains a collection of java components that are used by the project to fetch, convert and clean data on development aid funded contracts and projects.

As it's based on an architecture of data collection system for a Digiwhist project ror more details about the design of this architecture, please see Digiwhist Workpackage 2.8 (PDF).

REQUIREMENTS

  • Postgresql 9.4 and higher
  • RabbitMQ 3.6
  • Java 8
  • Maven

ARCHITECTURE

DCS is organised as a series of Maven projects. These can be built using the mvn compile command.

Data processing stages

  • Raw - downloading of raw (HTML, XML, etc.) files from internet
  • Parsed - conversion of unstructured data to structured format (all values in text format)
  • Clean - conversion of text values into proper data types, standardizing of enumeration values etc.

Workers

Each above described stage of data is processed by a standalone program called worker

  • Crawler - crawls a website, FTP server or reads from an API and passes information of what should be downloaded to Downloader - in some specific cases Crawler also serves as a Downloader
  • Downloader - reads information passed by a crawler and downloads and stores data to a DB. Tells parser which records can be parsed.
  • Parser - creates structured data from unstructured data. Tells cleaner which records can be cleaned.
  • Cleaner - does the cleaning job and tells matcher which records can be matched.

Worker names are derived from a package structure of DCS. Worker names that processes World Bank contracts source are:

  • eu.dfid.worker.wb.raw.WBContractCrawler
  • eu.dfid.worker.wb.raw.WBContractDownloader
  • eu.dfid.worker.wb.raw.WBContractParser
  • eu.dfid.worker.wb.raw.WBContractCleaner

Storage

Each tender record has it's copy on each stage of data processing. These are stored in separate DB tables. The names of tables are:

  • raw_data
  • parsed_tender
  • clean_tender

Project level data are stored in tables

  • raw_data
  • parsed_project
  • clean_project

Create script is located in dfid-dataaccess\src\main\resources\migrations\001_base.sql

Each table row will contain meta-data about the tender, along with a blob of structured JSON.

Communication

DDCS uses RabitMQ messaging system to ensure communication between workers. Each time some record is processed on a specific level of data processing, proper program publishes a message containing ID of a tender record which should be processed on a next level. Such message is used by a next level worker to retrieve the right record

Collapse  Jump Mark as read (esc) Message Input

Message @honza