Download articles and legal documents from public procurement sources:
- Tender and contract data of European government bodies from OpenOpps via API or Amazon-S3 bucket (credentials are required)
- Legislative texts via JRC-Acquis dataset.
- Public procurement notices via TED dataset.
And index them into SOLR to perform complex queries and visualize results through Banana.
-
Install Docker and Docker-Compose
-
Clone this repo
git clone https://github.com/TBFY/harvester.git
-
Move into
src/test/docker
directory. -
Run Solr and Banana by:
docker-compose up -d
-
You should be able to monitor the progress by:
docker-compose logs -f
-
A Solr Admin site should be available at: http://localhost:8983/solr
-
Rename the configuration file:
src/test/resources/credentials.properties.sample
tosrc/test/resources/credentials.properties
(if you have credentials, update its content) -
Download and extract TED articles from ftp://guest:guest@ted.europa.eu/daily-packages/ and save them at:
input/ted
-
Move into base directory and run our harvester by:
./test TEDHarvester
-
A dashboard with results should be available at: http://localhost:8983/solr/banana
Take a look at all our harvesters here: src/test/java/harvest/
.
Step 1. Add the JitPack repository to your build file
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Step 2. Add the dependency
<dependency>
<groupId>com.github.TBFY</groupId>
<artifactId>harvester</artifactId>
<version>last-stable-release-version</version>
</dependency>
Please take a look at our contributing guidelines if you're interested in helping!