DOAJ Harvester

An external module which harvests metadata from 3rd parties and adds it to the DOAJ via the API

Installation

Clone the project:

git clone https://github.com/doaj/harvester.git

get all the submodules

cd myapp
git submodule init
git submodule update

This will initialise and clone the esprit and magnificent octopus libraries

Then get the submodules for Magnificent Octopus

cd myapp/magnificent-octopus
git submodule init
git submodule update

Create your virtualenv and activate it

virtualenv /path/to/venv
source /path/tovenv/bin/activate

Install esprit and magnificent octopus (in that order)

cd myapp/esprit
pip install -e .

cd myapp/magnificent-octopus
pip install -e .

Create your local config

cd myapp
touch local.cfg

Then you can override any config values that you need to

To start the application, you'll also need to install it into the virtualenv just this first time

cd myapp
pip install -e .

Then, start your app with

python service/web.py

If you want to specify your own root config file, you can use

APP_CONFIG=path/to/rootcfg.py python service/web.py

Running the Harvester

First you need to configure the runner, which means in local.cfg to put the list of accounts and their API keys in the API_KEYS property:

API_KEYS = {
    "11112222" : "kjsdhjf98qr82oif293pJUWEF",
    "22223333" : "OIWEFWOIJEFIWLKSDLKFJIOWE"
}

You also need to set a suitable initial harvest date. Too far in the past and you'll be wasting time (and EPMC resources), but too near and you'll miss things you want. The default has been set to the start of December 2015 - you will probably want a much longer date than that initially.

INITIAL_HARVEST_DATE = "2015-12-01T00:00:00Z"

Then you just need to start the harvester with:

python service/runner.py

This will run through each account id in the API_KEYS list, and for each one retrieve the ISSNs from the DOAJ, then request the articles for each of those ISSNs from EPMC, and fire it over to the DOAJ (authenticating with that account's API key).

Rotating the logs

The harvester creates a fair amount of log output, so it's best to rotate and archive old logs using logrotate. Symlink the config from deploy/logrotate/doaj-harvester into the logrotate directory at /etc/logorotate.d/ and it will be picked up on the next cron.daily run of logrotate. E.g:

sudo ln -s /full/path/to/deploy/logrotate/doaj-harvester /etc/logrotate.d/doaj-harvester

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
config		config
deploy		deploy
esprit @ 7695f11		esprit @ 7695f11
magnificent-octopus @ 770a9dd		magnificent-octopus @ 770a9dd
service		service
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
deploy.sh		deploy.sh
requirements.txt		requirements.txt
setup.py		setup.py
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DOAJ Harvester

Installation

Running the Harvester

Rotating the logs

About

Releases

Packages

Contributors 3

Languages

DOAJ/harvester

Folders and files

Latest commit

History

Repository files navigation

DOAJ Harvester

Installation

Running the Harvester

Rotating the logs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages