Skip to content
Rich Context knowledge graph management
HTML Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
adrf-onto @ 8d20ed2 ran workflow for partition updates since Jan 2 Jan 18, 2020
bucket_final
bucket_stage full workflow run Jan 17, 2020
datasets @ 03daa72
errors
human @ 2ba9b7f
misc
publications @ d7c171a
rclc @ 3f9a7e8
richcontext/graph
.gitignore
.gitmodules
LICENSE
README.md
auth_train.tsv
authors.json
corpus.html
corpus.ipynb
gen_ttl.py authors added into graph Jan 17, 2020
journals.json
requirements.txt
run_author.py
run_final.py
run_step2.py
run_step3.py
run_step4.py
test.py
vocab.json

README.md

RCGraph

Manage the Rich Context knowledge graph.

Installation

First, there are two options for creating an environment.

Option 1: use virtualenv to create a virtual environment with the local Python 3.x as the target binary.

Then activate virtualenv and update your pip configuration:

source venv/bin/activate
pip install setuptools --upgrade

Option 2: use conda -- see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Second, clone the repo:

git clone https://github.com/Coleridge-Initiative/RCGraph.git

Third, connect into the directory and initialize the local Git configuration for the required submodules:

cd RCGraph
git submodule init
git submodule update
git config status.submodulesummary 1

Given that foundation, load the dependencies:

pip install -r requirements.txt

Fourth, set up the local rc.cfg configuration file and run unit the tests (see below) to confirm that this project has been installed and configured properly.

Submodules

Ontology definitions used for the KG are linked into this project as a submodule:

Git repos exist for almost every entity in the KG, also linked as submodules:

The RCLC leaderboard competition is also linked as a submodule since it consumes from this repo for corpus updates:

Updates

To update the submodules to their latest HEAD commit in master branch run:

git submodule foreach "(git fetch; git merge origin/master; cd ..;)"

Then add the submodule and commit.

For more info about how to use Git submodules, see:

Workflow

Initial Steps

  • update datasets.json -- datasets are the foundation for the KG
  • add a new partition of publication metadata for each data ingest

Step 1: Graph Consistency Tests

To perform these tests:

nose2 -v --pretty-assert

Then create GitHub issues among the submodules for any failed tests.

Step 2: Gather the DOIs, etc.

Use title search across the scholarly infrastructure APIs to identify a DOI and other metadata for each publication.

python run_step2.py

Results are organized in partitions within the bucket_stage subdirectory, using the same partition names from the preceding workflow steps, to make errors easier to trace and troubleshoot.

See the misses_step2.txt file which reports the title of each publication that failed every API lookup.

Step 3: Gather the PDFs, etc.

Use publication lookup with DOIs across the scholarly infrastructure APIs to identify open access PDFs, journals, authors, keywords, etc.

python run_step3.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_step3.txt file which reports the title of each publication that failed every API lookup.

Step 4: Reconcile Journal Entities

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile the journal for each publication with the journals.json entity listing.

python run_step4.py

Disputed entity defintions are written to standard output, and suggested additions are written to a new update_journals.json file.

The person running this step must review each suggestion, then determine whether to add the suggested journals to the journals.json entities file -- or make other changes to previously described journal entities. For example, sometimes the metadata returned from discovery APIs has errors and would cause data quality issues within the KG.

Some good tools for manually checking journal metadata via ISSNs include ISSN.org, Crossref, and NCBI. For example, using the ISSN "1531-3204" to lookup journal metadata:

Often there will be outdated/invalidated ISSNs or low-info-content defaults (e.g., substituting SSRN) included in API results, which could derail our KG development.

Journal names get used later in the workflow to construct UUIDs for publications, prior to generating the public corpus. This step performs consistency tests and filtering of the API metadata, to avoid data quality issues later.

See the misses_step4.txt file which reports the title of each publication that doesn't have a journal.

Caveats:

  • If you don't understand what this step performs, don't run it
  • Do not make manual edits to the journals.json file

Step N: Reconcile Author Lists

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile (disambiguate) the author lists for each publication with the authors.json entity listing.

python run_author.py

Lists of authors are parsed from metadata in the bucket_stage then disambiguated.

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

The stage produces two files:

  • authors.json -- list of known authors
  • auth_train.tsv -- training set for self-supervised model

See the misses_author.txt file which reports the title of each publication that doesn't any authors.

Caveats:

  • Do not make manual edits to authors.json or auth_train.tsv

Step N: Finalize Metadata Corrections

This workflow step finalizes the metadata corrections for each publication, including selection of a URL, open access PDF, etc., along with the manual override.

python run_final.py

Results are organized in partitions in the bucket_final subdirectory, using the same partition names from the previous workflow step.

See the misses_final.txt file which reports the title of each publication that failed every API lookup.

Step N: Generate Corpus Update

This workflow step generates uuid values (late binding) for both publications and datasets, then serializes the full output as TTL in tmp.ttl and as JSON-LD in tmp.jsonld for a corpus update:

python gen_ttl.py

Afterwards, move the generated tmp.* files into the RCLC repo and rename them:

mv tmp.* rclc
cd rclc
mv tmp.ttl corpus.ttl
mv tmp.jsonld corpus.jsonld

Then to publish the corpus:

  1. commit and create a new tagged release
  2. run bin/download_resources.py to download PDFs
  3. extract text from PDFs
  4. upload to the public S3 bucket and write manifest
You can’t perform that action at this time.