Skip to content
Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Python Shell
Branch: master
Clone or download
ceteri Merge pull request #26 from Coleridge-Initiative/textrank_test
A Simple Unit Test Made for TextRank
Latest commit 99e7c9d Feb 18, 2020
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin A Simple Unit Test Made for TextRank Feb 18, 2020
docs updated w/ new partitions since 2020-01-02 Jan 18, 2020
example/pub/json needed to fix a path Feb 12, 2020
lib WIP toward having a PDF download+extract pipeline Dec 20, 2019
.gitignore working toward #6, tracing d/l errors, also introduced Ray for faster… Dec 24, 2019
LICENSE Update LICENSE Nov 7, 2019
MANIFEST.txt trouble upload Feb 18, 2020 WIP toward having a PDF download+extract pipeline Dec 20, 2019
corpus.jsonld updated w/ new partitions since 2020-01-02 Jan 18, 2020
corpus.ttl updated w/ new partitions since 2020-01-02 Jan 18, 2020
errors.txt WIP debugging Parsr calls Feb 17, 2020
requirements.txt WIP debugging Parsr calls Feb 17, 2020 updates from APIs Dec 3, 2019
vocab.json terser context Aug 21, 2019

Tracking Progress in Rich Context

The Coleridge Initiative at NYU has been researching Rich Context to enhance search and discovery of datasets used in scientific research – see the Background Info section for more details. Partnering with experts throughout academia and industry, NYU-CI has worked to leverage the closely adjacent fields of NLP/NLU, knowledge graph, recommender systems, scholarly infrastructure, data mining from scientific literature, dataset discovery, linked data, open vocabularies, metadata management, data governance, and so on. Leaderboards are published here on GitHub to track state-of-the-art (SOTA) progress among the top results.

Leaderboard 1

Entity Linking for Datasets in Publications

The first challenge is to identify the datasets used in research publications, initially focused on the problem of entity linking. Research papers generally mention the datasets they've used, although there are limited formal means to describe that metadata in a machine-readable way. The goal here is to predict a set of dataset IDs for each publication. The dataset IDs within the corpus represent the set of all possible datasets which will appear.

Identifying dataset mentions typically requires:

  • extracting text from an open access PDF
  • some NLP parsing of the text
  • feature engineering (e.g., attention to where text is located in a paper)
  • modeling to identify up to 5 datasets per publication

See Evaluating Models for Entity Linking with Datasets for details about how the Top5uptoD leaderboard metric is calculated.

Current SOTA

source precision entry code paper corpus submitted notes
LARC @philipskokoh 0.7836 ipynb repo RCC_1 v0.1.5 2019-09-26 RCLC baseline experiment using RCC_1 approach
KAIST @HaritzPuerto 0.6319 ipynb repo RCC_1 v0.1.5 2019-11-01 model trained a different dataset using DocumentQA and Ultra-Fine Entity Typing -- NB: this approach is able to identify new datasets


Use of open source and open standards are especially important to further the cause for effective, reproducible research. We're hosting this competition to focus on the research challenges of specific machine learning use cases encountered within Rich Context – see the Workflow Stages section.

If you have any questions about the Rich Context leaderboard competition – and especially if you identify any problems in the corpus (e.g., data quality, incorrect metadata, broken links, etc.) – please use the GitHub issues for this repo and pull requests to report, discuss, and resolve them.

You can’t perform that action at this time.