Evaluation framework for reference matching algorithms

This code allows for:

generating artificial datasets for reference matching
applying reference matching approaches
evaluating reference matching approaches

Before running the scripts, add the code directory path to the environment variable PYTHONPATH.

Polite/Plus support

Most scripts call Crossref REST API under the hood. To use the polite pool or Plus Service when calling the API, simply copy the file .crapi_key_example to ~/.crapi_key, and fill in the settings in the file.

Dataset generation

A dataset is generated in several steps:

Sample extraction

Drawing the sample of documents from the system.

dataset/draw_sample.py [-h] [-v] -s SIZE [-f FILTER] [-q QUERY] -o OUTPUT

Arguments:

-h print help
-v verbose output
-s sample size
-f filter to apply before drawing the sample
-q query to apply before drawing the sample
-o output file path

Sample extension

Extending the sample by adding similar documents.

dataset/extend_sample.py [-h] [-v] -s SAMPLE -e EXTEND -o OUTPUT

Arguments:

-h print help
-v verbose output
-s sample file path
-e number of similar documets to add for each sample document
-o output file path

Generating the dataset

Generating reference strings and the final dataset.

dataset/generate_dataset.py [-h] [-v] [-l STYLES] [-a ATTRIBUTES] [-n] -s SAMPLE -o OUTPUT

Arguments:

-h print help
-v verbose output
-l comma-separated list of citation styles
-a comma-separated list of document attributes to keep (useful if we want to calculate the results along the values of those attributes)
-n set the target DOIs to null
-s sample file path
-o output file path

Reference matching

Match the references from the dataset to target DOIs.

matching/match_references.py [-h] [-v] -d DATASET -o OUTPUT

Arguments:

-h print help
-v verbose output
-d dataset file path
-o output file path

The matching algorithm is set by specifying the matcher object in matching/match_config.py. The matcher object should implement two methods:

description(): returns a string
match(reference_string): returns a tuple (matched DOI, matching score)

Evaluation

Evaluate the reference matching. The results are printed to stdout.

evaluation/evaluate.py [-h] [-v] -d DATASET [-s SPLITATTR] [-o OUTPUTDOC] [-p OUTPUTSPLIT]

Arguments:

-h print help
-v verbose output
-d dataset file path (matched)
-s attribute along which the results should be split
-o output file path for document-level results
-p output file path for split attribute results

Additional scripts

Datasets merging

Merge two or more samples or references datasets into a single file.

dataset/merge_datasets.py [-h] [-v] -d DATASETS -o OUTPUT

Arguments:

-h print help
-v verbose output
-d comma-separated list of dataset file paths
-o output file path

Links export

Export the information about a sample of existing and missing links in the current system.

dataset/export_existing_and_new_links.py [-h] [-v] -s SAMPLE [-n REFERENCES] [-p] -o OUTPUT

Arguments:

-h print help
-v verbose output
-s sample file path
-n max number of output references
-p export parsed references instead of strings
-o output file path

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
analyses		analyses
data		data
dataset		dataset
evaluation		evaluation
examples		examples
journal_abbreviations		journal_abbreviations
matching		matching
monitoring		monitoring
utils		utils
.crapi_key_example		.crapi_key_example
.gitignore		.gitignore
CONDUCT.md		CONDUCT.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation framework for reference matching algorithms

Polite/Plus support

Dataset generation

Sample extraction

Sample extension

Generating the dataset

Reference matching

Evaluation

Additional scripts

Datasets merging

Links export

About

Releases

Packages

Languages

License

CrossRef/reference-matching-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation framework for reference matching algorithms

Polite/Plus support

Dataset generation

Sample extraction

Sample extension

Generating the dataset

Reference matching

Evaluation

Additional scripts

Datasets merging

Links export

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages