Part of: pyOpenSci/software-submission#267 (comment)
If a DOI can be parsed from an item, then we look up the metadata from crossref and then return. that seems OK.
The rest of what happens in the identifier is a pretty big mystery. I would expect to just query all available metadata sources that have APIs that support that (e.g. sure you might not try and treat everything like a github URL, but if so far you haven't identified a work, you might as well try to check openaire).
Instead there is a ton of ad-hoc, in-place logic for determining where to check. Most of the time querying the sources is not even attempted because the logic is so specific, e.g. we have to have the zenodo DOI fragment present in the query string to check for it -
|
zenodo_pattern = r'10\.5281/zenodo\.(\d+)' |
, but that's a valid DOI so we should have already gotten the metadata for crossref! For some reason the figshare code is
also in the zenodo method??? to check openaire we need to literally have the word "dissertation" or "phd thesis" in our query???
|
thesis_keywords = [ |
|
'phd thesis', 'ph.d. thesis', 'doctoral thesis', 'dissertation', |
|
'master thesis', "master's thesis", 'msc thesis', 'm.s. thesis' |
|
] |
Why do we then try and guess the university pattern when we should be receiving that from the metadata provider???
|
university_patterns = [ |
|
r'(?:PhD|Ph\.D\.|Master|Doctoral|Dissertation).*?([A-Z][^.]*?University[^.]*?)\.?\s*$', |
|
r'([A-Z][^.]*?University[^.]*?)\.?\s*$', |
|
] |
The whole identifier class needs to be restructured to have a sensible query logic, because as it stands now the identifier seems to pretty much always just query crossref, find a good enough match, and continue from there and the other sources are not attempted as much as I try to give examples that should be found elsewhere.
Part of: pyOpenSci/software-submission#267 (comment)
If a DOI can be parsed from an item, then we look up the metadata from crossref and then return. that seems OK.
The rest of what happens in the identifier is a pretty big mystery. I would expect to just query all available metadata sources that have APIs that support that (e.g. sure you might not try and treat everything like a github URL, but if so far you haven't identified a work, you might as well try to check openaire).
Instead there is a ton of ad-hoc, in-place logic for determining where to check. Most of the time querying the sources is not even attempted because the logic is so specific, e.g. we have to have the zenodo DOI fragment present in the query string to check for it -
OneCite/onecite/pipeline.py
Line 531 in 12b1dea
OneCite/onecite/pipeline.py
Lines 631 to 634 in 12b1dea
OneCite/onecite/pipeline.py
Lines 671 to 674 in 12b1dea
The whole identifier class needs to be restructured to have a sensible query logic, because as it stands now the identifier seems to pretty much always just query crossref, find a good enough match, and continue from there and the other sources are not attempted as much as I try to give examples that should be found elsewhere.