Skip to content

[pyos][design] Identification logic is weak, difficult to follow, and inconsistent. #23

@sneakers-the-rat

Description

@sneakers-the-rat

Part of: pyOpenSci/software-submission#267 (comment)

If a DOI can be parsed from an item, then we look up the metadata from crossref and then return. that seems OK.

The rest of what happens in the identifier is a pretty big mystery. I would expect to just query all available metadata sources that have APIs that support that (e.g. sure you might not try and treat everything like a github URL, but if so far you haven't identified a work, you might as well try to check openaire).

Instead there is a ton of ad-hoc, in-place logic for determining where to check. Most of the time querying the sources is not even attempted because the logic is so specific, e.g. we have to have the zenodo DOI fragment present in the query string to check for it -

zenodo_pattern = r'10\.5281/zenodo\.(\d+)'
, but that's a valid DOI so we should have already gotten the metadata for crossref! For some reason the figshare code is also in the zenodo method??? to check openaire we need to literally have the word "dissertation" or "phd thesis" in our query???

OneCite/onecite/pipeline.py

Lines 631 to 634 in 12b1dea

thesis_keywords = [
'phd thesis', 'ph.d. thesis', 'doctoral thesis', 'dissertation',
'master thesis', "master's thesis", 'msc thesis', 'm.s. thesis'
]
Why do we then try and guess the university pattern when we should be receiving that from the metadata provider???

OneCite/onecite/pipeline.py

Lines 671 to 674 in 12b1dea

university_patterns = [
r'(?:PhD|Ph\.D\.|Master|Doctoral|Dissertation).*?([A-Z][^.]*?University[^.]*?)\.?\s*$',
r'([A-Z][^.]*?University[^.]*?)\.?\s*$',
]

The whole identifier class needs to be restructured to have a sensible query logic, because as it stands now the identifier seems to pretty much always just query crossref, find a good enough match, and continue from there and the other sources are not attempted as much as I try to give examples that should be found elsewhere.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions