[pyos][design] Identification logic is weak, difficult to follow, and inconsistent.

Part of: https://github.com/pyOpenSci/software-submission/issues/267#issuecomment-3886731422

If a DOI can be parsed from an item, then we look up the metadata from crossref and then return. that seems OK.

The rest of what happens in the identifier is a pretty big mystery. I would expect to just query all available metadata sources that have APIs that support that (e.g. sure you might not try and treat everything like a github URL, but if so far you haven't identified a work, you might as well try to check openaire).

Instead there is a ton of ad-hoc, in-place logic for determining where to check. Most of the time querying the sources is not even attempted because the logic is so specific, e.g. we have to have the zenodo DOI fragment present in the query string to check for it - https://github.com/HzaCode/OneCite/blob/12b1dea45a2b7ddcc60b1abc6cc29984b1aefbc8/onecite/pipeline.py#L531 , but that's a valid DOI so we should have already gotten the metadata for crossref! For some reason the figshare code is *also* in the zenodo method??? to check openaire we need to literally have the word "dissertation" or "phd thesis" in our query??? https://github.com/HzaCode/OneCite/blob/12b1dea45a2b7ddcc60b1abc6cc29984b1aefbc8/onecite/pipeline.py#L631-L634 Why do we then try and guess the university pattern when we should be receiving that from the metadata provider??? https://github.com/HzaCode/OneCite/blob/12b1dea45a2b7ddcc60b1abc6cc29984b1aefbc8/onecite/pipeline.py#L671-L674

The whole identifier class needs to be restructured to have a sensible query logic, because as it stands now the identifier seems to pretty much always just query crossref, find a good enough match, and continue from there and the other sources are not attempted as much as I try to give examples that should be found elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyos][design] Identification logic is weak, difficult to follow, and inconsistent. #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	thesis_keywords = [
	'phd thesis', 'ph.d. thesis', 'doctoral thesis', 'dissertation',
	'master thesis', "master's thesis", 'msc thesis', 'm.s. thesis'
	]

	university_patterns = [
	r'(?:PhD\|Ph\.D\.\|Master\|Doctoral\|Dissertation).?([A-Z][^.]?University[^.]?)\.?\s$',
	r'([A-Z][^.]?University[^.]?)\.?\s*$',
	]

[pyos][design] Identification logic is weak, difficult to follow, and inconsistent. #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions