Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crossref import with DOI URI behaves incorrectly #9561

Closed
alanorth opened this issue May 9, 2024 · 4 comments · Fixed by #9582
Closed

Crossref import with DOI URI behaves incorrectly #9561

alanorth opened this issue May 9, 2024 · 4 comments · Fixed by #9582
Labels
bug tools: import-sources Related to "Live Import" Sources feature, allowing import of content via external APIs.
Milestone

Comments

@alanorth
Copy link
Contributor

alanorth commented May 9, 2024

Describe the bug
In DSpace 7.6.1 and current DSpace 8.0-SNAPSHOT at least, if you try to import an item using a DOI using its URI form from Crossref, you get millions of results.

To Reproduce
Steps to reproduce the behavior:

  1. Log into DSpace
  2. Go to MyDSpace
  3. Use the dropdown menu for importing metadata from an external source
  4. Choose Crossref and enter a DOI in URI format such as https://doi.org/10.1108/CAER-03-2020-0040

Expected behavior
DSpace should show exactly one result for the DOI.

Crossref's API supports retrieving the DOI in various formats, so I'm not sure what is going on. See:

Related work
#9385

@alanorth alanorth added bug needs triage New issue needs triage and/or scheduling tools: import-sources Related to "Live Import" Sources feature, allowing import of content via external APIs. labels May 9, 2024
@tdonohue tdonohue added help wanted Needs a volunteer to claim to move forward and removed needs triage New issue needs triage and/or scheduling labels May 9, 2024
@floriangantner
Copy link
Contributor

@alanorth The current implementation

uriBuilder.addParameter("query", query.getParameterAsClass("query", String.class));
uses some search via query parameter, not the direct access to the doi resource.

So the effective Queries called in the search are e.g.

Reproduced on todays sandbox:

Screenshot 2024-05-13 at 13-21-28 DSpace Repository Import metadata from an external source

Also searching after the title is possible.

Screenshot 2024-05-13 at 13-19-58 DSpace Repository Import metadata from an external source

@alanorth
Copy link
Contributor Author

@floriangantner Ah I see! It's hard to imagine this free-text search being useful. Returning more than a single page of results—leave alone millions!—is a terrible user experience. Unless there's some way to make that free-text search more useful, I would say that we should make this explicitly use DOIs because that will return an exact match and is more likely the workflow that submitters will be using (at least at our institute, where our submitters are cataloging a journal article authored by one of our scientists).

@hutattedonmyarm
Copy link
Contributor

hutattedonmyarm commented May 15, 2024

I'd blame DoiCheck:

return DoiCheck.isDoi(id) ? "filter=doi:" + id : StringUtils.EMPTY;

The crossrefimport explicitly checks if a DOI is given or not and only searches by query if no DOI is provided:

public Collection<ImportRecord> getRecords(String query, int start, int count) throws MetadataSourceException {
String id = getID(query.toString());
return StringUtils.isNotBlank(id) ? retry(new SearchByIdCallable(id))
: retry(new SearchByQueryCallable(query, count, start));
}

However only an extremely limited set of prefixes is recognized by isDoi:

private static final List<String> DOI_PREFIXES = Arrays.asList("http://dx.doi.org/", "https://dx.doi.org/");

so that https://doi.org/10.1108/CAER-03-2020-0040 is used with a query parameter and not as an ID.

We ran into the same problem a while ago and added a few more valid prefixes. Some of them very legacy:

private static final List<String> DOI_PREFIXES = Arrays.asList(
            "http://dx.doi.org/",
            "https://dx.doi.org/",
            "http://www-dx.doi.org/",
            "https://www-dx.doi.org/",
            "http://doi.org/",
            "https://doi.org/",
            "dx.doi.org/",
            "www.dx.doi.org/",
            "doi:");

@alanorth
Copy link
Contributor Author

alanorth commented May 15, 2024

@hutattedonmyarm Wow yes this is very simple and obviously correct. While there doesn't seem to be a single canonical form for DOIs, I think it has become more common to use the https://doi.org/10.xxxx/xxxxx URI format in the last few years. In our repository I have begun normalizing all DOIs to that format upon deposit. For one data point, Crossref at least recommends this since 2017.

Could you submit a patch with your additions?

@tdonohue tdonohue added this to the 7.6.2 milestone May 15, 2024
@tdonohue tdonohue removed the help wanted Needs a volunteer to claim to move forward label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug tools: import-sources Related to "Live Import" Sources feature, allowing import of content via external APIs.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants