Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
import org.jabref.model.entry.field.InternalField;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.identifier.ArXivIdentifier;
import org.jabref.logic.importer.util.UrlIdentifierParser;
import org.jabref.model.entry.identifier.DOI;
import org.jabref.model.entry.types.StandardEntryType;
import org.jabref.model.paging.Page;
Expand Down Expand Up @@ -339,7 +340,7 @@ public Page<BibEntry> performSearchPaged(BaseQueryNode queryNode, int pageNumber
public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
CompletableFuture<Optional<BibEntry>> arXivBibEntryPromise = arXiv.asyncPerformSearchById(identifier);
if (this.doiFetcher != null) {
inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, ArXivIdentifier.parse(identifier));
inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, UrlIdentifierParser.parseArXiv(identifier));
}
Comment on lines 340 to 344
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Arxiv pdf url fails 🐞 Bug ≡ Correctness

ArXivFetcher.performSearchById still calls arXiv.asyncPerformSearchById(identifier) with the raw
input, so https://arxiv.org/pdf/<id>.pdf fails because ArXivIdentifier.parse rejects the trailing
.pdf.
The new UrlIdentifierParser.parseArXiv result is only used for the DOI-infusion optimization, not
for the actual arXiv lookup, so the feature doesn’t work end-to-end for PDF URLs.
Agent Prompt
### Issue description
`ArXivFetcher.performSearchById` still passes the raw user input into `arXiv.asyncPerformSearchById(...)`. For inputs like `https://arxiv.org/pdf/2203.02155.pdf`, `ArXivIdentifier.parse(...)` rejects the `.pdf` suffix, so the actual fetch returns empty even though `UrlIdentifierParser.parseArXiv` can normalize this URL.

### Issue Context
`UrlIdentifierParser.parseArXiv(identifier)` is currently only used to accelerate DOI infusion, not to normalize the identifier used in the actual arXiv fetch.

### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[339-346]

### Suggested fix
1. Parse once at the start of `performSearchById`:
   - `Optional<ArXivIdentifier> parsed = UrlIdentifierParser.parseArXiv(identifier);`
2. If `parsed.isEmpty()`, return `Optional.empty()` (or keep existing behavior, but avoid calling the API with a non-parseable URL).
3. Call `arXiv.asyncPerformSearchById(parsed.get().asString())` instead of using `identifier`.
4. Pass `parsed` into `inplaceAsyncInfuseArXivWithDoi(...)` so both the fetch and the optimization use the same normalized value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

return arXivBibEntryPromise.join();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.identifier.DOI;
import org.jabref.logic.importer.util.UrlIdentifierParser;
import org.jabref.model.entry.types.StandardEntryType;
import org.jabref.model.util.DummyFileUpdateMonitor;
import org.jabref.model.util.OptionalUtil;
Expand Down Expand Up @@ -88,7 +89,7 @@ public Optional<HelpFile> getHelpPage() {
private void doAPILimiting(String identifier) {
// Without a generic API Rate Limiter implemented on the project, use Guava's RateLimiter for avoiding
// API throttling when multiple threads are working, specially during DOI Content Negotiations
Optional<DOI> doi = DOI.parse(identifier);
Optional<DOI> doi = UrlIdentifierParser.parseDOI(identifier);

try {
Optional<String> agency;
Expand Down Expand Up @@ -121,7 +122,7 @@ protected CompletableFuture<Optional<BibEntry>> asyncPerformSearchById(String id

@Override
public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
DOI doi = DOI.parse(identifier)
DOI doi = UrlIdentifierParser.parseDOI(identifier)
.orElseThrow(() -> new FetcherException(Localization.lang("Invalid DOI: '%0'.", identifier)));
Comment thread
qodo-free-for-open-source-projects[bot] marked this conversation as resolved.

URL doiURL;
Expand All @@ -141,7 +142,7 @@ public Optional<BibEntry> performSearchById(String identifier) throws FetcherExc
throw new FetcherException("Invalid URL", e);
}
if (agency.isPresent() && "medra".equalsIgnoreCase(agency.get())) {
return new Medra().performSearchById(identifier);
return new Medra().performSearchById(doi.asString());
}

URLDownload download = getUrlDownload(doiURL);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
package org.jabref.logic.importer.util;

import java.util.Optional;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jabref.model.entry.identifier.ArXivIdentifier;
import org.jabref.model.entry.identifier.DOI;

/**
* Parses identifiers from URLs and plain text.
* Extracts DOI, arXiv ID, etc. from various URL formats.
*/
public class UrlIdentifierParser {

private static final Pattern DOI_URL_PATTERN =
Pattern.compile("https?://(?:dx\\.)?doi\\.org/(.+)");

private static final Pattern DOI_ACM_PATTERN =
Pattern.compile("https?://dl\\.acm\\.org/doi/(?:abs/)?(.+)");

private static final Pattern ARXIV_URL_PATTERN =
Pattern.compile("https?://arxiv\\.org/(?:abs|pdf)/([\\w.\\-]+?)(?:\\.pdf)?$");

public static Optional<DOI> parseDOI(String input) {
if (input == null || input.isBlank()) {
return Optional.empty();
}

String trimmedInput = input.trim();

Matcher doiUrlMatcher = DOI_URL_PATTERN.matcher(trimmedInput);
if (doiUrlMatcher.find()) {
return DOI.parse(doiUrlMatcher.group(1));
}

Matcher acmMatcher = DOI_ACM_PATTERN.matcher(trimmedInput);
if (acmMatcher.find()) {
return DOI.parse(acmMatcher.group(1));
}
Comment on lines +19 to +40
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

4. Acm doi regex rejects pdf 🐞 Bug ≡ Correctness

UrlIdentifierParser.parseDOI short-circuits on DOI_ACM_PATTERN and captures everything after
/doi/, so URLs like https://dl.acm.org/doi/pdf/10.... are turned into pdf/10.... and then
rejected by DOI.parse.
This is a regression because DOI.parse is already able to extract a DOI embedded later in an
arbitrary https URL.
Agent Prompt
### Issue description
`UrlIdentifierParser.parseDOI` uses `DOI_ACM_PATTERN = https?://dl.acm.org/doi/(?:abs/)?(.+)` and returns `DOI.parse(acmMatcher.group(1))`. For common ACM URLs such as `/doi/pdf/10.1145/...`, this extracts `pdf/10.1145/...` and causes parsing to fail.

### Issue Context
`DOI.parse(...)` is already designed to handle many URL forms by allowing an arbitrary `https?://...` prefix before the DOI group.

### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[16-43]

### Suggested fix options
Option A (simplest/robust):
- Remove the special-case ACM/doi.org regexes and just `return DOI.parse(trimmedInput);` (or use `DOI.findInText(trimmedInput)` first if you want to safely ignore query/fragment junk).

Option B (keep special-cases):
- Tighten the ACM pattern to capture only a DOI starting with `10.` and stop at query/fragment:
  - e.g., `https?://dl\\.acm\\.org/doi/(?:abs/|pdf/|full/)?(10\\.[^\\s?#]+)`
- Use `matches()` (or anchor with `^...$`) instead of `find()` so you don’t accidentally capture trailing unrelated text.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


return DOI.parse(trimmedInput);
}

public static Optional<ArXivIdentifier> parseArXiv(String input) {
if (input == null || input.isBlank()) {
return Optional.empty();
}

String trimmedInput = input.trim();

Matcher arxivMatcher = ARXIV_URL_PATTERN.matcher(trimmedInput);
if (arxivMatcher.find()) {
return ArXivIdentifier.parse(arxivMatcher.group(1));
}

return ArXivIdentifier.parse(trimmedInput);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
package org.jabref.logic.importer.util;

import org.jabref.model.entry.identifier.ArXivIdentifier;
import org.jabref.model.entry.identifier.DOI;

import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;

class UrlIdentifierParserTest {

@Test
void parseDOIFromPlainDOI() {
String input = "10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromDoiOrgURL() {
String input = "https://doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromDxDoiOrgURL() {
String input = "https://dx.doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromHTTPURL() {
String input = "http://doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromACMDigitalLibrary() {
String input = "https://dl.acm.org/doi/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromACMAbsURL() {
String input = "https://dl.acm.org/doi/abs/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIReturnsEmptyForNull() {
assertFalse(UrlIdentifierParser.parseDOI(null).isPresent());
}

@Test
void parseDOIReturnsEmptyForEmptyString() {
assertFalse(UrlIdentifierParser.parseDOI("").isPresent());
}

@Test
void parseDOIReturnsEmptyForInvalidURL() {
assertFalse(UrlIdentifierParser.parseDOI("https://example.com").isPresent());
}

@Test
void parseArXivFromPlainID() {
String input = "2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromAbsURL() {
String input = "https://arxiv.org/abs/2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromPDFURL() {
String input = "https://arxiv.org/pdf/2203.02155.pdf";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromHTTPURL() {
String input = "http://arxiv.org/abs/2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivReturnsEmptyForNull() {
assertFalse(UrlIdentifierParser.parseArXiv(null).isPresent());
}

@Test
void parseArXivReturnsEmptyForInvalidURL() {
assertFalse(UrlIdentifierParser.parseArXiv("https://example.com").isPresent());
}

@Test
void parseArXivHandlesOldIDFormat() {
String input = "https://arxiv.org/abs/math.GT/0309136";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}
Comment on lines +13 to +102
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Tests only assert presence 📘 Rule violation ≡ Correctness

UrlIdentifierParserTest uses
assertTrue(optional.isPresent())/assertFalse(optional.isPresent()) instead of asserting the
exact parsed DOI/arXiv value. This weakens test precision and can allow incorrect-but-present
parsing results to pass.
Agent Prompt
## Issue description
The new tests only assert `Optional.isPresent()` / `isPresent()`-negation, which is a weak predicate and does not verify that the parser extracted the correct DOI/arXiv identifier.

## Issue Context
Per test compliance, assertions should compare against the full expected value/structure (e.g., `assertEquals(expectedOptional, actualOptional)`).

## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[13-102]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

}
Loading