Skip to content

[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852

Open
vins01-4science wants to merge 3 commits into
DSpace:mainfrom
4Science:task/main/DURACOM-443
Open

[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852
vins01-4science wants to merge 3 commits into
DSpace:mainfrom
4Science:task/main/DURACOM-443

Conversation

@vins01-4science

@vins01-4science vins01-4science commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

References

Description

This PR introduces GROBID integration for automated metadata extraction from PDF files during the submission process.

Instructions for Reviewers

List of changes in this PR:

  • GROBID Integration: Added GROBID metadata extraction service for parsing PDF documents and extracting bibliographic metadata during submission
    • New GrobidImportMetadataSourceServiceImpl service for TEI XML parsing
    • New TEIAuthorMetadataContributor and TEIDateMetadataContributor classes for metadata mapping and transformation (useful for other TEI documents too)
    • REST endpoint integration via ExtractMetadataStep and DataProcessingStep
    • Configuration in grobid-integration.xml and grobid.cfg
  • Test Improvements: Added integration tests (GrobidMetadataExtractionIT) and unit tests for GROBID extraction
  • Submission Form Updates: Updated submission form configurations to support GROBID extraction

Note: Previous versions of this PR included JAXB-generated Java models for TEI. This is no longer the case, and the TEI XML is treated simply as a DOM Document.

GROBID Metadata Extraction Testing

To test this feature, you need a GROBID endpoint.
You can start one locally using the docker image provided by grobid, or you can use an open online service ( there are different ones, you can google for them )

Main configuration

The main configuration has been placed inside the grobid-integration.xml file, and the mappings have been placed inside this metadata-map:

    <util:map id="grobidMetadataFieldMap"
              key-type="org.dspace.importer.external.metadatamapping.MetadataFieldConfig"
              value-type="org.dspace.importer.external.metadatamapping.contributor.MetadataContributor">
        <entry key-ref="grobid.title"             value-ref="grobidTitleContrib"/>
        <entry key-ref="grobid.date.issued"        value-ref="grobidDateContrib"/>
        <entry key-ref="grobid.contributor.author" value-ref="grobidAuthorContrib"/>
        <entry key-ref="grobid.source"             value-ref="grobidJournalContrib"/>
        <entry key-ref="grobid.issn"               value-ref="grobidIssnContrib"/>
        <entry key-ref="grobid.identifier.doi"     value-ref="grobidDoiContrib"/>
        <entry key-ref="grobid.identifier.isbn"    value-ref="grobidIsbnContrib"/>
        <entry key-ref="grobid.description.abstract" value-ref="grobidAbstractContrib"/>
        <entry key-ref="grobid.language"           value-ref="grobidLanguageContrib"/>
        <entry key-ref="grobid.subject"            value-ref="grobidSubjectContrib"/>
    </util:map>

Prerequisites

  • A running GROBID service (default: http://localhost:8070)
    • You can run GROBID via Docker: docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2-full
  • A PDF file with embedded academic article metadata

Configuration Steps

  1. Enable GROBID in local.cfg:
    grobid.service.url = http://localhost:8070/
  2. Configure Submission Process (in item-submission.xml):
    Add the GROBID metadata extraction step to your submission process:
   <step-definition id="grobidmetadata" mandatory="true">
       <heading>submit.progressbar.describe.stepone</heading>
       <processing-class>org.dspace.app.rest.submit.step.DescribeStep</processing-class>
       <type>submission-form</type>
   </step-definition>
   
   <step-definition id="extractionstep">
       <heading>submit.progressbar.ExtractMetadataStep</heading>
       <processing-class>org.dspace.app.rest.submit.step.ExtractMetadataStep</processing-class>
       <type>extract</type>
   </step-definition>

Then add these steps to your submission process:

    <submission-process name="traditional">
       <step id="collection"/>
       <step id="grobidmetadata"/>
       <step id="upload"/>
       <step id="extractionstep"/>
       <!-- other steps -->
   </submission-process>

Then place the correct form definition (submission-forms.xml):

   <form name="grobidmetadata">
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>title</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>false</repeatable>
         <label>Title</label>
         <input-type>onebox</input-type>
         <required>Field required</required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>contributor</dc-element>
         <dc-qualifier>author</dc-qualifier>
         <label>Author</label>
         <input-type>name</input-type>
         <repeatable>false</repeatable>
         <required>You must enter at least the author.</required>
         <hint>Enter the names of the authors of this item in the form Lastname, Firstname [i.e. Smith, Josh or Smith, J].</hint>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>source</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>false</repeatable>
         <label>Journal</label>
         <input-type>onebox</input-type>
         <hint>Enter the Journal of the publication</hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>date</dc-element>
         <dc-qualifier>issued</dc-qualifier>
         <repeatable>false</repeatable>
         <label>Date of Issue</label>
         <style>col-sm-4</style>
         <input-type>date</input-type>
         <hint>Please give the date of previous publication or public distribution.
           You can leave out the day and/or month if they aren't
           applicable.</hint>
         <required>You must enter at least the year.</required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>issn</dc-qualifier>
         <repeatable>true</repeatable>
         <label>ISSN</label>
         <input-type>onebox</input-type>
         <hint>Enter the ISSN of the book.</hint>
         <required></required>
       </field>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>isbn</dc-qualifier>
         <repeatable>true</repeatable>
         <label>ISBN of Book</label>
         <input-type>onebox</input-type>
         <hint>Enter the ISBN of the book in which this chapter appears.</hint>
         <required></required>
       </field>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>doi</dc-qualifier>
         <repeatable>true</repeatable>
         <label>DOI</label>
         <input-type>onebox</input-type>
         <hint>Enter the DIO of the publication.</hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>description</dc-element>
         <dc-qualifier>abstract</dc-qualifier>
         <repeatable>false</repeatable>
         <label>Abstract</label>
         <input-type>textarea</input-type>
         <hint>Enter the abstract of the item. </hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>subject</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>true</repeatable>
         <label>Subject</label>
         <input-type>onebox</input-type>
         <hint>Subject field that can be associated with an authority providing lookup</hint>
         <required></required>
       </field>
     </row>
   </form>
  1. Verify Spring Configuration:
    Ensure grobid-integration.xml is loaded (check dspace/config/spring/api/ directory)

Test Case 1: PDF Metadata Extraction

  1. Start a new submission in a collection using the configured submission process
  2. Upload a PDF file of an academic article during the upload step
    • Sample PDFs for testing are included in:
      • dspace-server-webapp/src/test/resources/org/dspace/app/rest/simple-article.pdf
      • dspace-api/src/test/java/org/dspace/submit/extraction/grobid/grobid.pdf
  3. Verify that after upload, the system automatically extracts one or many metadata:
    • Title (dc.title)
    • Authors (dc.contributor.author)
    • Publication Date (dc.date.issued)
    • ISSN (dc.identifier.issn)
    • ISBN (dc.identifier.isbn)
    • DOI (dc.identifier.doi)
    • Abstract (dc.description.abstract)
    • Keywords (dc.subject)
  4. Navigate to the metadata form step - verify extracted metadata is pre-populated
  5. Complete the submission and verify the final item has all extracted metadata

Test Case 2: Metadata Not Overwritten

  1. Create a submission and manually enter metadata before uploading a PDF
  2. Upload a PDF - verify that manually entered metadata is NOT overwritten by extracted values

Test Case 3: GROBID Service Unavailable

  1. Stop the GROBID service
  2. Try uploading a PDF - verify graceful handling (submission continues without extraction, no errors blocking submission)

Running Automated Tests

Run GROBID-specific integration tests

mvn install -DskipIntegrationTests=false -Dit.test=GrobidMetadataExtractionIT -DfailIfNoTests=false

Run unit tests for GROBID service

mvn test -DskipUnitTests=false -Dtest=GrobidImportMetadataSourceServiceImplTest -DfailIfNoTests=false

Checklist

  • My PR is created against the main branch of code (unless it is a backport or is fixing an issue specific to an older branch).
  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • My PR includes details on how to test it. I've provided clear instructions to reviewers on how to successfully test this fix or feature.
  • If my PR includes new libraries/dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR modifies REST API endpoints, I've opened a separate REST Contract PR related to this change.
  • If my PR includes new configurations, I've provided basic technical documentation in the PR itself.
  • If my PR fixes an issue ticket, I've linked them together.

@tdonohue tdonohue added new feature component: submission Related to configurable submission system DSpace-CRIS merger This ticket/PR relates to the merger of DSpace-CRIS into DSpace. labels Feb 3, 2026
@tdonohue tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 10.0 Release Feb 3, 2026
@tdonohue tdonohue added this to the 10.0 milestone Feb 3, 2026
@tdonohue tdonohue changed the title [MERGER] Metadata extraction from PDF files via GROBID [DSpace-CRIS] Metadata extraction from PDF files via GROBID Feb 3, 2026
@github-actions github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Feb 5, 2026
@github-actions

github-actions Bot commented Feb 5, 2026

Copy link
Copy Markdown

Hi @vins01-4science,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@github-actions github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Feb 12, 2026
@vins01-4science vins01-4science marked this pull request as ready for review February 20, 2026 08:52
@damian-joz damian-joz requested a review from Dawnkai February 26, 2026 12:38
@tdonohue tdonohue self-requested a review February 26, 2026 15:16
@tdonohue tdonohue moved this from 🙋 Needs Reviewers Assigned to 👀 Under Review in DSpace 10.0 Release Feb 26, 2026
@nwoodward nwoodward self-requested a review March 12, 2026 17:47
@github-actions github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Mar 13, 2026
@github-actions

Copy link
Copy Markdown

Hi @vins01-4science,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@Dawnkai

Dawnkai commented Mar 19, 2026

Copy link
Copy Markdown

I had a look at the code and tested this PR. Overall, the integration is functional and follows the test cases provided by Vincenzo. However, given the scale of the contribution (115+ files), it would be nice if someone else also took a look at the code (if possible).

There are a few places, like response.getEntity().getContent(), where a null response or entity could trigger a NullPointerException, but considering that tests pass and a potential connection error does not crash the submission form (though it does dump the whole traceback in the logs), I think that's okay.

Here is my feedback:

  1. Formatting - many files (like ItemImportServiceImpl) contain only indentation or whitespace changes. This needlessly inflates the PR size and has already caused merge conflicts during local testing when I tried to resolve them. I recommend reverting the changes where no functional logic was modified to keep the PR focused and maintainable and to prevent existing PRs from suddenly triggering merge conflicts (as some of those files are used by many other classes).

  2. Missing docs - while most new classes are annotated, GrobidImportMetadataSecurityServiceImpl lacks docstrings. This class is arguably the most important one in this PR, so a detailed documentation would help in future maintenance (especially the handling of "analytic", "monogr" and "series").

  3. The ConsolidateHeaderEnum enum - I can see it defines three options, but only CONSOLIDATE_AND_INJECT_METADATA is used. It would probably be a good idea to explicitly note in the code that other modes are untested/unsupported.

  4. I would also consider treating GROBID as a separate module instead of adding it to dspace.cfg for the sake of potential future improvements.

And speaking of potential future improvements:

  1. The current implementation disregards submission form constraints. For example, if dc.contributor.author is set as non-repeatable in the input forms, the integration still injects multiple values. This "breaks" the submission form for the end-user by preventing them from depositing the publication unless they manually remove additional authors. I assume this will also apply to regex constraints.

  2. I tried edge-case testing this PR. It worked with a PDF file with just one letter, but when I tried a 65-page PDF, the integration injected hundreds of repeating metadata entries. This resulted in the browser freezing and rendering the record inaccessible, even after the file had been processed (the submission form never loads). I recommend implementing a configurable limit on the number of values extracted per field.

  3. Currently, the integration can overwrite manual entries if the user hasn't clicked "Save" yet. We should consider a warning or some sort of way to preserve manually entered but unsaved values in the submission form.

In my opinion, this feature really should have a documentation. Given the complexity of the TEI mapping, a brief guide on how to add or map new fields would be highly beneficial, especially for new developers. The topic of extracting metadata from PDFs in the DSpace ecosystem is getting traction, so I assume people will be interested in this feature. Without this documentation, future updates to GROBID versions could cause the integration to work incorrectly (or not work at all), and fixing these issues without documentation might be significantly time-consuming for the community, especially since they would have to edit around 100 files added by this PR.

The feature works well for standard academic papers, but the lack of adherence to DSpace submission rules and the potential for browser-side DoS with large files should be addressed in the near future.

@tdonohue

Copy link
Copy Markdown
Member

@vins01-4science : I haven't had a chance to give this a full review, but I do have one initial question:

  • Is there a reason why we need to add all this JAXB generated code directly into our "core" DSpace/DSpace repository? In the past, we've tried to keep JAXB generated code in a separate module (e.g. https://github.com/DSpace/orcid-jaxb-api) in order to make it easier to maintain separate from the core code. That way we can avoid the extra auto-generated code being directly in our main codebase.

Overall, it looks like the vast majority of this PR is JAXB-generated code which doesn't necessarily need to follow all our normal DSpace code policies. However, by putting this code directly in our main codebase we would have to apply all the same policies to the code (including JavaDocs, test coverage, etc)

I'd strongly recommend we consider moving the JAXB code to a separate DSpace/grobid-api repository (or similar) where we can document how this code was generated & regenerate it easily. It also will allow us to make an exception for this code to allow it to not have the normal required code coverage (i.e. we won't need to require tests).

@vins01-4science

vins01-4science commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

@vins01-4science : I haven't had a chance to give this a full review, but I do have one initial question:

* Is there a reason why we need to add all this JAXB generated code directly into our "core" DSpace/DSpace repository? In the past, we've tried to keep JAXB generated code in a separate module (e.g. https://github.com/DSpace/orcid-jaxb-api) in order to make it easier to maintain _separate_ from the core code.  That way we can avoid the extra auto-generated code being directly in our main codebase.

Overall, it looks like the vast majority of this PR is JAXB-generated code which doesn't necessarily need to follow all our normal DSpace code policies. However, by putting this code directly in our main codebase we would have to apply all the same policies to the code (including JavaDocs, test coverage, etc)

I'd strongly recommend we consider moving the JAXB code to a separate DSpace/grobid-api repository (or similar) where we can document how this code was generated & regenerate it easily. It also will allow us to make an exception for this code to allow it to not have the normal required code coverage (i.e. we won't need to require tests).

As I said in the PR description, the JAXB generated code comes from an older version of the XSL provided by grobid. As of now the classes cannot be generated from JAXB anymore, due to the high complexity of the schema. ( There are also some issues opened on the github repo that are referencing to this problem ).

So, we cannot move those easily out to be generated on runtime starting from the latest schema.

@tdonohue

Copy link
Copy Markdown
Member

Thanks for clarifying @vins01-4science why all the JAXB-generated code is in this PR. That makes sense to me then to add it into "core" DSpace since it's no longer possible to use JAXB to regenerate this code.

That all said, I'm still concerned that this PR significantly decreases our code coverage (because of all that JAXB generated code). I'd feel a lot better if we could create automated tests (either Unit or Integration) for that JAXB-generated code, especially since we will now need to maintain that code as-is (i.e. cannot regenerate it via JAXB in the future).

As I said in today's Dev Meeting, I'm not sure whether we'll get this specific PR into 10.0, because it's not as high priority as the other CRIS-related PRs.

But, my main feedback is that I think we need to find ways to increase the code coverage here, as it will make all this new code easier to maintain over time.

@tdonohue tdonohue removed this from the 10.0 milestone Apr 2, 2026
@tdonohue tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 11.0 Release Apr 2, 2026
@kshepherd kshepherd self-assigned this Jun 26, 2026
@kshepherd kshepherd force-pushed the task/main/DURACOM-443 branch from af5f9c8 to 0c14cfd Compare June 26, 2026 17:06
@github-actions github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Jun 26, 2026
@kshepherd

kshepherd commented Jun 26, 2026

Copy link
Copy Markdown
Member

@tdonohue @Dawnkai @nwoodward this is ready for re-review. It has been completely refactored so as not to rely on the JAXB-generated modelling and document parsing, instead using the existing tools we have for other external sources like XPath-based metadata mapping directly against the TEI XML document.

The property configs are moved to grobid.cfg and the URL key changed slightly to match other configs in there (the HTTP pooled client expects the same prefix)l, to grobid.service.url

@kshepherd kshepherd force-pushed the task/main/DURACOM-443 branch from 0c14cfd to 73458ed Compare June 27, 2026 10:03
ref: DURACOM-443

# Conflicts:
#	dspace-api/src/test/data/dspaceFolder/config/item-submission.xml
#	dspace-api/src/test/data/dspaceFolder/config/submission-forms.xml
#	dspace-server-webapp/src/test/java/org/dspace/app/rest/SubmissionFormsControllerIT.java
@kshepherd kshepherd force-pushed the task/main/DURACOM-443 branch from 73458ed to ff8bb99 Compare June 27, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: submission Related to configurable submission system DSpace-CRIS merger This ticket/PR relates to the merger of DSpace-CRIS into DSpace. new feature

Projects

Status: 🙋 Needs Reviewers Assigned

Development

Successfully merging this pull request may close these issues.

[DSpace-CRIS] Metadata extraction from PDF files via GROBID

4 participants