[DSpace-CRIS] Metadata extraction from PDF files via GROBID by vins01-4science · Pull Request #11852 · DSpace/DSpace

vins01-4science · 2026-02-03T17:53:10Z

References

Fixes [DSpace-CRIS] Metadata extraction from PDF files via GROBID #11750 (GROBID integration feature)

Description

This PR introduces GROBID integration for automated metadata extraction from PDF files during the submission process.

Instructions for Reviewers

List of changes in this PR:

GROBID Integration: Added GROBID metadata extraction service for parsing PDF documents and extracting bibliographic metadata during submission
- New GrobidImportMetadataSourceServiceImpl service for TEI XML parsing
- New TEIAuthorMetadataContributor and TEIDateMetadataContributor classes for metadata mapping and transformation (useful for other TEI documents too)
- REST endpoint integration via ExtractMetadataStep and DataProcessingStep
- Configuration in grobid-integration.xml and grobid.cfg
Test Improvements: Added integration tests (GrobidMetadataExtractionIT) and unit tests for GROBID extraction
Submission Form Updates: Updated submission form configurations to support GROBID extraction

Note: Previous versions of this PR included JAXB-generated Java models for TEI. This is no longer the case, and the TEI XML is treated simply as a DOM Document.

GROBID Metadata Extraction Testing

To test this feature, you need a GROBID endpoint.
You can start one locally using the docker image provided by grobid, or you can use an open online service ( there are different ones, you can google for them )

Main configuration

The main configuration has been placed inside the grobid-integration.xml file, and the mappings have been placed inside this metadata-map:

    <util:map id="grobidMetadataFieldMap"
              key-type="org.dspace.importer.external.metadatamapping.MetadataFieldConfig"
              value-type="org.dspace.importer.external.metadatamapping.contributor.MetadataContributor">
        <entry key-ref="grobid.title"             value-ref="grobidTitleContrib"/>
        <entry key-ref="grobid.date.issued"        value-ref="grobidDateContrib"/>
        <entry key-ref="grobid.contributor.author" value-ref="grobidAuthorContrib"/>
        <entry key-ref="grobid.source"             value-ref="grobidJournalContrib"/>
        <entry key-ref="grobid.issn"               value-ref="grobidIssnContrib"/>
        <entry key-ref="grobid.identifier.doi"     value-ref="grobidDoiContrib"/>
        <entry key-ref="grobid.identifier.isbn"    value-ref="grobidIsbnContrib"/>
        <entry key-ref="grobid.description.abstract" value-ref="grobidAbstractContrib"/>
        <entry key-ref="grobid.language"           value-ref="grobidLanguageContrib"/>
        <entry key-ref="grobid.subject"            value-ref="grobidSubjectContrib"/>
    </util:map>

Prerequisites

A running GROBID service (default: http://localhost:8070)
- You can run GROBID via Docker: docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2-full
A PDF file with embedded academic article metadata

Configuration Steps

Enable GROBID in local.cfg:
grobid.service.url = http://localhost:8070/
Configure Submission Process (in item-submission.xml):
Add the GROBID metadata extraction step to your submission process:

   <step-definition id="grobidmetadata" mandatory="true">
       <heading>submit.progressbar.describe.stepone</heading>
       <processing-class>org.dspace.app.rest.submit.step.DescribeStep</processing-class>
       <type>submission-form</type>
   </step-definition>
   
   <step-definition id="extractionstep">
       <heading>submit.progressbar.ExtractMetadataStep</heading>
       <processing-class>org.dspace.app.rest.submit.step.ExtractMetadataStep</processing-class>
       <type>extract</type>
   </step-definition>

Then add these steps to your submission process:

    <submission-process name="traditional">
       <step id="collection"/>
       <step id="grobidmetadata"/>
       <step id="upload"/>
       <step id="extractionstep"/>
       <!-- other steps -->
   </submission-process>

Then place the correct form definition (submission-forms.xml):

   <form name="grobidmetadata">
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>title</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>false</repeatable>
         <label>Title</label>
         <input-type>onebox</input-type>
         <required>Field required</required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>contributor</dc-element>
         <dc-qualifier>author</dc-qualifier>
         <label>Author</label>
         <input-type>name</input-type>
         <repeatable>false</repeatable>
         <required>You must enter at least the author.</required>
         <hint>Enter the names of the authors of this item in the form Lastname, Firstname [i.e. Smith, Josh or Smith, J].</hint>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>source</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>false</repeatable>
         <label>Journal</label>
         <input-type>onebox</input-type>
         <hint>Enter the Journal of the publication</hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>date</dc-element>
         <dc-qualifier>issued</dc-qualifier>
         <repeatable>false</repeatable>
         <label>Date of Issue</label>
         <style>col-sm-4</style>
         <input-type>date</input-type>
         <hint>Please give the date of previous publication or public distribution.
           You can leave out the day and/or month if they aren't
           applicable.</hint>
         <required>You must enter at least the year.</required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>issn</dc-qualifier>
         <repeatable>true</repeatable>
         <label>ISSN</label>
         <input-type>onebox</input-type>
         <hint>Enter the ISSN of the book.</hint>
         <required></required>
       </field>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>isbn</dc-qualifier>
         <repeatable>true</repeatable>
         <label>ISBN of Book</label>
         <input-type>onebox</input-type>
         <hint>Enter the ISBN of the book in which this chapter appears.</hint>
         <required></required>
       </field>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>identifier</dc-element>
         <dc-qualifier>doi</dc-qualifier>
         <repeatable>true</repeatable>
         <label>DOI</label>
         <input-type>onebox</input-type>
         <hint>Enter the DIO of the publication.</hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>description</dc-element>
         <dc-qualifier>abstract</dc-qualifier>
         <repeatable>false</repeatable>
         <label>Abstract</label>
         <input-type>textarea</input-type>
         <hint>Enter the abstract of the item. </hint>
         <required></required>
       </field>
     </row>
     <row>
       <field>
         <dc-schema>dc</dc-schema>
         <dc-element>subject</dc-element>
         <dc-qualifier></dc-qualifier>
         <repeatable>true</repeatable>
         <label>Subject</label>
         <input-type>onebox</input-type>
         <hint>Subject field that can be associated with an authority providing lookup</hint>
         <required></required>
       </field>
     </row>
   </form>

Verify Spring Configuration:
Ensure grobid-integration.xml is loaded (check dspace/config/spring/api/ directory)

Test Case 1: PDF Metadata Extraction

Start a new submission in a collection using the configured submission process
Upload a PDF file of an academic article during the upload step
- Sample PDFs for testing are included in:
  - dspace-server-webapp/src/test/resources/org/dspace/app/rest/simple-article.pdf
  - dspace-api/src/test/java/org/dspace/submit/extraction/grobid/grobid.pdf
Verify that after upload, the system automatically extracts one or many metadata:
- Title (dc.title)
- Authors (dc.contributor.author)
- Publication Date (dc.date.issued)
- ISSN (dc.identifier.issn)
- ISBN (dc.identifier.isbn)
- DOI (dc.identifier.doi)
- Abstract (dc.description.abstract)
- Keywords (dc.subject)
Navigate to the metadata form step - verify extracted metadata is pre-populated
Complete the submission and verify the final item has all extracted metadata

Test Case 2: Metadata Not Overwritten

Create a submission and manually enter metadata before uploading a PDF
Upload a PDF - verify that manually entered metadata is NOT overwritten by extracted values

Test Case 3: GROBID Service Unavailable

Stop the GROBID service
Try uploading a PDF - verify graceful handling (submission continues without extraction, no errors blocking submission)

Running Automated Tests

Run GROBID-specific integration tests

mvn install -DskipIntegrationTests=false -Dit.test=GrobidMetadataExtractionIT -DfailIfNoTests=false

Run unit tests for GROBID service

mvn test -DskipUnitTests=false -Dtest=GrobidImportMetadataSourceServiceImplTest -DfailIfNoTests=false

Checklist

github-actions · 2026-02-05T20:48:00Z

Hi @vins01-4science,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

github-actions · 2026-03-13T17:36:38Z

Hi @vins01-4science,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

Dawnkai · 2026-03-19T14:19:47Z

I had a look at the code and tested this PR. Overall, the integration is functional and follows the test cases provided by Vincenzo. However, given the scale of the contribution (115+ files), it would be nice if someone else also took a look at the code (if possible).

There are a few places, like response.getEntity().getContent(), where a null response or entity could trigger a NullPointerException, but considering that tests pass and a potential connection error does not crash the submission form (though it does dump the whole traceback in the logs), I think that's okay.

Here is my feedback:

Formatting - many files (like ItemImportServiceImpl) contain only indentation or whitespace changes. This needlessly inflates the PR size and has already caused merge conflicts during local testing when I tried to resolve them. I recommend reverting the changes where no functional logic was modified to keep the PR focused and maintainable and to prevent existing PRs from suddenly triggering merge conflicts (as some of those files are used by many other classes).
Missing docs - while most new classes are annotated, GrobidImportMetadataSecurityServiceImpl lacks docstrings. This class is arguably the most important one in this PR, so a detailed documentation would help in future maintenance (especially the handling of "analytic", "monogr" and "series").
The ConsolidateHeaderEnum enum - I can see it defines three options, but only CONSOLIDATE_AND_INJECT_METADATA is used. It would probably be a good idea to explicitly note in the code that other modes are untested/unsupported.
I would also consider treating GROBID as a separate module instead of adding it to dspace.cfg for the sake of potential future improvements.

And speaking of potential future improvements:

The current implementation disregards submission form constraints. For example, if dc.contributor.author is set as non-repeatable in the input forms, the integration still injects multiple values. This "breaks" the submission form for the end-user by preventing them from depositing the publication unless they manually remove additional authors. I assume this will also apply to regex constraints.
I tried edge-case testing this PR. It worked with a PDF file with just one letter, but when I tried a 65-page PDF, the integration injected hundreds of repeating metadata entries. This resulted in the browser freezing and rendering the record inaccessible, even after the file had been processed (the submission form never loads). I recommend implementing a configurable limit on the number of values extracted per field.
Currently, the integration can overwrite manual entries if the user hasn't clicked "Save" yet. We should consider a warning or some sort of way to preserve manually entered but unsaved values in the submission form.

In my opinion, this feature really should have a documentation. Given the complexity of the TEI mapping, a brief guide on how to add or map new fields would be highly beneficial, especially for new developers. The topic of extracting metadata from PDFs in the DSpace ecosystem is getting traction, so I assume people will be interested in this feature. Without this documentation, future updates to GROBID versions could cause the integration to work incorrectly (or not work at all), and fixing these issues without documentation might be significantly time-consuming for the community, especially since they would have to edit around 100 files added by this PR.

The feature works well for standard academic papers, but the lack of adherence to DSpace submission rules and the potential for browser-side DoS with large files should be addressed in the near future.

tdonohue · 2026-03-25T22:01:36Z

@vins01-4science : I haven't had a chance to give this a full review, but I do have one initial question:

Is there a reason why we need to add all this JAXB generated code directly into our "core" DSpace/DSpace repository? In the past, we've tried to keep JAXB generated code in a separate module (e.g. https://github.com/DSpace/orcid-jaxb-api) in order to make it easier to maintain separate from the core code. That way we can avoid the extra auto-generated code being directly in our main codebase.

Overall, it looks like the vast majority of this PR is JAXB-generated code which doesn't necessarily need to follow all our normal DSpace code policies. However, by putting this code directly in our main codebase we would have to apply all the same policies to the code (including JavaDocs, test coverage, etc)

I'd strongly recommend we consider moving the JAXB code to a separate DSpace/grobid-api repository (or similar) where we can document how this code was generated & regenerate it easily. It also will allow us to make an exception for this code to allow it to not have the normal required code coverage (i.e. we won't need to require tests).

vins01-4science · 2026-03-26T14:58:28Z

@vins01-4science : I haven't had a chance to give this a full review, but I do have one initial question:
* Is there a reason why we need to add all this JAXB generated code directly into our "core" DSpace/DSpace repository? In the past, we've tried to keep JAXB generated code in a separate module (e.g. https://github.com/DSpace/orcid-jaxb-api) in order to make it easier to maintain _separate_ from the core code.  That way we can avoid the extra auto-generated code being directly in our main codebase.
Overall, it looks like the vast majority of this PR is JAXB-generated code which doesn't necessarily need to follow all our normal DSpace code policies. However, by putting this code directly in our main codebase we would have to apply all the same policies to the code (including JavaDocs, test coverage, etc)

I'd strongly recommend we consider moving the JAXB code to a separate DSpace/grobid-api repository (or similar) where we can document how this code was generated & regenerate it easily. It also will allow us to make an exception for this code to allow it to not have the normal required code coverage (i.e. we won't need to require tests).

As I said in the PR description, the JAXB generated code comes from an older version of the XSL provided by grobid. As of now the classes cannot be generated from JAXB anymore, due to the high complexity of the schema. ( There are also some issues opened on the github repo that are referencing to this problem ).

So, we cannot move those easily out to be generated on runtime starting from the latest schema.

tdonohue · 2026-03-26T16:47:03Z

Thanks for clarifying @vins01-4science why all the JAXB-generated code is in this PR. That makes sense to me then to add it into "core" DSpace since it's no longer possible to use JAXB to regenerate this code.

That all said, I'm still concerned that this PR significantly decreases our code coverage (because of all that JAXB generated code). I'd feel a lot better if we could create automated tests (either Unit or Integration) for that JAXB-generated code, especially since we will now need to maintain that code as-is (i.e. cannot regenerate it via JAXB in the future).

As I said in today's Dev Meeting, I'm not sure whether we'll get this specific PR into 10.0, because it's not as high priority as the other CRIS-related PRs.

But, my main feedback is that I think we need to find ways to increase the code coverage here, as it will make all this new code easier to maintain over time.

kshepherd · 2026-06-26T17:09:12Z

@tdonohue @Dawnkai @nwoodward this is ready for re-review. It has been completely refactored so as not to rely on the JAXB-generated modelling and document parsing, instead using the existing tools we have for other external sources like XPath-based metadata mapping directly against the TEI XML document.

The property configs are moved to grobid.cfg and the URL key changed slightly to match other configs in there (the HTTP pooled client expects the same prefix)l, to grobid.service.url

ref: DURACOM-443

ref: DURACOM-443 # Conflicts: # dspace-api/src/test/data/dspaceFolder/config/item-submission.xml # dspace-api/src/test/data/dspaceFolder/config/submission-forms.xml # dspace-server-webapp/src/test/java/org/dspace/app/rest/SubmissionFormsControllerIT.java

ref: DURACOM-443

github-actions Bot assigned vins01-4science Feb 3, 2026

tdonohue added new feature component: submission Related to configurable submission system DSpace-CRIS merger This ticket/PR relates to the merger of DSpace-CRIS into DSpace. labels Feb 3, 2026

tdonohue added this to Merger of DSpace-CRIS into DSpace and DSpace 10.0 Release Feb 3, 2026

github-project-automation Bot moved this to 📋 To Do in Merger of DSpace-CRIS into DSpace Feb 3, 2026

tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 10.0 Release Feb 3, 2026

tdonohue removed this from Merger of DSpace-CRIS into DSpace Feb 3, 2026

tdonohue added this to the 10.0 milestone Feb 3, 2026

tdonohue changed the title ~~[MERGER] Metadata extraction from PDF files via GROBID~~ [DSpace-CRIS] Metadata extraction from PDF files via GROBID Feb 3, 2026

github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Feb 5, 2026

vins01-4science force-pushed the task/main/DURACOM-443 branch from 016b843 to e15015b Compare February 12, 2026 10:41

github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Feb 12, 2026

vins01-4science marked this pull request as ready for review February 20, 2026 08:52

damian-joz requested a review from Dawnkai February 26, 2026 12:38

tdonohue self-requested a review February 26, 2026 15:16

tdonohue moved this from 🙋 Needs Reviewers Assigned to 👀 Under Review in DSpace 10.0 Release Feb 26, 2026

nwoodward self-requested a review March 12, 2026 17:47

github-actions Bot added the merge conflict PR has a merge conflict that needs resolution label Mar 13, 2026

tdonohue removed this from the 10.0 milestone Apr 2, 2026

tdonohue added this to DSpace 11.0 Release Apr 2, 2026

tdonohue removed this from DSpace 10.0 Release Apr 2, 2026

tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 11.0 Release Apr 2, 2026

kshepherd self-assigned this Jun 26, 2026

kshepherd force-pushed the task/main/DURACOM-443 branch from af5f9c8 to 0c14cfd Compare June 26, 2026 17:06

github-actions Bot removed the merge conflict PR has a merge conflict that needs resolution label Jun 26, 2026

kshepherd force-pushed the task/main/DURACOM-443 branch from 0c14cfd to 73458ed Compare June 27, 2026 10:03

kshepherd added 3 commits June 27, 2026 12:28

[DURACOM-443] Refactored GROBID client and metadata import impl

971df63

ref: DURACOM-443

[DURACOM-443] DataProcessingStep improvements (GROBID)

ff8bb99

ref: DURACOM-443

kshepherd force-pushed the task/main/DURACOM-443 branch from 73458ed to ff8bb99 Compare June 27, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852

[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852
vins01-4science wants to merge 3 commits into
DSpace:mainfrom
4Science:task/main/DURACOM-443

vins01-4science commented Feb 3, 2026 •

edited by kshepherd

Loading

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

Dawnkai commented Mar 19, 2026 •

edited

Loading

Uh oh!

tdonohue commented Mar 25, 2026

Uh oh!

vins01-4science commented Mar 26, 2026 •

edited

Loading

Uh oh!

tdonohue commented Mar 26, 2026

Uh oh!

kshepherd commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

vins01-4science commented Feb 3, 2026 • edited by kshepherd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Description

Instructions for Reviewers

GROBID Metadata Extraction Testing

Main configuration

Prerequisites

Configuration Steps

Test Case 1: PDF Metadata Extraction

Test Case 2: Metadata Not Overwritten

Test Case 3: GROBID Service Unavailable

Running Automated Tests

Run GROBID-specific integration tests

Run unit tests for GROBID service

Checklist

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

Dawnkai commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdonohue commented Mar 25, 2026

Uh oh!

vins01-4science commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdonohue commented Mar 26, 2026

Uh oh!

kshepherd commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vins01-4science commented Feb 3, 2026 •

edited by kshepherd

Loading

Dawnkai commented Mar 19, 2026 •

edited

Loading

vins01-4science commented Mar 26, 2026 •

edited

Loading

kshepherd commented Jun 26, 2026 •

edited

Loading