[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852
[DSpace-CRIS] Metadata extraction from PDF files via GROBID#11852vins01-4science wants to merge 3 commits into
Conversation
|
Hi @vins01-4science, |
016b843 to
e15015b
Compare
|
Hi @vins01-4science, |
|
I had a look at the code and tested this PR. Overall, the integration is functional and follows the test cases provided by Vincenzo. However, given the scale of the contribution (115+ files), it would be nice if someone else also took a look at the code (if possible). There are a few places, like Here is my feedback:
And speaking of potential future improvements:
In my opinion, this feature really should have a documentation. Given the complexity of the TEI mapping, a brief guide on how to add or map new fields would be highly beneficial, especially for new developers. The topic of extracting metadata from PDFs in the DSpace ecosystem is getting traction, so I assume people will be interested in this feature. Without this documentation, future updates to GROBID versions could cause the integration to work incorrectly (or not work at all), and fixing these issues without documentation might be significantly time-consuming for the community, especially since they would have to edit around 100 files added by this PR. The feature works well for standard academic papers, but the lack of adherence to DSpace submission rules and the potential for browser-side DoS with large files should be addressed in the near future. |
|
@vins01-4science : I haven't had a chance to give this a full review, but I do have one initial question:
Overall, it looks like the vast majority of this PR is JAXB-generated code which doesn't necessarily need to follow all our normal DSpace code policies. However, by putting this code directly in our main codebase we would have to apply all the same policies to the code (including JavaDocs, test coverage, etc) I'd strongly recommend we consider moving the JAXB code to a separate |
As I said in the PR description, the JAXB generated code comes from an older version of the XSL provided by grobid. As of now the classes cannot be generated from JAXB anymore, due to the high complexity of the schema. ( There are also some issues opened on the github repo that are referencing to this problem ). So, we cannot move those easily out to be generated on runtime starting from the latest schema. |
|
Thanks for clarifying @vins01-4science why all the JAXB-generated code is in this PR. That makes sense to me then to add it into "core" DSpace since it's no longer possible to use JAXB to regenerate this code. That all said, I'm still concerned that this PR significantly decreases our code coverage (because of all that JAXB generated code). I'd feel a lot better if we could create automated tests (either Unit or Integration) for that JAXB-generated code, especially since we will now need to maintain that code as-is (i.e. cannot regenerate it via JAXB in the future). As I said in today's Dev Meeting, I'm not sure whether we'll get this specific PR into 10.0, because it's not as high priority as the other CRIS-related PRs. But, my main feedback is that I think we need to find ways to increase the code coverage here, as it will make all this new code easier to maintain over time. |
af5f9c8 to
0c14cfd
Compare
|
@tdonohue @Dawnkai @nwoodward this is ready for re-review. It has been completely refactored so as not to rely on the JAXB-generated modelling and document parsing, instead using the existing tools we have for other external sources like XPath-based metadata mapping directly against the TEI XML document. The property configs are moved to |
0c14cfd to
73458ed
Compare
ref: DURACOM-443
ref: DURACOM-443 # Conflicts: # dspace-api/src/test/data/dspaceFolder/config/item-submission.xml # dspace-api/src/test/data/dspaceFolder/config/submission-forms.xml # dspace-server-webapp/src/test/java/org/dspace/app/rest/SubmissionFormsControllerIT.java
ref: DURACOM-443
73458ed to
ff8bb99
Compare
References
Description
This PR introduces GROBID integration for automated metadata extraction from PDF files during the submission process.
Instructions for Reviewers
List of changes in this PR:
GrobidImportMetadataSourceServiceImplservice for TEI XML parsingTEIAuthorMetadataContributorandTEIDateMetadataContributorclasses for metadata mapping and transformation (useful for other TEI documents too)ExtractMetadataStepandDataProcessingStepgrobid-integration.xmlandgrobid.cfgGrobidMetadataExtractionIT) and unit tests for GROBID extractionNote: Previous versions of this PR included JAXB-generated Java models for TEI. This is no longer the case, and the TEI XML is treated simply as a DOM Document.
GROBID Metadata Extraction Testing
To test this feature, you need a GROBID endpoint.
You can start one locally using the docker image provided by grobid, or you can use an open online service ( there are different ones, you can google for them )
Main configuration
The main configuration has been placed inside the
grobid-integration.xmlfile, and the mappings have been placed inside this metadata-map:Prerequisites
docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2-fullConfiguration Steps
local.cfg:grobid.service.url = http://localhost:8070/item-submission.xml):Add the GROBID metadata extraction step to your submission process:
Then add these steps to your submission process:
Then place the correct form definition (
submission-forms.xml):Ensure
grobid-integration.xmlis loaded (checkdspace/config/spring/api/directory)Test Case 1: PDF Metadata Extraction
dspace-server-webapp/src/test/resources/org/dspace/app/rest/simple-article.pdfdspace-api/src/test/java/org/dspace/submit/extraction/grobid/grobid.pdfdc.title)dc.contributor.author)dc.date.issued)dc.identifier.issn)dc.identifier.isbn)dc.identifier.doi)dc.description.abstract)dc.subject)Test Case 2: Metadata Not Overwritten
Test Case 3: GROBID Service Unavailable
Running Automated Tests
Run GROBID-specific integration tests
mvn install -DskipIntegrationTests=false -Dit.test=GrobidMetadataExtractionIT -DfailIfNoTests=falseRun unit tests for GROBID service
mvn test -DskipUnitTests=false -Dtest=GrobidImportMetadataSourceServiceImplTest -DfailIfNoTests=falseChecklist
mainbranch of code (unless it is a backport or is fixing an issue specific to an older branch).pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.