Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DANS - Exporters in external jars #9175

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Nov 18, 2022

What this PR does / why we need it: DANS is interested in being able to add exporters and to replace the existing ones with customized/improved versions. This PR extends the service provider interface code in the ExporterService to dynamically load classes from Jar files places outside the exploded war file. It is therefore a proof-of-concept for this approach.

Which issue(s) this PR closes:

Related to #7050

Special notes for your reviewer:

The overall design was discussed in Tech Hour. Aside from the ability to dynamically use external exporter jars, the main change is to the Exporter interface and the introduction of a ExportDataProviderInterface that remove dependencies Exporters had on internal Dataverse classes (DatasetVersion, FileMetadata, DataFile, etc.) This was done by providing Exporters with the standard JSON/ORE/Schema.org/DataCite XML export contents along with a new JSON representation of file variable level metadata. All of the existing Exporters were rewritten to use these JSON/JSON-LD inputs. (Nominally the JSON and ORE exports are 'complete' and are usable for new exporters. The DataCite and Schema exporters were and still are just wrappers of content created deeper in the Dataverse code (and reused there for other purposes.))

Another ~minor change was to add the Locale to the Exporter.getDisplayName() method. (Exporters originally retrieved the Locale using the BundleUtil class which calls PrimeFaces FacesContext.) This change avoids these dependencies (but does not allow external Exporters to use other BundleUtil methods and thereby share Dataverse's existing i18n properties files - that could be done later if desired.)

With those changes in place, I went ahead to create maven modules for an exporters SPI jar file that includes the two interfaces above and the ExportException class. That jar is now used in the main code and in an example 'MyJSON' exporter (that does nothing but wrapper the original JSON export with

{
    "name": "dataverse_json",
    "inputJson": <original export>
}

Having schema in place for the JSON/ORE/Variable-level metadata or creating new types of transfer objects would be useful going forward. Presumably their design/devolpment will be driven by the SPA work and changes it requires. In the meantime, this PR should make it significantly easier to develop/deploy exporters.

The one aspect of this work I'm least sure of is the best way to handle the maven modules. Right now, the jar has a 0.0.1 version, the interface jar and the example MyJSON exporter are both under the ./modules dir, the code installs the new jar as a local resource rather than publishing it, etc. All changes welcome if there are ideas of how to version, where code should go in the repo, if the interface jar and/or example should be separate repos, etc. If someone knows how to do that and we have the appropriate credentials to do so, publishing the jar would be a better long-term option. Until then, doing mvn clean package for the ./modules/exportSPI module and then building the main war as usual works. (The auto-build currently fails due to missing this step/the interface jar only being local.)

Suggestions on how to test this:
There are three things to test:

  • Regression testing - that existing exporters work as before - much of which is covered by tests already
  • That dropping a new exporter in place works with two subcases - adding a new one and replacing an existing one - this requires defining the dataverse.exporters.directory mpconfig/jvm option, putting the new jar there, restarting the server, clearing the cached exports (for the replace case), and verifying that the new exporter appears in the list of options, the export file can be downloaded, etc.
  • Building a new exporter works - this is basically demonstrated by the 'MyJSON' example above - just running the maven build for it proves this.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Only by replacing/adding new exporters

Is there a release notes update needed for this change?:
Yes - for this and other documentation, I'd like to get through an initial review that might change naming, locations, build instructions, etc. before nailing things down.
Additional documentation:

Older notes:

The PR does raise issues though. The ServiceLoader mechanism looks for files "META-INF/services/edu.harvard.iq.dataverse.export.spi.Exporter" that list the available instances of Exporter that should be loaded. Thus, to successfully load a class, it has to be in one of the META-INF list files and it has to be on the classpath, not necessarily in the same Jar file as the META-INF file in which it is listed. In considering where to put the existing exporter class files - if they stay in the war but the META-INF is moved to a separate Jar, the could be overridden (can't use the same class name but another class could provide the same export (defined by the displayname and provider name). Conversely if the META-INF file is in the war, ServiceLoader doesn't allow skipping it (as far as I can tell), so it would not be possible to override the internal classes. That said, even the current exporters could move to an external Jar (we might need to add some error handling if people removed the exporters that do double duty (DataCite, Schema.org, OAI_ORE, ...).

I've also explored not using the ServiceLoader - the main difference with not using it would be that the META-INF file wouldn't be needed - we could check all files in the jar to see if they are Exporters or provide a different lookup mechanism, i.e. some setting/microprofile config list.

Since these are techno-policy issues, I think it would be useful to discuss the best approaches in a Tech Hour (and through discussion in this PR, slack, etc.) before I remove 'Draft' and send something like this forward.

Also note - ~this should be usable for ingest, archivers, full-text search, etc., so it would be nice if we had a common pattern across all of those.

Also also note - the PR doesn't currently remove any Exporter class files from the war, etc. I've been testing by manually creating Jars and removing the META-INF file and/or the classes from the exported war to show that it works. Who/how/where Exporters get built/reviewed/tested/packaged is also a good topic.

@coveralls
Copy link

coveralls commented Nov 18, 2022

Coverage Status

Coverage: 20.335% (-0.001%) from 20.336% when pulling 4322d50 on GlobalDataverseCommunityConsortium:DANS-external_exporters into e2eb6d9 on IQSS:develop.

@poikilotherm
Copy link
Contributor

poikilotherm commented Nov 21, 2022

Thank you @qqmyers for your extensive work on this! I'm glad there is a chance it will get picked up so someday we have #7050 resolved.

As I wrote on Slack, it seems to be totally fine not to use a convenient wrapper as PF4J, but rely on standard SPI mechanics like we see them in action for MicroProfile Config API or Jakarta EE. I also completely agree we need to discuss these topics at Tech Hour.

Here's a thought: IMHO we (whoever that may be) shall create a proper "Dataverse Java API" package, preferably as a module under /modules, to create a solid ground to build on. The content would be inspired from the mentioned frameworks:

  • Interfaces to provide implementations from a package: ExporterProvider, MinterProvider, and so on (similar to ConfigSourceProvider)
  • Interfaces expressing the contract to be used in the main code: Exporter, Minter, and so on (similar to ConfigSource)
  • Data Transfer Objects (DTOs) to abstract our data model Entity classes and not expose these internal implementations. Something like modelmapper might be handy to do the conversion. Maybe moving our codebase to Java 17 is a win here, enabling use of records for this.

It might be fine to do develop this API package iteratively, adding more as we go.

IMHO this technical discussion is very much related two other topics:

  1. The UI reworks in process. Providing DTOs would help in abstracting away our internal data model, which simplifies REST API stabilization. Also, using the class loading approach, it should be possible to add more JAX-RS Resources and Providers to extend the REST API from an external JAR. Shipping extensible UI functionality, bundling a React addon plus the backing API classes, would be in reach.
  2. DataCite metadata issues like Support flexible DataCite resourceType metadata in a Dataset #7077, Feature Request/Idea: Standardize standard license configuration #8512 and many others, as we need a completely renewed exporter/registrar for this. As this is crucial to support things like software publications, I can add funded developer time from me to do enabling groundwork. (Will add a HERMES label to this PR to express interest and collab options).

@qqmyers qqmyers added the GDCC: DANS related to GDCC work for DANS label Nov 21, 2022
@poikilotherm poikilotherm added the HERMES related to @hermes-hmc work on Dataverse code label Nov 22, 2022
@johannes-darms
Copy link
Contributor

1.) Is is possible to delegate the classloading to the application server instead of writing our own implementation? If this is not possible I would suggest to move the generic part to a dedicated class as is is likely used by more modules that load custom implementation at runtime (e.g. PID providers).

2.) Do we plan to add libraries to ease and automate the generation of mainifest files? Something like google auto?

still uses a ServiceLoader to find exporters but now keep a Map based on
providerName keys. (Possible that this isn't needed if the serviceLoader
always orders returned classes by class loader, but it simplifies the
code to find an exporter by providerName in several methods).
@qqmyers
Copy link
Member Author

qqmyers commented Dec 16, 2022

Just updated to prioritize external exporters that register with the same 'providerName', allowing replacement of the internally provider export format(s). This was ~per Tech Hour Discussion and uses the standard ServiceLoader to load classes combined with a Map that is only populated with the 'winning' Exporters.

@johannes-darms - we are using @autoservice already.

FWIW: I'm planning to meet with @poikilotherm next week to think through how to make a separate SPI API jar/artifact that would be all an external Exporter project would have to depend on. Hoping to do that generically so other SPI interfaces can be added as we open those up.

poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
This introduces a new submodule of the Dataverse codebase to ship an Extension API to the community.
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Dec 20, 2022
…QSS#9175

- Change main export method signature (avoids internal model class DatasetVersion in Extension API)
- Migrate imports
- Cleanup imports order

Note: this commit introduces a BREAKING change, as a few exporters relied on the DatasetVersion parameter!
This will need to be addressed.
@poikilotherm poikilotherm mentioned this pull request Dec 20, 2022
9 tasks
The packages built and installed in the step would not be cached
automatically. So we call the save action manually, making the
built packages available for subsequent runs which do not built the
packages again by calling the parent module. (Like the container image builds)
We cannot use the same cache key from restore to
save our build artifacts into.

This is an attempt to rely on the restore action matching
for names to pick this up when we append a unique id.
- Delete all the purge cache things, seems to be impossible (permissions?)
- Remove the manual cache setup and let setup-java handle it
- Switch to Eclipse Temurin as well - AdoptJDK is dead
- Avoid not having dependent submodules around
- Build them with the container profile which might change some dependencies
- Renamed the IT test group for ct profile from "testcontainers" to "end2end"
  as we will be using "testcontainers" tag also for non-container image builds
  and running integration tests.
- Make the profile skip integration tests by default (so we don't run them with
  the Maven install target)
Add configuration to SPI module to skip deployment (so we don't run into errors, as this would fail)
@poikilotherm
Copy link
Contributor

/push-image

Test the new build also including submodules.

@github-actions
Copy link

📦 Could not push preview image 😞. See log for details.

@poikilotherm
Copy link
Contributor

/push-image

Try push image again thx

@github-actions
Copy link

📦 Could not push preview image 😞. See log for details.

@poikilotherm
Copy link
Contributor

Stupid me - obviously won't work... Needs to be pushed to develop first as command picks up the version from there, not the PR.

@kcondon
Copy link
Contributor

kcondon commented May 30, 2023

Issues found:

  1. On a published dataset, choosing DDI metadata export from UI fails with no server log error but this in UI:
Screen Shot 2023-05-30 at 9 33 58 AM

Related: api export for ddi fails with no server log error but command line: ""
oai_ddi format does work via api.

Update: This appears related to files since it works with no files but fails with one txt file.

  1. On a published dataset, choosing DDI HTML Codebook fails with blank page, no server log error.
    Update: seems related to files since it works with no files but fails with one text file.

@kcondon kcondon merged commit 7df17dc into IQSS:develop May 30, 2023
kcondon added a commit that referenced this pull request Jun 21, 2023
kcondon added a commit that referenced this pull request Jul 5, 2023
…-external_exporters

Remove temporary additions to json export from #9175
@pdurbin pdurbin added this to the 5.14 milestone Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS HERMES related to @hermes-hmc work on Dataverse code Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants