Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL/7493 Make BagGenerator threads configurable #8606

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Apr 12, 2022

What this PR does / why we need it: Improves scaling of Archiving and allows it to use more or fewer threads depending on machine size

Which issue(s) this PR closes:

Closes #8602

Special notes for your reviewer:

Suggestions on how to test this: Configure archiving (e.g. with LocalSubmitToArchiveCommand), change the :BagGeneratorThreads setting and compare the load and/or watch the log to see how many files are being zipped in parallel.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: multiple Bag/Archiving changes coming - perhaps one overall release note?

Additional documentation: guides updated

qqmyers and others added 30 commits December 23, 2020 11:38
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
checks to see if space exists
IQSS/8422-default_to_allow_dataset_level_choice_of_metadatalanguage
TDL/7493-batch_archiving_API_call

Conflicts:
	modules/dataverse-parent/pom.xml
@qqmyers qqmyers added the GDCC: TDL supported by Texas Digital Library label Apr 12, 2022
@qqmyers qqmyers changed the title Tdl/7493 make bag generate threads configurable only TDL/7493 Make BagGenerator threads configurable Apr 12, 2022
@coveralls
Copy link

coveralls commented Apr 12, 2022

Coverage Status

Coverage increased (+0.002%) to 18.999% when pulling 32f3a58 on TexasDigitalLibrary:TDL/7493-make_BagGenerate_threads_configurable_only into a5f703e on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. It seems like all the ParallelScatterZipCreator code is already in place and what this pull request is doing is reducing the default number of threads from 8 to 2 and making it configurable.

I didn't actually run the code but I navigated around it a bit.

I found a few small things we should probably fix up so I'm hitting "request changes" for now. Details are in the review.

doc/sphinx-guides/source/developers/workflows.rst Outdated Show resolved Hide resolved
@@ -180,7 +180,7 @@ archiver

A step that sends an archival copy of a Dataset Version to a configured archiver, e.g. the DuraCloud interface of Chronopolis. See the `DuraCloud/Chronopolis Integration documentation <http://guides.dataverse.org/en/latest/admin/integrations.html#id15>`_ for further detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to fix this "#id15" link while were in here. (It links to the table of contents rather than the actual content below the toc.) We could add a custom anchor and use :ref: to link to it.

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
@@ -115,6 +115,7 @@ public class BagGenerator {
private boolean usetemp = false;

private int numConnections = 8;
public static final String BAG_GENERATOR_THREADS = ":BagGeneratorThreads";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a String here, can we keep all the database settings centralized in the Key enum at SettingsServiceBean?

I tried this locally and the following seems to work:

public static final String BAG_GENERATOR_THREADS = SettingsServiceBean.Key.BagGeneratorThreads.toString();

(I had to import SettingsServiceBean.)

I see that DuraCloudSubmitToArchiveCommand also has strings like :DuraCloudPort, :DuraCloudHost, and :DuraCloudContext hard-coded, but we could centralize those later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've purposely avoided that so far with the hope that at some point these archivers will be packaged separately. Right now they are loaded via reflection (you specify the classname in a property) and there should be no references to the specific Archiver classes or their properties in the main codebase. (As you note though we do have their properties in the master list in the guides so far.)
( I hope it isn't too much work for these classes and their dependencies to be pulled out into separate jars but it isn't something I know how to do off-hand - perhaps @poikilotherm?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I guess I was under the impression that since these archiver classes/commands are in the main code base that they have to be.

It would be nice to figure out how to extract them to a different git repo. It would be a great success story of modularity in Dataverse.

Once they are truly extracted from the main code base, I suppose their documentation should be extracted too, just like we do with external tools.

It sounds like we're not there yet. That's fine. Thanks for the cleanup you did.


A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_
A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_, to a local file system, or to `Google Cloud Storage<https://cloud.google.com/storage>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing with:

/github/workspace/doc/sphinx-guides/source/admin/integrations.rst:165:Unknown target name: "google cloud storagehttps://cloud.google.com/storage".

@@ -115,6 +115,7 @@ public class BagGenerator {
private boolean usetemp = false;

private int numConnections = 8;
public static final String BAG_GENERATOR_THREADS = ":BagGeneratorThreads";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I guess I was under the impression that since these archiver classes/commands are in the main code base that they have to be.

It would be nice to figure out how to extract them to a different git repo. It would be a great success story of modularity in Dataverse.

Once they are truly extracted from the main code base, I suppose their documentation should be extracted too, just like we do with external tools.

It sounds like we're not there yet. That's fine. Thanks for the cleanup you did.

doc/sphinx-guides/source/developers/workflows.rst Outdated Show resolved Hide resolved
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we fix make html ( TexasDigitalLibrary#2 ) I'm ready to move this to QA. (I did also throw a wording suggestion in this review.)

I haven't tested this at all but the changes make sense and are pretty minor. Again, we're reducing the default number of threads from 8 to 2 and making it configurable.

doc/sphinx-guides/source/admin/integrations.rst Outdated Show resolved Hide resolved
doc/sphinx-guides/source/developers/workflows.rst Outdated Show resolved Hide resolved
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Again, I haven't tested it but the changes make sense.

@kcondon kcondon self-assigned this Apr 14, 2022
@kcondon kcondon merged commit bb598fc into IQSS:develop Apr 14, 2022
@pdurbin pdurbin added this to the 5.11 milestone May 4, 2022
@qqmyers qqmyers mentioned this pull request Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: TDL supported by Texas Digital Library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make BagGenerator parallelism configurable
4 participants