Add Harvesting Client name to the Metadata Source facet #10464

jp-tosca · 2024-04-04T17:28:31Z

What this PR does / why we need it:

This PR will change the "Metadata Source" facet. It will stop showing all the Harvested datasets inside the same category and group them by the nickname of the client. ›

Which issue(s) this PR closes: 10298

Closes #10298

Special notes for your reviewer:

We may need to update this when #10217 gets done with that value.

Suggestions on how to test this:

Create multiple harvesting clients and you should be able to test the change on the facet.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Before:

After:

qqmyers

Looks good. Asked for a release note. The tests haven't completed yet so I'm not sure if there's any Harvesting/search related test that needs to be updated. (Can one be added - do we already have harvested test datasets where we could search on this field?)

qqmyers · 2024-04-04T17:47:51Z

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

@@ -897,7 +897,8 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long

        if (dataset.isHarvested()) {
            solrInputDocument.addField(SearchFields.IS_HARVESTED, true);
-            solrInputDocument.addField(SearchFields.METADATA_SOURCE, HARVESTED);
+            solrInputDocument.addField(SearchFields.METADATA_SOURCE,


Looks good. Probably need a release note about this feature that also notes that (async/background) reindexing is needed to populate the facet.

Hi @qqmyers 👋🏼

Thanks! I will add the release note in a few just looking at some other PR right now but I will add it ASAP. Regarding the tests, I am not sure I will also look at this.

Best,
Juan

Note added and test condition added to the harvesting test to search by the new collection name. 😃

Putting what I said during standup in writing:
Rather than using the HarvestingClient's nickname for this facet, we should probably use the name of the local collection into which the client is harvesting.
The clear advantages of doing that:

This mirrors what we are doing for the local datasets: solrInputDocument.addField(SearchFields.METADATA_SOURCE, rootDataverse.getName());

The name of the collection is likely to be more descriptive/better-looking to a human user

While both the client nickname and the name of the local collection are chosen by the local admin, it is far easier to change the latter. The former is not editable at all. With the current implementation, if a prod. instance admin realizes that they named the harvesting client oai_3 and that's what's going to show up in the facet, the only way for them to address it would be to delete the client (and all the content associated with it), and re-create it with a better-looking nickname, then re-harvest.

This will make it unnecessary to add an extra field with a descriptive label to the client class (as was mentioned during standup).

What would happen if someone has multiple clients harvested into the same collection? 🤔 should we consider this scenario! Also @DS-INRA made some comments a few minutes ago on the issue confirming that the name would be what they need which is as I understand is the other PR associated with this.

It is possible to harvest into the same collection, yes. But then one could make an argument that if the local admin wants their users to see these datasets from different OAI archives (or sets/clients) show as the same collection to their users, they may actually prefer to have them under the same facet too...
All that said, I would agree that it makes sense to implement it the way the original requestor wants it to work. But let's make sure everybody is on the same page. Let me ask some followup questions in the issue.

qqmyers

Looks good!

landreev · 2024-04-09T18:50:14Z

(I'm generally ready to merge this)

pdurbin · 2024-04-09T21:04:38Z

I just chatted with @jp-tosca about it.

We may have talked earlier about having a setting to revert the old behavior but I think it's fine that it's absent. Onwards and upwards.

One downside is that you won't be able to just look at a number of how many harvested dataset (unless there's only one client). Oh well.

Maybe it will encourage more harvesting, once you can easily tell where stuff is being harvested from.

Overall, seems like a good change, and @DS-INRA approves, which is the most important thing. 😄

landreev · 2024-04-10T13:45:28Z

@pdurbin Since we're still going to be showing the total number of local datasets, it should still be fairly clear how many harvested datasets total there are - because, math ("datasets" facets - total local); even if we are not showing the exact total number.
IMO, this would be useful to the users implemented like this. But, if anyone complains - we can revisit it and add a quick setting for reversing the behavior.

jp-tosca · 2024-04-10T14:21:14Z

We are having a meeting today at 11:00 EST with @DS-INRA, I will post if there are any updates.

landreev · 2024-04-10T17:54:04Z

src/test/java/edu/harvard/iq/dataverse/api/HarvestingClientsIT.java

@@ -272,6 +270,12 @@ private void harvestingClientRun(boolean allowHarvestingMissingCVV)  throws Inte
        } while (i<maxWait); 

        System.out.println("Waited " + i + " seconds for the harvest to complete.");
+
+        Response searchHarvestedDatasets = UtilIT.search("metadataSource:" + nickName, normalUserAPIKey);


Let's keep an eye on it - will not surprise me if it needs an extra second here, occasionally, for the indexing of all the harvested datasets to complete.

@jp-tosca @stevenwinship
Hah! This appears to be exactly what's happening now, since I merged this PR yesterday.
The new, weird failures - data.total_count doesn't match. Expected: <8> Actual: <5>

JP, this was a bad call on my part - I should've asked you to run Jenkins 3 times in a row on the PR before merging it.

github-actions · 2024-04-10T18:10:40Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10298-update-harvesting-source-facet

ghcr.io/gdcc/configbaker:10298-update-harvesting-source-facet

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

DS-INRA · 2024-04-11T08:43:58Z

👏 Thanks for the PR !

Add Harvesting Client name to the Metadata Source facet

31784a9

jp-tosca added Type: Feature a feature request Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Apr 4, 2024

This comment has been minimized.

Sign in to view

qqmyers requested changes Apr 4, 2024

View reviewed changes

qqmyers assigned jp-tosca Apr 4, 2024

Add release note

59d5578

This comment has been minimized.

Sign in to view

Integration Test added

c435bc9

This comment has been minimized.

Sign in to view

jp-tosca requested a review from qqmyers April 6, 2024 00:33

qqmyers approved these changes Apr 6, 2024

View reviewed changes

jp-tosca removed their assignment Apr 7, 2024

landreev assigned landreev and jp-tosca Apr 8, 2024

landreev mentioned this pull request Apr 8, 2024

Add Harvesting Source to search facets #10298

Closed

landreev unassigned jp-tosca Apr 9, 2024

Update 10464-add-name-harvesting-client-facet.md

8b52741

landreev reviewed Apr 10, 2024

View reviewed changes

landreev merged commit 54dddf0 into develop Apr 10, 2024
4 of 5 checks passed

landreev deleted the 10298-update-harvesting-source-facet branch April 10, 2024 17:56

landreev removed their assignment Apr 10, 2024

pdurbin added this to the 6.3 milestone Apr 10, 2024

landreev mentioned this pull request Apr 11, 2024

Fix intermittent IT test failures for HarvestingClientsIT #10449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Harvesting Client name to the Metadata Source facet #10464

Add Harvesting Client name to the Metadata Source facet #10464

jp-tosca commented Apr 4, 2024

This comment has been minimized.

qqmyers left a comment

qqmyers Apr 4, 2024

jp-tosca Apr 4, 2024

jp-tosca Apr 6, 2024

landreev Apr 8, 2024

jp-tosca Apr 8, 2024

landreev Apr 8, 2024

This comment has been minimized.

This comment has been minimized.

qqmyers left a comment

landreev commented Apr 9, 2024

pdurbin commented Apr 9, 2024

landreev commented Apr 10, 2024

jp-tosca commented Apr 10, 2024

landreev Apr 10, 2024

landreev Apr 11, 2024

landreev Apr 11, 2024

github-actions bot commented Apr 10, 2024

DS-INRA commented Apr 11, 2024

Add Harvesting Client name to the Metadata Source facet #10464

Add Harvesting Client name to the Metadata Source facet #10464

Conversation

jp-tosca commented Apr 4, 2024

This comment has been minimized.

qqmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

qqmyers left a comment

Choose a reason for hiding this comment

landreev commented Apr 9, 2024

pdurbin commented Apr 9, 2024

landreev commented Apr 10, 2024

jp-tosca commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 10, 2024

DS-INRA commented Apr 11, 2024