Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Harvesting Client name to the Metadata Source facet #10464

Merged
merged 4 commits into from Apr 10, 2024

Conversation

jp-tosca
Copy link
Contributor

@jp-tosca jp-tosca commented Apr 4, 2024

What this PR does / why we need it:

This PR will change the "Metadata Source" facet. It will stop showing all the Harvested datasets inside the same category and group them by the nickname of the client. ›

Which issue(s) this PR closes: 10298

Closes #10298

Special notes for your reviewer:

We may need to update this when #10217 gets done with that value.

Suggestions on how to test this:

Create multiple harvesting clients and you should be able to test the change on the facet.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:


Before:

image


After:
image

@jp-tosca jp-tosca added Type: Feature a feature request Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Apr 4, 2024

This comment has been minimized.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Asked for a release note. The tests haven't completed yet so I'm not sure if there's any Harvesting/search related test that needs to be updated. (Can one be added - do we already have harvested test datasets where we could search on this field?)

@@ -897,7 +897,8 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long

if (dataset.isHarvested()) {
solrInputDocument.addField(SearchFields.IS_HARVESTED, true);
solrInputDocument.addField(SearchFields.METADATA_SOURCE, HARVESTED);
solrInputDocument.addField(SearchFields.METADATA_SOURCE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Probably need a release note about this feature that also notes that (async/background) reindexing is needed to populate the facet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qqmyers 👋🏼

Thanks! I will add the release note in a few just looking at some other PR right now but I will add it ASAP. Regarding the tests, I am not sure I will also look at this.

Best,
Juan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note added and test condition added to the harvesting test to search by the new collection name. 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting what I said during standup in writing:
Rather than using the HarvestingClient's nickname for this facet, we should probably use the name of the local collection into which the client is harvesting.
The clear advantages of doing that:

  1. This mirrors what we are doing for the local datasets: solrInputDocument.addField(SearchFields.METADATA_SOURCE, rootDataverse.getName());
  2. The name of the collection is likely to be more descriptive/better-looking to a human user
  3. While both the client nickname and the name of the local collection are chosen by the local admin, it is far easier to change the latter. The former is not editable at all. With the current implementation, if a prod. instance admin realizes that they named the harvesting client oai_3 and that's what's going to show up in the facet, the only way for them to address it would be to delete the client (and all the content associated with it), and re-create it with a better-looking nickname, then re-harvest.
  4. This will make it unnecessary to add an extra field with a descriptive label to the client class (as was mentioned during standup).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if someone has multiple clients harvested into the same collection? 🤔 should we consider this scenario! Also @DS-INRA made some comments a few minutes ago on the issue confirming that the name would be what they need which is as I understand is the other PR associated with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to harvest into the same collection, yes. But then one could make an argument that if the local admin wants their users to see these datasets from different OAI archives (or sets/clients) show as the same collection to their users, they may actually prefer to have them under the same facet too...
All that said, I would agree that it makes sense to implement it the way the original requestor wants it to work. But let's make sure everybody is on the same page. Let me ask some followup questions in the issue.

This comment has been minimized.

This comment has been minimized.

@jp-tosca jp-tosca requested a review from qqmyers April 6, 2024 00:33
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@landreev
Copy link
Contributor

landreev commented Apr 9, 2024

(I'm generally ready to merge this)

@pdurbin
Copy link
Member

pdurbin commented Apr 9, 2024

I just chatted with @jp-tosca about it.

We may have talked earlier about having a setting to revert the old behavior but I think it's fine that it's absent. Onwards and upwards.

One downside is that you won't be able to just look at a number of how many harvested dataset (unless there's only one client). Oh well.

Maybe it will encourage more harvesting, once you can easily tell where stuff is being harvested from.

Overall, seems like a good change, and @DS-INRA approves, which is the most important thing. 😄

@landreev
Copy link
Contributor

@pdurbin Since we're still going to be showing the total number of local datasets, it should still be fairly clear how many harvested datasets total there are - because, math ("datasets" facets - total local); even if we are not showing the exact total number.
IMO, this would be useful to the users implemented like this. But, if anyone complains - we can revisit it and add a quick setting for reversing the behavior.

@jp-tosca
Copy link
Contributor Author

We are having a meeting today at 11:00 EST with @DS-INRA, I will post if there are any updates.

@@ -272,6 +270,12 @@ private void harvestingClientRun(boolean allowHarvestingMissingCVV) throws Inte
} while (i<maxWait);

System.out.println("Waited " + i + " seconds for the harvest to complete.");

Response searchHarvestedDatasets = UtilIT.search("metadataSource:" + nickName, normalUserAPIKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep an eye on it - will not surprise me if it needs an extra second here, occasionally, for the indexing of all the harvested datasets to complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp-tosca @stevenwinship
Hah! This appears to be exactly what's happening now, since I merged this PR yesterday.
The new, weird failures - data.total_count doesn't match. Expected: <8> Actual: <5>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JP, this was a bad call on my part - I should've asked you to run Jenkins 3 times in a row on the PR before merging it.

@landreev landreev merged commit 54dddf0 into develop Apr 10, 2024
4 of 5 checks passed
@landreev landreev deleted the 10298-update-harvesting-source-facet branch April 10, 2024 17:56
@landreev landreev removed their assignment Apr 10, 2024
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10298-update-harvesting-source-facet
ghcr.io/gdcc/configbaker:10298-update-harvesting-source-facet

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@pdurbin pdurbin added this to the 6.3 milestone Apr 10, 2024
@DS-INRA
Copy link
Member

DS-INRA commented Apr 11, 2024

👏 Thanks for the PR !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Feature a feature request
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Add Harvesting Source to search facets
5 participants