Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-harvesting datasets from Roper Center #204

Open
jggautier opened this issue Dec 3, 2022 · 15 comments
Open

Re-harvesting datasets from Roper Center #204

jggautier opened this issue Dec 3, 2022 · 15 comments
Assignees
Labels
Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims Size: 3 A percentage of a sprint. Status: Needs Input Applied to issues in need of input from someone currently unavailable

Comments

@jggautier
Copy link
Collaborator

jggautier commented Dec 3, 2022

Managers of the Roper Center for Public Opinion Research emailed to let us know that they now make their dataset metadata available over OAI-PMH. See https://help.hmdc.harvard.edu/Ticket/Display.html?id=330637.

But when I tested harvesting the records, Demo Dataverse wasn't able to harvest any:

Screen Shot 2022-12-03 at 3 15 48 PM

When I created the harvesting client, for the "Archive Type" I selected "Generic OAI Archive".
Screen Shot 2022-12-02 at 12 53 06 PM

I tried the "Roper Archive" option, too, but that didn't work either.

I let the folks at Roper know that the Dataverse development team is working on improving how Dataverse harvests using OAI-PMH, and that once those improvements made it to the Harvard repository and Demo Dataverse, I would try to harvest again.

I also asked them what we should do about the stale records in the Harvard repository (https://dataverse.harvard.edu/dataverse/roper) whose links lead to error pages. Similar to the stale ICPSR records (#63), people who find these Roper datasets and realize that the links don't work could still go to Roper's website (or even try a general search engine) and search there by the dataset's title.

So maybe we could leave them there until we're able to re-harvest them using OAI-PMH.

Someone at Roper asked in the email thread if, in the meantime, we're able to make the links redirect to the dataset pages:

It looks like the links were directed at our legacy server which has been replaced. From what I can see, the links are going through a resolver on your end, so for example
https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.4/GBSSLT62-CQ266 will end up at
https://ropercenter.cornell.edu//CFIDE/cf/action/catalog/abstract.cfm?archno=GBSSLT62-CQ266 (old URL)

Can your resolver point to a different URL prefix?
https://ropercenter.cornell.edu/ipoll/study/GBSSLT62-CQ266 (new URL)

Or maybe we could remove them sooner (e.g. using the destroy dataset API endpoint)?

So for now I plan to:

  • Ask what we should do about the records in https://dataverse.harvard.edu/dataverse/roper
    • Can the resolver point to a different URL prefix, so that the links point to the dataset pages instead of error pages?
    • If not, do we just keep the dead links up since the titles might still help people find the datasets on the Roper site?
    • Should we delete those records (such as by using the destroy API endpoint)?
  • Wait for OAI-PMH harvesting improvements to be applied to Harvard's repository and Demo Dataverse and try again to harvest from Roper.

Definition of done:
When we're able to harvest the metadata from all datasets in Roper's OAI-PMH feed and we remove the stale records that are in https://dataverse.harvard.edu/dataverse/roper

@jggautier
Copy link
Collaborator Author

jggautier commented Feb 24, 2023

Folks at the Roper Center let me know that they changed things on their end so that the links that the Harvard Dataverse has for Roper datasets (in the collection at https://dataverse.harvard.edu/dataverse/roper) lead to the datasets (instead of those links leading to error pages like they did earlier this year).

I haven't tried again to use Roper Center's OAI-PMH feed to harvest. I'll try again today or next week and report here and in the email thread with the folks from Roper.

@landreev
Copy link
Collaborator

This is cool! TBH, I gave up on the Roper records we have in the database a while ago, I just assumed they were useless. They are most likely way out of date, even if some are resolving now.
Ideally, we do want to drop that collection and reharvest everything, if they have a functioning OAI interface. But there is no guarantee we will be able to process their records on the first try (so, there's a chance that if we try that, we'll end up with fewer useful records than we have now...). So what we should do is probably start a harvest of their holdings on one of the test boxes - dataverse-internal maybe? - and see how that works.

@jggautier
Copy link
Collaborator Author

Ah okay. I'm not able to start a harvest on one of the test boxes. I was going to use Demo Dataverse, but I won't now.

It sounds better to me if someone else uses a test box. It's more likely that whoever can do that will also be more capable of figuring out what went wrong if something goes wrong.

And thinking more about it, it's probably better that anyone who continues to work on this wait for when @sbarbosadataverse can prioritize this on the "Harvard Dataverse Repository Instance" column on the Dataverse Global backlog.

@landreev
Copy link
Collaborator

OK, I'll do that.

@jggautier
Copy link
Collaborator Author

Gene Wang from the Roper Center's been following up regularly about this. I can let him know we haven't looked into this more, yet. But I think it would be helpful if we could say when we could try harvesting from them, even if it's not right away. Is that possible?

@landreev
Copy link
Collaborator

OK, I'll do it (an experimental harvest) this week, maybe even today. dataverse-internal is really not a good server for that (it's being used for testing PRs and needs to be restarted constantly), but I'm thinking of trying it on the perf cluster. Will post any updates here.

@landreev
Copy link
Collaborator

landreev commented Mar 23, 2023

Just deleting all the old, stale Roper from the prod. database is going to be a little non-trivial. I've experimented with that a bit this week on the perf cluster (using a copy of the prod. db there). If you do it the supported way, through the harvesting clients panel, our application attempts to delete all the records at once, and that's a bulky trunsaction with 20+K datasets. I'd like to avoid having to delete them by one by one, so I'm figuring that part out.
Their OAI server is not working properly as of today, I'm talking to an engineer at Roper via the linked RT.

@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@cmbz
Copy link
Collaborator

cmbz commented Dec 19, 2023

2023/12/19: Roper's OAI does not implement the OAI Dublin Core. Unclear on their approach. @landreev will contact them to follow up, and determine next steps.

@cmbz cmbz added the Size: 10 A percentage of a sprint. label Dec 19, 2023
@landreev landreev self-assigned this Jan 17, 2024
@landreev
Copy link
Collaborator

I don't have much in terms of a status update. I haven't been able to re-test their OAI server because it's been down or broken for the past few days. I.e. all of these calls are returning a 500:

https://api.ropercenter.org/prod/api/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc
https://api.ropercenter.org/prod/api/oai2?verb=ListRecords&metadataPrefix=oai_dc
https://api.ropercenter.org/prod/api/oai2?verb=GetRecord&identifier=10.25940/ROPER-31095120&metadataPrefix=oai_dc

On the other hand,

https://api.ropercenter.org/prod/api/oai2?verb=ListSets
https://api.ropercenter.org/prod/api/oai2?verb=ListMetadataFormats

are working; so their OAI server is still there - just not working properly.
I really wanted to retest the output of their GetRecord and ListRecords implementation before I reach out to them again. The main problem with their records, the last time I tried harvesting from them, was that instead of the standard tags <oai_dc:dc ...> ... </oai_dc:dc>, as required by the OAI-PMH standard, their records were formatted with <ns2>...</ns2> (?). I would prefer to check if that was still the case, before reaching out and asking about it.

I'm a little self-conscious following up on that RT ticket (330637), since it's so old and since we (I) have dropped the ball on it before. But if their OAI doesn't come back to live miraculously in the next couple of days, I'll reach out and ask.

@jggautier
Copy link
Collaborator Author

Email Debt Forgiveness Day is on Feb 29 😛

(I'm kidding of course!)

@cmbz cmbz added Size: 3 A percentage of a sprint. and removed Size: 10 A percentage of a sprint. labels Feb 15, 2024
@cmbz
Copy link
Collaborator

cmbz commented Feb 15, 2024

Resized to 3 during sprint kickoff

@landreev
Copy link
Collaborator

Their OAI service was not showing any intent to "fix itself", so I finally emailed them via RT - hoping that the people on the other end of the ticket are still employed, and willing to talk to me.

@landreev
Copy link
Collaborator

Resumed communication with the developer(s) on the Roper side. Hopefully we'll nail it this time around.

Happy International Email Debt Forgiveness Day!

@landreev landreev added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Mar 14, 2024
@cmbz
Copy link
Collaborator

cmbz commented Jul 11, 2024

2024/07/10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims Size: 3 A percentage of a sprint. Status: Needs Input Applied to issues in need of input from someone currently unavailable
Projects
Status: On Hold ⌛
Development

No branches or pull requests

3 participants