Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent IT test failures for HarvestingClientsIT #10449

Merged
merged 12 commits into from
Apr 12, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,6 @@ public void testHarvestingClientRun_AllowHarvestingMissingCVV_True() throws Int
}

private void harvestingClientRun(boolean allowHarvestingMissingCVV) throws InterruptedException {
int expectedNumberOfSetsHarvested = allowHarvestingMissingCVV ? DATASETS_IN_CONTROL_SET : DATASETS_IN_CONTROL_SET - 1;

// This test will create a client and attempt to perform an actual
// harvest and validate the resulting harvested content.
Expand Down Expand Up @@ -264,7 +263,11 @@ private void harvestingClientRun(boolean allowHarvestingMissingCVV) throws Inte
assertEquals(harvestTimeStamp, responseJsonPath.getString("data.lastNonEmpty"));

// d) Confirm that the correct number of datasets have been harvested:
assertEquals(expectedNumberOfSetsHarvested, responseJsonPath.getInt("data.lastDatasetsHarvested"));
if (allowHarvestingMissingCVV) {
assertEquals(DATASETS_IN_CONTROL_SET, responseJsonPath.getInt("data.lastDatasetsHarvested"));
} else {
assertTrue(responseJsonPath.getInt("data.lastDatasetsHarvested") < DATASETS_IN_CONTROL_SET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the issue seemed to be that it (sometimes?) expected this to be 2 less than the control set, rather than one? @stevenwinship just to understand do we know why it would sometimes want 1 less and sometimes 2 less? (this change, of course, does fix it for either case, just curious what exactly is going on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure but it looks like a timing issue. It may be harvesting asynchronously. It may be failing before finishing loading the last good one. Locally I never see the issue so it's hard to test.

@landreev Do you know if files are harvested asynchronously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvesting is definitely asynchronous. But if I remember correctly, the code in that test was specifically checking/waiting for the harvest to be marked as completed before verifying the number of records harvested, so you would expect that to take the asynchronous nature of it out of the picture.

I would prefer to figure out why that failure to harvest one of the records is happening. The test failing intermittently is not such a big deal, so there's no rush to fix it. OTOH, replacing the check with just a "less than" leaves it open to ignoring more serious issues, IMO?

(and I'd like to know if it is an actual failure - as opposed to the last record still being processed at the time of the check; I don't think the latter is a possibility, because, again, I don't think the results of a harvesting run are written in the database before the run is completed - but I really need to take a closer look at the code before I can make educated guesses.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landreev. I'd like to really understand what's happening during the failure. It's difficult because I can't get it to fail locally. It only fails in the pipeline, and not every time :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do run into timing issues that only happen during Jenkins runs. Investigating such is time-consuming. Basically, you wait for it to happen again, then go on Jenkins and look for anything meaningful in the full console output from the test, AND in the server.log from the instance that was executing the server side of the exchange. The latter is saved in the workspace directory, but only for the most recent Jenkins build, I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, if it is a timing issue, there is a chance that it would manifest itself locally - if you run the entire suite of tests, just like Jenkins does, rather than just executing the select tests. But, either way, things like this are always time-consuming.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had that same failure (6 out of 7 records) on my own dev. instance. This was outside the restassured test, I just recreated the client and ran the harvest in the UI. I'll take a look at the errors.
Screen Shot 2024-04-09 at 8 22 02 AM

Copy link
Contributor

@landreev landreev Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in my case it was not the same intermittent error we've been seeing w/ Jenkins. I just had that 7th dataset already harvested (via the "single dataset set" from demo), so it was skipped. So, nothing intermittent or random about it.
Still, this may be what is happening w/ Jenkins too. Since we are harvesting the same 7 datasets in both tests, it is conceivable that when they are deleted between tests, one is still there by the time the next harvesting run tries to import it again.

In other words, it's not the asynchronous harvesting, it's the deleteClient() also being executed asynchronously that may be the problem.
If this is the case, simply adding a few seconds of sleep to the cleanup() method would fix it.

Also, if this failure to import is indeed on account of the dataset still being in the database, the error message in server.log to look for would be this:
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.api.imports.ImportException (The dataset with the global id doi:xxxx/yy/zzzzz already exists, in the dataverse ...., skipping.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running the test locally with Dataverse in Docker. Testing most of the IT tests and this test is not failing even with the assert set back to expecting 7. I haven't seen only 6 harvested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be impossible to reproduce outside of Jenkins, for all we know.

This is an example of a Jenkins run where this test failed and no other test run has happened since; i.e., the server.log from the failed run is still there: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10241/4/execution/node/3/ws/target/server.log

There is no "ImportException ... dataset already exists" message there, like on my instance. But, on a second thought, that would not be the case, since your tests are harvesting into the same collection (meaning, if the dataset still exists, Dataverse attempts to delete, then re-import it). There is indeed a ton of OptimisticLock exceptions right around the time of the second harvesting run that appear to be about a failure to delete. You can tell by the messages about the field missing from CVV that it is the second test run that is failing. No obvious smoking guns in the console output.

Don't have much time to look into this any deeper. (I don't think this issue is worth the time I've already spent looking into it, tbh). But, what I mentioned earlier - adding a few sec. of sleep to the cleanup method - may be something worth trying.

}

// ok, it looks like the harvest has completed successfully.
break;
Expand Down