Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent IT test failures for HarvestingClientsIT #10449

Merged
merged 12 commits into from Apr 12, 2024

Conversation

stevenwinship
Copy link
Contributor

@stevenwinship stevenwinship commented Mar 29, 2024

Description: Intermittent failures happen during IT tests pointing to HarvestingClientsIT

Closes: #10438

How to test this: Observe other PRs and see if any fail on this test

@stevenwinship stevenwinship self-assigned this Mar 29, 2024
@stevenwinship stevenwinship linked an issue Mar 29, 2024 that may be closed by this pull request
@stevenwinship stevenwinship added Size: 3 A percentage of a sprint. 2.1 hours. Size: 0.5 A percentage of a sprint. 0.35 hours and removed Size: 0.5 A percentage of a sprint. 0.35 hours labels Mar 29, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Apr 1, 2024
if (allowHarvestingMissingCVV) {
assertEquals(DATASETS_IN_CONTROL_SET, responseJsonPath.getInt("data.lastDatasetsHarvested"));
} else {
assertTrue(responseJsonPath.getInt("data.lastDatasetsHarvested") < DATASETS_IN_CONTROL_SET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the issue seemed to be that it (sometimes?) expected this to be 2 less than the control set, rather than one? @stevenwinship just to understand do we know why it would sometimes want 1 less and sometimes 2 less? (this change, of course, does fix it for either case, just curious what exactly is going on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure but it looks like a timing issue. It may be harvesting asynchronously. It may be failing before finishing loading the last good one. Locally I never see the issue so it's hard to test.

@landreev Do you know if files are harvested asynchronously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvesting is definitely asynchronous. But if I remember correctly, the code in that test was specifically checking/waiting for the harvest to be marked as completed before verifying the number of records harvested, so you would expect that to take the asynchronous nature of it out of the picture.

I would prefer to figure out why that failure to harvest one of the records is happening. The test failing intermittently is not such a big deal, so there's no rush to fix it. OTOH, replacing the check with just a "less than" leaves it open to ignoring more serious issues, IMO?

(and I'd like to know if it is an actual failure - as opposed to the last record still being processed at the time of the check; I don't think the latter is a possibility, because, again, I don't think the results of a harvesting run are written in the database before the run is completed - but I really need to take a closer look at the code before I can make educated guesses.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landreev. I'd like to really understand what's happening during the failure. It's difficult because I can't get it to fail locally. It only fails in the pipeline, and not every time :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do run into timing issues that only happen during Jenkins runs. Investigating such is time-consuming. Basically, you wait for it to happen again, then go on Jenkins and look for anything meaningful in the full console output from the test, AND in the server.log from the instance that was executing the server side of the exchange. The latter is saved in the workspace directory, but only for the most recent Jenkins build, I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, if it is a timing issue, there is a chance that it would manifest itself locally - if you run the entire suite of tests, just like Jenkins does, rather than just executing the select tests. But, either way, things like this are always time-consuming.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had that same failure (6 out of 7 records) on my own dev. instance. This was outside the restassured test, I just recreated the client and ran the harvest in the UI. I'll take a look at the errors.
Screen Shot 2024-04-09 at 8 22 02 AM

Copy link
Contributor

@landreev landreev Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in my case it was not the same intermittent error we've been seeing w/ Jenkins. I just had that 7th dataset already harvested (via the "single dataset set" from demo), so it was skipped. So, nothing intermittent or random about it.
Still, this may be what is happening w/ Jenkins too. Since we are harvesting the same 7 datasets in both tests, it is conceivable that when they are deleted between tests, one is still there by the time the next harvesting run tries to import it again.

In other words, it's not the asynchronous harvesting, it's the deleteClient() also being executed asynchronously that may be the problem.
If this is the case, simply adding a few seconds of sleep to the cleanup() method would fix it.

Also, if this failure to import is indeed on account of the dataset still being in the database, the error message in server.log to look for would be this:
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.api.imports.ImportException (The dataset with the global id doi:xxxx/yy/zzzzz already exists, in the dataverse ...., skipping.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running the test locally with Dataverse in Docker. Testing most of the IT tests and this test is not failing even with the assert set back to expecting 7. I haven't seen only 6 harvested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be impossible to reproduce outside of Jenkins, for all we know.

This is an example of a Jenkins run where this test failed and no other test run has happened since; i.e., the server.log from the failed run is still there: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10241/4/execution/node/3/ws/target/server.log

There is no "ImportException ... dataset already exists" message there, like on my instance. But, on a second thought, that would not be the case, since your tests are harvesting into the same collection (meaning, if the dataset still exists, Dataverse attempts to delete, then re-import it). There is indeed a ton of OptimisticLock exceptions right around the time of the second harvesting run that appear to be about a failure to delete. You can tell by the messages about the field missing from CVV that it is the second test run that is failing. No obvious smoking guns in the console output.

Don't have much time to look into this any deeper. (I don't think this issue is worth the time I've already spent looking into it, tbh). But, what I mentioned earlier - adding a few sec. of sleep to the cleanup method - may be something worth trying.

@scolapasta scolapasta self-assigned this Apr 1, 2024

This comment has been minimized.

1 similar comment

This comment has been minimized.

@stevenwinship
Copy link
Contributor Author

Added back the assert for exact harvest count and added an extra second to the wait.

@landreev
Copy link
Contributor

landreev commented Apr 9, 2024

When/if Jenkins comes back to life, if the tests complete w/out failing 3 times in a row with this extra second added, I'll vote for merging it quickly; if it keeps happening, we'll revisit it then.

@landreev
Copy link
Contributor

I started a build by hand just now.

@landreev
Copy link
Contributor

@stevenwinship Could you please sync the branch up with develop, appears to be the reason the last Jenkins run bombed.

This comment has been minimized.

@landreev
Copy link
Contributor

one successful run earlier today (build 14). build 15: a really weird failure - the other harvesting test, testHarvestingClientRun_AllowHarvestingMissingCVV_True, 4 datasets harvested instead of 8. (??)

@stevenwinship stevenwinship self-assigned this Apr 11, 2024
@landreev
Copy link
Contributor

Don has fixed the payara distribution issue. But it's still Jenkins roulette, with the harvesting tests failing half of the times; somehow it's been made worse so far. So no, we are not merging this PR just yet. If I have time I'll look into it some more.

@landreev
Copy link
Contributor

ok, one mystery solved - I just left a comment in #10464, the PR I merged yesterday; the new errors are the result of the extra check added, which is performed before the async. indexing gets a chance to finish.

@landreev
Copy link
Contributor

also, just saw that you have fixed it already.

This comment has been minimized.

@landreev
Copy link
Contributor

also, just saw that you have fixed it already.

OK, you have fixed the cleanup part - but the search check in line 278 may still bomb for the same async. reason - ?
... I'll shut up and let you handle this.

This comment has been minimized.

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Apr 12, 2024
@landreev landreev self-assigned this Apr 12, 2024
globalIdList.add(jsonPath.getString("data.items["+idx+"].global_id"));
}
// verify count after collecting global ids
assertEquals(expectedNumberOfSetsHarvested, jsonPath.getInt("data.total_count"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenwinship
The new intermittent error, the one introduced in #10464 the other day, is still here - the last Jenkins build has just failed again on this line with expected: <8> but was: <2>. Just a matter of calling the search engine while the new datasets may still be getting indexed.

Not your fault, strictly speaking, but since we are already working on this test, may as well. Let me just add a second of sleep here and merge it; I'm sure that's all it needs.


int i = 0;
int maxWait = 20;
String query = "dsPersistentId:" + globalIdList.stream().map(s -> "\""+s+"\"").collect(Collectors.joining(","));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary, btw - what was the problem with "metadataSource:" + nickName?

Doesn't matter really, as long as it works. But I can't help but feel like there's too much going on in this test class, vs. its useful value. (Strictly speaking, there is no need to harvest the first 7 datasets twice in a row; that second set could just contain the 1 dataset with the CVV, maybe?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still mildly curious why it was necessary to search by individual dataset ids here.

@landreev
Copy link
Contributor

Anyway, I just want to see it pass a few times in a row and merge it without further ado.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10438-fix-intermittent-harvesting-tests
ghcr.io/gdcc/configbaker:10438-fix-intermittent-harvesting-tests

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev
Copy link
Contributor

There were more annoying unrelated Jenkins failures. But the 2 harvesting tests have objectively passed three times in a row. Merging.

@landreev landreev merged commit 0b404c8 into develop Apr 12, 2024
18 of 19 checks passed
@pdurbin pdurbin added this to the 6.3 milestone Apr 16, 2024
@landreev landreev removed their assignment May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Fix intermittent failures in Harvesting IT tests
4 participants