-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent IT test failures for HarvestingClientsIT #10449
Merged
Merged
Changes from 5 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
4fe9053
add order to test
stevenwinship 30356fd
add order to test
stevenwinship 642b659
fix test
stevenwinship 2ed51c2
fix test
stevenwinship 38d565f
fix test
stevenwinship 627bd58
fixing test
stevenwinship 2669bcc
fixing test
stevenwinship df9ce9d
Merge branch 'develop' into 10438-fix-intermittent-harvesting-tests
stevenwinship e22501d
adding better cleanup between tests
stevenwinship 7db469d
refine cleanup
stevenwinship f4f57d9
refine cleanup
stevenwinship 03aa898
An extra sec. for indexing
landreev File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the issue seemed to be that it (sometimes?) expected this to be 2 less than the control set, rather than one? @stevenwinship just to understand do we know why it would sometimes want 1 less and sometimes 2 less? (this change, of course, does fix it for either case, just curious what exactly is going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure but it looks like a timing issue. It may be harvesting asynchronously. It may be failing before finishing loading the last good one. Locally I never see the issue so it's hard to test.
@landreev Do you know if files are harvested asynchronously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harvesting is definitely asynchronous. But if I remember correctly, the code in that test was specifically checking/waiting for the harvest to be marked as completed before verifying the number of records harvested, so you would expect that to take the asynchronous nature of it out of the picture.
I would prefer to figure out why that failure to harvest one of the records is happening. The test failing intermittently is not such a big deal, so there's no rush to fix it. OTOH, replacing the check with just a "less than" leaves it open to ignoring more serious issues, IMO?
(and I'd like to know if it is an actual failure - as opposed to the last record still being processed at the time of the check; I don't think the latter is a possibility, because, again, I don't think the results of a harvesting run are written in the database before the run is completed - but I really need to take a closer look at the code before I can make educated guesses.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landreev. I'd like to really understand what's happening during the failure. It's difficult because I can't get it to fail locally. It only fails in the pipeline, and not every time :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do run into timing issues that only happen during Jenkins runs. Investigating such is time-consuming. Basically, you wait for it to happen again, then go on Jenkins and look for anything meaningful in the full console output from the test, AND in the server.log from the instance that was executing the server side of the exchange. The latter is saved in the workspace directory, but only for the most recent Jenkins build, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, if it is a timing issue, there is a chance that it would manifest itself locally - if you run the entire suite of tests, just like Jenkins does, rather than just executing the select tests. But, either way, things like this are always time-consuming.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had that same failure (6 out of 7 records) on my own dev. instance. This was outside the restassured test, I just recreated the client and ran the harvest in the UI. I'll take a look at the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in my case it was not the same intermittent error we've been seeing w/ Jenkins. I just had that 7th dataset already harvested (via the "single dataset set" from demo), so it was skipped. So, nothing intermittent or random about it.
Still, this may be what is happening w/ Jenkins too. Since we are harvesting the same 7 datasets in both tests, it is conceivable that when they are deleted between tests, one is still there by the time the next harvesting run tries to import it again.
In other words, it's not the asynchronous harvesting, it's the
deleteClient()
also being executed asynchronously that may be the problem.If this is the case, simply adding a few seconds of sleep to the
cleanup()
method would fix it.Also, if this failure to import is indeed on account of the dataset still being in the database, the error message in server.log to look for would be this:
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.api.imports.ImportException (The dataset with the global id doi:xxxx/yy/zzzzz already exists, in the dataverse ...., skipping.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been running the test locally with Dataverse in Docker. Testing most of the IT tests and this test is not failing even with the assert set back to expecting 7. I haven't seen only 6 harvested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be impossible to reproduce outside of Jenkins, for all we know.
This is an example of a Jenkins run where this test failed and no other test run has happened since; i.e., the server.log from the failed run is still there: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10241/4/execution/node/3/ws/target/server.log
There is no "ImportException ... dataset already exists" message there, like on my instance. But, on a second thought, that would not be the case, since your tests are harvesting into the same collection (meaning, if the dataset still exists, Dataverse attempts to delete, then re-import it). There is indeed a ton of OptimisticLock exceptions right around the time of the second harvesting run that appear to be about a failure to delete. You can tell by the messages about the field missing from CVV that it is the second test run that is failing. No obvious smoking guns in the console output.
Don't have much time to look into this any deeper. (I don't think this issue is worth the time I've already spent looking into it, tbh). But, what I mentioned earlier - adding a few sec. of sleep to the cleanup method - may be something worth trying.