Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent network failures attempting repeated downloads of schemas/schematrons #739

Closed
jordanpadams opened this issue Oct 24, 2023 · 11 comments · Fixed by #742
Closed
Assignees

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Oct 24, 2023

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I did a mvn test, it works sometimes, sometimes it fails indicating it cannot download a schema or schematron, and other times it just hangs...

Atmospheres also noted that validate is running and then just hangs and appears to not be doing anything.

🕵️ Expected behavior

I expected it would download and test successfully.

📜 To Reproduce

  1. mvn test (may need to try it a few times)

🖥 Environment Info

Mac OSx (me)
Linux (ATM Node)

📚 Version of Software Used

v3.2.0
v3.4.0-SNAPSHOT

🩺 Test Data / Additional context

Any large test data sets.

🦄 Related requirements

⚙️ Engineering Details

  • DSIO ticket created to investigate CloudWatch and any blocking happening there

  • For development, let's verify our local caching is actually working and we are not going back to JPL website over and over again

@al-niessner
Copy link
Contributor

@jordanpadams

The way cucumber works it goes back for a download every test - validate is run from the CLI each and every time. There is not a java wrapper around it. Hence, cache is for each individual test not the entire test suite.

@jordanpadams
Copy link
Member Author

@al-niessner The SAs and I dug into this, and it does not look like anything abnormal is happening on the WAF side of the house blocking GitHub or my laptop builds (I was seeing these errors). It has become pretty evident it is something specific to the Saxon upgrade that is causing these intermittent failures. Not sure if they changed something in the way they are caching the loaded data or how they are handling http vs https or ??? But looking back through the github action history and doing some testing, it is pretty clear it's a Saxon upgrade thing:

Screenshot 2023-10-26 at 7 34 26 PM

If nothing else, I think we need to add some exception handling into where source.getSystemId() is being called to track this issue down, and then maybe we just do a retry or 2? Or if it is an http vs. https thing maybe we need to just automatically detect and fix those URLs? Not sure.

@jordanpadams
Copy link
Member Author

I added a stacktrace to this branch, and you can see some more interesting information about when/how it is failing: https://github.com/NASA-PDS/validate/actions/runs/6662161511/job/18106097361#step:5:22250

@jordanpadams
Copy link
Member Author

I know we experienced some issues in the past with trying to clear the JAXB cache with different versions of the PDS4 schemas (see the comments in validate.feature regarding `Move github87 as it is interfering with github292 tests)

@al-niessner
Copy link
Contributor

@jordanpadams

Now this is funny. I am at run 6 no failures with maven test of validate. I am using my phone's 5G and it is not particularly fast. Seems slow networks are good networks.

@jordanpadams
Copy link
Member Author

@al-niessner I just ran it once and it failed... JPL VPN maybe?

@al-niessner
Copy link
Contributor

@jordanpadams

No, I am trying to decide why it is intermittent. Changing the rate of the network is consistent with a race condition somewhere and SAXON looks likely. Maybe they will have 12.4 with it fixed sooner than I can find it.

@jordanpadams
Copy link
Member Author

@al-niessner Copy that. A race condition makes sense here. Very annoying.

I am ok with reverting for now, and leaving a PR out there for the changes you have made, and we can come back to it if/when they fix it?

@al-niessner
Copy link
Contributor

@jordanpadams

Why revert? I am on the trail of this and it is not too difficult to work around. Besides, I do not have built in mirrors to see behind me. Move forward or stand still.

@jordanpadams
Copy link
Member Author

@al-niessner sounds good. As long as we fix the errors, I am good. Just wanted to make sure you weren't banging your head too hard on the table trying to get this to work.

@miguelp1986
Copy link

TestRail Test ID: T8681186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants