Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase rtol in thermoToYaml for ppcle64/aarch64 systems #1271

Closed
wants to merge 1 commit into from

Conversation

mefuller
Copy link
Contributor

@mefuller mefuller commented May 2, 2022

Changes proposed in this pull request

rtol for thermoToYaml increased by 20% to allow tests to pass on ppcle64 and aarch64 builds

Problem was identified while building v2.6.0 for Fedora Linux 35+; no issue was present in v2.6.0b2:
Relevant logs from v2.6.0 builds:
https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-aarch64/04357916-cantera/builder-live.log.gz
https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-ppc64le/04357916-cantera/builder-live.log.gz

Checklist

  • The pull request includes a clear description of this code change
  • Commit messages have short titles and reference relevant issues
  • Build passes (scons build & scons test) and unit tests address code coverage
  • Style & formatting of contributed code follows contributing guidelines
  • The pull request is ready for review

@mefuller mefuller changed the title increase rtol in thermoToYaml for ppcle64/aarch64 systems Increase rtol in thermoToYaml for ppcle64/aarch64 systems May 2, 2022
@bryanwweber
Copy link
Member

Can we bisect to find out why this happened? There haven't been a huge number of changes, so I'm surprised this is needed.

@mefuller
Copy link
Contributor Author

mefuller commented May 2, 2022

It'll take me a few days to get to it, but sure

@speth
Copy link
Member

speth commented May 2, 2022

Can you link to the page with the successful runs for 2.6.0b2? I'm curious if there are any differences in the library or compiler versions that could be responsible, besides changes to Cantera.

Are you running these builds only when we tag a version, or are these run on every update to the main branch? It's a little hard to be proactive about these issues and make an efficient release if we only get notice about them after the fact, and we don't have builders like these available through the GitHub actions system.

@mefuller
Copy link
Contributor Author

mefuller commented May 3, 2022

The last four builds at https://koji.fedoraproject.org/koji/packageinfo?packageID=35082 (v2.6.0-18) are the successful builds on all supported architectures for the four current Fedoras 34-36, rawhide(37)

@speth I will try to investigate a CI process with the Fedora builld system(s) to improve support for alternate architectures

@ischoegl
Copy link
Member

ischoegl commented May 3, 2022

So it looks like things started to break with Cantera-2.6.0-14 in mid April?

@mefuller
Copy link
Contributor Author

mefuller commented May 3, 2022

@ischoegl v2.6.0b2 broke my build routines ("spec file"), but that is separate from the current test failures on alternate 64 bit arches.
I updated the spec to fix those issues separately, which led to v2.6.0-18.

The current failures are not listed on koji, only on copr (first links, not latter in response to @speth) as getting burned last time led to me not formally submitting to Fedora yet.

@speth
Copy link
Member

speth commented May 3, 2022

There are definitely differences in the system package versions between these builds, at least for FC37. At the very least, gcc-c++ went from 12.0.1-0.15 to 12.0.1-0.17, and glibc-devel went from 2.35.9000-12 to 2.35.9000-15.

@mefuller - can you point to the COPR logs for FC35 and FC36, which you also indicated were failing. Is there a link to a more general status page that would provide access to additional information?

@speth
Copy link
Member

speth commented May 3, 2022

Looking through the logs makes it pretty clear why this started failing:

[ RUN      ] ThermoYamlRoundTrip.BinarySolutionTabulated
test/thermo/thermoToYaml.cpp:393: Failure
The difference between v1[k] and v2[k] is 2.4286128663675299e-16, which exceeds 1e-20 + rtol*fabs(v1[k]), where
v1[k] evaluates to 0.020432776617953851,
v2[k] evaluates to 0.020432776617954094, and
1e-20 + rtol*fabs(v1[k]) evaluates to 2.0433776617953852e-16.
0
test/thermo/thermoToYaml.cpp:393: Failure
The difference between v1[k] and v2[k] is 2.0816681711721685e-16, which exceeds 1e-20 + rtol*fabs(v1[k]), where
v1[k] evaluates to 0.018983716075156774,
v2[k] evaluates to 0.018983716075156566, and
1e-20 + rtol*fabs(v1[k]) evaluates to 1.8984716075156774e-16.
1
[  FAILED  ] ThermoYamlRoundTrip.BinarySolutionTabulated (16 ms)

PR #1072, which was merged just after the 2.6.0b2 tag was specifically about changing the calculation of the partial molar volumes (v1 and v2) in the error message in the BinarySolutionTabulatedThermo class. I'm not sure why this calculation is less repeatable on these particular platforms than all of the others that we test, though.

@mefuller
Copy link
Contributor Author

mefuller commented May 3, 2022

@speth logs for complete matrix of Fedora versions and architectures (except s390x) are at https://copr.fedorainfracloud.org/coprs/fuller/cantera-test/build/4357916/

FWIW, the modification I propose here to rtol does fix this issue of failing tests: https://copr.fedorainfracloud.org/coprs/fuller/cantera-test/build/4358146/
(Ignore failed F34 builds - this is issue #1269 - and also rawhide/s390x as this is a totally separate issue)

@speth
Copy link
Member

speth commented May 3, 2022

Also, I'm not super excited by the idea of just arbitrarily relaxing the tolerance on this test. On the two platforms that I have easy access to (macOS/arm64 and Ubuntu/x86_64), I can actually tighten these tolerances by quite a bit, since they ought to differ only by a modest roundoff error. Specifically, I can use a tolerance of 2e-15 for the entire ThermoYamlRoundTrip test suite, and go down all the way to 2e-16 for the specific BinarySolutionTabulated test.

I think it would be worth trying to understand why these platforms can only meet tolerances of 1.2e-14. Unfortunately, investigating this seems like it will be quite difficult unless someone has full access to machines with these architectures.

@mefuller
Copy link
Contributor Author

mefuller commented May 3, 2022

@speth @bryanwweber @ischoegl I am really sorry about kicking the hornet's nest here - I started a draft PR intending for it to be a draft while I investigated this, including establishing what the relaxation would need to be to get tests passing and also seeing if there was anything else out there. I didn't realize it would grab any attention right away.

From the perspective of Fedora packaging and (realistic) end-use cases, I don't believe (temporarily) excluding these architectures is a big deal, and I can do that. I can keep building against them routinely to check for improvements, since that's currently my technical level (I know big-endian and little-endian have to do with Gulliver's Travels, less what the practical implications for precision might be, etc.).

I'm content to follow all of you ("upstream") on this one - I was just trying to propose a solution/work-around concurrent with the problem to be less of a pest, not more.

@speth
Copy link
Member

speth commented May 3, 2022

@mefuller No worries, and apologies for the flurry of activity here. I think we're just very eager to make sure that the new release is working everywhere that we expect it to.

@bryanwweber
Copy link
Member

@mefuller I agree with @speth 100%, we're happy to know about the issues, and don't worry at all about the flurry of activity here. Packages are available via pip for those architectures, so disabling them for now shouldn't have much of an impact on end-users.

@ischoegl
Copy link
Member

ischoegl commented May 3, 2022

I likewise agree!

@ischoegl
Copy link
Member

What is the current status on this?

@mefuller mefuller marked this pull request as ready for review September 25, 2022 17:29
@mefuller
Copy link
Contributor Author

Just rebased
I've made this change as a patch in the Fedora package in order to get it to pass tests there

@ischoegl
Copy link
Member

Personally, I’m not opposed to adopting this tweak; perhaps you could be a little more descriptive in the commit comment?

@ischoegl
Copy link
Member

Personally, I’m not opposed to adopting this tweak; perhaps you could be a little more descriptive in the commit comment?

@mefuller ... is this PR still relevant or can it be closed?

@mefuller mefuller closed this Mar 22, 2023
@mefuller
Copy link
Contributor Author

For Fedora builds, I just patch it. I don't think this needs to be upstreamed especially as I haven't reevaluated whether it's still necessary at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants