New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174
Comments
Hi @mefuller … thanks for reporting. One thing that would help narrowing the offending commit down would be to know when the last known build that succeeded was triggered? (I.e. ideally what commit hash) |
Four days ago we were in good shape: https://koji.fedoraproject.org/koji/tasks?owner=fuller&state=all and https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3163517/ I'm looking, but not finding a corresponding commit hash. |
Ok. #1089 is the likely culprit for this then (sigh). It was merged 4 days ago and the last build likely passed just hours before that merge. |
If it helps, I would be willing to work with you and @bryanwweber (and anyone else) on setting up automated builds for testing with Fedora/EL and multiple architectures - I believe I can provide you with URLs to add as webhooks to trigger builds when you push to main have not tested this yet). |
Regarding the build failure, it almost looks like this is due to some upstream issue, as it is triggered for an Regarding the other issues, these happen to be in a part unaffected by recent changes and mainly look like issues related to machine precision. Still curious that this happens all of a sudden. |
Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present. Would it be acceptable to modify the tests such that there's more leniency? |
That would explain this!
I think changing offending lines to |
A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing. I agree that changing those failing comparisons to |
@mefuller Thanks for volunteering! I'd been thinking about how to add a Fedora job to our CI here on GitHub Actions. You can specify a container in which the job should run, so I think it should be possible to add a job that pulls a Fedora container from Quay and runs the build and tests inside that. I've been working on other things lately, but it's on my to-do list. If you want to try to figure it out, you can edit |
@bryanwweber I will definitely take a look. |
Given that the failures that are being identified here are related to architectures other than x86_64, I wonder if the most useful thing would actually be to trigger these builds elsewhere -- I don't think Github Actions currently provides runners on architectures other than |
ok, I took care of the original test errors and now have a few more to deal with - I'll work on a larger PR aimed at getting things working across architectures |
This is true, but it can use emulated architectures, as is done for the PyPI packages. That said, if COPR provides the resources, it'd probably be worth having architectures other than x86_64 running over there. I still think it'd be worth having a Fedora build on our GH actions here though. |
I need to ask for a bit more help:
I don't see any useful output. Am I looking in the wrong place and/or are there options I should pass to the tests to get more out? |
@mefuller ... could you run tests with the |
On second look. The existing log already points to
meaning that generated output contains some problematic characters. Tracking this down would likely involve extracting the |
@ischoegl thanks - I feel pretty dumb now for not seeing all that output above where I was looking. |
No worries. Fwiw, I just retracted a PR as I didn't realize that the change would force a complete rebuild of Cantera after each commit 😢 ... hindsight (sigh) |
I guess today's a move fast and break things kind of day. I ran the verbose tests: https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-s390x/03196717-cantera/builder-live.log.gz (just in case anyone else wants to take a peek) |
This error:
just looks like an internal problem with the |
I have filed a bug report with eigen at https://gitlab.com/libeigen/eigen/-/issues/2422 |
I opened a ticket regarding this issue: https://sourceforge.net/p/ruamel-yaml/tickets/417/ |
@mefuller If an issue, it is in ruamel-yaml-clib. The 16385 (2^14) is the input buffer size ( https://sourceforge.net/p/ruamel-yaml-clib/code/ci/default/tree/yaml_private.h#l57 ) so maybe this is some issue reading past the buffer only showing up on 390. I assume you compile ruamel.yaml.clib yourself (as I don't provide any wheels for that architecture), so maybe you can patch a larger number in there. |
@AvdN I've opened a bug report to have the buffer patch tested: https://bugzilla.redhat.com/show_bug.cgi?id=2042422 |
A Red Hat ticket has also been opened regarding the Eigen / ppc64le build failure: https://bugzilla.redhat.com/show_bug.cgi?id=2042432 |
I think that last bug should be filed against GCC, not Eigen - an error in Eigen should at worst result in the compiler reporting an error of some sort, not segfaulting. |
Problem description
All builds on the ppc64le architecture with F36/Rawhide now fail. This was not the case four days ago (see https://copr.fedorainfracloud.org/coprs/fuller/Cantera/builds/).
Concurrently, the
kinetics: KineticsAddSpecies3.add_species_sequential
test now fails on "successful" builds on F34/35 for ppc64le architectures (and also for aarch64, i686 and s390x on all three Fedoras 34, 35, and 36/Rawhide, but not x86_64). This may not be a new problem as I only just added testing automation to the build automation for all architectures. (Yes, I know, I should have done that earlier)Steps to reproduce
scons build && scons test
Behavior
Error message:
Build failure (ppc64le:F36):
Test failure (all other affected systems):
System information
Attachments
Additional context
While the build processes are not failing, one test pertaining to kinetics is on both F34 and F35.
I suspect that these problems are related.
The test failures looks like excessive precision being requested - or is this a truncation/rounding error?
Logs
ppc64le/Rawhide - failed build
ppc64le/F35 - failed test
ppc64le/F34 - failed test
aarch64/Rawhide
aarch64/F35
aarch64/F34
Additional information and build logs at:
The text was updated successfully, but these errors were encountered: