Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

Closed
mefuller opened this issue Jan 18, 2022 · 26 comments · Fixed by #1178
Closed

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

mefuller opened this issue Jan 18, 2022 · 26 comments · Fixed by #1178

Comments

@mefuller
Copy link
Contributor

Problem description

All builds on the ppc64le architecture with F36/Rawhide now fail. This was not the case four days ago (see https://copr.fedorainfracloud.org/coprs/fuller/Cantera/builds/).

Concurrently, the kinetics: KineticsAddSpecies3.add_species_sequential test now fails on "successful" builds on F34/35 for ppc64le architectures (and also for aarch64, i686 and s390x on all three Fedoras 34, 35, and 36/Rawhide, but not x86_64). This may not be a new problem as I only just added testing automation to the build automation for all architectures. (Yes, I know, I should have done that earlier)

Steps to reproduce

scons build && scons test

Behavior

Error message:
Build failure (ppc64le:F36):

In file included from /usr/include/eigen3/Eigen/Core:210,
                 from /usr/include/eigen3/Eigen/SparseCore:11,
                 from /usr/include/eigen3/Eigen/Sparse:26,
                 from include/cantera/numerics/eigen_sparse.h:10,
                 from include/cantera/kinetics/StoichManager.h:11,
                 from include/cantera/kinetics/Kinetics.h:14,
                 from src/base/Solution.cpp:12:
/usr/include/eigen3/Eigen/src/Core/arch/AltiVec/PacketMath.h:78:8: internal compiler error: Segmentation fault
   78 | static _EIGEN_DECLARE_CONST_FAST_Packet4i(ZERO, 0); //{ 0, 0, 0, 0,}
      |        ^

Test failure (all other affected systems):

[----------] 2 tests from KineticsAddSpecies3
[ RUN      ] KineticsAddSpecies3.add_species_sequential
test/kinetics/kineticsFromScratch3.cpp:371: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 591054161.41004908
  k[i]
    Which is: 591054161.41004813
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:377: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 150.5822178080069
  k[i]
    Which is: 150.58221780800667
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:384: Failure
Expected equality of these values:
  w_ref[iref]
    Which is: 150.58221780800866
  w[i]
    Which is: 150.58221780800844
sp = O; N = 4
[  FAILED  ] KineticsAddSpecies3.add_species_sequential (5 ms)

System information

  • Cantera version: fcff592
  • OS: Fedora Linux 36 (Rawhide), Fedora 35, Fedora 34
  • Python/MATLAB/other software versions: Python 3.9, 3.10 (see logs)

Attachments

Additional context
While the build processes are not failing, one test pertaining to kinetics is on both F34 and F35.
I suspect that these problems are related.
The test failures looks like excessive precision being requested - or is this a truncation/rounding error?

Logs
ppc64le/Rawhide - failed build
ppc64le/F35 - failed test
ppc64le/F34 - failed test

aarch64/Rawhide
aarch64/F35
aarch64/F34

Additional information and build logs at:

  1. https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3192999/
  2. https://koji.fedoraproject.org/koji/taskinfo?taskID=81397194
@ischoegl
Copy link
Member

Hi @mefuller … thanks for reporting. One thing that would help narrowing the offending commit down would be to know when the last known build that succeeded was triggered? (I.e. ideally what commit hash)

@mefuller
Copy link
Contributor Author

Four days ago we were in good shape: https://koji.fedoraproject.org/koji/tasks?owner=fuller&state=all and https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3163517/

I'm looking, but not finding a corresponding commit hash.
The good news is, the builds on COPR pull the main branch of the official Cantera repo at the time they are run.

@ischoegl
Copy link
Member

Ok. #1089 is the likely culprit for this then (sigh). It was merged 4 days ago and the last build likely passed just hours before that merge.

@mefuller
Copy link
Contributor Author

If it helps, I would be willing to work with you and @bryanwweber (and anyone else) on setting up automated builds for testing with Fedora/EL and multiple architectures - I believe I can provide you with URLs to add as webhooks to trigger builds when you push to main have not tested this yet).

@ischoegl
Copy link
Member

ischoegl commented Jan 18, 2022

Regarding the build failure, it almost looks like this is due to some upstream issue, as it is triggered for an #include statement. You're using system Eigen (3.4.0), whereas the last successful build used 3.3.9. So I am not sure that this has to do with recent changes in Cantera. (#1089 heavily relies on Eigen's sparse matrices, which made me think of this for a moment.)

Regarding the other issues, these happen to be in a part unaffected by recent changes and mainly look like issues related to machine precision. Still curious that this happens all of a sudden.

@mefuller
Copy link
Contributor Author

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present. Would it be acceptable to modify the tests such that there's more leniency?
I'd like to retain the current structure where my builds are marked as failed if the tests fail.

@ischoegl
Copy link
Member

ischoegl commented Jan 18, 2022

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present.

That would explain this!

Would it be acceptable to modify the tests such that there's more leniency?

I think changing offending lines to ASSERT_NEAR may be appropriate in this case.

@speth
Copy link
Member

speth commented Jan 18, 2022

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I agree that changing those failing comparisons to ASSERT_NEAR would probably be fine, although I would keep the tolerance fairly tight, as the differences should just be the result of a little bit of accumulated rounding error.

@bryanwweber
Copy link
Member

@mefuller Thanks for volunteering! I'd been thinking about how to add a Fedora job to our CI here on GitHub Actions. You can specify a container in which the job should run, so I think it should be possible to add a job that pulls a Fedora container from Quay and runs the build and tests inside that. I've been working on other things lately, but it's on my to-do list. If you want to try to figure it out, you can edit .github/workflows/main.yml to add a new job. Thanks!

@mefuller
Copy link
Contributor Author

@bryanwweber I will definitely take a look.
It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

@speth
Copy link
Member

speth commented Jan 18, 2022

It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

Given that the failures that are being identified here are related to architectures other than x86_64, I wonder if the most useful thing would actually be to trigger these builds elsewhere -- I don't think Github Actions currently provides runners on architectures other than x86_64.

@mefuller
Copy link
Contributor Author

ok, I took care of the original test errors and now have a few more to deal with - I'll work on a larger PR aimed at getting things working across architectures

@bryanwweber
Copy link
Member

I don't think Github Actions currently provides runners on architectures other than x86_64.

This is true, but it can use emulated architectures, as is done for the PyPI packages. That said, if COPR provides the resources, it'd probably be worth having architectures other than x86_64 running over there. I still think it'd be worth having a Fedora build on our GH actions here though.

@mefuller
Copy link
Contributor Author

I need to ask for a bit more help:
On s390x architecture, I get the following block of errors:

----------------------------- Captured stdout call -----------------------------
Solution saved to file /builddir/build/BUILD/cantera-reduce_precision/test/work/python/impingingjet1.yaml as solution 'solution'.
- generated xml file: /builddir/build/BUILD/cantera-reduce_precision/test/work/pytest.xml -
=========================== short test summary info ============================
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:246: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:316: pandas is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:327: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:387: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:373: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:562: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:513: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:521: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:517: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_kinetics.py:142: scipy is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:791: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:1350: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_reaction.py:346: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_reactor.py:1490: Integration of sensitivity ODEs is unreliable
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_lean
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_stoichiometric
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::EquilExtraElements::test_element_potential
  reason: 
XFAIL ../../build/python/cantera/test/test_mixture.py::TestMixture::test_equilibrate2
  reason: 
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_load_thermo_models
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_restore_thermo_models
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits1
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits2
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_simple
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_surface
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_extra
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_sri_zero
= 6 failed, 1384 passed, 14 skipped, 4 xfailed, 11 warnings, 2 errors in 66.66s (0:01:06) =

e.g. https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-35-s390x/03195208-cantera/builder-live.log.gz

I don't see any useful output. Am I looking in the wrong place and/or are there options I should pass to the tests to get more out?

@ischoegl
Copy link
Member

@mefuller ... could you run tests with the SCons flag verbose_tests=y?

@ischoegl
Copy link
Member

On second look. The existing log already points to

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/test/work/python/gri30_extra-from-ck.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

meaning that generated output contains some problematic characters. Tracking this down would likely involve extracting the gri_extra-from-ck.yaml from your build environment.

@mefuller
Copy link
Contributor Author

@ischoegl thanks - I feel pretty dumb now for not seeing all that output above where I was looking.
I'll see what I can do.

@ischoegl
Copy link
Member

No worries. Fwiw, I just retracted a PR as I didn't realize that the change would force a complete rebuild of Cantera after each commit 😢 ... hindsight (sigh)

@mefuller
Copy link
Contributor Author

I guess today's a move fast and break things kind of day.

I ran the verbose tests: https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-s390x/03196717-cantera/builder-live.log.gz (just in case anyone else wants to take a peek)

@speth
Copy link
Member

speth commented Jan 18, 2022

This error:

_____________ ERROR at setup of TestModels.test_load_thermo_models _____________

cls = <class 'cantera.test.test_composite.TestModels'>

    @classmethod
    def setUpClass(cls):
        utilities.CanteraTest.setUpClass()
        cls.yml_file = cls.test_data_path / "thermo-models.yaml"
>       cls.yml = utilities.load_yaml(cls.yml_file)

../../build/python/cantera/test/test_composite.py:18: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../build/python/cantera/test/utilities.py:35: in load_yaml
    return yaml_.load(stream)
/usr/lib/python3.10/site-packages/ruamel/yaml/main.py:341: in load
    return constructor.get_single_data()
/usr/lib/python3.10/site-packages/ruamel/yaml/constructor.py:111: in get_single_data
    node = self.composer.get_single_node()
ruamel.yaml.clib/_ruamel_yaml.pyx:701: in _ruamel_yaml.CParser.get_single_node
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/build/python/cantera/test/data/thermo-models.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

@mefuller
Copy link
Contributor Author

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I have filed a bug report with eigen at https://gitlab.com/libeigen/eigen/-/issues/2422

@mefuller
Copy link
Contributor Author

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

I opened a ticket regarding this issue: https://sourceforge.net/p/ruamel-yaml/tickets/417/

@AvdN
Copy link

AvdN commented Jan 19, 2022

@mefuller If an issue, it is in ruamel-yaml-clib. The 16385 (2^14) is the input buffer size ( https://sourceforge.net/p/ruamel-yaml-clib/code/ci/default/tree/yaml_private.h#l57 ) so maybe this is some issue reading past the buffer only showing up on 390.

I assume you compile ruamel.yaml.clib yourself (as I don't provide any wheels for that architecture), so maybe you can patch a larger number in there.

@mefuller
Copy link
Contributor Author

@AvdN I've opened a bug report to have the buffer patch tested: https://bugzilla.redhat.com/show_bug.cgi?id=2042422

@mefuller
Copy link
Contributor Author

A Red Hat ticket has also been opened regarding the Eigen / ppc64le build failure: https://bugzilla.redhat.com/show_bug.cgi?id=2042432

@speth
Copy link
Member

speth commented Jan 19, 2022

I think that last bug should be filed against GCC, not Eigen - an error in Eigen should at worst result in the compiler reporting an error of some sort, not segfaulting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants