Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

mefuller · 2022-01-18T11:40:09Z

Problem description

All builds on the ppc64le architecture with F36/Rawhide now fail. This was not the case four days ago (see https://copr.fedorainfracloud.org/coprs/fuller/Cantera/builds/).

Concurrently, the kinetics: KineticsAddSpecies3.add_species_sequential test now fails on "successful" builds on F34/35 for ppc64le architectures (and also for aarch64, i686 and s390x on all three Fedoras 34, 35, and 36/Rawhide, but not x86_64). This may not be a new problem as I only just added testing automation to the build automation for all architectures. (Yes, I know, I should have done that earlier)

Steps to reproduce

scons build && scons test

Behavior

Error message:
Build failure (ppc64le:F36):

In file included from /usr/include/eigen3/Eigen/Core:210,
                 from /usr/include/eigen3/Eigen/SparseCore:11,
                 from /usr/include/eigen3/Eigen/Sparse:26,
                 from include/cantera/numerics/eigen_sparse.h:10,
                 from include/cantera/kinetics/StoichManager.h:11,
                 from include/cantera/kinetics/Kinetics.h:14,
                 from src/base/Solution.cpp:12:
/usr/include/eigen3/Eigen/src/Core/arch/AltiVec/PacketMath.h:78:8: internal compiler error: Segmentation fault
   78 | static _EIGEN_DECLARE_CONST_FAST_Packet4i(ZERO, 0); //{ 0, 0, 0, 0,}
      |        ^

Test failure (all other affected systems):

[----------] 2 tests from KineticsAddSpecies3
[ RUN      ] KineticsAddSpecies3.add_species_sequential
test/kinetics/kineticsFromScratch3.cpp:371: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 591054161.41004908
  k[i]
    Which is: 591054161.41004813
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:377: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 150.5822178080069
  k[i]
    Which is: 150.58221780800667
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:384: Failure
Expected equality of these values:
  w_ref[iref]
    Which is: 150.58221780800866
  w[i]
    Which is: 150.58221780800844
sp = O; N = 4
[  FAILED  ] KineticsAddSpecies3.add_species_sequential (5 ms)

System information

Cantera version: fcff592
OS: Fedora Linux 36 (Rawhide), Fedora 35, Fedora 34
Python/MATLAB/other software versions: Python 3.9, 3.10 (see logs)

Attachments

Additional context
While the build processes are not failing, one test pertaining to kinetics is on both F34 and F35.
I suspect that these problems are related.
The test failures looks like excessive precision being requested - or is this a truncation/rounding error?

Logs
ppc64le/Rawhide - failed build
ppc64le/F35 - failed test
ppc64le/F34 - failed test

aarch64/Rawhide
aarch64/F35
aarch64/F34

Additional information and build logs at:

The text was updated successfully, but these errors were encountered:

ischoegl · 2022-01-18T12:34:35Z

Hi @mefuller … thanks for reporting. One thing that would help narrowing the offending commit down would be to know when the last known build that succeeded was triggered? (I.e. ideally what commit hash)

mefuller · 2022-01-18T12:59:33Z

Four days ago we were in good shape: https://koji.fedoraproject.org/koji/tasks?owner=fuller&state=all and https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3163517/

I'm looking, but not finding a corresponding commit hash.
The good news is, the builds on COPR pull the main branch of the official Cantera repo at the time they are run.

ischoegl · 2022-01-18T13:15:35Z

Ok. #1089 is the likely culprit for this then (sigh). It was merged 4 days ago and the last build likely passed just hours before that merge.

mefuller · 2022-01-18T13:35:05Z

If it helps, I would be willing to work with you and @bryanwweber (and anyone else) on setting up automated builds for testing with Fedora/EL and multiple architectures - I believe I can provide you with URLs to add as webhooks to trigger builds when you push to main have not tested this yet).

ischoegl · 2022-01-18T13:57:30Z

Regarding the build failure, it almost looks like this is due to some upstream issue, as it is triggered for an #include statement. You're using system Eigen (3.4.0), whereas the last successful build used 3.3.9. So I am not sure that this has to do with recent changes in Cantera. (#1089 heavily relies on Eigen's sparse matrices, which made me think of this for a moment.)

Regarding the other issues, these happen to be in a part unaffected by recent changes and mainly look like issues related to machine precision. Still curious that this happens all of a sudden.

mefuller · 2022-01-18T14:02:32Z

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present. Would it be acceptable to modify the tests such that there's more leniency?
I'd like to retain the current structure where my builds are marked as failed if the tests fail.

ischoegl · 2022-01-18T14:08:30Z

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present.

That would explain this!

Would it be acceptable to modify the tests such that there's more leniency?

I think changing offending lines to ASSERT_NEAR may be appropriate in this case.

speth · 2022-01-18T14:31:21Z

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I agree that changing those failing comparisons to ASSERT_NEAR would probably be fine, although I would keep the tolerance fairly tight, as the differences should just be the result of a little bit of accumulated rounding error.

bryanwweber · 2022-01-18T14:40:55Z

@mefuller Thanks for volunteering! I'd been thinking about how to add a Fedora job to our CI here on GitHub Actions. You can specify a container in which the job should run, so I think it should be possible to add a job that pulls a Fedora container from Quay and runs the build and tests inside that. I've been working on other things lately, but it's on my to-do list. If you want to try to figure it out, you can edit .github/workflows/main.yml to add a new job. Thanks!

mefuller · 2022-01-18T15:31:32Z

@bryanwweber I will definitely take a look.
It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

speth · 2022-01-18T15:48:54Z

It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

Given that the failures that are being identified here are related to architectures other than x86_64, I wonder if the most useful thing would actually be to trigger these builds elsewhere -- I don't think Github Actions currently provides runners on architectures other than x86_64.

mefuller · 2022-01-18T16:04:07Z

ok, I took care of the original test errors and now have a few more to deal with - I'll work on a larger PR aimed at getting things working across architectures

bryanwweber · 2022-01-18T17:09:04Z

I don't think Github Actions currently provides runners on architectures other than x86_64.

This is true, but it can use emulated architectures, as is done for the PyPI packages. That said, if COPR provides the resources, it'd probably be worth having architectures other than x86_64 running over there. I still think it'd be worth having a Fedora build on our GH actions here though.

mefuller · 2022-01-18T18:36:31Z

I need to ask for a bit more help:
On s390x architecture, I get the following block of errors:

----------------------------- Captured stdout call -----------------------------
Solution saved to file /builddir/build/BUILD/cantera-reduce_precision/test/work/python/impingingjet1.yaml as solution 'solution'.
- generated xml file: /builddir/build/BUILD/cantera-reduce_precision/test/work/pytest.xml -
=========================== short test summary info ============================
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:246: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:316: pandas is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:327: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:387: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:373: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:562: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:513: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:521: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:517: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_kinetics.py:142: scipy is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:791: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:1350: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_reaction.py:346: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_reactor.py:1490: Integration of sensitivity ODEs is unreliable
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_lean
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_stoichiometric
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::EquilExtraElements::test_element_potential
  reason: 
XFAIL ../../build/python/cantera/test/test_mixture.py::TestMixture::test_equilibrate2
  reason: 
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_load_thermo_models
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_restore_thermo_models
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits1
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits2
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_simple
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_surface
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_extra
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_sri_zero
= 6 failed, 1384 passed, 14 skipped, 4 xfailed, 11 warnings, 2 errors in 66.66s (0:01:06) =

e.g. https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-35-s390x/03195208-cantera/builder-live.log.gz

I don't see any useful output. Am I looking in the wrong place and/or are there options I should pass to the tests to get more out?

ischoegl · 2022-01-18T18:39:34Z

@mefuller ... could you run tests with the SCons flag verbose_tests=y?

ischoegl · 2022-01-18T18:45:49Z

On second look. The existing log already points to

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/test/work/python/gri30_extra-from-ck.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

meaning that generated output contains some problematic characters. Tracking this down would likely involve extracting the gri_extra-from-ck.yaml from your build environment.

mefuller · 2022-01-18T18:54:23Z

@ischoegl thanks - I feel pretty dumb now for not seeing all that output above where I was looking.
I'll see what I can do.

ischoegl · 2022-01-18T19:02:51Z

No worries. Fwiw, I just retracted a PR as I didn't realize that the change would force a complete rebuild of Cantera after each commit 😢 ... hindsight (sigh)

mefuller · 2022-01-18T19:31:04Z

I guess today's a move fast and break things kind of day.

I ran the verbose tests: https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-s390x/03196717-cantera/builder-live.log.gz (just in case anyone else wants to take a peek)

speth · 2022-01-18T20:05:08Z

This error:

_____________ ERROR at setup of TestModels.test_load_thermo_models _____________

cls = <class 'cantera.test.test_composite.TestModels'>

    @classmethod
    def setUpClass(cls):
        utilities.CanteraTest.setUpClass()
        cls.yml_file = cls.test_data_path / "thermo-models.yaml"
>       cls.yml = utilities.load_yaml(cls.yml_file)

../../build/python/cantera/test/test_composite.py:18: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../build/python/cantera/test/utilities.py:35: in load_yaml
    return yaml_.load(stream)
/usr/lib/python3.10/site-packages/ruamel/yaml/main.py:341: in load
    return constructor.get_single_data()
/usr/lib/python3.10/site-packages/ruamel/yaml/constructor.py:111: in get_single_data
    node = self.composer.get_single_node()
ruamel.yaml.clib/_ruamel_yaml.pyx:701: in _ruamel_yaml.CParser.get_single_node
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/build/python/cantera/test/data/thermo-models.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

mefuller · 2022-01-19T12:07:45Z

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I have filed a bug report with eigen at https://gitlab.com/libeigen/eigen/-/issues/2422

mefuller · 2022-01-19T12:22:14Z

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

I opened a ticket regarding this issue: https://sourceforge.net/p/ruamel-yaml/tickets/417/

AvdN · 2022-01-19T13:35:32Z

@mefuller If an issue, it is in ruamel-yaml-clib. The 16385 (2^14) is the input buffer size ( https://sourceforge.net/p/ruamel-yaml-clib/code/ci/default/tree/yaml_private.h#l57 ) so maybe this is some issue reading past the buffer only showing up on 390.

I assume you compile ruamel.yaml.clib yourself (as I don't provide any wheels for that architecture), so maybe you can patch a larger number in there.

mefuller · 2022-01-19T13:56:04Z

@AvdN I've opened a bug report to have the buffer patch tested: https://bugzilla.redhat.com/show_bug.cgi?id=2042422

mefuller · 2022-01-19T14:11:24Z

A Red Hat ticket has also been opened regarding the Eigen / ppc64le build failure: https://bugzilla.redhat.com/show_bug.cgi?id=2042432

speth · 2022-01-19T14:45:18Z

I think that last bug should be filed against GCC, not Eigen - an error in Eigen should at worst result in the compiler reporting an error of some sort, not segfaulting.

ischoegl mentioned this issue Jan 18, 2022

Include explicit git commit hash in config.h #1175

Closed

5 tasks

mefuller mentioned this issue Jan 19, 2022

Reduce precision in select tests to pass on alternate architectures #1178

Merged

5 tasks

speth closed this as completed in #1178 Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022 •

edited

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022 •

edited

speth commented Jan 18, 2022

bryanwweber commented Jan 18, 2022

mefuller commented Jan 18, 2022

speth commented Jan 18, 2022

mefuller commented Jan 18, 2022

bryanwweber commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

speth commented Jan 18, 2022

mefuller commented Jan 19, 2022

mefuller commented Jan 19, 2022

AvdN commented Jan 19, 2022

mefuller commented Jan 19, 2022

mefuller commented Jan 19, 2022

speth commented Jan 19, 2022

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174

Comments

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022 • edited

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022 • edited

speth commented Jan 18, 2022

bryanwweber commented Jan 18, 2022

mefuller commented Jan 18, 2022

speth commented Jan 18, 2022

mefuller commented Jan 18, 2022

bryanwweber commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

ischoegl commented Jan 18, 2022

mefuller commented Jan 18, 2022

speth commented Jan 18, 2022

mefuller commented Jan 19, 2022

mefuller commented Jan 19, 2022

AvdN commented Jan 19, 2022

mefuller commented Jan 19, 2022

mefuller commented Jan 19, 2022

speth commented Jan 19, 2022

ischoegl commented Jan 18, 2022 •

edited

ischoegl commented Jan 18, 2022 •

edited