Continuous integration and docker auto-build #70

zhucaoxiang · 2020-06-25T04:38:18Z

In the branch ci, I have implemented some initial tests for continuous integration (CI). Basically, each push and pull-request will trigger an automatic build and regression test. This could be done with various tools. Here, I am using GitHub actions and Docker.

It seems that GitHub action cannot cache numerical packages. To avoid installing all the required packages each time, I created a docker image for compiling (docker pull zhucaoxiang/stellopt:compile). When triggered, it will compile the latest code and run testings using the docker image. An example log can be found here. It takes 5 minutes to compile the code.

name: build-test
on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    container: zhucaoxiang/stellopt:compile
    env:
      MACHINE: docker
      STELLOPT_PATH: ${{ github.workspace }}
      OMPI_ALLOW_RUN_AS_ROOT: 1
      OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: 1
    steps:
      - name: checkout sources
        uses: actions/checkout@master
      - name: compile
        run: |
          cd ${STELLOPT_PATH}
          ./build_all 2>&1 | tee log.build
      - name: test
        run: |
          cd ${STELLOPT_PATH}/BENCHMARKS
          make test_vmec_QAS

Here are the things that I would like to discuss.

Are these the right tools to do CI? Does anyone have other experiences?
Decide the appropriate triggers. Should we do the test for every push in every branch? Fortunately, we have unlimited building time.
Complete the regression test. Right now, I have only implemented one simple test compere QAS VMEC case. We should determine a list of regression tests that are representative and fast. Also, we have to update the check script.

The text was updated successfully, but these errors were encountered:

zhucaoxiang · 2020-06-25T04:54:07Z

I got a side issue when comparing the VMEC outputs. Because of errors in using the dynamic library on docker (see the log), I am using a python script to compare the wout file with a reference one (see compare_VMEC.py). The reference wout file was produced on the PPPL cluster with CentOS 6 and GCC in the develop branch. When comparing it with the newly produced wout file in the ci branch, the comparison passed on the PPPL cluster, but failed at others.

Below is the comparison between the produced file on the Eddy cluster and the reference one.

Compare VMEC outputs in wout_QAS.nc and wout_QAS_ref.nc with tolerance  1.00000E-12
======================================
UNMATCHED: b0 , diff= 1.67533E-12
UNMATCHED: rbtor , diff= 1.18616E-12
UNMATCHED: ctor , diff= 2.18279E-09
UNMATCHED: raxis_cc , diff= 2.38362E-12
UNMATCHED: zaxis_cs , diff= 1.79086E-12
UNMATCHED: iotaf , diff= 2.08501E-10
UNMATCHED: q_factor , diff= 1.10495E-09
UNMATCHED: chi , diff= 1.19103E-11
UNMATCHED: chipf , diff= 1.03639E-10
UNMATCHED: jcuru , diff= 9.02457E-04
UNMATCHED: jcurv , diff= 1.23804E-07
UNMATCHED: iotas , diff= 2.08630E-10
UNMATCHED: beta_vol , diff= 1.31091E-12
UNMATCHED: bvco , diff= 5.70251E-12
UNMATCHED: vp , diff= 3.70526E-12
UNMATCHED: specw , diff= 2.00514E-08
UNMATCHED: over_r , diff= 1.30525E-11
UNMATCHED: jdotb , diff= 1.74535E-03
UNMATCHED: bdotb , diff= 1.32924E-10
UNMATCHED: bdotgradv , diff= 5.55480E-11
UNMATCHED: DMerc , diff= 1.07121E-06
UNMATCHED: DShear , diff= 3.17038E-09
UNMATCHED: DWell , diff= 1.33956E-08
UNMATCHED: DCurr , diff= 1.45721E-08
UNMATCHED: DGeod , diff= 1.06125E-06
UNMATCHED: equif , diff= 6.51166E-08
UNMATCHED: rmnc , diff= 3.89256E-10
UNMATCHED: zmns , diff= 4.38073E-10
UNMATCHED: lmns , diff= 4.76656E-09
UNMATCHED: gmnc , diff= 1.37951E-09
UNMATCHED: bmnc , diff= 4.92215E-10
UNMATCHED: bsubumnc , diff= 5.46979E-10
UNMATCHED: bsubvmnc , diff= 6.05400E-10
UNMATCHED: bsubsmns , diff= 1.54010E-09
UNMATCHED: currumnc , diff= 1.31922E-02
UNMATCHED: currvmnc , diff= 1.80030E-03
UNMATCHED: bsupumnc , diff= 8.59530E-09
UNMATCHED: bsupvmnc , diff= 5.19838E-10
  comparison passed: False 
======================================
Traceback (most recent call last):
  File "./compare_VMEC.py", line 41, in <module>
    assert match, 'Differences in some elements are larger than the tolerence.'
AssertionError: Differences in some elements are larger than the tolerence.

The jcuru, jdotb, currumnc, currvmnc has a significant difference. The possible reason is the difference of the platform and/or compilers. But still, I am surprised to see the numbers differ so much on different platforms. @lazersos Any thoughts?

This actually brings up difficulties in doing the regression test.

krystophny · 2020-06-25T06:40:20Z

@zhucaoxiang this is very interesting. From my (also not large) experience with CI testing your approach looks fine on first sight. Github actions are the new standard way of doing things and we use it also for pyccel and other stuff now and plan to convert older Travis CI builds to that one.

About the results: one idea would be to switch off all optimization and see if there are still differences. Compiler-optimized SIMD floating-point code will introduce differences in the last digits that may be amplified if some computation is numerically unstable. Still the differences look a bit large. In the end this may point to a bug in the code.

landreman · 2020-06-25T10:16:50Z

Regarding the vmec problem, I bet the quantities related to J have larger errors because of mu0. For a 1 Tesla 1 meter configuration, errors in B on the order of machine precision will have errors in J on the order of machine precision / mu0.

…

On Thu, Jun 25, 2020, 2:40 AM Christopher Albert ***@***.***> wrote: @zhucaoxiang <https://github.com/zhucaoxiang> this is very interesting. From my (also not large) experience with CI testing your approach looks fine on first sight. Github actions are the new standard way of doing things and we use it also for pyccel and other stuff now and plan to convert older Travis CI builds to that one. About the results: one idea would be to switch off all optimization and see if there are still differences. Compiler-optimized SIMD floating-point code will introduce differences in the last digits that may be amplified if some computation is numerically unstable. Still the differences look a bit large. In the end this may point to a bug in the code. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABH4ZAYTWVF6ZQJ67RCT37LRYLWOFANCNFSM4OH2LB2A> .

landreman · 2020-06-25T14:11:18Z

It is a great idea to set up CI for stellopt - thanks for working on this!

About your remark that "GitHub actions cannot cache numerical packages": one thing that may work for you here is to use apt-get to install pre-built packages. An example can be seen here, and the list of available packages can be found here. Using apt-get may make it easier to add packages (no need to re-build a container), but very new versions of some software may not be available.

Back on the VMEC issue, another reason J-related quantities may have larger errors is that J is a derivative of B, and numerical derivatives magnify noise. For some combination of this reason and 1/mu0, when VMEC is run for vacuum fields it often gives jdotb ~ O(1) in SI units. This value is small compared to typical tokamak currents (MA) so it is not necessarily a bug. Since this jdotB~1 is 100% numerical error, I am not too surprised/worried by a 1e-3 difference in jdotb between different computers.

zhucaoxiang · 2020-06-25T14:21:24Z

@landreman Thanks for the replies. I think you are right about why the J terms have such a significant difference. I am trying to compare the debug version with -O 0 option as @krystophny suggested, but get stuck with some bugs. I guess we can skip the J terms or set a different tolerance.

As to use GitHub actions to install numerical packages, I did it for FOCUS. It works, but my concern is that we need so many libraries in STELLOPT. It would take probably over 10 mins to install all the libraries each time. As far as I know, there is no easy way to use cached packages. Building a container actually takes less than 1 min. We could implement both at the testing stage.

zhucaoxiang assigned landreman, jcschmitt, lazersos and jbreslau Jun 25, 2020

zhucaoxiang added the enhancement New feature or request label Jun 25, 2020

zhucaoxiang assigned dbindel, krystophny and cdmuhlb Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous integration and docker auto-build #70

Continuous integration and docker auto-build #70

zhucaoxiang commented Jun 25, 2020 •

edited by landreman

zhucaoxiang commented Jun 25, 2020

krystophny commented Jun 25, 2020

landreman commented Jun 25, 2020 via email

landreman commented Jun 25, 2020

zhucaoxiang commented Jun 25, 2020

Continuous integration and docker auto-build #70

Continuous integration and docker auto-build #70

Comments

zhucaoxiang commented Jun 25, 2020 • edited by landreman

zhucaoxiang commented Jun 25, 2020

krystophny commented Jun 25, 2020

landreman commented Jun 25, 2020 via email

landreman commented Jun 25, 2020

zhucaoxiang commented Jun 25, 2020

zhucaoxiang commented Jun 25, 2020 •

edited by landreman