Skip to content

Conversation

hyeoksu-lee
Copy link
Contributor

@hyeoksu-lee hyeoksu-lee commented Oct 6, 2025

User description

Description

This PR refactors time_stepper routines by integrating s_1st_order_tvd_rk, s_2nd_order_tvd_rk, s_3rd_order_tvd_rk, s_strang_splitting into a single subroutine s_tvd_rk.

Type of change

  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

test suite passed on MacBook M4 Pro, NCSA Delta GPU (A100), Frontier GPU

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement


Description

  • Consolidate four separate TVD RK time-stepping subroutines into single unified s_tvd_rk

  • Replace conditional calls with unified interface accepting nstage parameter

  • Add Runge-Kutta coefficient matrix for different time-stepping orders

  • Integrate adaptive time-stepping for bubble dynamics into unified routine


Diagram Walkthrough

flowchart LR
  A["Four separate TVD RK subroutines"] --> B["Single s_tvd_rk subroutine"]
  C["Conditional time_stepper calls"] --> D["Unified interface with nstage parameter"]
  E["RK coefficient matrix"] --> B
  F["Adaptive dt integration"] --> B
Loading

File Walkthrough

Relevant files
Enhancement
m_start_up.fpp
Simplify time-stepper invocation logic                                     

src/simulation/m_start_up.fpp

  • Replace conditional calls to separate TVD RK subroutines with single
    s_tvd_rk call
  • Pass time_stepper parameter to unified routine
+2/-8     
m_time_steppers.fpp
Consolidate TVD RK time-stepping implementations                 

src/simulation/m_time_steppers.fpp

  • Remove four separate TVD RK subroutines (s_1st_order_tvd_rk,
    s_2nd_order_tvd_rk, s_3rd_order_tvd_rk, s_strang_splitting)
  • Add unified s_tvd_rk subroutine with stage loop and RK coefficient
    matrix
  • Initialize RK coefficients for different time-stepping orders in setup
  • Integrate adaptive time-stepping directly into unified routine
+99/-673

@hyeoksu-lee hyeoksu-lee requested a review from a team as a code owner October 6, 2025 21:01
Copy link

qodo-merge-pro bot commented Oct 6, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The previous logic used Strang splitting when time_stepper==3 and adap_dt==true; the new unified call only triggers for time_stepper in {1,2,3} without checking adap_dt, potentially skipping Strang splitting behavior entirely.

if (any(time_stepper == (/1, 2, 3/))) then
    call s_tvd_rk(t_step, time_avg, time_stepper)
end if
Coeff Array Bounds

rk_coef is allocated as (time_stepper,3) but indexed with s=1..nstage; if nstage>time_stepper (e.g., 2nd/3rd order), indexing rk_coef(s,*) may go out of bounds unless nstage==time_stepper is guaranteed.

@:ALLOCATE (rk_coef(time_stepper, 3))
if (time_stepper == 1) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
else if (time_stepper == 2) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
    rk_coef(2, 1) = 0.5_wp
    rk_coef(2, 2) = 0.5_wp
    rk_coef(2, 3) = 0.5_wp
else if (time_stepper == 3) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
    rk_coef(2, 1) = 1._wp/4._wp
    rk_coef(2, 2) = 3._wp/4._wp
    rk_coef(2, 3) = 1._wp/4._wp
    rk_coef(3, 1) = 2._wp/3._wp
    rk_coef(3, 2) = 1._wp/3._wp
    rk_coef(3, 3) = 2._wp/3._wp
end if
State Storage Logic

The use of stor=1 for 1st order and 2 otherwise, with copying only when s==1 and time_stepper/=1, may not mirror prior GPU memory paths/dest handling; verify that q_cons_ts and pb/mv staging matches original semantics across compilers.

        if (any(time_stepper == (/1, 2, 3/))) then
            ! Time stepper index
            if (time_stepper == 1) then
                stor = 1
            else
                stor = 2
            end if

            @:ALLOCATE (rk_coef(time_stepper, 3))
            if (time_stepper == 1) then
                rk_coef(1, 1) = 1._wp
                rk_coef(1, 2) = 0._wp
                rk_coef(1, 3) = 1._wp
            else if (time_stepper == 2) then
                rk_coef(1, 1) = 1._wp
                rk_coef(1, 2) = 0._wp
                rk_coef(1, 3) = 1._wp
                rk_coef(2, 1) = 0.5_wp
                rk_coef(2, 2) = 0.5_wp
                rk_coef(2, 3) = 0.5_wp
            else if (time_stepper == 3) then
                rk_coef(1, 1) = 1._wp
                rk_coef(1, 2) = 0._wp
                rk_coef(1, 3) = 1._wp
                rk_coef(2, 1) = 1._wp/4._wp
                rk_coef(2, 2) = 3._wp/4._wp
                rk_coef(2, 3) = 1._wp/4._wp
                rk_coef(3, 1) = 2._wp/3._wp
                rk_coef(3, 2) = 1._wp/3._wp
                rk_coef(3, 3) = 2._wp/3._wp
            end if
        end if

    end subroutine s_initialize_time_steppers_module

    impure subroutine s_tvd_rk(t_step, time_avg, nstage)
#ifdef _CRAYFTN
        !DIR$ OPTIMIZE (-haggress)
#endif
        integer, intent(in) :: t_step
        real(wp), intent(inout) :: time_avg
        integer, intent(in) :: nstage

        integer :: i, j, k, l, q, s !< Generic loop iterator
        real(wp) :: start, finish
        integer :: dest

        call cpu_time(start)
        call nvtxStartRange("TIMESTEP")

        ! Adaptive dt: initial stage
        if (adap_dt) call s_adaptive_dt_bubble(1)

        do s = 1, nstage
            call s_compute_rhs(q_cons_ts(1)%vf, q_T_sf, q_prim_vf, bc_type, rhs_vf, pb_ts(1)%sf, rhs_pb, mv_ts(1)%sf, rhs_mv, t_step, time_avg, s)

            if (s == 1 .and. run_time_info) then
                if (igr) then
                    call s_write_run_time_information(q_cons_ts(1)%vf, t_step)
                else
                    call s_write_run_time_information(q_prim_vf, t_step)
                end if
            end if

            if (probe_wrt) then
                call s_time_step_cycling(t_step)
            end if

            if (cfl_dt) then
                if (mytime >= t_stop) return
            else
                if (t_step == t_step_stop) return
            end if

            if (bubbles_lagrange .and. .not. adap_dt) call s_update_lagrange_tdv_rk(stage=s)

            $:GPU_PARALLEL_LOOP(collapse=4)
            do i = 1, sys_size
                do l = 0, p
                    do k = 0, n
                        do j = 0, m
                            if (s == 1 .and. time_stepper /= 1) then
                                q_cons_ts(stor)%vf(i)%sf(j, k, l) = &
                                    q_cons_ts(1)%vf(i)%sf(j, k, l)
                            end if
                            q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                                rk_coef(s, 1)*q_cons_ts(1)%vf(i)%sf(j, k, l) &
                                + rk_coef(s, 2)*q_cons_ts(stor)%vf(i)%sf(j, k, l) &
                                + rk_coef(s, 3)*dt*rhs_vf(i)%sf(j, k, l)
                        end do
                    end do
                end do
            end do

            !Evolve pb and mv for non-polytropic qbmm
            if (qbmm .and. (.not. polytropic)) then
                $:GPU_PARALLEL_LOOP(collapse=5)
                do i = 1, nb
                    do l = 0, p
                        do k = 0, n
                            do j = 0, m
                                do q = 1, nnode
                                    if (s == 1 .and. time_stepper /= 1) then
                                        pb_ts(stor)%sf(j, k, l, q, i) = &
                                            pb_ts(1)%sf(j, k, l, q, i)
                                        mv_ts(stor)%sf(j, k, l, q, i) = &
                                            mv_ts(1)%sf(j, k, l, q, i)
                                    end if
                                    pb_ts(1)%sf(j, k, l, q, i) = &
                                        rk_coef(s, 1)*pb_ts(1)%sf(j, k, l, q, i) &
                                        + rk_coef(s, 2)*pb_ts(stor)%sf(j, k, l, q, i) &
                                        + rk_coef(s, 3)*dt*rhs_pb(j, k, l, q, i)
                                    mv_ts(1)%sf(j, k, l, q, i) = &
                                        rk_coef(s, 1)*mv_ts(1)%sf(j, k, l, q, i) &
                                        + rk_coef(s, 2)*mv_ts(stor)%sf(j, k, l, q, i) &
                                        + rk_coef(s, 3)*dt*rhs_mv(j, k, l, q, i)

@sbryngelson
Copy link
Member

I love a good red line PR! Will review soon

@hyeoksu-lee
Copy link
Contributor Author

Some of test suite failed (with the error close to tolerance, though), so set this PR as draft for now

@hyeoksu-lee hyeoksu-lee marked this pull request as draft October 6, 2025 21:25
@sbryngelson
Copy link
Member

small errors when refactoring the time steppers was also seen by @danieljvickers... so long as i'm convinced the code is right we might need a new goldenfile that satiates all the time steppers (or figure out the offending case and set looser tolerance).

@hyeoksu-lee hyeoksu-lee marked this pull request as ready for review October 6, 2025 21:29
Copy link

qodo-merge-pro bot commented Oct 6, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The unified call removes the previous branching that selected Strang splitting when time_stepper == 3 .and. adap_dt. Now it calls s_tvd_rk(time_stepper) only if 1..3, but there is no handling for adap_dt here, potentially changing behavior for adaptive-DT third-order runs.

if (any(time_stepper == (/1, 2, 3/))) then
    call s_tvd_rk(t_step, time_avg, time_stepper)
end if
Coeff Init

rk_coef is allocated with shape (time_stepper,3) each init, but accesses rk_coef(s, :) inside s_tvd_rk for s=1..nstage. Ensure nstage == time_stepper always; otherwise out-of-bounds. Consider allocating (nstage,3) or validating inputs.

@:ALLOCATE (rk_coef(time_stepper, 3))
if (time_stepper == 1) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
else if (time_stepper == 2) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
    rk_coef(2, 1) = 0.5_wp
    rk_coef(2, 2) = 0.5_wp
    rk_coef(2, 3) = 0.5_wp
else if (time_stepper == 3) then
    rk_coef(1, 1) = 1._wp
    rk_coef(1, 2) = 0._wp
    rk_coef(1, 3) = 1._wp
    rk_coef(2, 1) = 1._wp/4._wp
    rk_coef(2, 2) = 3._wp/4._wp
    rk_coef(2, 3) = 1._wp/4._wp
    rk_coef(3, 1) = 2._wp/3._wp
    rk_coef(3, 2) = 1._wp/3._wp
    rk_coef(3, 3) = 2._wp/3._wp
end if
State Storage

The storage index stor is set to 1 for 1st order and 2 for others, and stage-1 copies reference q_cons_ts(stor) and pb_ts(stor), mv_ts(stor). Verify pools are sized and initialized for stor==2; also confirm device annotations (GPU_DECLARE) include rk_coef and any needed buffers for consistent GPU behavior.

if (any(time_stepper == (/1, 2, 3/))) then
    ! Time stepper index
    if (time_stepper == 1) then
        stor = 1
    else
        stor = 2
    end if

    @:ALLOCATE (rk_coef(time_stepper, 3))
    if (time_stepper == 1) then
        rk_coef(1, 1) = 1._wp
        rk_coef(1, 2) = 0._wp
        rk_coef(1, 3) = 1._wp
    else if (time_stepper == 2) then
        rk_coef(1, 1) = 1._wp
        rk_coef(1, 2) = 0._wp
        rk_coef(1, 3) = 1._wp
        rk_coef(2, 1) = 0.5_wp
        rk_coef(2, 2) = 0.5_wp
        rk_coef(2, 3) = 0.5_wp
    else if (time_stepper == 3) then
        rk_coef(1, 1) = 1._wp
        rk_coef(1, 2) = 0._wp
        rk_coef(1, 3) = 1._wp
        rk_coef(2, 1) = 1._wp/4._wp
        rk_coef(2, 2) = 3._wp/4._wp
        rk_coef(2, 3) = 1._wp/4._wp
        rk_coef(3, 1) = 2._wp/3._wp
        rk_coef(3, 2) = 1._wp/3._wp
        rk_coef(3, 3) = 2._wp/3._wp

@hyeoksu-lee
Copy link
Contributor Author

Oh I see.. I put it back to ready for review. Please review the change and let me know if you have any question or concerns!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level Suggestion

To improve readability and maintainability, define the Runge-Kutta coefficients as named constants or parameters instead of hardcoding them in a large if-else block. [High-level, importance: 6]

Solution Walkthrough:

Before:

subroutine s_initialize_time_steppers_module
    ...
    if (any(time_stepper == (/1, 2, 3/))) then
        @:ALLOCATE (rk_coef(time_stepper, 3))
        if (time_stepper == 1) then
            rk_coef(1, 1) = 1._wp
            rk_coef(1, 2) = 0._wp
            rk_coef(1, 3) = 1._wp
        else if (time_stepper == 2) then
            rk_coef(1, 1) = 1._wp; rk_coef(1, 2) = 0._wp; rk_coef(1, 3) = 1._wp
            rk_coef(2, 1) = 0.5_wp; rk_coef(2, 2) = 0.5_wp; rk_coef(2, 3) = 0.5_wp
        else if (time_stepper == 3) then
            ...
        end if
    end if
end subroutine

After:

module m_time_steppers
    ...
    real(wp), parameter :: RK1_COEFS(1,3) = reshape((/ 1.0_wp, 0.0_wp, 1.0_wp /), (/1,3/))
    real(wp), parameter :: RK2_COEFS(2,3) = reshape((/ 1.0_wp, 0.0_wp, 1.0_wp, &
                                                      0.5_wp, 0.5_wp, 0.5_wp /), (/2,3/))
    ...
contains
    subroutine s_initialize_time_steppers_module
        ...
        if (time_stepper == 1) then
            rk_coef = RK1_COEFS
        else if (time_stepper == 2) then
            rk_coef = RK2_COEFS
        ...
        end if
    end subroutine

hyeoksu-lee and others added 2 commits October 6, 2025 14:50
Co-authored-by: qodo-merge-pro[bot] <151058649+qodo-merge-pro[bot]@users.noreply.github.com>
Co-authored-by: qodo-merge-pro[bot] <151058649+qodo-merge-pro[bot]@users.noreply.github.com>
@danieljvickers
Copy link
Member

Oh yeah, I tried to do something similar a few months ago, but kept getting weird compiler-dependent bugs that I couldn't reproduce locally. I like this PR. Looks like it is reducing a lot of the redundant code in the time_stepper. Seems odd that you are failing so many tests by so little. I would be interested to also see how far off of the golden files you are for the tests that pass, and see if they are similar.

@sbryngelson
Copy link
Member

@hyeoksu-lee status on this? was working i guess, but now everything is failing?

@hyeoksu-lee
Copy link
Contributor Author

@sbryngelson All failing CPU tests are due to error of the order of 1e-10, which is tolerance level. I am looking into the code to see if there is any systematic issue. And I am waiting for @wilfonba to review body force input for RK2.

In the meantime, I need to fix GPU bug, which seems not a simple tolerance-type issue.

@hyeoksu-lee
Copy link
Contributor Author

hyeoksu-lee commented Oct 10, 2025

@sbryngelson @danieljvickers I found the source of the slightly large error in failing cases.

The original time-stepper code was updating the conservative variables, for example, by

q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            (q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + 2._wp*q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + 2._wp*dt*rhs_vf(i)%sf(j, k, l))/3._wp

Previously, I changed this to

q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            rk_coef(3, 1)*q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + rk_coef(3, 2)*q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + rk_coef(3, 3)*dt*rhs_vf(i)%sf(j, k, l)

where rk_coef(3,1) = 1._wp/3._wp, rk_coef(3,2) = 2._wp/3._wp, and rk_coef(3,3) = 2._wp/3._wp. However, in this way, 2._wp/3._wp = 0.66666666666666663 (on my MacBook M4 Pro), so it introduces a very small, but critical, error that is not present in the original code. I don't think it is good for our code to be super sensitive to the order of arithmetic operations, but it is what's happening.

Accordingly, I changed the code to

q_cons_ts(1)%vf(i)%sf(j, k, l) = &
                            (rk_coef(3, 1)*q_cons_ts(1)%vf(i)%sf(j, k, l) &
                             + rk_coef(3, 2)*q_cons_ts(2)%vf(i)%sf(j, k, l) &
                             + rk_coef(3, 3)*dt*rhs_vf(i)%sf(j, k, l))/rk_coef(3,4)

where rk_coef(3,1) = 1._wp, rk_coef(3,2) = 2._wp, rk_coef(3,3) = 2._wp, and rk_coef(3,4) = 3._wp, so that the order of original operations are preserved.

@hyeoksu-lee
Copy link
Contributor Author

hyeoksu-lee commented Oct 10, 2025

@sbryngelson MacOS runners fail to build MFC due to cantera version compatibility. But I haven't changed anything about installations, and test suite passed on my MacBook M4 Pro. Do you have any idea?

ERROR: Ignored the following versions that require a different python version: 2.6.0.post1 Requires-Python >=3.7,<3.11; 3.1.0 Requires-Python >=3.8,<3.14; 3.1.0a4 Requires-Python >=3.8,<3.13; 3.1.0b1 Requires-Python >=3.8,<3.14
INFO: pip is looking at multiple versions of mfc to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement cantera==3.1.0 (from mfc) (from versions: 2.6.0a3, 2.6.0a4, 2.6.0b2, 2.6.0, 3.0.0b1, 3.0.0, 3.0.1)
ERROR: No matching distribution found for cantera==3.1.0
mfc: ERROR > (venv) Installation failed.
mfc: (venv) Exiting the Python virtual environment.

@danieljvickers
Copy link
Member

@hyeoksu-lee Look at this commit related to the cantera issue: 9dc62dd

@hyeoksu-lee
Copy link
Contributor Author

@danieljvickers Thanks! I updated test.yml as you did.

Copy link

codecov bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 43.13725% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.82%. Comparing base (c86fdd9) to head (1787d0c).

Files with missing lines Patch % Lines
src/simulation/m_time_steppers.fpp 42.85% 19 Missing and 9 partials ⚠️
src/simulation/m_start_up.fpp 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1012      +/-   ##
==========================================
+ Coverage   41.75%   41.82%   +0.07%     
==========================================
  Files          70       70              
  Lines       20126    19921     -205     
  Branches     2504     2490      -14     
==========================================
- Hits         8403     8332      -71     
+ Misses      10180    10043     -137     
- Partials     1543     1546       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wilfonba
Copy link
Collaborator

Is this verified to give the correct answer with unified memory?

@sbryngelson
Copy link
Member

We have no tests with unified memory...

@wilfonba
Copy link
Collaborator

wilfonba commented Oct 10, 2025

Well, anyone can run ./mfc.sh test --unified. It should pass on AMD. Probably won't pass on NVIDIA.

@hyeoksu-lee
Copy link
Contributor Author

@wilfonba Can I simply run ./mfc.sh test --unified on Frontier?

@wilfonba
Copy link
Collaborator

./mfc.sh test -j 8 --unified -- -c frontier after loading modules should do the trick. You'll need a compute node

@hyeoksu-lee
Copy link
Contributor Author

@wilfonba ./mfc.sh test -j 8 --unified -- -c frontier passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants