# CPU-based Nek5000 with in-situ image generation on different numbers of CPU cores
Using the integer selector labeled “Frequency”, users can specify how often the in-situ image generation should be executed. The integer range slider labeled “# of cores” allows users to define the range of total CPU cores used for evaluating both the synchronous in-situ approach and the asynchronous in-situ approach with their best-performing configurations.

The left plot displays the execution time of the synchronous and asynchronous in-situ approaches. The blue line represents the synchronous execution time, while the red crosses indicate the best performance of the asynchronous approach. For each red marker, the number of CPU cores allocated to Nek5000 in the asynchronous approach is annotated above it.

The right plot shows the parallel efficiency of both the NEKO simulation and the image generation. The purple line represents the parallel efficiency of Nek5000, and the green line represents the parallel efficiency of the image generation.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact
from interactive_func import plot_sync_async
x_widget=widgets.IntRangeSlider(
    value=[864, 1728],
    min=0,
    max=2000,
    step=72,
    description='# of cores:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
)

fre_opt=widgets.BoundedIntText(
    value=2,
    min=1,
    max=100,
    step=1,
    description='Frequancy: 1/',
    disabled=False
)
interact(plot_sync_async, fre=fre_opt, nodes=x_widget)

# CPU-based Nek5000 with image generation
We use the slider with the name "total_core" to choose the total number of CPU cores. Similarly, we also used the numerical text box with the name "Frequency" and four toggle buttons ("Synchronous", "Asynchronous", "Hybrid", and "Auto" approaches). In this case, the synchronous and asynchronous approaches are possible, so no plots would be illustrated when the "Hybrid" button is chosen. When the "Auto" button is chosen, the synchronous or asynchronous approach is plotted after comparing the performance of the synchronous approach and the best performance of the asynchronous approach. 

When in-situ image generation is executed every 50 simulation steps using 1728 Raven CPU cores, the synchronous approach is preferred because the total computation cost of image generation is much lower than the Nek5000 simulation. The total execution time (red crosses) in Quadrants I and II is the sum of Nek5000 simulation time (purple dash-dot line) and image generation time (green dash-dot line).

When in-situ image generation is executed every two steps and the Nek5000 with in-situ image generation using 1728 Raven CPU cores, the asynchronous approach is preferred, and the best configuration is to use 1326 CPU cores to execute Nek5000 and the rest for in-situ image generation. The total execution time (red crosses) in Quadrants I and II is the sum of image generation time (green dash-dot line) and communication between CPU cores for Nek5000 simulation and CPU cores for image generation (yellow dash-dot line).

In [None]:
import ipywidgets as widgets
from ipywidgets import interact
from interactive_single import interactive_insitu
log_button = widgets.ToggleButtons(
    options=['Synchronous', 'Asynchronous', 'Hybrid', 'Auto'],
    description='Aproaches:',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['sync', 'async', 'hybrid'],
#     icons=['check'] * 3
)

y_widget = widgets.IntSlider(min=72, max=1728, step=72, value=864, description = "total_core")
x_widget = widgets.IntSlider(min=1, max=1728, step=1, value=864, description = " ")

def update_x_range(*args):
    if log_button.value == "Asynchronous" : 
        x_widget.max = y_widget.value - 1
        x_widget.description = "app_core"
        x_widget.value = y_widget.value / 2
    elif log_button.value == "Synchronous" : 
        x_widget.value = y_widget.value
        x_widget.max = y_widget.max
        x_widget.description = " "

y_widget.observe(update_x_range, 'value')
log_button.observe(update_x_range, 'value')
fre_opt=widgets.BoundedIntText(
    value=2,
    min=1,
    max=100,
    step=1,
    description='Frequancy: 1/',
    disabled=False
)
interact(interactive_insitu, total_core=y_widget, app_core=x_widget, freq=fre_opt, method=log_button)

# GPU-accelerated NEKO with lossy data compression
We use the slider with the name "gpu" to choose the total number of GPU(s) and the slider with the name "cores/gpu" to choose the number of \gls{cpu} cores sharing one \gls{gpu}. The numerical text box with the name "Frequency" allows users to input the frequency to execute the in-situ task. With four toggle buttons, we can choose from "Synchronous", "Asynchronous", "Hybrid", and "Auto" approaches. When "Synchronous", "Asynchronous", or "Hybrid" button is chosen, the performance and efficiency prediction of this approach are plotted. When the "Auto" button is chosen, the synchronous or asynchronous approach is plotted after comparing the performance of the synchronous approach and the best performance of the asynchronous approach. In this case, the synchronous and hybrid approaches are possible, so no plots would be illustrated when the "Asynchronous" button is chosen. 

We predict the preferred in-situ approach for the GPU-accelerated NEKO with lossy and lossless data compression every ten simulation steps. 
As the case is rather small, we execute it on two Raven GPU nodes (eight GPUs in total). For each GPU, a different number of CPU cores for the CPU-parts of NEKO can be chosen, where for each core, a separate copy of the simulation data needs to be kept on the GPU. Because of these memory requirements, not more than six cores per GPU can be used, and our experiments showed the best performance with three cores per GPU. 
Two GPU nodes execute the GPU-accelerated NEKO case. Choosing a different number of CPU cores for each GPU is possible. 
As the lossy data compression is deeply coupled with the simulation, we always keep it synchronous. 

Synchronous data compression is preferred when we use eight GPUs and six cores on each GPU for NEKO simulation. In Quadrants I and II, the total execution time (red crosses) is the sum of NEKO simulation time (purple dash-dot lines) and lossy and lossless data compression time (green dash-dot lines). In Quadrants III and IV, the parallel efficiencies of NEKO simulation (purple star) and data compression (green star) are both better than $80\%$. With this configuration, both NEKO simulation and data compression are efficient enough.

With the best performance setup (eight GPUs and three cores per GPU), our model suggests using the asynchronous in-situ method for lossless compression. Thus, we end up with a hybrid in-situ method. In this situation, the hybrid approach is preferred. For the hybrid approach, both synchronous and asynchronous parts of in-situ tasks are plotted in Quadrants I and IV. Our model predicts that by using 30 cores for the lossless compression, we get the best efficiency, as with this setup the time spent in lossless compression is about the same as the time of the simulation steps and the synchronous lossy compression and the total execution time (red crosses) is the sum of lossless compression on CPUs (green dash-dot line) and communication (yellow dash line) shown in Quadrants I and II. Thus, we achieve a good asynchronous overlap with good resource usage. In Quadrants III and IV, the parallel efficiencies of NEKO simulation (purple star), synchronous lossy data compression on GPUs (green star), and asynchronous lossless data compression (green circle) are all better than $80\%$.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact
from interactive_double import interactive_insitu
log_button = widgets.ToggleButtons(
    options=['Synchronous', 'Asynchronous', 'Hybrid','Auto'],
    description='Aproaches:',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['sync', 'async', 'hybrid'],
#     icons=['check'] * 3
)

y_widget = widgets.IntSlider(min=4, max=28, step=4, value=8, description = "gpu")
x_widget = widgets.IntSlider(min=1, max=18, step=1, value=3, description = "cores / gpu")
z_widget = widgets.IntSlider(min=1, max=504, step=1, value=8, description = "in-situ core")
def update_x_range(*args):
    if log_button.value == "Hybrid" : 
        z_widget.max = y_widget.value * (18 - x_widget.value)
        x_widget.max = 17
        z_widget.min = y_widget.value / 4
        z_widget.step = y_widget.value / 4
        z_widget.description = "in-situ core"
    elif log_button.value == "Synchronous" : 
        z_widget.value = y_widget.value * x_widget.value
        z_widget.max = y_widget.value * 18 
        x_widget.max = 18
        z_widget.min = 1
        z_widget.step = 1
        z_widget.description = "in-situ core"
    elif log_button.value == "Auto" : 
        z_widget.max = y_widget.value * (18 - x_widget.value)
        x_widget.max = 17
        z_widget.min = y_widget.value / 4
        z_widget.step = y_widget.value / 4
        z_widget.value = 9
        z_widget.description = " "
        
y_widget.observe(update_x_range, 'value')
x_widget.observe(update_x_range, 'value')
log_button.observe(update_x_range, 'value')
fre_opt=widgets.BoundedIntText(
    value=10,
    min=1,
    max=100,
    step=1,
    description='Frequancy: 1/',
    disabled=False
)
interact(interactive_insitu, gpu=y_widget, ppg=x_widget, core=z_widget, freq=fre_opt, method=log_button)

# GPU-accelerated QE with lossless data compression
We use the slider with the name "gpu" to choose the total number of GPU(s), the slider with the name "ranks/gpu" to choose the number of MPI ranks accessing one GPU, and the slider with the name "threads/rank" to choose the number of threads on each MPI rank. The numerical text box with the name "Frequency" enables users to specify the interval at which the in-situ task should be executed. The four toggle buttons are the same as NEKO with data compression. 

Given the relatively small scale of this case, the execution is confined to a single GPU node that has four GPUs. QE uses MPI to allow a different number of MPI ranks for the CPU-parts of QE and OpenMP to allow a different number of CPU cores per MPI rank. For each MPI rank a separate copy of the simulation data needs to be kept on the GPU. Our experimental results indicate that the most effective performance is achieved with three CPU cores per MPI rank and three ranks per GPU.
We also execute the case on two GPU nodes (eight GPUs in total). 
Asynchronous data compression is preferred when we use eight GPUs, three MPI ranks per GPU and two threads per MPI rank for QE simulation. Our model predicts that by using 24 cores for the lossless compression, we get the best efficiency, as with this setup the time spent in lossless compression is about the same as the time of the simulation steps (red crosses) is the sum of QE simulation on GPUs (purple dash-dot line) and communication (yellow dash line) shown in Quadrants I and II. Thus, we achieve a good asynchronous overlap with good resource usage. And in Quadrants III, the parallel efficiencies of QE simulation on GPUs (purple star) is better than $80\%$, and, in Quadrants IV, asynchronous lossless data compression (green star) is worse than $20\%$, because of the relatively small computational cost for compression.

With the other setup (six MPI ranks per GPU), our model suggests using the synchronous in-situ method for lossless compression. 
In Quadrants I and II, the total execution time (red crosses) is the sum of QE simulation time (purple dash-dot lines) and lossless data compression time (green dash-dot lines). In Quadrant III, the parallel efficiency of the QE simulation on GPUs (purple star) exceeds $80\%$, demonstrating good scalability. Conversely, in Quadrant IV, the parallel efficiency of synchronous lossless data compression (green star) is still below $20\%$, primarily due to the relatively low computational cost associated with the compression task.



In [None]:
import ipywidgets as widgets
from ipywidgets import interact
from interactive_qe import interactive_insitu
log_button = widgets.ToggleButtons(
    options=['Synchronous', 'Asynchronous', 'Hybrid','Auto'],
    description='Aproaches:',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['sync', 'async', 'hybrid'],
#     icons=['check'] * 3
)

y_widget = widgets.IntSlider(min=4, max=28, step=4, value=8, description = "gpu")
x_widget = widgets.IntSlider(min=1, max=18, step=1, value=3, description = "ranks / gpu")
x2_widget = widgets.IntSlider(min=1, max=18, step=1, value=2, description = "threads / rank")
z_widget = widgets.IntSlider(min=1, max=504, step=1, value=8, description = "in-situ core")
def update_x_range(*args):
    if log_button.value == "Asynchronous" : 
        z_widget.max = y_widget.value * (18 - x_widget.value*x2_widget.value)
        x_widget.max = 18/x2_widget.value -1
        z_widget.min = y_widget.value / 4
        z_widget.step = y_widget.value / 4
        z_widget.description = "in-situ core"
    elif log_button.value == "Synchronous" : 
        z_widget.value = y_widget.value * x_widget.value
        z_widget.max = y_widget.value * 18 
        x_widget.max = 18
        z_widget.min = 1
        z_widget.step = 1
        z_widget.description = "in-situ core"
    elif log_button.value == "Auto" : 
        #z_widget.max = y_widget.value * (18 - x_widget.value*x2_widget.value)
        x_widget.max = 17
        x2_widget.max = 18 / x_widget.value
        z_widget.min = y_widget.value / 4
        z_widget.step = y_widget.value / 4
        z_widget.description = " "
        
y_widget.observe(update_x_range, 'value')
x_widget.observe(update_x_range, 'value')
x2_widget.observe(update_x_range, 'value')
log_button.observe(update_x_range, 'value')
fre_opt=widgets.BoundedIntText(
    value=10,
    min=1,
    max=1000,
    step=1,
    description='Frequancy: 1/',
    disabled=False
)
interact(interactive_insitu, gpu=y_widget, ppg=x_widget, r=x2_widget, core=z_widget, freq=fre_opt, method=log_button)