# Kernel Level Profiling

We start by compiling and executing our application to make sure that its results still match our expectations.

In [None]:
!nvc++ -O3 -march=native -std=c++17 -mp=gpu -target=gpu ../src/stencil-2d/stencil-2d-omp-target-v1.cpp -o ../build/stencil-2d-omp-target-v1

In [None]:
!../build/stencil-2d-omp-target-v1 double 8192 8192 2 256

## Nsight Compute CLI

Next, we profile the application using the CLI of Nsight Compute: `ncu`.
Command line arguments are:
* `-o ...`: sets the target output profile file (equivalent to `nsys`)
* `--force-overwrite`: replaces the profile file if it already exists (in contrast to `nsys` no `=true`)

All command line arguments are listed in the [documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#profile).

Without an output file, results are printed to the command line.

Note that we also decrease the number of iterations since by default *every kernel* is profiled.

In [None]:
!ncu ../build/stencil-2d-omp-target-v1 double 8192 8192 2 2

We can further limit the scope of profiled kernels with
* `--launch-skip n` or `-s n`: skips the first `n` kernels encountered
* `--launch-count n` or `-c n`: limits profiling to the first `n` applicable kernels
* `--kernel name` or `-k name`: limits profiling to kernels with the name `name`
  * can also be used with regex, e.g. `-k regex:"*stencil2d*"`
  * Nsight Compute also supports kernel renaming

In [None]:
!ncu -s 2 -c 1 ../build/stencil-2d-omp-target-v1 double 8192 8192 2 2

### Potential Output

```bash
  nvkernel__Z9stencil2dIdEvPKT_PS0_mm_F2L5_14 (64, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         7.24
    SM Frequency                    Ghz         1.30
    Elapsed Cycles                cycle      9309966
    Memory Throughput                 %        43.05
    DRAM Throughput                   %        21.77
    Duration                         ms         7.13
    L1/TEX Cache Throughput           %        58.12
    L2 Cache Throughput               %        35.90
    SM Active Cycles              cycle   6889985.05
    Compute (SM) Throughput           %        17.17
    ----------------------- ----------- ------------

    OPT   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full
          waves across all SMs. Look at Launch Statistics for more details.
```

```bash
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                     64
    Registers Per Thread             register/thread              48
    Shared Memory Configuration Size           Kbyte           16.38
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              84
    Threads                                   thread            8192
    Uses Green Context                                             0
    Waves Per SM                                                0.08
    -------------------------------- --------------- ---------------

    OPT   Est. Speedup: 23.81%
          The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 84
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)
          description for more details on launch configurations.
```

```bash
    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           10
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           12
    Theoretical Active Warps per SM        warp           40
    Theoretical Occupancy                     %        83.33
    Achieved Occupancy                        %         8.33
    Achieved Active Warps Per SM           warp         4.00
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 90.01%
          The difference between calculated theoretical (83.3%) and measured achieved occupancy (8.3%) can be the
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
          optimizing occupancy.
```

```bash
    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average DRAM Active Cycles       cycle  11247889.33
    Total DRAM Elapsed Cycles        cycle    620130304
    Average L1 Active Cycles         cycle   6889985.05
    Total L1 Elapsed Cycles          cycle    781393884
    Average L2 Active Cycles         cycle   8747615.29
    Total L2 Elapsed Cycles          cycle    421197072
    Average SM Active Cycles         cycle   6889985.05
    Total SM Elapsed Cycles          cycle    781393884
    Average SMSP Active Cycles       cycle   6883894.44
    Total SMSP Elapsed Cycles        cycle   3125575536
    -------------------------- ----------- ------------

    OPT   Est. Speedup: 19.18%
          One or more SMs have a much lower number of active cycles than the average number of active cycles. Maximum
          instance value is 25.89% above the average, while the minimum instance value is 100.00% below the average.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Speedup: 19.19%
          One or more SMSPs have a much lower number of active cycles than the average number of active cycles. Maximum
          instance value is 25.94% above the average, while the minimum instance value is 100.00% below the average.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Speedup: 19.18%
          One or more L1 Slices have a much lower number of active cycles than the average number of active cycles.
          Maximum instance value is 25.89% above the average, while the minimum instance value is 100.00% below the
          average.
```

### Exercise

Have a look at the speed of light (SOL) output and try to pinpoint performance issues, as well as their main contributor.

### Solution

Nsight Compute rightfully reports that none of the usual performance limiters (FLOPS, MEM, ...) are fully utilized showing an underperforming applications.
The tool also tries to give developers hints into potentially useful optimizations or reasons for lacking performance.
In this case these hints are useful (which might not always be the case):
* The occupancy is very low which is a results of having only one fraction of a wave.
* This stems from a low number of CUDA blocks, lower than the number of SMs even.

## Stencil Code Optimization 2

One first attempt to increase the number of blocks could be reducing the block size.
OpenMP supports this by specifying the `num_threads` clause.
The updated code is available at [stencil-2d-omp-target-v2](../src/stencil-2d/stencil-2d-omp-target-v2.cpp), and can be compiled, executed and profiled with the following cells.

In [None]:
!nvc++ -O3 -march=native -std=c++17 -mp=gpu -target=gpu ../src/stencil-2d/stencil-2d-omp-target-v2.cpp -o ../build/stencil-2d-omp-target-v2

In [None]:
!../build/stencil-2d-omp-target-v2 double 8192 8192 2 256

In [None]:
!ncu -s 2 -c 1 ../build/stencil-2d-omp-target-v2 double 8192 8192 2 2

### Potential Output

```bash
  nvkernel__Z9stencil2dIdEvPKT_PS0_mm_F2L5_14 (128, 1, 1)x(64, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         7.24
    SM Frequency                    Ghz         1.30
    Elapsed Cycles                cycle      9322846
    Memory Throughput                 %        43.03
    DRAM Throughput                   %        21.74
    Duration                         ms         7.14
    L1/TEX Cache Throughput           %        51.57
    L2 Cache Throughput               %        35.91
    SM Active Cycles              cycle   7762959.54
    Compute (SM) Throughput           %        17.17
    ----------------------- ----------- ------------

    OPT   This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full
          waves across all SMs. Look at Launch Statistics for more details.
```

```bash
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    128
    Registers Per Thread             register/thread              48
    Shared Memory Configuration Size           Kbyte           16.38
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              84
    Threads                                   thread            8192
    Uses Green Context                                             0
    Waves Per SM                                                0.10
    -------------------------------- --------------- ---------------

    OPT   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.
```

```bash
    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           20
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           24
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %        66.67
    Achieved Occupancy                        %         6.71
    Achieved Active Warps Per SM           warp         3.22
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 89.93%
          The difference between calculated theoretical (66.7%) and measured achieved occupancy (6.7%) can be the
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
          optimizing occupancy.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Local Speedup: 33.33%
          The 8.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
          hardware maximum of 12. This kernel's theoretical occupancy (66.7%) is limited by the number of blocks that
          can fit on the SM. This kernel's theoretical occupancy (66.7%) is limited by the required amount of shared
          memory.
```

```bash
    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average DRAM Active Cycles       cycle  11249669.33
    Total DRAM Elapsed Cycles        cycle    620989440
    Average L1 Active Cycles         cycle   7762959.54
    Total L1 Elapsed Cycles          cycle    781549776
    Average L2 Active Cycles         cycle   8760143.08
    Total L2 Elapsed Cycles          cycle    421780848
    Average SM Active Cycles         cycle   7762959.54
    Total SM Elapsed Cycles          cycle    781549776
    Average SMSP Active Cycles       cycle   6252614.98
    Total SMSP Elapsed Cycles        cycle   3126199104
    -------------------------- ----------- ------------

    OPT   Est. Speedup: 13.81%
          One or more SMs have a much lower number of active cycles than the average number of active cycles. Maximum
          instance value is 16.55% above the average, while the minimum instance value is 19.57% below the average.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Speedup: 22.03%
          One or more SMSPs have a much lower number of active cycles than the average number of active cycles. Maximum
          instance value is 32.78% above the average, while the minimum instance value is 100.00% below the average.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Speedup: 13.81%
          One or more L1 Slices have a much lower number of active cycles than the average number of active cycles.
          Maximum instance value is 16.55% above the average, while the minimum instance value is 19.57% below the
          average.
```

### Interpretation

While our application now utilized all SMs, the other part of the previous problem - not enough (full) waves and a low occupancy - has not been fixed.
As such, the performance is still comparatively low as well.

## Stencil Code Optimization 3

To increase the number of threads we need to generate more parallelism.
Luckily, simply associating more loops with our OpenMP construct serves exactly that purpose.
The updated code is available at [stencil-2d-omp-target-v3](../src/stencil-2d/stencil-2d-omp-target-v3.cpp), and can be compiled, executed and profiled with the following cells.

In [None]:
!nvc++ -O3 -march=native -std=c++17 -mp=gpu -target=gpu ../src/stencil-2d/stencil-2d-omp-target-v3.cpp -o ../build/stencil-2d-omp-target-v3

In [None]:
!../build/stencil-2d-omp-target-v3 double 8192 8192 2 256

In [None]:
!ncu -s 2 -c 1 ../build/stencil-2d-omp-target-v3 double 8192 8192 2 2

### Potential Output

```bash
  nvkernel__Z9stencil2dIdEvPKT_PS0_mm_F2L5_14 (524033, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         7.24
    SM Frequency                    Ghz         1.30
    Elapsed Cycles                cycle      2128981
    Memory Throughput                 %        94.55
    DRAM Throughput                   %        94.55
    Duration                         ms         1.63
    L1/TEX Cache Throughput           %        37.26
    L2 Cache Throughput               %        72.76
    SM Active Cycles              cycle   2126581.36
    Compute (SM) Throughput           %        75.00
    ----------------------- ----------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
          further improve performance, work will likely need to be shifted from the most utilized to another unit.
          Start by analyzing DRAM in the Memory Workload Analysis section.
```

```bash
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                 524033
    Registers Per Thread             register/thread              46
    Shared Memory Configuration Size           Kbyte           16.38
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              84
    Threads                                   thread        67076224
    Uses Green Context                                             0
    Waves Per SM                                              623.85
    -------------------------------- --------------- ---------------
```

```bash
    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           10
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           12
    Theoretical Active Warps per SM        warp           40
    Theoretical Occupancy                     %        83.33
    Achieved Occupancy                        %        73.26
    Achieved Active Warps Per SM           warp        35.17
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 12.08%
          The difference between calculated theoretical (83.3%) and measured achieved occupancy (73.3%) can be the
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
          optimizing occupancy.
```

```bash
    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average DRAM Active Cycles       cycle  11177070.67
    Total DRAM Elapsed Cycles        cycle    141852672
    Average L1 Active Cycles         cycle   2126581.36
    Total L1 Elapsed Cycles          cycle    178860480
    Average L2 Active Cycles         cycle   2001744.83
    Total L2 Elapsed Cycles          cycle     96347664
    Average SM Active Cycles         cycle   2126581.36
    Total SM Elapsed Cycles          cycle    178860480
    Average SMSP Active Cycles       cycle   2126465.98
    Total SMSP Elapsed Cycles        cycle    715441920
    -------------------------- ----------- ------------
```

### Exercise

Investigate the output for the optimized version.
Has the previous issue been addressed?
What is the main performance limiter now?

### Solution

That looks already way better - DRAM utilization is now close to 100% and the main bottleneck.

## Further Nsight Compute Options

The default configuration of Nsight Compute captures only a small subset of the metrics available.
To alter this, different **sections** and **sets** are available, in addition to specifying own metric lists.

### Sections

The available sections can be queried with

In [None]:
!ncu --list-sections > ../profiles/sections.txt

In [None]:
!cat ../profiles/sections.txt

and used with the `--section` parameter

In [None]:
!ncu -s 2 -c 1 --section=SpeedOfLight ../build/stencil-2d-omp-target-v3 double 8192 8192 2 2

### Sets

In many cases, multiple sections are required concurrently.
Sets provide an interface for the most relevant combinations.
As before, they can be queried directly from `ncu` and then used with the corresponding command line argument.

In [None]:
!ncu --list-sets > ../profiles/sets.txt

In [None]:
!cat ../profiles/sets.txt

In [None]:
!ncu -s 2 -c 1 --set=roofline ../build/stencil-2d-omp-target-v3 double 8192 8192 2 2

If further information are necessary, the section definitions included in the Nsight Compute distribution can be helpful.
After locating them, the can simply be read.

In [None]:
!ll $NVHPC_ROOT/Linux_x86_64/23.7/profilers/Nsight_Compute/sections/

In [None]:
!cat $NVHPC_ROOT/Linux_x86_64/23.7/profilers/Nsight_Compute/sections/SpeedOfLight.section

### Metrics

In some cases, directly accessing very specific metrics can be helpful.
For instance, for doing automatic benchmarking or to keep profiling overheads low.

Additional material is available on the [metrics structure](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-structure), as well as a [list of key metrics](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison).

In [None]:
!ncu --list-metrics > ../profiles/metrics.txt

In [None]:
!cat ../profiles/metrics.txt

Relevant metrics include (but are not limited to)
* `sm__warps_active.avg.pct_of_peak_sustained_active` displays the achieved occupancy.
* `dram__bytes_read` and `dram__bytes_write` correspond to the bytes read from and written to DRAM. This can be
    extended with `.sum` to obtain the total volume and `.sum.per_second` to obtain the bandwidth.
* `smsp__sass_thread_inst_executed_op_{dadd, dmul, dfma}_pred_on.sum` represents the total executed additions,
    multiplications and fused multiply-adds in double precision. The total number of FLOPs in double precision can be
    computed with `dadd + dmul + 2 * dfma`.
* `lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_{ld, st}.sum` can be used to query the L2
    cache load and store volumes.

In [None]:
!ncu -s 2 -c 1 --metrics \
sm__warps_active.avg.pct_of_peak_sustained_active,\
dram__bytes_read.sum,dram__bytes_write.sum,\
dram__bytes_read.sum.per_second,dram__bytes_write.sum.per_second,\
smsp__sass_thread_inst_executed_op_dadd_pred_on.sum,smsp__sass_thread_inst_executed_op_dmul_pred_on.sum,smsp__sass_thread_inst_executed_op_dfma_pred_on.sum,\
smsp__sass_thread_inst_executed_op_dadd_pred_on.sum.per_second,smsp__sass_thread_inst_executed_op_dmul_pred_on.sum.per_second,smsp__sass_thread_inst_executed_op_dfma_pred_on.sum.per_second \
../build/stencil-2d-omp-target-v3 double 8192 8192 2 2

For counting metrics (e.g. bytes transferred) the structure is usually `metric.sum`.
For corresponding rates this can be extended to `metric.sum.per_second`.
This can also be recomputed in terms of percentage of theoretical peak performance with `metric.sum.pct_of_peak_sustained_elapsed`.

Available metrics can be queried with (example for A100)

In [None]:
!ncu --query-metrics --chip ga100 > ../profiles/metrics.ga100.txt

and with additional filters (e.g. only metrics starting with a certain string)

In [None]:
!ncu --query-metrics-mode suffix --metrics sm__inst_executed --chip ga100

A small collection of useful metrics is collected in [metrics](metrics.md).

## Nsight Compute GUI

While the command line interface for Nsight Compute is quite powerful, in many cases a more structured and explorable way of presenting the obtained data is advantageous.
In this case, the Nsight Compute GUI can be a helpful tool.
We follow this pattern:
* Profiling of our application with suitable sections/ sets remotely.
* Downloading the profile data.
* Opening it locally.

In [None]:
!nvc++ -O3 -march=native -std=c++17 -mp=gpu -target=gpu ../src/stencil-2d/stencil-2d-omp-target-v3.cpp -o ../build/stencil-2d-omp-target-v3

In [None]:
!../build/stencil-2d-omp-target-v3 double 8192 8192 2 256

In [None]:
!ncu --set=full -o ../profiles/stencil-2d-omp-target-v3 --force-overwrite ../build/stencil-2d-omp-target-v3 double 8192 8192 2 2

### Exercise

Take a moment to explore the different sections and options in the GUI.
You might notice that the memory statistics section looks interesting.
Is there anything that catches your eye?

### Solution

The memory statistics section reveals that data volumes from cache are quite high.
This is an artefact of the thread distribution employed by OpenMP.
To remedy this behavior, using (spatial) blocking techniques can be one option.
Another is switching to CUDA which naturally allows having a 2D thread decomposition.

## Stencil Code Optimization 4

The updated code is available at [stencil-2d-cuda-v4](../src/stencil-2d/stencil-2d-cuda-v4.cpp), and can be compiled, executed and profiled with the following cells.

In [None]:
!nvc++ -O3 -march=native -std=c++17 ../src/stencil-2d/stencil-2d-cuda-v4.cu -o ../build/stencil-2d-cuda-v4

In [None]:
!../build/stencil-2d-cuda-v4 double 1024 1024 2 256

In [None]:
!ncu --set=full -o ../profiles/stencil-2d-cuda-v4 --force-overwrite ../build/stencil-2d-cuda-v4 double 8192 8192 2 2

### Exercise

Compare this and the previous version in the GUI.
You can also set a baseline to compare both in one view.
How did the memory statistics section change?

## Stencil Code Optimization 5

Our cache transfer volumes now look a lot better.
One last optimization idea to maximize cache reuse is to map thread blocks forward and backward in alternating fashion.
This aims at promoting data reuse - data that was *written last* is now *read first*.
Unfortunately, examining the effect of this optimization is difficult with Nsight Compute since kernel profiling usually entails flushing all caches.

This last version of our stencil application is available at [stencil-2d-cuda-v5](../src/stencil-2d/stencil-2d-cuda-v5.cu), and can be compiled, executed and profiled with the following cells.

In [None]:
!nvc++ -O3 -march=native -std=c++17 ../src/stencil-2d/stencil-2d-cuda-v5.cu -o ../build/stencil-2d-cuda-v5

In [None]:
!../build/stencil-2d-cuda-v5 double 8192 8192 2 256

For GPUs with a small L2 cache size, like the A40, performance improvements may be negligable for this problem size.
The cell below executes both versions for a smaller problem size - can you identify performance optimizations for specific sizes?

In [None]:
!../build/stencil-2d-cuda-v4 double 1024 1024 2 256
!../build/stencil-2d-cuda-v5 double 1024 1024 2 256

In [None]:
!ncu --set=full -o ../profiles/stencil-2d-cuda-v5 --force-overwrite ../build/stencil-2d-cuda-v5 double 8192 8192 2 2

Further optimization of this application will most likely not be successful since the main bottleneck - the DRAM bandwidth - is already well utilized for the problem sizes regarded.
Additional performance improvements now require shifting this bottleneck.
This could be done by simply switching to another GPU that features a higher bandwidth, or by applying algorithmic changes such as temporal blocking.

## Next Step

After having learnt about various aspects of GPU performance engineering and optimization, it is now time to apply these skills to a more complicated application.
Head over to the [Conjugate Gradient](./conjugate-gradient.ipynb) notebook to get started.