<h1 style="color:#65AE11;">Kernel Launches in Non-Default Streams</h1>

In this section you will learn to launch kernels in non-default streams.

<h2 style="color:#65AE11;">Objectives</h2>

By the time you complete this section you will:

* Know how to create non-default streams
* Be able to launch kernels in non-default streams
* Know how to observe operations in non-default streams in Nsight Systems
* Know how to destroy non-default streams

<h2 style="color:#65AE11;">Non-Default Stream Creation</h2>

To create a new non-default stream, pass `cudaStreamCreate` a `cudaStream_t` pointer:

```c
cudaStream_t stream;
cudaStreamCreate(&stream);
```

<h2 style="color:#65AE11;">Launching a Kernel in a Non-Default Stream</h2>

To launch a kernel in a non-default stream, pass a non-default stream identifier as its 4th launch configuration argument. Because a kernel's 3rd launch configuration argument defines dynamically allocated shared memory, you will need to pass it `0` (its default value since we are not using shared memory) if you are not modifying its default value:

```c
cudaStream_t stream;
cudaStreamCreate(&stream);

kernel<<<grid, blocks, 0, stream>>>();
```

<h2 style="color:#65AE11;">Non-Default Stream Destruction</h2>

Destroy non-default streams when you are done with them by passing a non-default stream identifier to `cudaStreamDestroy`:

```c
cudaStream_t stream;
cudaStreamCreate(&stream);

kernel<<<grid, blocks, 0, stream>>>();

cudaStreamDestroy(stream);
```

<h2 style="color:#65AE11;">Exercise: Launch Kernel in Non-Default Stream</h2>

Open and refactor [*06_Kernels_in_Streams/baseline_cipher/baseline.cu*](baseline_cipher/baseline.cu) to launch the `decrypt_gpu` kernel (around line 65) in a non-default stream.

Generate a report file for the refactored application by using a JupyterLab terminal and running `make profile` from within the *06_Kernels_in_Streams/baseline_cipher* directory. (See the [*Makefile*](baseline_cipher/Makefile) there for details).

Open the report file in Nsight Systems. If you've closed the Nsight Systems tab, you can reopen it by following the instructions in [*Nsight Systems Setup*](../04_Nsight_Systems_Setup/Nsight_Systems_Setup.ipynb). As a reminder the password is `nvidia`.

If you were successful, you should notice that the Nsight Systems visual timeline is now presenting information about streams, and that the kernel launch occured in some non-default stream, as is shown in the screenshot below.

If you get stuck, please refer to [06_Kernels_in_Streams/baseline_cipher/baseline_solution.cu](../06_Kernels_in_Streams/baseline_cipher/baseline_solution.cu).

In [1]:
%%bash
# Affiche le répertoire courant pour vérifier le point de départ
pwd

cd baseline_cipher

module load cuda/12.6

make profile

./baseline




/gpfs/home/scortinhal/CHPS0904/task/06_Kernels_in_Streams
nsys profile --stats=true --force-overwrite=true -o baseline-report ./baseline
TIMING: 37.884 ms (allocate memory)
TIMING: 433124 ms (encrypt data on CPU)
TIMING: 1.46874 ms (copy data from CPU to GPU)
TIMING: 64.5848 ms (decrypt data on GPU)
TIMING: 3.47162 ms (copy data from GPU to CPU)
TIMING: 74.7554 ms (total time on GPU)
STATUS: test passed
TIMING: 56.2705 ms (checking result on CPU)
TIMING: 11.8645 ms (free memory)
Collecting data...
Generating '/tmp/nsys-report-31e5.qdstrm'


SKIPPED: /gpfs/home/scortinhal/CHPS0904/task/06_Kernels_in_Streams/baseline_cipher/baseline-report.sqlite does not contain NV Tools Extension (NVTX) data.


[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  -----------  --------  ---------  -----------  ----------------------
     99.9     433516480672       4337  99957685.2  100116096.0      1920  176350112    4891513.3  poll                  
      0.1        336261024        768    437839.9      29504.0      1024   35197376    2249517.3  ioctl                 
      0.0         24810304          4   6202576.0      32016.0      7968   24738304   12357164.9  fread                 
      0.0         20854912          1  20854912.0   20854912.0  20854912   20854912          0.0  pthread_cond_wait     
      0.0         15854112          9   1761568.0    1468544.0   1385760    2858208     523644.8  fflush                
      0.0          7348096         37    198597.2      14848.0      6720

![kernel_in_stream](images/kernel_in_stream.png)

In [4]:
%%bash
# Affiche le répertoire courant pour vérifier le point de départ
pwd

cd kernel_baseline_cipher

module load cuda/12.6

make profile

./baseline

/gpfs/home/scortinhal/CHPS0904/task/06_Kernels_in_Streams
nsys profile --stats=true --force-overwrite=true -o baseline-report ./baseline
TIMING: 40.3016 ms (allocate memory)
TIMING: 433275 ms (encrypt data on CPU)
TIMING: 1.46723 ms (copy data from CPU to GPU)
TIMING: 35.321 ms (decrypt data on GPU)
TIMING: 3.47434 ms (copy data from GPU to CPU)
TIMING: 44.8525 ms (total time on GPU)
STATUS: test passed
TIMING: 56.906 ms (checking result on CPU)
TIMING: 10.1205 ms (free memory)
Collecting data...
Generating '/tmp/nsys-report-6d62.qdstrm'


SKIPPED: /gpfs/home/scortinhal/CHPS0904/task/06_Kernels_in_Streams/kernel_baseline_cipher/baseline-report.sqlite does not contain NV Tools Extension (NVTX) data.


[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  -----------  --------  ---------  -----------  ----------------------
     99.9     433740902272       4339  99963333.1  100116768.0      2016  173509312    4887850.3  poll                  
      0.1        393957440        770    511633.0      27824.0      1024   37691968    2715625.2  ioctl                 
      0.0         15389824          9   1709980.4    1509312.0   1309984    2614368     488999.1  fflush                
      0.0          8830976          1   8830976.0    8830976.0   8830976    8830976          0.0  pthread_cond_broadcast
      0.0          6689920         37    180808.6      15744.0      5728    6063872     994059.3  mmap                  
      0.0          1084736          2    542368.0     542368.0    519904

<h2 style="color:#65AE11;">Next</h2>

Now that you can launch kernels in non-default streams, you will in the next section launch memory transfers in non-default streams.

Please continue to the next section: [*Memcpy in Streams*](../07_Memcpy_in_Streams/Memcpy_in_Streams.ipynb).

<h2 style="color:#65AE11;">Optional Further Study</h2>

The following are for students with time and interest to do additional study on topics related to this workshop.

* In scenarios where a single kernel is unable to saturate the device, you might consider using streams to [launch multiple kernels simultaneously](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#concurrent-kernel-execution).
* For full coverage of of CUDA stream management functions, see [Stream Management](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html) in the CUDA Runtime API docs.