Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
6a7ffa7
HOST NVSHMEM EXAMPLE
Oct 13, 2021
f7b0db0
boundary conditions is not working correctly yet
Oct 13, 2021
cbc3d0b
Fixed rounding error due to not correctly waiting for the compute ker…
Oct 14, 2021
2b0f180
First draft of the MPI Overalp version to be tested
Oct 19, 2021
6789073
Tested and working properly, also added Instructions.md and copy.mk
Oct 19, 2021
08b72c8
first draft of NCCL version, needs to be tested for correctness
Oct 27, 2021
495b91f
Merge branch 'main' of https://github.com/simongdg/tutorial-multi-gpu…
Oct 27, 2021
f599f89
NCCL and host-side NVSHMEM with overlap first version, both need to b…
Oct 27, 2021
bceb8fa
6-H Pull request edits have been implemented, 8-H NCCL is also ready …
Oct 28, 2021
2ad3188
Update 8-H_NCCL_NVSHMEM/NCCL/Instructions.md
simongdg Oct 28, 2021
f3aa824
Update 8-H_NCCL_NVSHMEM/NCCL/copy.mk
simongdg Oct 28, 2021
65f4a4a
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 28, 2021
87fa989
Update 6-H_Overlap_Communication_and_Computation_MPI/jacobi.cpp
simongdg Oct 28, 2021
b29b6c4
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 28, 2021
8371ce4
Addressing pull request comments on TODOs spacing and instructions fo…
Oct 28, 2021
c8a21dc
Merge branch 'main' of https://github.com/simongdg/tutorial-multi-gpu…
Oct 28, 2021
734da80
NCCL warmup has been added as a TODO, and added to the instructions
Oct 29, 2021
78a3f92
Update 6-H_Overlap_Communication_and_Computation_MPI/jacobi.cpp
simongdg Oct 29, 2021
2764eed
Update 6-H_Overlap_Communication_and_Computation_MPI/jacobi.cpp
simongdg Oct 29, 2021
c96a255
Update 6-H_Overlap_Communication_and_Computation_MPI/jacobi.cpp
simongdg Oct 29, 2021
ce10df8
Update 6-H_Overlap_Communication_and_Computation_MPI/jacobi.cpp
simongdg Oct 29, 2021
2a4de59
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 29, 2021
d691b6c
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 29, 2021
21238b8
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 29, 2021
af0dc3d
Update 8-H_NCCL_NVSHMEM/NCCL/jacobi.cpp
simongdg Oct 29, 2021
5361260
Fixed indentation on issue on 6-H and added jacobi.cpp to the copy.mk…
Oct 29, 2021
62bef40
Update 8-H_NCCL_NVSHMEM/NCCL/Instructions.md
simongdg Oct 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions 6-H_Overlap_Communication_and_Computation_MPI/Instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# SC21 Tutorial: Efficient Distributed GPU Programming for Exascale

- Time: Sunday, 14 November 2021 8AM - 5PM CST
- Location: *online*
- Program Link: https://sc21.supercomputing.org/presentation/?id=tut138&sess=sess188


## Hands-On 6: Overlap Communication and Computation with MPI

### Task 0: Profile the non-Overlap MPI-CUDA version of the code using Nsight Systems to discover areas of possible compute/communication overlap

#### Description
The purpose of this task is to use the Nsight System profiler to profile the starting point version non-Overlap MPI jacobi solver. The objective is to become familiar in navigating the GUI identify possible areas to overlap computation and communication.

- STEPS TO BE ADDED HERE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm writing up some draft steps to be included here, but I think it makes sense to merge your PR first so that we can build on top of it.


### Task 1: Overlap Communication and Computation using high priority streams and hide launch time for halo processing kernels

#### Description

The purpose of this task is to overlap computation and communication based on the profiling done during the previus task. The starting point of this task is the non-Overlap MPI variant of the jacobi solver. You need to work on `TODOs` in `jacobi.cu`:

- Initialize a priority range to be used by the CUDA streams
- Create new top and bottom CUDA streams and corresponding CUDA events
- Initialize all streams using priorities
- Modify the original jacobi kernel launch to not compute the top and bottom regions
- Launch additional jacobi kernels for the top and bottom regions using the high-priority streams
- Wait on both top and bottom streams when calculating the norm
- Synchronize top and bottom streams before applying the periodic boundary conditions using MPI
- Destroy the additional cuda streams and events before ending the application

Compile with

``` {.bash}
make
```

Submit your compiled application to the batch system with

``` {.bash}
make run
```

Study the performance by glimpsing at the profile generated with
`make profile`. For `make run` and `make profile` the environment variable `NP` can be set to change the number of processes.

41 changes: 41 additions & 0 deletions 6-H_Overlap_Communication_and_Computation_MPI/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
NP ?= 1
NVCC=nvcc
MPICXX=mpicxx
JSC_SUBMIT_CMD ?= srun --gres=gpu:4 --ntasks-per-node 4
CUDA_HOME ?= /usr/local/cuda
GENCODE_SM30 := -gencode arch=compute_30,code=sm_30
GENCODE_SM35 := -gencode arch=compute_35,code=sm_35
GENCODE_SM37 := -gencode arch=compute_37,code=sm_37
GENCODE_SM50 := -gencode arch=compute_50,code=sm_50
GENCODE_SM52 := -gencode arch=compute_52,code=sm_52
GENCODE_SM60 := -gencode arch=compute_60,code=sm_60
GENCODE_SM70 := -gencode arch=compute_70,code=sm_70
GENCODE_SM80 := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
GENCODE_FLAGS := $(GENCODE_SM70) $(GENCODE_SM80)
ifdef DISABLE_CUB
NVCC_FLAGS = -Xptxas --optimize-float-atomics
else
NVCC_FLAGS = -DHAVE_CUB
endif
NVCC_FLAGS += -lineinfo $(GENCODE_FLAGS) -std=c++14
MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -std=c++14
LD_FLAGS = -L$(CUDA_HOME)/lib64 -lcudart -lnvToolsExt
jacobi: Makefile jacobi.cpp jacobi_kernels.o
$(MPICXX) $(MPICXX_FLAGS) jacobi.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi

jacobi_kernels.o: Makefile jacobi_kernels.cu
$(NVCC) $(NVCC_FLAGS) jacobi_kernels.cu -c

.PHONY.: clean
clean:
rm -f jacobi jacobi_kernels.o *.nsys-rep jacobi.*.compute-sanitizer.log

sanitize: jacobi
$(JSC_SUBMIT_CMD) -n $(NP) compute-sanitizer --log-file jacobi.%q{SLURM_PROCID}.compute-sanitizer.log ./jacobi -niter 10

run: jacobi
$(JSC_SUBMIT_CMD) -n $(NP) ./jacobi

profile: jacobi
$(JSC_SUBMIT_CMD) -n $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{SLURM_PROCID} ./jacobi -niter 10
40 changes: 40 additions & 0 deletions 6-H_Overlap_Communication_and_Computation_MPI/copy.mk
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/make -f
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
TASKDIR = ../../tasks/6-H_Overlap_Communication_and_Computation_MPI/
SOLUTIONDIR = ../../solutions/6-H_Overlap_Communication_and_Computation_MPI

PROCESSFILES = jacobi.cu
COPYFILES = Makefile Instructions.ipynb Instructions.md


TASKPROCCESFILES = $(addprefix $(TASKDIR)/,$(PROCESSFILES))
TASKCOPYFILES = $(addprefix $(TASKDIR)/,$(COPYFILES))
SOLUTIONPROCCESFILES = $(addprefix $(SOLUTIONDIR)/,$(PROCESSFILES))
SOLUTIONCOPYFILES = $(addprefix $(SOLUTIONDIR)/,$(COPYFILES))

.PHONY: all task
all: task
task: ${TASKPROCCESFILES} ${TASKCOPYFILES} ${SOLUTIONPROCCESFILES} ${SOLUTIONCOPYFILES}


${TASKPROCCESFILES}: $(PROCESSFILES)
mkdir -p $(TASKDIR)/
cppp -USOLUTION $(notdir $@) $@

${SOLUTIONPROCCESFILES}: $(PROCESSFILES)
mkdir -p $(SOLUTIONDIR)/
cppp -DSOLUTION $(notdir $@) $@


${TASKCOPYFILES}: $(COPYFILES)
mkdir -p $(TASKDIR)/
cp $(notdir $@) $@

${SOLUTIONCOPYFILES}: $(COPYFILES)
mkdir -p $(SOLUTIONDIR)/
cp $(notdir $@) $@

%.ipynb: %.md
pandoc $< -o $@
# add metadata so this is seen as python
jq -s '.[0] * .[1]' $@ ../template.json | sponge $@
Loading