In [47]:
# Please execute/shift-return this cell everytime you run the notebook.  Don't edit it. 
%load_ext autoreload
%autoreload 2
from notebook import * 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Amdahl's Law

## The starting example -- sorting

In [None]:
render_code("./gpusort/main.cu", show="main")

## Where is the most time critical part of my program?

In [48]:
! cd gpusort; make clean; make
! echo "File on H.D.D.; Sorting on CPU"
! cd ./gpusort; echo "ET,FileInput,CPU_Kernel,GPU_Kernel,Host2GPU,GPU2Host" > sort.csv; source ./run_CPU 2>> sort.csv

rm -f	*.o hybridsort hybridsort_cpu
/usr/local/cuda/bin/nvcc -DTIMER -O3 -w   -DCPU -DHAVE_LINUX_PERF_EVENT_H -DREADING_FROM_BINARY         main.cu -o hybridsort_cpu
/usr/local/cuda/bin/nvcc -DTIMER -O3 -w   -DHAVE_LINUX_PERF_EVENT_H -DREADING_FROM_BINARY         main.cu -o hybridsort
File on H.D.D.; Sorting on CPU
Sorting list of 134217728 floats
FileInput 1.973035 seconds
Sorting on CPU...done.
Total CPU execution time: 14.929325 seconds


In [None]:
display_df_mono(render_csv("./gpusort/sort.csv", columns=["ET","FileInput","CPU_Kernel"]))

In [None]:
! lscpu

### Use gprof to figure out the timing breakdown

In [None]:
! cd gpusort; make clean; make EXTRA_FLAGS=-pg 
! cd ./gpusort; source ./run_CPU

In [None]:
! cd gpusort; gprof ./hybridsort_cpu ./gmon.out

## Amdahl's Law -- optimizating is a moving target

In [49]:
render_code("./gpusort/main.cu", lang="c++", show="bitonic_sort")

In [50]:
! nvidia-smi -a



Timestamp                                 : Tue Oct 10 15:45:23 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 4070
    Product Brand                         : GeForce
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
 

In [51]:
! cd gpusort; make clean; make
# ! ssh htseng@azelf "source ./courses/CS203/demo/amdahlslaw/gpusort/run_CPU"
! echo "File on H.D.D.; Sorting on GPU"
! cd gpusort; source ./run 2>> sort.csv

rm -f	*.o hybridsort hybridsort_cpu
/usr/local/cuda/bin/nvcc -DTIMER -O3 -w   -DCPU -DHAVE_LINUX_PERF_EVENT_H -DREADING_FROM_BINARY         main.cu -o hybridsort_cpu
/usr/local/cuda/bin/nvcc -DTIMER -O3 -w   -DHAVE_LINUX_PERF_EVENT_H -DREADING_FROM_BINARY         main.cu -o hybridsort
File on H.D.D.; Sorting on GPU
Sorting list of 134217728 floats
FileInput 1.829851 seconds
Sorting on GPU...GPU iterations: 1
Total GPU Sort execution time: 0.766892 seconds
    - Upload		: 0.042830 seconds
    - Download		: 0.146355 seconds


In [None]:
display_df_mono(render_csv("./gpusort/sort.csv"))

In [None]:
! echo "File on S.S.D.; Sorting on GPU"
! cd gpusort; source ./run_SSD 2>> sort.csv

In [None]:
display_df_mono(render_csv("./gpusort/sort.csv"))

## Amdahl's Law on parallel programming

In [None]:
! cd vmul; make clean; make
! echo "THREADS,CPUTIME,HOST2GPU,GPUTIME,GPU2HOST" > ./vmul/vmul.csv
! echo "CPU based vul"
! time ./vmul/vmul 33554432 1 0 30 2>> ./vmul/vmul.csv
! echo "GPU based vul"
### i stands for "How many iterations each thread performs 
### -- the larger the number, the fewer the parallelism
! for i in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192; do time ./vmul/vmul 33554432 $i 1 30 2>> ./vmul/vmul.csv ; done

In [None]:
df = render_csv("./vmul/vmul.csv")
df["TOTAL"] = df["CPUTIME"] + df["HOST2GPU"] + df["GPUTIME"] + df["GPU2HOST"]
df = df.sort_values(by=["THREADS"], ascending=True)
display_df_mono(df)
plotPE(df=df, lines=True, what=[ ('THREADS', "TOTAL"), ('THREADS', "GPUTIME")], columns=2)

# Choose the "right" metrics

## Throughput and Latency

### GPU performance

Let's compare the performance of running matrix multiplications on GPU is get a feeling about the difference between "throughput" and latency

In [54]:
! cd ./metrics; ./cudamm 16 1

Data Type Size: 4
Time elapsed on matrix multiplication of 16x16 . 16x16 on GPU: 0.084192 ms.


Throughput: 0.10 GFLOPS



In [55]:
! cd ./metrics; ./cudamm 32 1

Data Type Size: 4
Time elapsed on matrix multiplication of 32x32 . 32x32 on GPU: 0.080416 ms.


Throughput: 0.81 GFLOPS



In [57]:
! cd ./metrics; ./cudamm 64 1

Data Type Size: 4
Time elapsed on matrix multiplication of 64x64 . 64x64 on GPU: 0.082752 ms.


Throughput: 6.34 GFLOPS



What do you find regarding the "latencies" of these three cases?

What do you find regarding the "throughput" of these three cases?

In [None]:
### SSD v.s. HDD

You may use to hdparm (need root permission to execute). The /dev/sda on this machine is an SATA SSD that has around 450-500MB/sec bandwidth. The /dev/md0 is a RAID contains two H.D.Ds in RAID-0 configuration that also achieves 450-500MB/sec bandwidth. Let's examine the bandwidth using the following command.

In [None]:
from IPython.display import IFrame
IFrame("https://hub.escalab.org:8000/user/htseng/terminals/1", width="100%", height="400")

Now, let's revisit the optimized gpusort on this machine with different array size...

In [None]:
! echo "Configuration,Size,ET,FileInput,CPU_Kernel,GPU_Kernel,Host2GPU,GPU2Host" > sort_small.csv
! echo "File on H.D.D"
! cd gpusort; source ./run_small 512 2>> ../sort_small.csv
! echo "File on S.S.D"
! cd gpusort; source ./run_small_SSD 512 2>> ../sort_small.csv
! echo "File on H.D.D"
! cd gpusort; source ./run_small 32768 2>> ../sort_small.csv
! echo "File on S.S.D"
! cd gpusort; source ./run_small_SSD 32768 2>> ../sort_small.csv
! echo "File on H.D.D"
! cd gpusort; source ./run_small 262144 2>> ../sort_small.csv
! echo "File on S.S.D"
! cd gpusort; source ./run_small_SSD 262144 2>> ../sort_small.csv
display_df_mono((render_csv("sort_small.csv")))

What can we observe here?

## FLOPs

In [58]:
! cd metrics; make
! cd ./metrics; ./cpumm 2048 512

make: Nothing to be done for 'all'.
Data type size: 4

Time: 4465.740 ms

Throughput: 3.85 GFLOPS



In [59]:
! cd ./metrics; ./cudamm 128 1

Data Type Size: 4
Time elapsed on matrix multiplication of 128x128 . 128x128 on GPU: 0.122976 ms.


Throughput: 34.11 GFLOPS



In [60]:
! cd ./metrics; ./cudamm 256 1

Data Type Size: 4
Time elapsed on matrix multiplication of 256x256 . 256x256 on GPU: 0.240960 ms.


Throughput: 139.25 GFLOPS



In [61]:
! cd ./metrics; ./cudamm 2048 1

Data Type Size: 4
Time elapsed on matrix multiplication of 2048x2048 . 2048x2048 on GPU: 15.069920 ms.


Throughput: 1140.01 GFLOPS



In [62]:
! cd ./metrics; ./cudamm 4096 1

Data Type Size: 4
Time elapsed on matrix multiplication of 4096x4096 . 4096x4096 on GPU: 93.704163 ms.


Throughput: 1466.73 GFLOPS



In [63]:
! cd metrics; ./cudamm 8192 1

Data Type Size: 4
Time elapsed on matrix multiplication of 8192x8192 . 8192x8192 on GPU: 675.555420 ms.


Throughput: 1627.57 GFLOPS



In [64]:
! cd metrics; ./cudamm_double 2048 1

Data Type Size: 8
Time elapsed on matrix multiplication of 2048x2048 . 2048x2048 on GPU: 54.336319 ms.


Throughput: 316.18 GFLOPS



In [None]:
! cd metrics;  ./cudamm_double 2048 0

In [None]:
! cd metrics; ./cpumm_double 2048 512

In [None]:
! cd metrics;  ./cudamm 2048 0