In [1]:
# Please execute/shift-return this cell everytime you run the notebook.  Don't edit it. 
%load_ext autoreload
%autoreload 2
from notebook import * 

### Matrix tiling algorithm with Transpositon/Registers/Rectangular/Prefetch



## Prefetch

x86 provide prefetch instructions. As a programmer, you may insert ```_mm_prefetch``` in x86 programs to perform software prefetch for your code. The gcc compiler also has a flag ```-fprefetch-loop-arrays``` to automatically insert software prefetch instructions.

### Using prefetch in matrix transpose code

The following example is a highly optimized matrix transpose code. In the example, we try to prefetch the next row.

In [2]:
render_code("./prefetch/transpose.cpp", lang="c++", show=["//START", "//END"])

Now, let's take a look of what's happening!

In [3]:
! cd prefetch; make clean; make
# ! echo "Without prefetch -- the baseline"; ssh htseng@celebi "lscpu | grep Model; cd courses/CS203/demo/memory/prefetch/; ./transpose"
! echo "Without prefetch -- the baseline"
! lscpu | grep Model
! ./prefetch/transpose
! echo "With prefetch"
! ./prefetch/transpose_prefetch

rm -f blockmm_sse blockmm blockmm_sse_prefetch transpose transpose_prefetch
g++ -msse4.1 -mavx -O3 transpose.cpp -o transpose 
g++ -msse4.1 -mavx -O3 -DENABLE_PREFETCH transpose.cpp -o transpose_prefetch 
Without prefetch -- the baseline
Model name:                              Intel(R) Core(TM) i7-14700K
Model:                                   183
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.13262 seconds
With prefetch
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.106772 seconds


Let's try a different machine now.

In [4]:
! ssh htseng@xerneas "cd /nfshome/htseng/courses/CSE142/demo/matrix_mul/; make -C ./prefetch clean; make -C ./prefetch ; lscpu | grep Model"
! echo "Without prefetch -- the baseline"; ssh htseng@xerneas  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose"
! echo "With prefetch";  ssh htseng@xerneas  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose_prefetch"

make: Entering directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
rm -f blockmm_sse blockmm blockmm_sse_prefetch transpose transpose_prefetch
make: Leaving directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
make: Entering directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
g++ -msse4.1 -mavx -O3 transpose.cpp -o transpose 
g++ -msse4.1 -mavx -O3 -DENABLE_PREFETCH transpose.cpp -o transpose_prefetch 
make: Leaving directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
Model name:                           AMD Ryzen 9 5950X 16-Core Processor
Model:                                33
Without prefetch -- the baseline
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.10477 seconds
With prefetch
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.098977 seconds


In [5]:
! ssh htseng@blissey "cd /nfshome/htseng/courses/CSE142/demo/matrix_mul/; make -C ./prefetch clean; make -C ./prefetch ; lscpu | grep Model"
! echo "Without prefetch -- the baseline"; ssh htseng@blissey  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose"
! echo "With prefetch";  ssh htseng@blissey  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose_prefetch"

make: Entering directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
rm -f blockmm_sse blockmm blockmm_sse_prefetch transpose transpose_prefetch
make: Leaving directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
make: Entering directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
g++ -msse4.1 -mavx -O3 transpose.cpp -o transpose 
g++ -msse4.1 -mavx -O3 -DENABLE_PREFETCH transpose.cpp -o transpose_prefetch 
make: Leaving directory '/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch'
Model name:                           AMD Ryzen 7 5700X 8-Core Processor
Model:                                33
Without prefetch -- the baseline
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.10448 seconds
With prefetch
bytes = 1073741824
Starting Data Transpose...   Done
Time: 0.098491 seconds


In [6]:
! ssh htseng@eevee "cd /nfshome/htseng/courses/CSE142/demo/matrix_mul/; make -C ./prefetch clean; make -C ./prefetch ; lscpu | grep Model"
! echo "Without prefetch -- the baseline"; ssh htseng@eevee  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose"
! echo "With prefetch";  ssh htseng@eevee  "/nfshome/htseng/courses/CSE142/demo/matrix_mul/prefetch/transpose_prefetch"

htseng@eevee's password: 
Without prefetch -- the baseline
htseng@eevee's password: 
With prefetch
htseng@eevee's password: 



-- It doesn't work always!

In [None]:
render_code("matrix_mul/blockmm_interchange.c", show=["//START","//END"])

In [None]:
! cd matrix_mul; rm -f blockmm_interchange; make blockmm_interchange; echo "size,tile_size,IC,Cycles,CPI,CT_ns,ET_s,DL1_miss_rate,DL1_misses,DL1_accesses" > blockmm_interchange.csv
! ./matrix_mul/blockmm_interchange 2048 8 >> ./matrix_mul/blockmm_interchange.csv 
! ./matrix_mul/blockmm_interchange 2048 16 >> ./matrix_mul/blockmm_interchange.csv 
! ./matrix_mul/blockmm_interchange 2048 32 >> ./matrix_mul/blockmm_interchange.csv 
! ./matrix_mul/blockmm_interchange 2048 64 >> ./matrix_mul/blockmm_interchange.csv
! ./matrix_mul/blockmm_interchange 2048 128 >> ./matrix_mul/blockmm_interchange.csv
! ./matrix_mul/blockmm_interchange 2048 256 >> ./matrix_mul/blockmm_interchange.csv
display_df_mono(render_csv("matrix_mul/blockmm.csv"))
display_df_mono(render_csv("matrix_mul/blockmm_interchange.csv"))


gcc -O4 -DHAVE_LINUX_PERF_EVENT_H blockmm_interchange.c perfstats.c -o blockmm_interchange
[01m[Kblockmm_interchange.c:[m[K In function ‘[01m[Kmain[m[K’:
   48 |   printf("%d,[01;35m[K%lu[m[K,",ARRAY_SIZE,[32m[Ktile_size[m[K);
      |              [01;35m[K~~^[m[K              [32m[K~~~~~~~~~[m[K
      |                [01;35m[K|[m[K              [32m[K|[m[K
      |                [01;35m[K|[m[K              [32m[Kint[m[K
      |                [01;35m[Klong unsigned int[m[K
      |              [32m[K%u[m[K
10521102336.000000,10521102336.000000,10521102336.000000,10521102336.000000,10521102336.000000,