<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/my_colab_gpu_topk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# task
Given 8.5 million big data files, each data is an integer id vector of up to 128 dimensions (called doc), and the id value range is 0-50000.
Given a integer id vector of up to 128 dimensions (called query), the data set can be spread for optimization

```shell
# Generate test data, has been sorted in ascending order, the default docs file counts one document per line,10 documents; 10 query files
make gen
```
Find the average score topk (k=100) of the number of data intersections in query and doc; Here we define the intersection fraction of item as:
query[i] == doc[j] (0<=i<query_size, 0<=j<doc_size) calculates an intersection, the average number of query and doc intersections /max(query_size,doc_size)

``` shell
./bin/query_doc_scoring <doc_file_name> <query_file_name> <output_filename>
```

# optimize
note: just optimize stand-alone, for dist m/r(fan-out/in) arch to schedule those instances.

0. gpu device RR balance by user request
1. concurrency(cpu thread pool) + parallel(cpu openMP + gpu warp threads): cpu(baseline) -> cpu thread concurrency -> cpu + gpu -> cpu thread concurrency/parallel + gpu stream concurrency/warp thread parallel => dist
2. find or filter: use hashmap/bitmap(bloom) on cpu/gpu global memory or gpu shared memory
3. topk sort: heap sort (partial_sort) on cpu -> bitonic/radix sort on gpu parallel topk,then reduce topk to cpu
4. search: need build index (list(IVF,skip),tree, graph), orderly struct/model
5. SIMD: for cpu arch instruction set (intel cpu sse,avx2,avx512 etc..)
6. sequential IO stream pipeline: for r query/docs file, (batch per thread, multibyte_split parallel Accelerators) , w res file
7. resources pool

# reference
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
- https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
- https://docs.nvidia.com/cuda/cuda-runtime-api/index.html
- https://docs.nvidia.com/cuda/thrust/index.html
- https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
- https://nvlabs.github.io/cub/index.html
- https://stotko.github.io/stdgpu/api/memory.html
-
- https://www.youtube.com/watch?v=cOBtkPsgkus
- **https://www.youtube.com/watch?v=Na9_2G6niMw**
-
- https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Many_core_computing_with_CUDA.pdf
- [Exploring Performance Portability for Accelerators via High-level Parallel Patterns](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=4Ab_NBkAAAAJ&citation_for_view=4Ab_NBkAAAAJ:hqOjcs7Dif8C), [PPT](https://pdfs.semanticscholar.org/b34a/f7c4739d622379fa31a1e88155335061c1b1.pdf)

-
- https://zhuanlan.zhihu.com/p/52344300
-
- https://passlab.github.io/OpenMPProgrammingBook/cover.html
-

- https://developer.nvidia.com/blog/maximizing-performance-with-massively-parallel-hash-maps-on-gpus/

- https://github.com/rapidsai/raft/blob/branch-23.12/docs/source/vector_search_tutorial.md


## view paper
1. [Fast Segmented Sort on GPUs.](https://raw.github.com/weedge/learn/main/gpu/Fast%20Segmented%20Sort%20on%20GPUs.pdf)
2. [Efficient Top-K query processing on massively parallel hardware](https://raw.githubusercontent.com/weedge/learn/main/gpu/Efficient%20Top-K%20Query%20Processing%20on%20Massively%20Parallel%20Hardware.pdf)
3. [stdgpu: Efficient STL-like Data Structures on the GPU](https://www.researchgate.net/publication/335233070_stdgpu_Efficient_STL-like_Data_Structures_on_the_GPU)
4. [Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods](https://sc23.supercomputing.org/presentation/?id=pap294&sess=sess156)

## view code
1. https://github.com/rapidsai/cudf/pull/8702 , https://github.com/rapidsai/cudf/blob/branch-23.12/cpp/tests/io/text/multibyte_split_test.cpp
2. https://github.com/vtsynergy/bb_segsort (k/v), https://github.com/Funatiq/bb_segsort (k,k/v)
3. https://github.com/anilshanbhag/gpu-topk
4. https://github.com/heavyai/heavydb/blob/master/QueryEngine/TopKSort.cu
5. https://github.com/rapidsai/raft/blob/branch-23.12/cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh
6. https://github.com/rapidsai/raft/blob/branch-23.12/cpp/include/raft/matrix/select_k.cuh , https://github.com/rapidsai/raft/blob/branch-23.12/cpp/test/matrix/select_k.cuh

## run baseline

In [None]:
!python --version

Python 3.10.12


In [239]:
!!lsblk

['NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS',
 'loop0     7:0    0   170G  0 loop ',
 'sda       8:0    0   180G  0 disk ',
 '├─sda1    8:1    0 175.8G  0 part /opt/bin/.nvidia',
 '│                                 /etc/hosts',
 '│                                 /etc/hostname',
 '│                                 /etc/resolv.conf',
 '│                                 /usr/lib64-nvidia',
 '├─sda2    8:2    0    16M  0 part ',
 '├─sda3    8:3    0     2G  0 part ',
 '├─sda4    8:4    0    16M  0 part ',
 '├─sda5    8:5    0     2G  0 part ',
 '├─sda6    8:6    0   512B  0 part ',
 '├─sda7    8:7    0   512B  0 part ',
 '├─sda8    8:8    0    16M  0 part ',
 '├─sda9    8:9    0   512B  0 part ',
 '├─sda10   8:10   0   512B  0 part ',
 '├─sda11   8:11   0     8M  0 part ',
 '└─sda12   8:12   0    32M  0 part ']

In [None]:
!nvcc -h

In [None]:
!nvidia-smi

Fri Nov 10 04:03:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!nvidia-smi -q



Timestamp                                 : Fri Nov 10 04:03:40 2023
Driver Version                            : 525.105.17
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:00:04.0
    Product Name                          : NVIDIA A100-SXM4-40GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322120078438
    GPU UUID                           

In [None]:
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt update
!apt install ./nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt --fix-broken install


In [None]:
!wget "https://bj.bcebos.com/v1/ai-studio-online/9805dd2d2e8e472693efac637628e16b9f9c5be0fe30438bb4a80de3b386781a?responseContentDisposition=attachment%3B%20filename%3DSTI2_1017.zip&authorization=bce-auth-v1%2F5cfe9a5e1454405eb2a975c43eace6ec%2F2023-10-18T12%3A42%3A27Z%2F-1%2F%2F6b5388dcd9013bc9b340bb1806476afa938ce0c65f2f595e1a75f529e90e4187" -O STI2_1017.zip

In [None]:
!rm -rf STI2 && unzip STI2_1017.zip && mv STI2\ 2 STI2

In [None]:
!sh STI2/build.sh

In [None]:
!STI2/bin/query_doc_scoring STI2/translate/docs.txt STI2/translate/querys ./res_gpu_baseline.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nvcc STI2/src/main.cpp STI2/src/topk.cu -o STI2/bin/query_doc_scoring_gpu  \
	-ISTI2/src \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-g


In [None]:
!STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res_3.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff res_3.txt STI2/translate/res/result.txt

1c1
< 3175
---
> 2990


In [None]:
!nvprof --print-gpu-trace STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res.txt

In [None]:
!ncu --set full --call-stack --nvtx -o report_gpu STI2/bin/query_doc_scoring_gpu STI2/translate/docs.txt STI2/translate/querys ./res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!nvcc STI2/src/main.cpp topk/topk_query_stream.cu -o STI2/bin/query_doc_scoring_gpu_stream  \
	-ISTI2/src \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-g

In [None]:
!STI2/bin/query_doc_scoring_gpu_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff ./res_gpu_stream.txt STI2/translate/res/result.txt

1c1
< 2850
---
> 2990


In [None]:
!nvprof --print-gpu-trace STI2/bin/query_doc_scoring_gpu_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_stream.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

## run topk

In [None]:
!make -C topk/ BUILD_TYPE=Release

In [None]:
!topk/bin/query_doc_scoring_cpu STI2/translate/docs.txt STI2/translate/querys ./cpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_res.txt STI2/translate/res/result.txt

1c1
< 87230
---
> 2990


In [None]:
!topk/bin/query_doc_scoring_cpu_concurrency STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_concurency_res.txt STI2/translate/res/result.txt

1c1
< 14206
---
> 2990


In [None]:
!make -C topk/ build_cpu_gpu BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_gpu_res.txt STI2/translate/res/result.txt

1c1
< 2504
---
> 2990


In [None]:
!nvprof --print-gpu-trace topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt

In [None]:
!nsys profile  -o a100_report_cpu_gpu.nsys-rep topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_gpu topk/bin/query_doc_scoring_cpu_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_gpu_res_1.txt

In [None]:
!make -C topk/ build_cpu_concurrency_gpu BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk.cu -o ./bin/query_doc_scoring_cpu_concurrency_gpu  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DCPU_CONCURRENCY \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [None]:
!diff cpu_concurency_gpu_res.txt STI2/translate/res/result.txt

1c1
< 2915
---
> 2990


In [None]:
!nvprof --print-gpu-trace topk/bin/query_doc_scoring_cpu_concurency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

In [None]:
!nsys profile  -o a100_report_cpu_concurrency_gpu.nsys-rep topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_concurrency_gpu topk/bin/query_doc_scoring_cpu_concurrency_gpu STI2/translate/docs.txt STI2/translate/querys ./cpu_concurency_gpu_res.txt

## insert sort topk

In [None]:
!nvcc sum.cu -o sum

In [None]:
!./sum

Init input source[N]
CPU time: 317.27
GPU time: 11.21
Result: Error
GPU_result: 119571172;
CPU_result: 450029111;


In [None]:
!nvcc topk.cu -o topk

In [None]:
!./topk

Init source data...........
Complete init source data.....
GPU Run **************
GPU Complete!!!
CPU RUN***************
CPU Complete!!!!!CPU top1: 2147483611; GPU top1: 2147483611;
CPU top2: 2147483578; GPU top2: 2147483578;
CPU top3: 2147483526; GPU top3: 2147483526;
CPU top4: 2147483514; GPU top4: 2147483514;
CPU top5: 2147483491; GPU top5: 2147483491;
CPU top6: 2147483482; GPU top6: 2147483482;
CPU top7: 2147483417; GPU top7: 2147483417;
CPU top8: 2147483385; GPU top8: 2147483385;
CPU top9: 2147483327; GPU top9: 2147483327;
CPU top10: 2147483297; GPU top10: 2147483297;
CPU top11: 2147483267; GPU top11: 2147483267;
CPU top12: 2147483227; GPU top12: 2147483227;
CPU top13: 2147483204; GPU top13: 2147483204;
CPU top14: 2147483188; GPU top14: 2147483188;
CPU top15: 2147483183; GPU top15: 2147483183;
CPU top16: 2147483170; GPU top16: 2147483170;
CPU top17: 2147483156; GPU top17: 2147483156;
CPU top18: 2147483141; GPU top18: 2147483141;
CPU top19: 2147483140; GPU top19: 2147483140;
CPU to

## sample test

In [None]:
!make -C topk/ build_example_readfile_cpu BUILD_TYPE=Release CXXFLAGS="-std=c++11"

make: Entering directory '/content/topk'
nvcc -o bin/example_readfile_cpu example_readfile.cpp -DFMT_HEADER_ONLY \
	-I./ \
	-std=c++11 \
	-O3 \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt line

docs_size:7853051 doc_lens_size:7853051
read file cost 33616 ms 


In [None]:
!topk/bin/example_readfile_cpu STI2/translate/docs.txt buffer

readcnt: 7 fread size: 3287461913
docs_size:7853051 doc_lens_size:7853051
read file cost 41724 ms 


In [None]:
!cd topk && nvcc ./stream.cu -o ./bin/stream && ./bin/stream

Number of device(s): 1
Device 0
    Name:                    Tesla T4
    Glocbal memory:          15101.8 MB
    Shared memory per block: 48 KB
    Warp size:               32
    Max thread per block:    1024
    Thread dimension limits: 1024 x 1024 x 64
    Max grid size:           2147483647 x 65535 x 65535
    Compute capability:      7.5
 
Generating 7680 x 4320 BRGA8888 image, data size: 132710400
 
Computing results using CPU.
 
    Whole process took 497.971ms.
 
Computing results using GPU, default stream.
 
    Move data to GPU.
        Data transfer took 12.0095ms.
        Performance is 11.0504GB/s.
    Convert 8-bit BGRA to 8-bit YUV.
        Processing of 8K image took 1.70637ms.
        Performance is 77.7736GB/s.
    Move data to CPU.
        Data transfer took 8.13226ms.
        Performance is 12.2393GB/s.
    Whole process took 21.8481ms.
    Compare CPU and GPU results ...
        Results are the same.
 
Computing results using GPU, using 16 streams.
 
    Creating 

In [None]:
!make -C topk/ build_cpu_gpu_doc_stream BUILD_TYPE=Release NVCCFLAGS="-std=c++11"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./topk_doc_stream.cu -o ./bin/query_doc_scoring_cpu_gpu_doc_stream  \
	-I./ \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-std=c++11 \
	-O3 \
	-DGPU \
	-g
make: Leaving directory '/content/topk'


In [None]:
!topk/bin/query_doc_scoring_cpu_gpu_doc_stream STI2/translate/docs.txt STI2/translate/querys ./res_gpu_doc_stream.txt

In [None]:
!diff ./res_gpu_doc_stream.txt STI2/translate/res/result.txt

# rapidsai - cudf
use chunk multibyte_split, strings split, gpu accelerate.

1. https://github.com/rapidsai/cudf/blob/branch-23.12/CONTRIBUTING.md#build-cudf-from-source

In [None]:
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu11

In [None]:
!git clone https://github.com/rapidsai/cudf.git

Cloning into 'cudf'...
remote: Enumerating objects: 352286, done.[K
remote: Counting objects: 100% (591/591), done.[K
remote: Compressing objects: 100% (367/367), done.[K
remote: Total 352286 (delta 295), reused 421 (delta 222), pack-reused 351695[K
Receiving objects: 100% (352286/352286), 131.07 MiB | 16.46 MiB/s, done.
Resolving deltas: 100% (260900/260900), done.


In [None]:
!cd cudf && ./build.sh --help

In [None]:
!cd cudf && ./build.sh libcudf

In [None]:
!ls /include/
!ls /lib


In [None]:
!git clone https://github.com/gabime/spdlog.git

Cloning into 'spdlog'...
remote: Enumerating objects: 27412, done.[K
remote: Counting objects: 100% (3852/3852), done.[K
remote: Compressing objects: 100% (377/377), done.[K
remote: Total 27412 (delta 3633), reused 3518 (delta 3462), pack-reused 23560[K
Receiving objects: 100% (27412/27412), 40.91 MiB | 12.04 MiB/s, done.
Resolving deltas: 100% (18468/18468), done.


In [None]:
!cd spdlog && cmake -B build -S . && make -C build -j

In [None]:
!cp -r ./spdlog/include/spdlog/fmt/bundled /include/spdlog/fmt/

In [None]:
!ls /lib/libarrow*

/lib/libarrow_acero.so		 /lib/libarrow_dataset.so	    /lib/libarrow.so
/lib/libarrow_acero.so.1400	 /lib/libarrow_dataset.so.1400	    /lib/libarrow.so.1400
/lib/libarrow_acero.so.1400.1.0  /lib/libarrow_dataset.so.1400.1.0  /lib/libarrow.so.1400.1.0


In [None]:
!tar -zcvf libcudf.tar.gz /include /lib/libcudf.so /lib/libarrow*

In [None]:
!tar -zxvf libcudf.tar.gz

In [None]:
!cd topk && nvcc example_readfile.cpp readfile.cu -o readfile -O3 --std=c++17 \
  -I./ -I/include -L/lib -lcudf -L/usr/local/cuda/lib64 -lcudart -lcuda -DGPU -DFMT_HEADER_ONLY --expt-relaxed-constexpr

In [None]:
!cat topk/data.txt

0, 1, 3
1, 2, 3, 4
4, 5, 6, 5
7, 2

In [None]:
!topk/readfile topk/data.txt chunk

file size: 34
chunk size: 268435456
 fread size: 34
 buffer: 0, 1, 3
1, 2, 3, 4
4, 5, 6, 5
7, 2

tid:0 docid:0 s:0 e:3 sub_view_size:3

tid:1 docid:1 s:3 e:7 sub_view_size:4

tid:2 docid:2 s:7 e:11 sub_view_size:4

tid:3 docid:3 s:11 e:13 sub_view_size:2
0,1,4,7,1,2,5,2,3,3,6,4,5,readcnt: 1
doccnt: 4
docs_size:0 doc_lens_size:0
read file cost 1183 ms 


In [None]:
!topk/readfile STI2/translate/docs.txt line

docs_size:7853051 doc_lens_size:7853051
read file cost 34274 ms 


In [None]:
!topk/readfile STI2/translate/docs.txt buffer

readcnt: 7 fread size: 3287461913
docs_size:7853051 doc_lens_size:7853051
read file cost 42369 ms 


In [228]:
!make -C topk/ build_example_readfile_gpu BUILD_TYPE=Release NVCCFLAGS="-std=c++17 --expt-relaxed-constexpr"

make: Entering directory '/content/topk'
nvcc -o bin/example_readfile_gpu example_readfile.cpp readfile.cu -DGPU -DFMT_HEADER_ONLY \
	-I./ \
	-std=c++17 --expt-relaxed-constexpr \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-L/lib -lcudf -I/include  \
	-O3 \
	-g
make: Leaving directory '/content/topk'


In [229]:
!topk/bin/example_readfile_gpu STI2/translate/docs.txt chunk

chunk size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 268435456
 fread size: 66239130
readcnt: 13
doccnt: 7853052
docs_size:7853052 doc_lens_size:7853052
read file cost 9271 ms 


In [225]:
!make -C topk/ build_cpu_gpu_readfile BUILD_TYPE=Release NVCCFLAGS="-std=c++17 --expt-relaxed-constexpr"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./readfile.cu ./topk.cu -o ./bin/query_doc_scoring_cpu_gpu_readfile \
	-I./ \
	-std=c++17 --expt-relaxed-constexpr \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-L/lib -lcudf -I/include  \
	-O3 \
	-DGPU -DFMT_HEADER_ONLY -DPIO \
	-g
make: Leaving directory '/content/topk'


In [226]:
!topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [227]:
!diff res_cpu_gpu_readfile.txt STI2/translate/res/result.txt

1c1
< 2393
---
> 2990


In [None]:
!nsys profile  -o a100_report_cpu_gpu_readfile.nsys-rep \
  topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt


In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_cpu_gpu_readfile \
  topk/bin/query_doc_scoring_cpu_gpu_readfile STI2/translate/docs.txt STI2/translate/querys ./res_cpu_gpu_readfile.txt

### gpu_readfile -> vec docs -> gpu_cpu_topk

1. read file cost from 34274 ms(line/per) to 9196 ms(gpu chunk multi_split), cost reduce (34274-9196)/34274 = **73.17%**
2. total cost reduce (35551 - 11589)/35551 = **67.40%**

---



In [238]:
!make -C topk/ build_gpu_cudf_strings BUILD_TYPE=Release NVCCFLAGS="-std=c++17 --expt-relaxed-constexpr"

make: Entering directory '/content/topk'
mkdir -p bin
nvcc ./main.cpp ./readfile.cu ./topk_doc_cudf_strings.cu -o ./bin/query_doc_scoring_gpu_cudf_strings \
	-I./ \
	-std=c++17 --expt-relaxed-constexpr \
	-L/usr/local/cuda/lib64 -lcudart -lcuda \
	-L/lib -lcudf -I/include  \
	-O3 \
	-DFMT_HEADER_ONLY -DGPU -DPIO_TOPK \
	-g
make: Leaving directory '/content/topk'


In [241]:
!topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

start get topk
query1.txt:10, 11, 16, 17, 42, 60, 22524, 22546, 22590, 22784, 23212, 23427, 23485, 23525, 23554, 24129, 24133, 24645, 24804, 24875, 25129, 25242, 25502, 25705, 25994, 26000, 26045, 26046, 26077, 26114, 26247, 26338, 26407, 27263, 27468, 27513, 28100, 40111, 40228, 40388, 41700, 45156, 45946, 46367, 47181, 47460, 47672
query2.txt:10, 16, 18, 21, 22, 23, 30, 42, 43, 44, 45, 54, 22497, 22512, 22524, 22533, 22535, 22608, 22624, 22790, 22828, 22836, 22885, 23188, 23381, 23409, 23558, 24103, 24197, 24250, 24496, 24918, 24974, 24987, 25179, 25317, 25827, 25994, 25996, 26009, 26015, 26023, 26030, 26050, 26052, 26082, 26096, 26205, 26247, 27399, 27475, 40029, 40300, 40416, 40504, 40696, 40837, 41166, 41172, 41336, 41407, 41516, 43247, 43309, 44547, 44795, 45101, 48828
query3.txt:11, 12, 13, 14, 21, 22, 23, 33, 42, 53, 61, 1380, 1545, 1546, 1557, 1560, 1566, 1569, 1583, 1646, 1759, 1762, 1787, 1794, 1877, 1882, 1892, 2069, 2120, 2146, 2368, 2670, 2888, 3022, 3327, 3335, 22460, 22

In [242]:
!diff res_gpu_cudf_strings.txt STI2/translate/res/result.txt

1c1
< 0
---
> 2990


In [None]:
!nsys profile  -o a100_report_gpu_cudf_strings.nsys-rep \
  topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

In [None]:
!ncu --set full --call-stack --nvtx -o a100_report_gpu_cudf_strings \
  topk/bin/query_doc_scoring_gpu_cudf_strings STI2/translate/docs.txt STI2/translate/querys ./res_gpu_cudf_strings.txt

### gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk

1. read file chunk pipeline to rank topk on gpu
2. total cost reduce (35551 - 7021)//35551 = **80.25%** compare with `gpu baseline`
3. total cost reduce (11589 - 7021)/11589 = **39.42%** compare with `gpu read file chunk to cpu vec docs then load to gpu rank topk`

---




### (gpu_readfile -> gpu_chunk_topk -> gpu_cpu_topk) + stream pool + rmm (todo)

# rapidsai - RAFT

use select k -> sort -> top k. gpu accelerate

1. https://github.com/rapidsai/raft/blob/branch-23.12/docs/source/build.md

In [None]:
!apt install ninja-build

In [None]:
!git clone https://github.com/rapidsai/raft.git

In [None]:
!cd raft && ./build.sh --help

In [None]:
!ls /lib

In [None]:
!cd raft && ./build.sh libraft --compile-lib

In [None]:
!tar -zcvf libraft.tar.gz /content/raft/cpp/build/install

# profiling

In [252]:
!tar zcvf a100_gpu_topk_ncu_nsys_profile.tar.gz ./a100*
!ls -gh a100*

./a100_report_cpu_concurrency_gpu.ncu-rep
./a100_report_cpu_concurrency_gpu.nsys-rep
./a100_report_cpu_gpu.ncu-rep
./a100_report_cpu_gpu.nsys-rep
./a100_report_cpu_gpu_readfile.ncu-rep
./a100_report_cpu_gpu_readfile.nsys-rep
./a100_report_gpu_cudf_strings.ncu-rep
./a100_report_gpu_cudf_strings.nsys-rep
-rw-r--r-- 1 root  66M Nov 10 15:03 a100_gpu_topk_ncu_nsys_profile.tar.gz
-rw-r--r-- 1 root  29M Nov 10 07:23 a100_report_cpu_concurrency_gpu.ncu-rep
-rw-rw-r-- 1 root  11M Nov 10 07:21 a100_report_cpu_concurrency_gpu.nsys-rep
-rw-r--r-- 1 root 2.6M Nov 10 07:16 a100_report_cpu_gpu.ncu-rep
-rw-rw-r-- 1 root 5.8M Nov 10 07:15 a100_report_cpu_gpu.nsys-rep
-rw-r--r-- 1 root 229M Nov 10 14:23 a100_report_cpu_gpu_readfile.ncu-rep
-rw-rw-r-- 1 root 583K Nov 10 14:11 a100_report_cpu_gpu_readfile.nsys-rep
-rw-r--r-- 1 root 253M Nov 10 14:37 a100_report_gpu_cudf_strings.ncu-rep
-rw-rw-r-- 1 root 621K Nov 10 14:37 a100_report_gpu_cudf_strings.nsys-rep
