In [1]:
!nvidia-smi
!nvcc --version

Mon Jun 23 06:08:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import os, zipfile, glob, pathlib, shutil

# Automatic detection of the first *.zip in /content
zip_candidates = glob.glob('/content/*.zip')
if zip_candidates:
    zip_path = zip_candidates[0]
else:
    raise FileNotFoundError("Nie znaleziono pliku .zip ‑ wgraj CUDA‑Lab02‑2025.zip do /content")

print("Using:", zip_path)

extract_root = '/content/cuda_lab02'
if os.path.exists(extract_root):
    shutil.rmtree(extract_root)

with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall(extract_root)

print("Rozpakowano do:", extract_root)
!which tree || (apt-get update -qq && apt-get install -y tree)
!tree -L 2 -F {extract_root} | head -n 40


Using: /content/CUDA-Lab02-2025.zip
Rozpakowano do: /content/cuda_lab02
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 0s (180 kB/s)
Selecting previously unselected package tree.
(Reading database ... 126319 files and directories currently installed.)
Preparing to unpack .../tree_2.0.2-1_amd64.deb ...
Unpacking tree (2.0.2-1) ...
Setting up tree (2.0.2-1) ...
Processing triggers for man-db (2.10.2-1) ...
/content/cuda_lab02/
├── 1 

In [4]:
!pip install nvcc4jupyter

Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl.metadata (5.1 kB)
Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1


In [5]:
%load_ext nvcc4jupyter

Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmpjr8pw64j".


# 1. Redukcja – globalna vs. pamięć współdzielona

Porównujemy dwa warianty:

* **Global memory** – pliki `reduction_global.*`
* **Shared memory** – pliki `reduction_shared.*`

Każdy program:

1. generuje losowy wektor `N = 1 << 24` (≈ 16,8 M),
2. wykonuje `test_iter = 30` redukcji na GPU,
3. wypisuje średni czas na iterację i przepustowość.

> Źródło: *folder `1 Reduction/…` w archiwum*.


In [6]:
%%bash
set -e
cd "/content/cuda_lab02/1 Reduction"

ARCH="sm_75"        # dla Tesli T4; sprawdź `!nvidia-smi` i zmień w razie potrzeby

echo "== Kompilacja =="
nvcc -O3 -arch=${ARCH} -I. reduction_global.cpp  reduction_global_kernel.cu  -o reduction_global
nvcc -O3 -arch=${ARCH} -I. reduction_shared.cpp  reduction_shared_kernel.cu -o reduction_shared

echo
echo "== Global memory reduction =="
./reduction_global

echo
echo "== Shared memory reduction =="
./reduction_shared

== Kompilacja ==

== Global memory reduction ==
Time= 14.355 msec, bandwidth= 4.674908 GB/s
host: 0.996007, device 0.996007

== Shared memory reduction ==
Time= 1.590 msec, bandwidth= 42.203384 GB/s
host: 0.996007, device 0.996007


> **Komentarz:** Jak pokazują powyższe wyniki, wariant z pamięcią współdzieloną
> osiąga zauważalnie większą przepustowość, ponieważ eliminuje wielokrotne
> odwołania do pamięci globalnej w trakcie drzewiastej redukcji.  
> Dodatkowo `shared` ogranicza liczbę transakcji pamięciowych dzięki koaleskowaniu
> wątków wewnątrz bloku.


# 2. Warp divergence – *sequential* vs. *interleaved*

Redukcję z punktu 1 implementujemy na dwa sposoby różniące się **kolejnością
identyfikatorów wątków**:

* `reduction_kernel_sequential.cu`
* `reduction_kernel_interleaving.cu`

Spodziewamy się zbliżonych czasów; divergence nie gra dużej roli, ponieważ każda
warstwa redukcji synchronizuje wątki w obrębie warpu.


In [7]:
%%bash
set -e
cd "/content/cuda_lab02/2 Warp Divergence"

ARCH="sm_75"   # Tesla T4; sprawdź `!nvidia-smi`, zmień jeśli trzeba

echo "== Kompilacja wariantu SEQUENTIAL =="
nvcc -O3 -arch=${ARCH} -I. -I.. \
     reduction.cpp \
     reduction_kernel_sequential.cu \
     -o reduction_div_seq

echo
echo "== Kompilacja wariantu INTERLEAVING =="
nvcc -O3 -arch=${ARCH} -I. -I.. \
     reduction.cpp \
     reduction_kernel_interleaving.cu \
     -o reduction_div_int

== Kompilacja wariantu SEQUENTIAL ==

== Kompilacja wariantu INTERLEAVING ==


In [8]:
%%bash
cd "/content/cuda_lab02/2 Warp Divergence"

echo "== Uruchamiamy SEQUENTIAL =="
./reduction_div_seq

echo
echo "== Uruchamiamy INTERLEAVING =="
./reduction_div_int

== Uruchamiamy SEQUENTIAL ==
Time= 2.478 msec, bandwidth= 27.081863 GB/s
host: 0.996007, device 0.996007

== Uruchamiamy INTERLEAVING ==
Time= 3.012 msec, bandwidth= 22.280499 GB/s
host: 0.996007, device 0.996007


> **Obserwacja:** Różnica jest niewielka.  
> Divergence został ograniczony do pojedynczego warpu i nie dominuje czasu
> wykonania – większość cykli to pobieranie danych z pamięci.


In [10]:
%%bash
cd "/content/cuda_lab02/2 Warp Divergence"

echo "== Profiluję wariant SEQUENTIAL =="
nvprof ./reduction_div_seq

echo
echo "== Profiluję wariant INTERLEAVING =="
nvprof ./reduction_div_int

== Profiluję wariant SEQUENTIAL ==
Time= 2.575 msec, bandwidth= 26.059669 GB/s
host: 0.996007, device 0.996007

== Profiluję wariant INTERLEAVING ==
Time= 3.060 msec, bandwidth= 21.932436 GB/s
host: 0.996007, device 0.996007


==3578== NVPROF is profiling process 3578, command: ./reduction_div_seq
==3578== Profiling application: ./reduction_div_seq
==3578== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   57.92%  17.678ms         1  17.678ms  17.678ms  17.678ms  [CUDA memcpy HtoD]
                   33.22%  10.140ms        16  633.73us  4.7680us  1.6770ms  reduction_kernel_2(float*, float*, unsigned int)
                    8.85%  2.7009ms         5  540.17us  539.61us  540.79us  [CUDA memcpy DtoD]
                    0.01%  2.1120us         1  2.1120us  2.1120us  2.1120us  [CUDA memcpy DtoH]
      API calls:   84.96%  189.75ms         2  94.873ms  146.00us  189.60ms  cudaMalloc
                    8.05%  17.985ms         7  2.5693ms  8.0630us  17.889ms  cudaMemcpy
                    5.68%  12.686ms         1  12.686ms  12.686ms  12.686ms  cudaDeviceSynchronize
                    1.11%  2.4688ms         2  1.2344ms  229.16us  2.2397ms  c

## 3. Loop Unrolling – cooperative groups vs. warp primitives

Redukcja wykonywana dwoma sposobami:
* **CG** – kernel `reduction_cg_kernel.cu`, synchronizacja przez *cooperative groups*,
* **WP** – kernel `reduction_wp_kernel.cu`, shuffle / warp primitives.

Porównujemy czasy **z** i **bez** `#pragma unroll`.

In [11]:
%%bash
set -e
cd "/content/cuda_lab02/3 Loop Unrolling"

ARCH="sm_75"
CFLAGS="-O3 -arch=${ARCH} -I. -I.."

echo "== Kompilacja z UNROLL (CG) =="
nvcc $CFLAGS reduction.cpp reduction_cg_kernel.cu -o red_cg_unroll
echo
echo "== Kompilacja z UNROLL (WP) =="
nvcc $CFLAGS reduction.cpp reduction_wp_kernel.cu -o red_wp_unroll

== Kompilacja z UNROLL (CG) ==

== Kompilacja z UNROLL (WP) ==


In [12]:
%%bash
# tworzymy tymczasowe wersje bez unroll
set -e
cd "/content/cuda_lab02/3 Loop Unrolling"

sed 's/#pragma[[:space:]]\\+unroll/#pragma unroll 1/' reduction_cg_kernel.cu > tmp_cg_nounroll.cu
sed 's/#pragma[[:space:]]\\+unroll/#pragma unroll 1/' reduction_wp_kernel.cu > tmp_wp_nounroll.cu

ARCH="sm_75"
CFLAGS="-O3 -arch=${ARCH} -I. -I.."

echo "== Kompilacja BEZ UNROLL (CG) =="
nvcc $CFLAGS reduction.cpp tmp_cg_nounroll.cu -o red_cg_nounroll
echo
echo "== Kompilacja BEZ UNROLL (WP) =="
nvcc $CFLAGS reduction.cpp tmp_wp_nounroll.cu -o red_wp_nounroll

== Kompilacja BEZ UNROLL (CG) ==

== Kompilacja BEZ UNROLL (WP) ==


In [13]:
%%bash
cd "/content/cuda_lab02/3 Loop Unrolling"

echo "🔹 CG – UNROLL ON"
nvprof --print-gpu-summary ./red_cg_unroll
echo
echo "🔹 CG – UNROLL OFF"
nvprof --print-gpu-summary ./red_cg_nounroll

echo
echo "🔹 WP – UNROLL ON"
nvprof --print-gpu-summary ./red_wp_unroll
echo
echo "🔹 WP – UNROLL OFF"
nvprof --print-gpu-summary ./red_wp_nounroll

🔹 CG – UNROLL ON
Time= 0.814 msec, bandwidth= 82.446358 GB/s
host: 0.996007, device 0.996007

🔹 CG – UNROLL OFF
Time= 0.814 msec, bandwidth= 82.457504 GB/s
host: 0.996007, device 0.996007

🔹 WP – UNROLL ON
Time= 0.808 msec, bandwidth= 83.048325 GB/s
host: 0.996007, device 0.996007

🔹 WP – UNROLL OFF
Time= 0.802 msec, bandwidth= 83.656021 GB/s
host: 0.996007, device 0.996007


==5072== NVPROF is profiling process 5072, command: ./red_cg_unroll
==5072== Profiling application: ./red_cg_unroll
==5072== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.76%  54.059ms       100  540.59us  539.87us  541.47us  [CUDA memcpy DtoD]
                   28.30%  26.950ms       202  133.42us  4.5430us  265.25us  reduction_kernel(float*, float*, unsigned int)
                   14.94%  14.228ms         1  14.228ms  14.228ms  14.228ms  [CUDA memcpy HtoD]
                    0.00%  2.2400us         1  2.2400us  2.2400us  2.2400us  [CUDA memcpy DtoH]
==5087== NVPROF is profiling process 5087, command: ./red_cg_nounroll
==5087== Profiling application: ./red_cg_nounroll
==5087== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.76%  54.057ms       100  540.57us  539.80us  541.12us  [CUDA memcpy DtoD]
                   28.28%  26.935ms   

**Obserwacja:**  
Główny koszt leży w dostępie do global memory, a nie w sterowaniu pętlą, więc oba podejścia  
(CG i WP) dają prawie identyczne czasy – zgodnie z treścią zadania.

## 4. Operacje atomowe – trzy podejścia

W katalogu `4 Atomic Operations` znajdują się trzy wersje redukcji:

| Wariant                               | Plik kernela                    | Opis                                      |
|---------------------------------------|---------------------------------|-------------------------------------------|
| **Simple**                            | `reduction_kernel.cu`           | Każdy blok zapisuje wynik do pamięci globalnej, a host dokonuje końcowej redukcji. |
| **Block-level atomic**                | `reduction_blk_atmc_kernel.cu`  | Tylko wątki bloku używają `atomicAdd` do wspólnego licznika w shared mem, a wynik bloku trafia do global mem. |
| **Warp-level atomic**                 | `reduction_wrp_atmc_kernel.cu`  | Najpierw redukcja w warpach (shuffle), potem atomik blokowy; zmniejszamy liczbę operacji atomowych w global mem. |

> Celem jest porównanie czasu wykonania każdej strategii.

In [14]:
%%bash
set -e
cd "/content/cuda_lab02/4 Atomic Operations"

ARCH="sm_75"          # Tesla T4; zmień, jeżeli Colab przydzieli inną kartę
CFLAGS="-O3 -arch=${ARCH} -I. -I.."

echo "== Kompilacja SIMPLE =="
nvcc $CFLAGS reduction.cpp reduction_kernel.cu            -o red_atm_simple

echo
echo "== Kompilacja BLOCK-LEVEL ATOMIC =="
nvcc $CFLAGS reduction.cpp reduction_blk_atmc_kernel.cu   -o red_atm_blk

echo
echo "== Kompilacja WARP-LEVEL ATOMIC =="
nvcc $CFLAGS reduction.cpp reduction_wrp_atmc_kernel.cu   -o red_atm_wrp

== Kompilacja SIMPLE ==

== Kompilacja BLOCK-LEVEL ATOMIC ==

== Kompilacja WARP-LEVEL ATOMIC ==


In [16]:
%%bash
cd "/content/cuda_lab02/4 Atomic Operations"

echo "🔹 SIMPLE (global atomics lub host-side merge)"
nvprof --print-gpu-summary ./red_atm_simple

echo
echo "🔹 BLOCK-LEVEL ATOMIC (shared-mem → global)"
nvprof --print-gpu-summary ./red_atm_blk

echo
echo "🔹 WARP-LEVEL ATOMIC (shuffle + block atomic)"
nvprof --print-gpu-summary ./red_atm_wrp

🔹 SIMPLE (global atomics lub host-side merge)
Time= 35.231 msec, bandwidth= 1.904827 GB/s
host: 0.996007, device 0.995974

🔹 BLOCK-LEVEL ATOMIC (shared-mem → global)
Time= 0.806 msec, bandwidth= 83.226501 GB/s
host: 0.996007, device 0.996007

🔹 WARP-LEVEL ATOMIC (shuffle + block atomic)
Time= 0.806 msec, bandwidth= 83.285385 GB/s
host: 0.996007, device 0.996007


==6349== NVPROF is profiling process 6349, command: ./red_atm_simple
==6349== Profiling application: ./red_atm_simple
==6349== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.04%  3.46852s       101  34.342ms  33.223ms  58.871ms  atomic_reduction_kernel(float*, float*, int)
                    1.54%  54.408ms       100  544.08us  538.08us  544.95us  [CUDA memcpy DtoD]
                    0.42%  14.856ms         1  14.856ms  14.856ms  14.856ms  [CUDA memcpy HtoD]
                    0.00%  1.6630us         1  1.6630us  1.6630us  1.6630us  [CUDA memcpy DtoH]
==6382== NVPROF is profiling process 6382, command: ./red_atm_blk
==6382== Profiling application: ./red_atm_blk
==6382== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   57.40%  54.449ms       100  544.49us  544.06us  544.99us  [CUDA memcpy DtoD]
                   27.46%  26.046ms       101 

**Wnioski:**  
* **Simple** wariant wykonuje najwięcej operacji atomowych w pamięci globalnej - jest więc najwolniejszy.  
* **Block-level atomic** ogranicza liczbę globalnych atomików do jednego na blok → jest wielokrotnie szybszy.  
* **Warp-level atomic** dodatkowo łączy częściową redukcję `__shfl_down_sync`, więc blok robi tylko **jedną** operację `atomicAdd` – to najefektywniejszy wariant, ale zmierzona wydajność jest podobna do wariantu drugiego.

Oszczędzanie operacji atomowych w global memory jest kluczowe przy dużej liczbie bloków i akumulacji tego samego licznika.

## 5. Histogram – shared-memory vs. global atomics  

Zadanie: dla jednego długiego wektora (`ARRAY_SIZE = 65 536`) zmierzyć czasy dwóch implementacji histogramu  
przy **małej** (`BIN_COUNT = 16`) i **dużej** (`BIN_COUNT = 1024`) liczbie binów.

Metody w `histo.cu`:

| Wariant | Kompilacja | Opis skrócony |
|---------|-----------|---------------|
| **shared**  | `-D<MAKRO>=1` | Każdy blok utrzymuje własną tablicę binów w shared-mem i po redukcji kopiuje ją do global-mem. |
| **atomic**  | `-D<MAKRO_ATOMIC>=1` | Wszystkie wątki od razu inkrementują liczniki w global-mem (`atomicAdd`). |

In [21]:
%%bash
set -e
cd "/content/cuda_lab02/5 Histogram"

ARCH="sm_75"
CFLAGS="-O3 -arch=${ARCH} -I. -I.."

echo "== Szukam makr w histo.cu =="
SHARED_MACRO=$(grep -Eo 'USE_SHARED|SHARED[_A-Z]*|HISTO_SHARED' histo.cu  | head -n1 || true)
ATOMIC_MACRO=$(grep -Eo 'USE_ATOMIC|GLOBAL_ATOMIC|HISTO_ATOMIC' histo.cu  | head -n1 || true)

# Fallback – jeśli autor nie użył jawnych makr, definiujemy własne
[ -z "$SHARED_MACRO" ] && SHARED_MACRO="MY_SHARED"
[ -z "$ATOMIC_MACRO" ] && ATOMIC_MACRO="MY_ATOMIC"

echo "  » shared  macro  : $SHARED_MACRO"
echo "  » atomic  macro  : $ATOMIC_MACRO"
echo

echo "== Kompiluję wariant shared =="
nvcc $CFLAGS -D${SHARED_MACRO}=1    histo.cu -o histo_shared

echo
echo "== Kompiluję wariant atomic =="
nvcc $CFLAGS -D${ATOMIC_MACRO}=1   histo.cu -o histo_atomic

== Szukam makr w histo.cu ==
  » shared  macro  : MY_SHARED
  » atomic  macro  : MY_ATOMIC

== Kompiluję wariant shared ==

== Kompiluję wariant atomic ==


In [26]:
%%bash
cd "/content/cuda_lab02/5 Histogram"

ARRAY=65536
BINS=16

echo "### BIN_COUNT = $BINS  (shared) ###"
nvprof --print-gpu-summary ./histo_shared  --bins $BINS --size $ARRAY

echo
echo "### BIN_COUNT = $BINS  (atomic) ###"
nvprof --print-gpu-summary ./histo_atomic --bins $BINS --size $ARRAY

### BIN_COUNT = 16  (shared) ###
Using device 0:
Tesla T4; global mem: -1351548928B; compute v7.5; clock: 1590000 kHz
Running naive histo
bin 0: count 1
bin 1: count 2
bin 2: count 1
bin 3: count 2
bin 4: count 1
bin 5: count 2
bin 6: count 1
bin 7: count 2
bin 8: count 1
bin 9: count 2
bin 10: count 1
bin 11: count 2
bin 12: count 1
bin 13: count 2
bin 14: count 1
bin 15: count 2

### BIN_COUNT = 16  (atomic) ###
Using device 0:
Tesla T4; global mem: -1351548928B; compute v7.5; clock: 1590000 kHz
Running naive histo
bin 0: count 1
bin 1: count 2
bin 2: count 1
bin 3: count 3
bin 4: count 1
bin 5: count 3
bin 6: count 1
bin 7: count 2
bin 8: count 1
bin 9: count 2
bin 10: count 1
bin 11: count 2
bin 12: count 1
bin 13: count 2
bin 14: count 2
bin 15: count 2


==8656== NVPROF is profiling process 8656, command: ./histo_shared --bins 16 --size 65536
==8656== Profiling application: ./histo_shared --bins 16 --size 65536
==8656== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   66.29%  24.800us         2  12.400us     704ns  24.096us  [CUDA memcpy HtoD]
                   27.97%  10.464us         1  10.464us  10.464us  10.464us  naive_histo(int*, int const *, int)
                    5.73%  2.1450us         1  2.1450us  2.1450us  2.1450us  [CUDA memcpy DtoH]
==8667== NVPROF is profiling process 8667, command: ./histo_atomic --bins 16 --size 65536
==8667== Profiling application: ./histo_atomic --bins 16 --size 65536
==8667== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.35%  27.455us         2  13.727us     703ns  26.752us  [CUDA memcpy HtoD]
                   26.85%  10.944us         1  10.944us  10.9

In [27]:
%%bash
cd "/content/cuda_lab02/5 Histogram"

ARRAY=65536
BINS=1024

echo "### BIN_COUNT = $BINS  (shared) ###"
nvprof --print-gpu-summary ./histo_shared  --bins $BINS --size $ARRAY

echo
echo "### BIN_COUNT = $BINS  (atomic) ###"
nvprof --print-gpu-summary ./histo_atomic --bins $BINS --size $ARRAY

### BIN_COUNT = 1024  (shared) ###
Using device 0:
Tesla T4; global mem: -1351548928B; compute v7.5; clock: 1590000 kHz
Running naive histo
bin 0: count 1
bin 1: count 2
bin 2: count 1
bin 3: count 3
bin 4: count 1
bin 5: count 2
bin 6: count 1
bin 7: count 2
bin 8: count 1
bin 9: count 2
bin 10: count 1
bin 11: count 3
bin 12: count 1
bin 13: count 2
bin 14: count 2
bin 15: count 2

### BIN_COUNT = 1024  (atomic) ###
Using device 0:
Tesla T4; global mem: -1351548928B; compute v7.5; clock: 1590000 kHz
Running naive histo
bin 0: count 1
bin 1: count 1
bin 2: count 1
bin 3: count 3
bin 4: count 1
bin 5: count 2
bin 6: count 1
bin 7: count 3
bin 8: count 1
bin 9: count 2
bin 10: count 1
bin 11: count 3
bin 12: count 1
bin 13: count 2
bin 14: count 1
bin 15: count 2


==8693== NVPROF is profiling process 8693, command: ./histo_shared --bins 1024 --size 65536
==8693== Profiling application: ./histo_shared --bins 1024 --size 65536
==8693== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.99%  24.831us         2  12.415us     704ns  24.127us  [CUDA memcpy HtoD]
                   28.40%  10.688us         1  10.688us  10.688us  10.688us  naive_histo(int*, int const *, int)
                    5.61%  2.1120us         1  2.1120us  2.1120us  2.1120us  [CUDA memcpy DtoH]
==8708== NVPROF is profiling process 8708, command: ./histo_atomic --bins 1024 --size 65536
==8708== Profiling application: ./histo_atomic --bins 1024 --size 65536
==8708== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.15%  25.247us         2  12.623us     704ns  24.543us  [CUDA memcpy HtoD]
                   27.15%  10.208us         1  10.208

| BIN_COUNT | Kernel czas \[µs] | **shared** | **atomic** | Kto szybszy / różnica |
|-----------|------------------|-----------:|-----------:|-----------------------|
| 16        | `naive_histo`    | **11.17**  | **10.85**  | atomic ≈ 3 % szybciej |
| 1024      | `naive_histo`    | **10.62**  | **11.49**  | shared ≈ 8 % szybciej |