# CUDA Lab 03 – Mnożenie macierzy na GPU  
**Środowisko:** CUDA 12.5 · NVIDIA Tesla T4 · Google Colab  

Wykonujemy **dwa** warianty mnożenia macierzy:

| Wariant | Plik źródłowy | Optymalizacje |
|---------|---------------|---------------|
| 1. *Simplest* | `1. Simplest_version/matrixmul.cu` | Brak (każdy wątek liczy jeden element) |
| 2. *CUDA Samples* | `2. CUDA_Samples_version/matrixMul.cu` | Tiling + pamięć współdzielona |

Wersję opartą o **cuBLAS** pomijamy zgodnie z treścią zadania.

In [1]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-a1f9335a-766a-40b3-a276-2834b0348348)


In [11]:
%%bash
# ── Clone CUDA Samples (only once) ─────────────────────────────
set -e
if [ ! -d cuda-samples ]; then
  echo "Cloning cuda‑samples repo…"
  git clone --depth 1 https://github.com/NVIDIA/cuda-samples.git
else
  echo "cuda‑samples already present."
fi

cuda‑samples already present.


In [12]:
%%bash
# ── Unzip lab archive ─────────────────────────────────────────
set -e
ZIP="CUDA-Lab03-Matrix Multiplication.zip"
if [ ! -f "$ZIP" ]; then
  echo "❌  $ZIP not found – upload it via the Files pane and re‑run."
  exit 1
fi

echo "Unzipping $ZIP …"
unzip -o "$ZIP" -d lab03 >/dev/null
echo "OK"

Unzipping CUDA-Lab03-Matrix Multiplication.zip …
OK


In [13]:
%%bash
# ── Build Simplest for chosen sizes ───────────────────────────
set -e

SIMPL_DIR="lab03/CUDA-Lab03-Matrix Multiplication/1. Simplest_version"
SRC="${SIMPL_DIR}/matrixmul.cu"
if [ ! -f "$SRC" ]; then
  echo "❌  Source $SRC not found."
  exit 1
fi

SIZES=(256 512 1024 2048)
for N in "${SIZES[@]}"; do
  TMP="${SIMPL_DIR}/tmp_${N}.cu"
  OUT="simple_mm_${N}"

  # inject N
  sed -E "s/int[[:space:]]+N[[:space:]]*=[[:space:]]*[0-9]+;/int N = ${N};/" "$SRC" > "$TMP"

  echo "→ Building Simplest N=${N}"
  nvcc -std=c++17 -arch=sm_75 -O3 -I "$SIMPL_DIR" "$TMP" -o "$OUT"
  rm "$TMP"
done
echo "✅ Simplest build finished."

→ Building Simplest N=256
→ Building Simplest N=512
→ Building Simplest N=1024
→ Building Simplest N=2048
✅ Simplest build finished.


In [14]:
%%bash
# ── Build CUDA Samples variant ───────────────────────────────
set -e

SAMPLE_DIR="lab03/CUDA-Lab03-Matrix Multiplication/2. CUDA_Samples_version"
SRC="${SAMPLE_DIR}/matrixMul.cu"
if [ ! -f "$SRC" ]; then
  echo "❌  $SRC not found."
  exit 1
fi

# find helper_cuda.h
INC_DIR=$(dirname "$(find cuda-samples -type f -name helper_cuda.h | head -n1)")
if [ -z "$INC_DIR" ]; then
  echo "❌  helper_cuda.h not found under cuda-samples."
  exit 1
fi
echo "Using helper headers from $INC_DIR"

echo "→ Building CUDA Samples variant"
nvcc -std=c++17 -arch=sm_75 -O3 -I "$INC_DIR" "$SRC" -o sample_mm
echo "✅ CUDA Samples build finished."

Using helper headers from cuda-samples/Common
→ Building CUDA Samples variant
✅ CUDA Samples build finished.


In [15]:
# ── Benchmark both binaries ──────────────────────────────────
import subprocess, time, re, pandas as pd, math

sizes = [256, 512, 1024, 2048]
records = []

def wall_time(cmd):
    t0 = time.perf_counter()
    proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    return proc.stdout, time.perf_counter() - t0

for N in sizes:
    ops = 2 * (N**3)   # 2·N³ floating‑point ops

    # Simplest
    out_s, t_s = wall_time([f"./simple_mm_{N}"])

    # Samples (iterations=1 to unify)
    sample_cmd = ["./sample_mm",
                  "--iterations=1",
                  "-wA", str(N), "-hA", str(N),
                  "-wB", str(N), "-hB", str(N)]
    out_c, wall_c = wall_time(sample_cmd)

    # try to parse kernel time
    m = re.search(r"Time=\s*([0-9.]+)\s*msec", out_c)
    kernel_c = float(m.group(1))/1000 if m else None

    records.append({
        "N": N,
        "Simplest_wall_s": round(t_s,4),
        "Simplest_GFLOPS": round(ops/(t_s*1e9),2),
        "Samples_wall_s": round(wall_c,4),
        "Samples_GFLOPS": round(ops/(wall_c*1e9),2),
        "Samples_kernel_s": round(kernel_c,4) if kernel_c else None,
        "Samples_kernel_GFLOPS": round(ops/(kernel_c*1e9),2) if kernel_c else None
    })

df = pd.DataFrame(records).set_index("N")
display(df)

print("\nLegend:")
print("  *_wall_*   – wall‑clock time (host→device + kernel + device→host)")
print("  *_kernel_* – czysty kernel (jeśli sample zwrócił Time=… msec)")

Unnamed: 0_level_0,Simplest_wall_s,Simplest_GFLOPS,Samples_wall_s,Samples_GFLOPS,Samples_kernel_s,Samples_kernel_GFLOPS
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
256,0.185,0.18,0.1559,0.22,,
512,0.433,0.62,0.243,1.1,,
1024,0.9423,2.28,0.1032,20.81,,
2048,17.3535,0.99,0.1099,156.38,,



Legend:
  *_wall_*   – wall‑clock time (host→device + kernel + device→host)
  *_kernel_* – czysty kernel (jeśli sample zwrócił Time=… msec)
