## 0. Setup — RAPIDS environment

In this notebook we will use NVIDIA RAPIDS to compare
CPU vs GPU execution on the same analytical workload.

We rely on the official RAPIDS Colab utilities to install
the correct CUDA-compatible libraries.

⚠️ If no GPU is available, this notebook can still be read
but GPU cells should not be executed.


In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

fatal: la ruta de destino 'rapidsai-csp-utils' ya existe y no es un directorio vacío.
  import pynvml
Traceback (most recent call last):
  File "/home/dalpenyes/Música/HUGBD/state-of-the-art-tools-for-big-data/exercises/session_2/nvidia/rapidsai-csp-utils/colab/pip-install.py", line 18, in <module>
    pynvml.nvmlInit()
  File "/home/dalpenyes/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pynvml.py", line 2882, in nvmlInit
    nvmlInitWithFlags(0)
  File "/home/dalpenyes/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pynvml.py", line 2872, in nvmlInitWithFlags
    _nvmlCheckReturn(ret)
  File "/home/dalpenyes/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_DriverNotLoaded: Driver Not Loaded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dalpenyes/Música/HUGBD/state-of-the-art-tools-for-big-data/exercises/session_2

In [2]:
%pip install pandas numpy cudf

Collecting pandas
  Downloading pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting cudf
  Downloading cudf-0.6.1.post1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting tzdata>=2022.7
  Downloading tzdata-2025.3-py2.py3-none-any.whl (348 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.5/348.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.2/509.2 kB

In [1]:
import pandas as pd
import numpy as np
import cudf
import time

ModuleNotFoundError: No module named 'pandas'

## 1. Dataset generator

This cell controls the size and shape of the dataset.

Change the parameters and re-run the notebook to observe
how CPU and GPU performance changes.


In [None]:
# ---- Playground parameters ----
N_ROWS = 2_000_000        # try: 500_000 | 2_000_000 | 10_000_000
N_CATEGORIES = 5          # try: 5 | 20 | 100
RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)

categories = [f"C{i}" for i in range(N_CATEGORIES)]

df_cpu = pd.DataFrame({
    "id": np.arange(N_ROWS),
    "category": np.random.choice(categories, size=N_ROWS),
    "value": np.random.rand(N_ROWS) * 1000,
})

df_cpu.head()


## 2. Experiment parameters

These parameters must be the same for CPU and GPU.
Change them to explore performance trade-offs.


In [None]:
VALUE_THRESHOLD = 500     # try: 100 | 500 | 900
AGG_FUNCTION = "sum"      # try: "sum" | "mean" | "count"

## 3. CPU baseline

This is the CPU reference implementation.
You will compare GPU execution against this result.


In [None]:
# TODO:
# - apply the filter using VALUE_THRESHOLD
# - apply the aggregation defined by AGG_FUNCTION
# - measure execution time


In [None]:
start = time.perf_counter()

if AGG_FUNCTION == "sum":
    cpu_result = (
        df_cpu[df_cpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .sum()
    )
elif AGG_FUNCTION == "mean":
    cpu_result = (
        df_cpu[df_cpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .mean()
    )
elif AGG_FUNCTION == "count":
    cpu_result = (
        df_cpu[df_cpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .count()
    )
else:
    raise ValueError("Unsupported AGG_FUNCTION")

cpu_time = time.perf_counter() - start

cpu_time, cpu_result.head()


## 4. GPU workload

Reproduce EXACTLY the same logic on GPU.

Only the execution backend should change.


In [None]:
df_gpu = cudf.from_pandas(df_cpu)

start = time.perf_counter()

if AGG_FUNCTION == "sum":
    gpu_result = (
        df_gpu[df_gpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .sum()
    )
elif AGG_FUNCTION == "mean":
    gpu_result = (
        df_gpu[df_gpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .mean()
    )
elif AGG_FUNCTION == "count":
    gpu_result = (
        df_gpu[df_gpu["value"] > VALUE_THRESHOLD]
          .groupby("category")["value"]
          .count()
    )
else:
    raise ValueError("Unsupported AGG_FUNCTION")

gpu_time = time.perf_counter() - start

gpu_time, gpu_result.head()


## 5. CPU vs GPU comparison

Compare execution times and validate results.


In [None]:
pd.DataFrame({
    "backend": ["CPU", "GPU"],
    "time_seconds": [cpu_time, gpu_time]
})


In [None]:
cpu_result.sort_index(), gpu_result.to_pandas().sort_index()


## 6. Experiments to try

Try changing ONE parameter at a time:

- Increase N_ROWS
- Increase N_CATEGORIES
- Change VALUE_THRESHOLD
- Change AGG_FUNCTION
- Run CPU first vs GPU first

Observe:
- when GPU wins
- when GPU loses
- when results are similar
