In [1]:
!nvidia-smi

Tue Jul 30 21:18:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install cupy-cuda12x numba
!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com


# Cupy as a drop-in replacement for numpy

In this example, we demonstrate how to use [CuPy](https://github.com/cupy/cupy), a GPU array library that leverages NVIDIA CUDA to provide a familiar interface for NumPy users. CuPy enables the acceleration of array operations by utilizing the parallel computing power of GPUs. Here, we showcase the process of creating a random array on the GPU, performing elementwise operations on it, and then transferring the results back to the host (CPU).

In [3]:
import numpy as cp
# import cupy as cp

# Create a random array on the GPU
x = cp.random.rand(1000000)

# Perform elementwise operations
y = cp.sin(x)

# Transfer data back to the host (CPU)
print(y[:10])

[0.79131525 0.7764646  0.13682646 0.72502146 0.44126537 0.6260933
 0.07038901 0.51977258 0.48913717 0.22888138]


# Numba JIT compiler to translate Python code into machine code.

In this example, we are using the [Numba](https://github.com/numba/numba?tab=readme-ov-file) library to optimize a Python function for summing all elements in a 1D NumPy array. Numba is a just-in-time compiler that translates a subset of Python and NumPy code into fast machine code. By using the `@jit` decorator with the `nopython=True` option, we instruct Numba to compile the function to machine code, resulting in significant speed improvements for numerical operations. The code demonstrates how to apply this technique to a simple function that iterates through a 2D array and calculates the sum of its elements.

In [5]:
from numba import jit
import numpy as np

@jit(nopython=True)
def sum_1d_array(arr):
    result = 0.0
    for value in arr:
        result += value
    return result

# Set a random seed for reproducibility
np.random.seed(42)

arr = np.random.rand(10000)
result = sum_1d_array(arr)
print(result)

4941.595576842995


# Exercise: sum 2D array



In [None]:
from numba import jit
import numpy as np

# Implement the 2D sum function.
# Note: you can use arr.shape, which returns a tuple to determine the size of the ndarray.
@jit(nopython=True)
def sum_2d_array(arr):
    # ...


# Set a random seed for reproducibility
np.random.seed(42)

arr = np.random.rand(1000, 1000)
result = sum_2d_array(arr)
print(result)

# Expected result: 500334.4861757136

In [11]:
# Solution

from numba import jit
import numpy as np

@jit(nopython=True)
def sum_2d_array(arr):
    m, n = arr.shape
    result = 0.0
    for i in range(m):
        for j in range(n):
            result += arr[i, j]
    return result

# Set a random seed for reproducibility
np.random.seed(42)

arr = np.random.rand(1000, 1000)
result = sum_2d_array(arr)
print(result)

# Expected result: 500334.4861757136

500334.4861757136


In [14]:
def sum_2d_array_pure_python(arr):
    m, n = arr.shape
    result = 0.0
    for i in range(m):
        for j in range(n):
            result += arr[i, j]
    return result

import time

# Timing the numba function
start_time = time.time()
result_numba = sum_2d_array(arr)
end_time = time.time()
print(f"Numba execution time: {end_time - start_time} seconds")

# Timing the pure Python function
start_time = time.time()
result_pure_python = sum_2d_array_pure_python(arr)
end_time = time.time()
print(f"Pure Python execution time: {end_time - start_time} seconds")


Numba execution time: 0.001617431640625 seconds
Pure Python execution time: 0.18297934532165527 seconds


# Example: Computing Pairwise Distances on the GPU

In this example, we'll explore how to compute the pairwise distance matrix of a given input matrix using GPU acceleration. The implementation leverages the numba library to offload computations to the GPU, significantly speeding up the process for large datasets.

In [15]:
import numpy as np

# Define the simplified gpu_dist_matrix function
import math
from numba import cuda

def gpu_dist_matrix(mat, USE_64=True):
    """Compute distance between each pair of the input matrix using GPU."""

    np_type = np.float64 if USE_64 else np.float32

    @cuda.jit
    def distance_matrix(mat, out):
        i, j = cuda.grid(2)
        if i < mat.shape[0] and j < mat.shape[0]:
            d = 0.0
            for k in range(mat.shape[1]):
                tmp = mat[i, k] - mat[j, k]
                d += tmp * tmp
            out[i, j] = math.sqrt(d)

    rows = mat.shape[0]
    block_dim = (16, 16)
    grid_dim = ((rows + block_dim[0] - 1) // block_dim[0], (rows + block_dim[1] - 1) // block_dim[1])

    mat_device = cuda.to_device(np.asarray(mat, dtype=np_type))
    out_device = cuda.device_array((rows, rows), dtype=np_type)

    distance_matrix[grid_dim, block_dim](mat_device, out_device)

    return out_device.copy_to_host()


# Create a sample input matrix
mat = np.array([[0, 1], [1, 0], [2, 2]], dtype=np.float32)

# Compute the pairwise distance matrix
dist_matrix = gpu_dist_matrix(mat, USE_64=False)

# Print the result
print("Input Matrix:\n", mat)
print("Pairwise Distance Matrix:\n", dist_matrix)




Input Matrix:
 [[0. 1.]
 [1. 0.]
 [2. 2.]]
Pairwise Distance Matrix:
 [[0.        1.4142135 2.236068 ]
 [1.4142135 0.        2.236068 ]
 [2.236068  2.236068  0.       ]]


# cuDF to accelerate dataframe operations using the GPU

In this example, we will explore the use of [cuDF](https://github.com/rapidsai/cudf), a GPU DataFrame library, to perform data manipulation operations on a DataFrame. We will start by creating a DataFrame on the CPU using pandas, then transfer it to the GPU using cuDF. On the GPU, we will perform a series of operations, including column addition, conditional column creation, filtering, and grouping with aggregation. Finally, we will transfer the results back to the CPU and print the outputs.

In [16]:
import cudf
import pandas as pd

# Create a DataFrame on the CPU
pdf = pd.DataFrame({
    'a': [1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5],
    'b': [10, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'c': [100, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700]
})

# Transfer it to the GPU
gdf = cudf.DataFrame.from_pandas(pdf)

# Perform operations on the GPU

# Add two columns
gdf['d'] = gdf['a'] + gdf['b']

# Create a new column based on a condition
gdf['e'] = gdf['c'] * (gdf['a'] > 3)

# Filter rows based on a condition
filtered_gdf = gdf[gdf['a'] > 2]

# Group by column 'a' and aggregate
grouped_gdf = filtered_gdf.groupby('a').agg({
    'b': ['sum', 'mean'],
    'c': ['min', 'max'],
    'd': ['sum', 'count'],
    'e': ['mean', 'std']
})

# Flatten the MultiIndex columns
grouped_gdf.columns = ['_'.join(col) for col in grouped_gdf.columns]

# Transfer the result back to the CPU
result_pdf = grouped_gdf.to_pandas()

# Print the results
print("Original DataFrame:")
print(pdf)

print("\nDataFrame after operations on GPU:")
print(gdf)

print("\nFiltered DataFrame:")
print(filtered_gdf)

print("\nGrouped and Aggregated DataFrame:")
print(result_pdf)

# Expected Results with Explanations:

# Original DataFrame:
#     a   b    c
# 0   1  10  100
# 1   2  20  200
# 2   2  25  250
# 3   3  30  300
# 4   3  35  350
# 5   3  40  400
# 6   4  45  450
# 7   4  50  500
# 8   5  55  550
# 9   5  60  600
# 10  5  65  650
# 11  5  70  700

# DataFrame after operations on GPU:
#     a   b    c   d    e
# 0   1  10  100  11    0  (a + b = 11, c * (a > 3) = 0)
# 1   2  20  200  22    0  (a + b = 22, c * (a > 3) = 0)
# 2   2  25  250  27    0  (a + b = 27, c * (a > 3) = 0)
# 3   3  30  300  33    0  (a + b = 33, c * (a > 3) = 0)
# 4   3  35  350  38    0  (a + b = 38, c * (a > 3) = 0)
# 5   3  40  400  43    0  (a + b = 43, c * (a > 3) = 0)
# 6   4  45  450  49  450  (a + b = 49, c * (a > 3) = 450)
# 7   4  50  500  54  500  (a + b = 54, c * (a > 3) = 500)
# 8   5  55  550  60  550  (a + b = 60, c * (a > 3) = 550)
# 9   5  60  600  65  600  (a + b = 65, c * (a > 3) = 600)
# 10  5  65  650  70  650  (a + b = 70, c * (a > 3) = 650)
# 11  5  70  700  75  700  (a + b = 75, c * (a > 3) = 700)

# Filtered DataFrame:
#     a   b    c   d    e
# 3   3  30  300  33    0
# 4   3  35  350  38    0
# 5   3  40  400  43    0
# 6   4  45  450  49  450
# 7   4  50  500  54  500
# 8   5  55  550  60  550
# 9   5  60  600  65  600
# 10  5  65  650  70  650
# 11  5  70  700  75  700

# Grouped and Aggregated DataFrame:
#    b_sum  b_mean  c_min  c_max  d_sum  d_count  e_mean  e_std
# a
# 3    105    35.0    300    400    114        3     0.0    0.0
# 4     95    47.5    450    500    103        2   475.0   35.4
# 5    250    62.5    550    700    270        4   625.0   64.5


Original DataFrame:
    a   b    c
0   1  10  100
1   2  20  200
2   2  25  250
3   3  30  300
4   3  35  350
5   3  40  400
6   4  45  450
7   4  50  500
8   5  55  550
9   5  60  600
10  5  65  650
11  5  70  700

DataFrame after operations on GPU:
    a   b    c   d    e
0   1  10  100  11    0
1   2  20  200  22    0
2   2  25  250  27    0
3   3  30  300  33    0
4   3  35  350  38    0
5   3  40  400  43    0
6   4  45  450  49  450
7   4  50  500  54  500
8   5  55  550  60  550
9   5  60  600  65  600
10  5  65  650  70  650
11  5  70  700  75  700

Filtered DataFrame:
    a   b    c   d    e
3   3  30  300  33    0
4   3  35  350  38    0
5   3  40  400  43    0
6   4  45  450  49  450
7   4  50  500  54  500
8   5  55  550  60  550
9   5  60  600  65  600
10  5  65  650  70  650
11  5  70  700  75  700

Grouped and Aggregated DataFrame:
   b_sum  b_mean  c_min  c_max  d_sum  d_count  e_mean      e_std
a                                                                
4     95 