We're going to illustrate some practical differences between CPU and GPU operations using Python and PyTorch. First, we'll import some modules.

In [35]:
import torch
import time

Next, we'll make sure that we have a GPU. If you don't have one, then click "Runtime" at the top of the notebook and then "Change runtime type". Select the "T4 GPU".

In [36]:
import torch

def get_best_device():
    # 1. CUDA (NVIDIA)
    if torch.cuda.is_available():
        return torch.device("cuda")

    # 2. MPS (Apple Silicon / Apple GPU)
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return torch.device("mps")

    # 3. ROCm (AMD GPUs on Linux)
    if torch.version.hip is not None and torch.cuda.is_available():
        # ROCm exposes itself via torch.cuda
        return torch.device("cuda")

    # 4. CPU fallback
    return torch.device("cpu")


device = get_best_device()
print(f"Using device: {device}")


Using device: mps


Now, we're going to create a function that we'll use for timing our experiments.

In [37]:

def calculate_and_time(func, *args, **kwargs):
    times = {}

    # ---- CPU ----
    device = torch.device("cpu")
    start = time.time()
    _ = func(*args, **kwargs, device=device)
    times["cpu"] = time.time() - start

    # ---- CUDA GPU ----
    if torch.cuda.is_available():
        device = torch.device("cuda")
        torch.cuda.synchronize()               # important for accurate timing
        start = time.time()
        _ = func(*args, **kwargs, device=device)
        torch.cuda.synchronize()
        times["gpu"] = time.time() - start
        times["gpu_type"] = "cuda"
        return times

    # ---- MPS (Apple Metal) ----
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        device = torch.device("mps")
        torch.mps.synchronize()
        start = time.time()
        _ = func(*args, **kwargs, device=device)
        torch.mps.synchronize()
        times["gpu"] = time.time() - start
        times["gpu_type"] = "mps"
        return times

    # ---- No GPU ----
    times["gpu"] = None
    times["gpu_type"] = None
    return times

We'll be performing dot products on vectors of various sizes. Specifically, we'll be computing the dot product between two vectors `A` and `B` both with shape `Nx1`. `N` will be one of `[10, 50, 100, 500, 1000, 2000, 5000]`.

In [38]:
# Varying data sizes
data_sizes = [10, 10, 50, 100, 500, 1000, 2000, 5000]

Let's create three functions that'll both perform the dot product. The first of which will naively use `for` loops to perform the computation; the second will use "vectorization" to accelerate the computation. These functions will use another function to create random tensors for the computation. We'll also be comparing the times for our custom functions to the built-in `torch.dot` function.

In [39]:
def get_random_tensor(size, device):
    return torch.randn(size, device=device)


def dot_product_for_loop(size, device):
    """
    Calculates the dot product of two PyTorch tensors using a for loop.

    Args:
        size: The size of the tensors.
        device: The device (CPU or GPU) to perform the computation on.

    Returns:
        The dot product (float).
    """
    # Create random tensors
    a = get_random_tensor(size, device)
    b = get_random_tensor(size, device)

    # Initialize dot product
    dot_product = 0

    # Iterate through elements and accumulate dot product
    for i in range(a.shape[0]):
        dot_product += a[i] * b[i]

    return dot_product


def dot_product_vectorized(size, device):
    """
    Calculates the dot product of two PyTorch tensors using vectorization.

    Args:
        size: The size of the tensors.
        device: The device (CPU or GPU) to perform the computation on.

    Returns:
        The dot product (float).
    """
    # Create random tensors
    a = get_random_tensor(size, device)
    b = get_random_tensor(size, device)

    # Calculate dot product using element-wise multiplication and sum
    dot_product = (a * b).sum()

    return dot_product


def dot_product_torch(size, device):
    """
    Calculates the dot product of two PyTorch tensors using torch.dot.

    Args:
        size: The size of the tensors.
        device: The device (CPU or GPU) to perform the computation on.

    Returns:
        The dot product (float).
    """
    # Create random tensors
    a = get_random_tensor(size, device)
    b = get_random_tensor(size, device)

    # Calculate dot product using torch.dot
    dot_product = torch.dot(a, b)

    return dot_product


For each of our tensor sizes, let's use the functions we just defined to compute the dot product and time them on the CPU and GPU.

In [None]:
# Perform calculations and measure execution time for CPU and GPU
cpu_times_for_loop = []
gpu_times_for_loop = []
cpu_times_vectorized = []
gpu_times_vectorized = []
cpu_times_torch_dot = []
gpu_times_torch_dot = []

for size in data_sizes:
    times = calculate_and_time(dot_product_for_loop, size)
    cpu_times_for_loop.append(times["cpu"])
    gpu_times_for_loop.append(times["gpu"])

    times = calculate_and_time(dot_product_vectorized, size)
    cpu_times_vectorized.append(times["cpu"])
    gpu_times_vectorized.append(times["gpu"])

    times = calculate_and_time(dot_product_torch, size)
    cpu_times_torch_dot.append(times["cpu"])
    gpu_times_torch_dot.append(times["gpu"])

Now let's format and print the timing results.

In [None]:
# Print timing results
print("For Loop")
print("Data Size\tCPU Time (s)\tGPU Time (s)")
for i, size in enumerate(data_sizes):
    print(f"{size}\t\t{cpu_times_for_loop[i]:.6f}\t\t{gpu_times_for_loop[i] if gpu_times_for_loop[i] is not None else 'N/A'}")

print("\nVectorized")
print("Data Size\tCPU Time (s)\tGPU Time (s)")
for i, size in enumerate(data_sizes):
    print(f"{size}\t\t{cpu_times_vectorized[i]:.6f}\t\t{gpu_times_vectorized[i] if gpu_times_vectorized[i] is not None else 'N/A'}")

print("\nTorch Dot")
print("Data Size\tCPU Time (s)\tGPU Time (s)")
for i, size in enumerate(data_sizes):
    print(f"{size}\t\t{cpu_times_torch_dot[i]:.6f}\t\t{gpu_times_torch_dot[i] if gpu_times_torch_dot[i] is not None else 'N/A'}")

For Loop
Data Size	CPU Time (s)	GPU Time (s)
10		0.001137		N/A
10		0.000077		N/A
50		0.000313		N/A
100		0.000451		N/A
500		0.002362		N/A
1000		0.004707		N/A
2000		0.006913		N/A
5000		0.011325		N/A

Vectorized
Data Size	CPU Time (s)	GPU Time (s)
10		0.000295		N/A
10		0.000014		N/A
50		0.000024		N/A
100		0.000017		N/A
500		0.000048		N/A
1000		0.000088		N/A
2000		0.000070		N/A
5000		0.000108		N/A

Torch Dot
Data Size	CPU Time (s)	GPU Time (s)
10		0.000032		N/A
10		0.000009		N/A
50		0.000013		N/A
100		0.000011		N/A
500		0.000028		N/A
1000		0.000037		N/A
2000		0.000067		N/A
5000		0.000074		N/A


#### For Loop
**CPU Time:** As the data size (vector length) increases, the CPU time increases linearly. This is expected since the for loop iterates through each element of the vectors, and the number of operations grows proportionally with the vector size.

**GPU Time:** The GPU times are actually higher than the CPU times for smaller sizes. There's an overhead associated with transferring data to and from the GPU. This overhead dominates for smaller data sizes, making it slower than the CPU in this case. However, as data size increases, the GPU leverages its parallel processing capabilities to outperform the CPU significantly, reducing computation time.

#### Vectorized
**CPU Time:** The CPU times remain relatively consistent across different data sizes. This highlights the efficiency of vectorization, as it leverages low-level, optimized operations that are less sensitive to vector size. The small increases in CPU time with larger data sizes may be due to the additional memory and computation required.

**GPU Time:** Similar to the for loop case, the GPU times are higher than CPU times for smaller sizes. The overhead of data transfer is more significant than any parallel processing advantage for small tasks. However, as data size increases, we see more speed benefits from the parallel processing capabilities of the GPU. For the largest size, 5000, the GPU time is significantly lower than the CPU time and also lower than the GPU time from the for loop.


### Key Takeaways
- Vectorization is essential for efficiency: Vectorized operations are significantly faster than for loops, especially when dealing with large datasets.
- GPU vs. CPU: The overhead of data transfer to the GPU might mean slower performance for smaller tasks. But for large tasks, the advantages of parallel processing and optimized computations of the GPU become apparent.
- PyTorch Considerations: It's likely that PyTorch has internal optimizations that reduce the overhead of transferring data between CPU and GPU, which explains some of the nuances observed in the timings. The built-in torch.dot function tends to be the fastest overall, particularly on the GPU. It is likely also optimized specifically for efficient dot product calculations.
