In [14]:
from numba import cuda
import numpy as np
from timeit import default_timer as timer

# Function to perform basic arithmetic operations on CPU
def cpu_calculator(a, b, operation):
    if operation == 'add':
        return a + b
    elif operation == 'subtract':
        return a - b
    elif operation == 'multiply':
        return a * b
    elif operation == 'divide':
        if b != 0:
            return a / b
        else:
            return "Error: Division by zero"
    else:
        return "Error: Invalid operation"

# Function to perform basic arithmetic operations on GPU
@cuda.jit
def gpu_calculator(a, b, operation, result):
    idx = cuda.grid(1)  # Get the unique thread index
    if idx == 0:  # Only one calculation, so we use the first thread
        if operation == 0:  # Add
            result[0] = a + b
        elif operation == 1:  # Subtract
            result[0] = a - b
        elif operation == 2:  # Multiply
            result[0] = a * b
        elif operation == 3:  # Divide
            if b != 0:
                result[0] = a / b
            else:
                result[0] = np.nan  # Not a number for error handling

if __name__ == "__main__":
    # Get two numbers and an operation from user input
    a = float(input("Enter the first number: "))  # First number input
    b = float(input("Enter the second number: "))  # Second number input
    operation = input("Enter operation (add, subtract, multiply, divide): ")  # Operation input

    # Measure execution time for CPU calculation
    start = timer()  
    cpu_result = cpu_calculator(a, b, operation)  # Perform calculation on CPU
    cpu_time = timer() - start  # Calculate time taken for CPU

    # Print results for CPU calculation
    print(f"CPU Result: {cpu_result}")
    print(f"Time taken on CPU: {cpu_time:.6f} seconds")  

    # Prepare to perform calculation on GPU
    # Convert inputs to numpy arrays for GPU
    a_device = np.array([a], dtype=np.float64)  # Create a single-element array for first input
    b_device = np.array([b], dtype=np.float64)  # Create a single-element array for second input
    result_device = np.zeros(1, dtype=np.float64)  # Create an array to store the result on GPU

    # Map operations to integers
    operations_map = {'add': 0, 'subtract': 1, 'multiply': 2, 'divide': 3}
    operation_code = operations_map.get(operation)

    if operation_code is None:
        print("Error: Invalid operation entered for GPU.")
    else:
        # Measure execution time for GPU calculation
        start = timer()  
        # Launch the GPU kernel
        gpu_calculator[1, 1](a_device[0], b_device[0], operation_code, result_device)  
        cuda.synchronize()  # Wait for the GPU to finish
        gpu_time = timer() - start  # Calculate time taken for GPU

        # Print results for GPU calculation
        gpu_result = result_device[0]  # Get the result from the GPU
        if np.isnan(gpu_result):
            print("GPU Result: Error: Division by zero")  
        else:
            print(f"GPU Result: {gpu_result}")
        print(f"Time taken on GPU: {gpu_time:.6f} seconds")  

Enter the first number:  2
Enter the second number:  3
Enter operation (add, subtract, multiply, divide):  add


CPU Result: 5.0
Time taken on CPU: 0.000003 seconds




GPU Result: 5.0
Time taken on GPU: 0.534545 seconds




# Why GPU time > CPU time?


### 1- Overhead of Data Transfer:

#####    When using GPU computations, there is usually a significant overhead due to the need to transfer data between the CPU (host) memory and the GPU (device) memory. In your example, you're performing a calculation on just two numbers, which means the overhead of transferring the data to the GPU and back can outweigh any benefits you gain from the parallel processing capabilities of the GPU.

### 2- Kernel Launch Overhead:

#####    Starting a GPU kernel involves some overhead. This includes setting up the execution configuration (grid and block sizes), compiling the kernel code if running for the first time, and managing the GPU’s execution. For very simple tasks, this overhead can take longer than performing the calculation on the CPU.

### 3- Underutilization of GPU Resources:

#####    As noted in the warnings you received, launching only a single thread (or a very small number) to handle operations means you are not leveraging the full capability of the GPU. GPUs are optimized for highly parallel workloads, and launching kernels with a small number of threads leads to low occupancy and increased relative execution time.

### 4- Nature of the Calculation:

#####    The arithmetic operations on just two numbers are very lightweight tasks. The CPU is already optimized for such small calculations and can handle them with minimal latency. For a more substantial workload, such as operations on large arrays, the GPU typically shows its strength.

### 5- Latency in Execution:

#####    For small tasks, the latency involved in sending data to the GPU, initiating the computation, and retrieving the results can be higher than just performing the computation directly on the CPU.

