# Pandas (cuDF)

In this notebook we will test Pandas with cuDF acceleration. We will use the `pandas` and `cudf` libraries to load a CSV file and perform some operations on it. We will compare the performance of the operations with and without cuDF acceleration.

#### Dataset information
![dataset_information](../public/dataset_information.png)

In [1]:
import time
import pandas as pd
import cudf

csv_file = "../data/concat.csv"

### Time to load a large CSV

In [2]:
# Measure Pandas load time (CPU)
start_time = time.time()
df_pandas = pd.read_csv(csv_file)
pandas_time = time.time() - start_time
print(f"Pandas load time: {pandas_time:.4f} seconds")

# Measure cuDF load time (GPU)
start_time = time.time()
df_cudf = cudf.read_csv(csv_file)
cudf_time = time.time() - start_time
print(f"cuDF (GPU) load time: {cudf_time:.4f} seconds")

Pandas load time: 15.6437 seconds
cuDF (GPU) load time: 2.8815 seconds


In [3]:
%load_ext cudf.pandas
import pandas as pd
start_time = time.time()
df_pandas_load = pd.read_csv(csv_file)
pandas_load_time = time.time() - start_time
print(f"Pandas load time: {pandas_load_time:.4f} seconds")

Pandas load time: 2.8536 seconds


### Time to perform operations on the dataset

In [4]:
transformations = {
    "Filter Rows": lambda df: df[df[df.columns[0]] > df[df.columns[0]].median()],
    "Sort Values": lambda df: df.sort_values(by=df.columns[0]),
    "GroupBy Mean": lambda df: df.groupby(df.columns[1]).mean(numeric_only=True),  # Ensure only numeric columns are included
    "Add Column": lambda df: df.assign(new_col=df[df.columns[0]] * 2),
}

In [5]:
# Apply transformations and time them
for name, func in transformations.items():
    # Pandas
    start_time = time.time()
    _ = func(df_pandas)
    pandas_time = time.time() - start_time
    print(f"{name} - Pandas time: {pandas_time:.4f} seconds")

    # cuDF
    start_time = time.time()
    _ = func(df_cudf)
    cudf_time = time.time() - start_time
    print(f"{name} - cuDF time: {cudf_time:.4f} seconds")

    # Pandas + cuDF extension
    start_time = time.time()
    _ = func(df_pandas_load)
    pandas_ext_time = time.time() - start_time
    print(f"{name} - Pandas (cuDF extension) time: {pandas_ext_time:.4f} seconds")
    
    print("\n-----------------------------------\n")


Filter Rows - Pandas time: 0.2585 seconds
Filter Rows - cuDF time: 0.0475 seconds
Filter Rows - Pandas (cuDF extension) time: 0.0313 seconds

-----------------------------------

Sort Values - Pandas time: 1.9335 seconds
Sort Values - cuDF time: 0.0684 seconds
Sort Values - Pandas (cuDF extension) time: 0.3990 seconds

-----------------------------------

GroupBy Mean - Pandas time: 3.8155 seconds
GroupBy Mean - cuDF time: 3.4842 seconds
GroupBy Mean - Pandas (cuDF extension) time: 3.6203 seconds

-----------------------------------

Add Column - Pandas time: 0.5798 seconds
Add Column - cuDF time: 0.0049 seconds
Add Column - Pandas (cuDF extension) time: 0.0030 seconds

-----------------------------------

