# Benchmarking Pandas, CUDF, and Polars

## Introduction

Choosing the right tools for data manipulation in you day 2 day task as a Quant can significantly impact your workflow's efficiency. In this project, we compare three popular libraries: **Pandas**, **CUDF**, and **Polars**. We'll focus on common operations like `groupby` and `map` using a large dataset to see which library performs best.

## Installation Steps

### Check GPU Availability

First, make sure you have a GPU available in your Colab environment. Run this command:

```python
!nvidia-smi


In [3]:
!nvidia-smi

Sun May 19 04:23:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P0              29W /  70W |   1665MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Install RAPIDS in Google Colab or Notebook
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git


!bash rapidsai-csp-utils/colab/rapids-colab.sh stable


In [2]:
import time
import pandas as pd
import cudf
import numpy as np
import polars as pl

# Create a large DataFrame with 10 million rows and 20 columns
data_size = 10**7
num_columns = 20
data = np.random.rand(data_size, num_columns)
columns = [f'col{i}' for i in range(num_columns)]

# Create Pandas DataFrame
pdf = pd.DataFrame(data, columns=columns)
pdf['group'] = np.random.randint(0, 100, size=data_size)  # Add a column for groupby

# Convert to CUDF DataFrame
gdf = cudf.DataFrame.from_pandas(pdf)

# Create Polars DataFrame
pl_df = pl.DataFrame(data, schema=columns)
pl_df = pl_df.with_columns([
    pl.Series('group', np.random.randint(0, 100, size=data_size))
])

# Define functions for various operations

def pandas_groupby(df):
    return df.groupby('group').sum()

def cudf_groupby(df):
    return df.groupby('group').sum()

def polars_groupby(df):
    return df.groupby('group').agg(pl.all().sum())

def pandas_map(df):
    return df['col0'].map(lambda x: x * 2)

def cudf_map(df):
    return df['col0'].apply(lambda x: x * 2)

def polars_map(df):
    return df.with_columns((pl.col('col0') * 2).alias('mapped_col0'))

# Benchmark Groupby

# Measure time for Pandas groupby
start_time = time.time()
pandas_groupby(pdf)
pandas_groupby_time = time.time() - start_time

# Measure time for CUDF groupby
start_time = time.time()
cudf_groupby(gdf)
cudf_groupby_time = time.time() - start_time

# Measure time for Polars groupby
start_time = time.time()
polars_groupby(pl_df)
polars_groupby_time = time.time() - start_time

print(f"Pandas groupby time: {pandas_groupby_time} seconds")
print(f"CUDF groupby time: {cudf_groupby_time} seconds")
print(f"Polars groupby time: {polars_groupby_time} seconds")

# Benchmark Map

# Measure time for Pandas map
start_time = time.time()
pandas_map(pdf)
pandas_map_time = time.time() - start_time

# Measure time for CUDF map
start_time = time.time()
cudf_map(gdf)
cudf_map_time = time.time() - start_time

# Measure time for Polars map
start_time = time.time()
polars_map(pl_df)
polars_map_time = time.time() - start_time

print(f"Pandas map time: {pandas_map_time} seconds")
print(f"CUDF map time: {cudf_map_time} seconds")
print(f"Polars map time: {polars_map_time} seconds")

# Repeat the measurements for more accurate results

# Measure time for Pandas groupby (repeated)
pandas_groupby_times = []
for _ in range(5):
    start_time = time.time()
    pandas_groupby(pdf)
    pandas_groupby_times.append(time.time() - start_time)

# Measure time for CUDF groupby (repeated)
cudf_groupby_times = []
for _ in range(5):
    start_time = time.time()
    cudf_groupby(gdf)
    cudf_groupby_times.append(time.time() - start_time)

# Measure time for Polars groupby (repeated)
polars_groupby_times = []
for _ in range(5):
    start_time = time.time()
    polars_groupby(pl_df)
    polars_groupby_times.append(time.time() - start_time)

print(f"Average Pandas groupby time: {np.mean(pandas_groupby_times)} seconds")
print(f"Average CUDF groupby time: {np.mean(cudf_groupby_times)} seconds")
print(f"Average Polars groupby time: {np.mean(polars_groupby_times)} seconds")

# Measure time for Pandas map (repeated)
pandas_map_times = []
for _ in range(5):
    start_time = time.time()
    pandas_map(pdf)
    pandas_map_times.append(time.time() - start_time)

# Measure time for CUDF map (repeated)
cudf_map_times = []
for _ in range(5):
    start_time = time.time()
    cudf_map(gdf)
    cudf_map_times.append(time.time() - start_time)

# Measure time for Polars map (repeated)
polars_map_times = []
for _ in range(5):
    start_time = time.time()
    polars_map(pl_df)
    polars_map_times.append(time.time() - start_time)

print(f"Average Pandas map time: {np.mean(pandas_map_times)} seconds")
print(f"Average CUDF map time: {np.mean(cudf_map_times)} seconds")
print(f"Average Polars map time: {np.mean(polars_map_times)} seconds")


  return df.groupby('group').agg(pl.all().sum())


Pandas groupby time: 0.5358314514160156 seconds
CUDF groupby time: 0.07317042350769043 seconds
Polars groupby time: 0.7813115119934082 seconds
Pandas map time: 2.6636924743652344 seconds
CUDF map time: 0.7424323558807373 seconds
Polars map time: 0.05660867691040039 seconds


  return df.groupby('group').agg(pl.all().sum())


Average Pandas groupby time: 0.5473709583282471 seconds
Average CUDF groupby time: 0.06942968368530274 seconds
Average Polars groupby time: 1.1076735019683839 seconds
Average Pandas map time: 2.900290775299072 seconds
Average CUDF map time: 0.005760478973388672 seconds
Average Polars map time: 0.050056076049804686 seconds


## Benchmarking Results

| Operation | Pandas Time (s) | CUDF Time (s) | Polars Time (s) |
|-----------|------------------|---------------|-----------------|
| Groupby   | 0.5474           | 0.0694        | 1.1077          |
| Map       | 2.9003           | 0.0058        | 0.0501          |


## Observations

### CUDF: The Power of GPU Acceleration
CUDF demonstrated remarkable performance, particularly in the `map` operation, where it completed the task in just 0.006 seconds on average. This highlights the significant advantage of GPU acceleration for large-scale data operations. For `groupby`, CUDF also outperformed the others, showcasing its efficiency in parallel processing.

### Polars: Efficient and Versatile
Polars showed impressive results, especially in the `map` operation, completing it in 0.05 seconds on average. Although it was slower than CUDF in the `groupby` operation, Polars still outperformed Pandas, making it a strong contender for high-performance data manipulation tasks.

### Pandas: Reliable but Slower
Pandas, while being the most familiar and widely used library, was the slowest in both operations. This is expected given that Pandas operates on CPU and is not optimized for parallel processing. However, its ease of use and extensive ecosystem make it a go-to choice for many data scientists.

## Conclusion
The benchmark highlights that for large datasets and performance-critical applications, leveraging GPU-accelerated libraries like CUDF can offer substantial benefits. Polars also provides an excellent balance between performance and ease of use, making it a valuable tool for data manipulation tasks.


# Connect with Me on LinkedIn

[![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue)](https://www.linkedin.com/in/akjha002/)

I invite you to connect with me on LinkedIn. Follow my profile for insights, updates, and professional networking opportunities in , financial mathematics, and quantitative analysis. Click the badge above or follow this link: [Amit](https://www.linkedin.com/in/akjha002/).

Let's connect and grow together in the world of Quant!
