# Chapter 7: Dataframes with cuDF

<img src="images/chapter-07/RAPIDS-logo-purple.png" style="width:600px;"/>

cuDF is a DataFrame library for GPU-accelerated computing with Python. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

cuDF is part of the NVIDIA RAPIDS suite of GPU-accelerated data science and AI libraries with APIs that match the most popular open-source data tools.  They serve to act as near drop-in replacements for popular scientific computing libraries detailed in Chapter 3: Python on the GPU.

You can use cuDF for manipulating large datasets using the computational power of GPUs. It offers a familiar interface similar to pandas but can handle much larger datasets faster.



## DataFrames Basics

cuDF primarily acts upon the DataFrame data structure.  A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.

<img src="images/chapter-07/dataframe.png" style="width:600px;"/>



## cuDF Basics

cuDF is designed for ease of use.  Python scientific computing developers will find that cuDF is comparable to Pandas in many ways, but it's important to keep in mind that there are some key differences as well.

**Performance:**
- cuDF:
Leverages the parallel processing power of GPUs, making it significantly faster for large datasets (gigabytes to terabytes) and computationally intensive operations like joins, aggregations, and sorting.
- Pandas:
Runs on CPUs, limiting its performance for large datasets and complex operations.

**Hardware Requirements:**
- cuDF: Requires an NVIDIA GPU and the RAPIDS software suite.
- Pandas: Works on any system with a CPU.

**Functionality:**
- cuDF:
Supports most Pandas functionality, including data structures like Series and DataFrames, as well as common operations. However, certain features might differ slightly, and some Pandas functions may not be implemented or may have different behavior.
- Pandas:
Offers a wider range of functions and features, including advanced indexing and time series manipulation.

**Compatibility:**
- cuDF:
Can be integrated with other RAPIDS libraries for GPU-accelerated data science workflows.
- Pandas:
Works seamlessly with the broader Python ecosystem, including NumPy, Scikit-learn, and Matplotlib.

**Pandas Accelerator Mode:**
cuDF provides a cudf.pandas mode, which enables users to leverage GPU acceleration with minimal code changes. It acts as a proxy, automatically switching between cuDF and Pandas based on data size and operation.

### Key differences to remember:
- Column Names: cuDF doesn't support duplicate column names, unlike Pandas.
- Data Types: While cuDF supports most Pandas data types, there might be differences in handling certain types like strings and categoricals.
- Indexing: cuDF might handle indexing and multi-index operations differently compared to Pandas.

### Choosing the Right Library:

For small datasets or CPU-bound tasks: Pandas is a good choice due to its wider functionality and compatibility.

For large datasets and GPU-accelerated computations: cuDF offers significant performance improvements, especially for data-intensive operations.

### cuDF vs. cudf.pandas
You may notice that the cuDF library has a `cudf.pandas` module available which can be confusing when importing and using cuDF.  While both cuDF and `cudf.pandas` are part of RAPIDS, designed to accelerate data science workflows by leveraging the power of GPUs, there are key differences to take into consideration.  Most importantly, users should be aware that cudf executes primarily on the GPU while cudf.pandas might fall back to using pandas on the CPU sometimes.  

**cuDF:**
- Core library:  It's a GPU DataFrame library, offering a subset of the Pandas API optimized for GPU execution.
- Direct access: Use it directly when you need full control over GPU-specific features and operations.
- Performance: Can offer superior performance for supported operations due to direct GPU optimization.
- API compatibility: Not 100% compatible with Pandas, so some Pandas functions may not be available or behave differently.
  
**cudf.pandas:**
- Pandas accelerator:  A layer on top of cuDF that enables GPU acceleration for your existing Pandas code.
- Seamless transition:  Use it to accelerate your Pandas code without significant modifications.
- Automatic fallback:  If a particular operation isn't supported by cuDF, it automatically falls back to the CPU-based Pandas implementation.
- API compatibility:  Aims for 100% compatibility with the Pandas API, providing a drop-in replacement for most workflows.

**When to use each:**
- cuDF:
If you need maximum performance and are comfortable working with a slightly different API, or need to leverage GPU-specific features.
- cudf.pandas:
If you want to accelerate your existing Pandas code with minimal changes and rely on the full Pandas API.

<img src="images/chapter-07/inference-data-analytics-featured.jpg" style="width:600px;"/>

## The latest in cuDF integration: Polars GPU engine
Polars is one of the fastest growing Python libraries for data scientists and engineers, and was designed from the ground up to address these challenges. It uses advanced query optimizations to reduce unnecessary data movement and processing, allowing data scientists to smoothly handle workloads of hundreds of millions of rows in scale on a single machine. Polars bridges the gap where single-threaded solutions are too slow, and distributed systems add unnecessary complexity, offering an appealing “medium-scale” data processing solution.

cuDF provides an in-memory, GPU-accelerated execution engine for Python users of the Polars Lazy API. The engine supports most of the core expressions and data types as well as a growing set of more advanced dataframe manipulations and data file formats. 

When using the GPU engine, Polars will convert expressions into an optimized query plan and determine whether the plan is supported on the GPU. If it is not, the execution will transparently fall back to the standard Polars engine and run on the CPU.

## Links to Handy References

cuDF Documentation: https://docs.rapids.ai/api/cudf/stable/ 

cuDF User Guide: https://docs.rapids.ai/api/cudf/stable/user_guide/ 

Pandas Documentation: https://pandas.pydata.org/docs/ 

Pandas API Reference: https://pandas.pydata.org/docs/reference/index.html 

Differences between cuDF and Pandas: https://docs.rapids.ai/api/cudf/stable/user_guide/pandas-comparison/ 

Data Exploration with cuDF: https://developer.nvidia.com/blog/accelerated-data-analytics-speed-up-data-exploration-with-rapids-cudf/ 

Polars GPU Engine Powered by RAPIDS cuDF Now Available in Open Beta https://developer.nvidia.com/blog/polars-gpu-engine-powered-by-rapids-cudf-now-available-in-open-beta/

NVIDIA CUDA-X Now Accelerates the Polars Data Processing Library: https://developer.nvidia.com/blog/nvidia-cuda-x-now-accelerates-the-polars-data-processing-library/

Polars Docs: https://docs.pola.rs/ 


## Coding Guide

### Installation 
Please use the cuDF RAPIDS Installation Guide for intallation instructions appropriate to your hardware and Python environment: https://docs.rapids.ai/install/ 

For the sake of our examples, we are using pip below:

In [None]:
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==24.8.*

# Examples:

## Create a cuDF Dataframe

In [None]:
import cudf
import numpy as np

numRows = 1000000
# Create a DataFrame with cuDF
data = {
    'A': np.random.rand(numRows),
    'B': np.random.rand(numRows),
    'C': np.random.rand(numRows)
}
gdf = cudf.DataFrame(data)

# Display the first few rows
print(gdf.head())


## Explore the DataFrame

**Shape:**

In [None]:
gdf.shape

As you can see, the first value corresponds to the number of rows we have, while the second indicates the number of columns we created.

Get a more comprehensive view of the dataframe using the .info method!

In [None]:
gdf.info

## Filtering Data

Filtering all rows where column 'A' is greater than 0.5:

In [None]:
filtered_gdf = gdf[gdf['A'] > 0.5]
filtered_gdf.shape
print(f"As you can tell from the shape of the new filtered dataframe, the number of rows reduced from {numRows} to {filtered_gdf.shape[0]}. That's {numRows - filtered_gdf.shape[0]} rows that we've filtered out with 'A' values less than 0.5!")

## Grouping & Aggregating

Creating a new dataframe with categories:

In [None]:
data = {
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 20, 30, 40, 50]
}
gdf = cudf.DataFrame(data)
gdf

Group by category and calculate the mean:

In [None]:
grouped = gdf.groupby('Category')['Value'].mean().reset_index()
print(grouped)

## Using cudf vs. cudf.pandas

In [None]:
import pandas as pd
import cudf.pandas as xpd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Using cuDF directly - moving dataframe to the GPU
gdf = cudf.DataFrame.from_pandas(df)
result = gdf.sum()

# Using cudf.pandas
xpd_df = xpd.DataFrame(df)
result = xpd_df.sum() 