In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Accelerated Data Analytics with Google Cloud and NVIDIA

| Authors |
| --- |
| Jeff Nelson, Will Hill |

## Overview

In this codelab, you will learn how to accelerate your data analytics workflows on large datasets using NVIDIA GPUs and open-source libraries on Google Cloud. You will start by optimizing your infrastructure and then explore how to apply GPU acceleration with zero code changes.

You will focus on `pandas`, a popular data manipulation library, and learn how to accelerate it using NVIDIA's `cuDF` library. The best part is you can get this GPU acceleration without changing your existing `pandas` and `scikit-learn` code.

### Objectives

* Understand Colab Enterprise on Google Cloud.
* Customize a Colab runtime environment with specific GPU, CPU, and memory configurations.
* Accelerate `pandas` with zero code changes using NVIDIA `cuDF`.
* Profile your code to identify and optimize performance bottlenecks.

**Note:** This notebook is intended to run in Colab Enterprise with a [GPU-enabled runtime](https://docs.cloud.google.com/colab/docs/default-runtimes-with-gpus).

### Services and Costs

This tutorial uses the following billable components of Google Cloud:

*   **Colab Enterprise**: [Pricing](https://cloud.google.com/colab/pricing)

*   **Google Cloud Storage**: [Pricing](https://cloud.google.com/storage/pricing)


You can use the [Pricing Calculator](https://cloud.google.com/products/calculator) to generate a cost estimate based on your projected usage.

## Setup

**Note:** This notebook is intended to run in Colab Enterprise with a [GPU-enabled runtime](https://docs.cloud.google.com/colab/docs/default-runtimes-with-gpus). If you run this notebook elsewhere, you must ensure you have a compatible NVIDIA GPU and have installed the necessary RAPIDS libraries (`cuDF`, `cuML`). Please refer to the [RAPIDS installation guide](https://docs.rapids.ai/install) for environment-specific instructions.



---


## Prepare the NYC taxi dataset

This codelab uses the [NYC Taxi & Limousine Commission (TLC) Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

The dataset contains individual trip records from yellow taxis in New York City, and includes fields like:
*   Pick-up and drop-off dates, times, and locations
*   Trip distances
*   Itemized fare amounts
*   Passenger counts

### Download the data

Next, download the trip data for all of 2024. The data is stored in the Parquet file format.

The following code block performs these steps:
1.  Defines the range of data to download.
2.  Creates a local directory named `nyc_taxi_data` to store the files.
3.  Loops through each month, downloads the corresponding Parquet file if it doesn't already exist, and saves it to the directory.

In [None]:
from tqdm import tqdm
import requests
import time
import os

YEAR = 2024
DATA_DIR = "nyc_taxi_data"

os.makedirs(DATA_DIR, exist_ok=True)
print(f"Checking/Downloading files for {YEAR}...")


for month in tqdm(range(1, 13), unit="file"):

    # Define standardized filename for both local path and URL
    file_name = f"yellow_tripdata_{YEAR}-{month:02d}.parquet"
    local_path = os.path.join(DATA_DIR, file_name)
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file_name}"

    if not os.path.exists(local_path):
        try:
            with requests.get(url, stream=True) as response:
                response.raise_for_status()
                with open(local_path, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
            time.sleep(1)
        except requests.exceptions.HTTPError as e:

            print(f"\nSkipping {file_name}: {e}")

            if os.path.exists(local_path):
                os.remove(local_path)

print("\nDownload complete.")



---


## Explore the taxi trip data

Now that you've downloaded the dataset, it's time to perform an initial exploratory data analysis (EDA). The goal of EDA is to understand the data's structure, find anomalies, and uncover potential patterns.

### Load a single month of data

Begin by loading a single month's worth of data. This provides a large enough sample (over 3 million rows) to be meaningful while keeping memory usage manageable for interactive analysis.

In [None]:
import pandas as pd

# Load the last month of the downloaded data
df = pd.read_parquet("nyc_taxi_data/yellow_tripdata_2024-12.parquet")
df.head()

### Get summary statistics

Use the `.describe()` method to generate high-level summary statistics for the numerical columns. This is a great first step to spot potential data quality issues, such as unexpected minimum or maximum values.

In [None]:
df.describe().round(2)

### Investigate data quality

The output from `.describe()` immediately reveals an issue. Notice that the `min` value for `tpep_pickup_datetime` and `tpep_dropoff_datetime` is in the year 2008, which doesn't make sense for a 2024 dataset.

This is an example of why to always inspect your data. You can investigate this further by sorting the DataFrame to find the exact rows that contain these outlier dates.

In [None]:
# Sort by the dropoff datetime to see the oldest records
df.sort_values("tpep_pickup_datetime").head()

### Visualize data distributions

Next, you can create histograms of the numerical columns to visualize their distributions. This helps you understand the spread and skew of features like `trip_distance` and `fare_amount`. The `.hist()` function is a quick way to plot histograms for all numerical columns in a DataFrame.

In [None]:
_ = df.hist(figsize=(20, 20))

Finally, generate a scatter matrix to visualize the relationships between a few key columns. Because plotting millions of points is slow and can obscure patterns, use `.sample()` to create the plot from a random sample of 100,000 rows.

In [None]:
_ = pd.plotting.scatter_matrix(
    df[['passenger_count', 'trip_distance', 'tip_amount', 'total_amount']].sample(100_000),
    diagonal="kde",
    figsize=(15, 15)
)



---


## Why use the Parquet file format?

The NYC taxi dataset is provided in [Apache Parquet](https://parquet.apache.org/) format. This is a deliberate choice made for large-scale analytics. Parquet offers several advantages over file types like CSV:

*   **Efficient and Fast:** As a columnar format, Parquet is highly efficient to store and read. It supports modern compression methods that result in smaller file sizes and significantly faster I/O, especially on GPUs.
*   **Preserves the Schema:** Parquet stores data types in the file's metadata. You never have to guess data types when you read the file.
*   **Enables Selective Reading:** The columnar structure allows you to read only the specific columns you need for an analysis. This can dramatically reduce the amount of data you have to load into memory.

### Inspect metadata without loading the full dataset

While you can't view a Parquet file in a standard text editor, you can easily inspect its schema and metadata without loading any data into memory. This is useful for quickly understanding the structure of a file.

In [None]:
import pyarrow as pa
from pyarrow.parquet import ParquetFile

pf = ParquetFile('nyc_taxi_data/yellow_tripdata_2024-12.parquet')

# Print the schema
print("File Schema:")
print(pf.schema)

# Print the file metadata
print("\nFile Metadata:")
print(pf.metadata)

### Read only the columns you need

Imagine you only need to analyze trip distance and fare amounts. With Parquet, you can load just those columns, which is much faster and more memory-efficient than loading the entire DataFrame.

In [None]:
import pandas as pd

# Read only four specific columns from the Parquet file
df_subset = pd.read_parquet(
    'nyc_taxi_data/yellow_tripdata_2024-12.parquet',
    columns=['passenger_count', 'trip_distance', 'tip_amount', 'total_amount']
)

df_subset.head()


---


## Accelerate pandas with NVIDIA cuDF

[NVIDIA CUDA for DataFrames (cuDF)](https://docs.rapids.ai/api/cudf/stable/) is an open-source, GPU-accelerated library that allows you to interact with DataFrames. cuDF lets you to perform common data operations like filtering, joining, and grouping on the GPU with massive parallelism.

The key feature you use in this codelab is the `cudf.pandas` accelerator mode. When you enable it, your standard `pandas` code is automatically redirected to use GPU-powered `cuDF` kernels under the hood, all without requiring you to change your code.

### Enable GPU acceleration

To use NVIDIA cuDF in a Colab Enterprise notebook, you load its magic extension before you import `pandas`.

First, inspect the standard `pandas` library. Notice the output shows the path to the default `pandas` installation.

In [None]:
import pandas as pd
pd # Note the output for the standard pandas library

Now, load the `cudf.pandas` extension and import `pandas` again. Watch how the output for the `pd` module changes - this confirms that the GPU-accelerated version is now active.

In [None]:
%load_ext cudf.pandas
import pandas as pd
pd # Note the new output, indicating cudf.pandas is active

### Restarting the Kernel

To ensure a clean environment for the next steps and to properly activate `cudf.pandas`, it is often necessary to restart the kernel after loading the extension. This step ensures that all subsequent `pandas` imports correctly use the GPU-accelerated `cuDF` backend. When using "Run All", execution will stop here, and you will need to continue manually after the restart.

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

**Note:** If you receive a warning like `UserWarning: Failed to dlopen libcuda.so.1...` and you were unable to load the extension, check that your runtime has a GPU attached and re-try.

### Other ways to enable `cudf.pandas`

While the magic command (`%load_ext`) is the easiest method in a notebook, you can also enable the accelerator in other environments:

*   **In Python scripts:** Call `import cudf.pandas` and `cudf.pandas.install()` before your `pandas` import.
*   **From non-notebook environments:** Run your script using `python -m cudf.pandas your_script.py`.



---


## Compare CPU vs. GPU performance

Now for the most important part: comparing the performance of standard `pandas` on a CPU with `cudf.pandas` on a GPU.

To ensure a completely fair baseline for the CPU, you must first reset the Colab runtime. This clears any GPU accelerators you might have enabled in the previous sections. You can restart runtime by running the following cell, or selecting **Restart session** from the **Runtime** menu.

In [None]:
import IPython

print("Restarting the kernel to ensure a clean CPU baseline...")
print("If you are using 'Run All', execution will stop here. Please continue manually after the restart.")

IPython.Application.instance().kernel.do_shutdown(True)

### Define the analytics pipeline

Now that the environment is clean, you will define the benchmarking function. This function allows you to run the exact same pipeline - loading, sorting, and summarizing - using whichever `pandas` module you pass to it.

In [None]:
import time
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def run_analytics_pipeline(pd_module):
    """Loads, sorts, and summarizes data using the provided pandas module."""
    timings = {}

    # 1. Load all 2024 Parquet files from the directory
    t0 = time.time()
    df = pd_module.concat(
        [pd_module.read_parquet(f) for f in glob.glob("nyc_taxi_data/*_2024*.parquet")],
        ignore_index=True
    )
    timings["load"] = time.time() - t0

    # 2. Sort the data by multiple columns
    t0 = time.time()
    df = df.sort_values(
        ['tpep_pickup_datetime', 'trip_distance', 'passenger_count'],
        ascending=[False, True, False]
    )
    timings["sort"] = time.time() - t0

    # 3. Perform a groupby and aggregation
    t0 = time.time()
    df['tpep_pickup_datetime'] = pd_module.to_datetime(df['tpep_pickup_datetime'])
    _ = (
        df.loc[df.tpep_pickup_datetime > '2024-11-01']
          .groupby(['VendorID', 'tpep_pickup_datetime'])
          [['passenger_count', 'fare_amount']]
          .agg(['min', 'mean', 'max'])
    )
    timings["summarize"] = time.time() - t0

    return timings

### Run the comparison

First, you will run the pipeline using standard `pandas` on the CPU. Then, you enable `cudf.pandas` and run it again on the GPU.

In [None]:
# --- Run on CPU ---
print("Running analytics pipeline on CPU...")
# Ensure we are using standard pandas
import pandas as pd
assert "cudf" not in str(pd), "Error: cuDF is still active. Please restart the kernel."

cpu_times = run_analytics_pipeline(pd)
print(f"CPU times: {cpu_times}")

# --- Run on GPU ---
print("\nEnabling cudf.pandas and running on GPU...")
# Load the extension
%load_ext cudf.pandas
import pandas as gpu_pd

gpu_times = run_analytics_pipeline(gpu_pd)
print(f"GPU times: {gpu_times}")

### Visualize the results

Finally, visualize the difference. The following code calculates the speedup for each operation and plots them side-by-side.

In [None]:
# Create a DataFrame for plotting
results_df = pd.DataFrame([cpu_times, gpu_times], index=["CPU", "GPU"]).T
total_cpu_time = results_df['CPU'].sum()
total_gpu_time = results_df['GPU'].sum()
speedup = total_cpu_time / total_gpu_time

print("--- Performance Results ---")
print(results_df)
print(f"\nTotal CPU Time: {total_cpu_time:.2f} seconds")
print(f"Total GPU Time: {total_gpu_time:.2f} seconds")
print(f"Overall Speedup: {speedup:.2f}x")

# Plot the results
fig, ax = plt.subplots(figsize=(10, 6))
results_df.plot(kind='bar', ax=ax, color={"CPU": "tab:blue", "GPU": "#76B900"})

ax.set_ylabel("Time (seconds)")
ax.set_title(f"CPU vs. GPU Runtimes (Overall Speedup: {speedup:.2f}x)", fontsize=14)
ax.tick_params(axis='x', rotation=0)

# Add numerical labels to the bars
for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", padding=3)

plt.tight_layout()
plt.show()



---


## Profile your code to find bottlenecks

Even with GPU acceleration, some `pandas` operations might fall back to the CPU if they are not yet supported by `cuDF`. These "CPU fallbacks" can become performance bottlenecks.

To help you identify these areas, [`cudf.pandas`](https://docs.rapids.ai/api/cudf/stable/cudf_pandas/usage/) includes two built-in profilers. You can use them to see exactly which parts of your code are running on the GPU and which are falling back to the CPU.

*   `%%cudf.pandas.profile`: Use this for a high-level, function-by-function summary of your code. It's best for getting a quick overview of which operations are running on which device.
*   `%%cudf.pandas.line_profile`: Use this for a detailed, line-by-line analysis. It's the best tool for pinpointing the exact lines in your code that are causing a fallback to the CPU.

Use these profilers as "cell magics" at the top of a notebook cell.

### Function-level profiling with `%%cudf.pandas.profile`

First, run the function-level profiler on the same analytics pipeline from the previous section. The output shows a table of every function called, the device it ran on (GPU or CPU), and how many times it was called.


#### Re-enable `cudf.pandas` and imports for profiling

Since the kernel was restarted for the CPU vs. GPU comparison, `cudf.pandas` needs to be re-enabled before running the profilers. This cell also imports `glob` which is used in the profiling examples. A dummy `pd.DataFrame` creation ensures `cudf.pandas` is fully active.

In [None]:
%load_ext cudf.pandas
import pandas as pd

# Ensure cudf.pandas is active before profiling
pd.DataFrame({"a": [1]})

After ensuring cudf.pandas is active, you can run a profile.

In [None]:
%%cudf.pandas.profile

df = pd.concat([pd.read_parquet(f) for f in glob.glob("nyc_taxi_data/*2024-1*.parquet")], ignore_index=True)

df = df.sort_values(['tpep_pickup_datetime', 'trip_distance', 'passenger_count'], ascending=[False, True, False])

summary = (
    df
        .loc[(df.tpep_pickup_datetime > '2024-11-01')]
        .groupby(['VendorID','tpep_pickup_datetime'])
        [['passenger_count', 'fare_amount']]
        .agg(['min', 'mean', 'max'])
)

### Line-by-line profiling with `%%cudf.pandas.line_profile`

Next, run the line-level profiler. This gives you a much more granular view, showing the portion of time each line of code spent executing on the GPU versus the CPU. This is the most effective way to find specific bottlenecks to optimize.

In [None]:
%%cudf.pandas.line_profile

df = pd.concat([pd.read_parquet(f) for f in glob.glob("nyc_taxi_data/*2024-1*.parquet")], ignore_index=True)

df = df.sort_values(['tpep_pickup_datetime', 'trip_distance', 'passenger_count'], ascending=[False, True, False])

summary = (
    df
        .loc[(df.tpep_pickup_datetime > '2024-11-01')]
        .groupby(['VendorID','tpep_pickup_datetime'])
        [['passenger_count', 'fare_amount']]
        .agg(['min', 'mean', 'max'])
)



---


## Integrate with Google Cloud Storage

[Google Cloud Storage (GCS)](https://docs.cloud.google.com/storage/docs/introduction) is a scalable and durable object storage service. When you use Colab Enterprise, GCS is a great place to store your datasets, model checkpoints, and other artifacts.

Your Colab Enterprise runtime has the necessary permissions to read and write data directly to GCS buckets, and these operations are GPU-accelerated for maximum performance.

### Create a GCS bucket

First, create a new GCS bucket. GCS bucket names are globally unique, so append the a UUID to its name.

In [None]:
from google.cloud import storage
import uuid

unique_suffix = uuid.uuid4().hex[:12]
bucket_name = f'nyc-taxi-codelab-{unique_suffix}'
project_id = storage.Client().project

client = storage.Client()

try:
    bucket = client.create_bucket(bucket_name)
    print(f"Successfully created bucket: gs://{bucket.name}")
except Exception as e:
    print(f"Bucket creation failed. You may already own it or the name is taken: {e}")

### Write data directly to GCS

Now, save a DataFrame directly to your new GCS bucket. If the `df` variable isn't available from the previous sections, the code first loads a single month of data.

In [None]:
%%cudf.pandas.line_profile

# Ensure the DataFrame exists before saving to GCS
if 'df' not in locals():
    print("DataFrame not found, loading a sample file...")
    df = pd.read_parquet('nyc_taxi_data/yellow_tripdata_2024-12.parquet')

print(f"Writing data to gs://{bucket_name}/nyc_taxi_data.parquet...")
df.to_parquet(f"gs://{bucket_name}/nyc_taxi_data.parquet", index=False)
print("Write operation complete.")

### Verify the file in GCS

You can verify the data is in GCS by visiting the bucket. The following code creates a clickable link.

In [None]:
import IPython
from IPython.display import Markdown

# Construct the URL and display it as a clickable link
gcs_url = f"https://console.cloud.google.com/storage/browser/{bucket_name}?project={project_id}"
Markdown(f'**[Click here to view your GCS bucket in the Google Cloud Console]({gcs_url})**')

### Read data directly from GCS

Finally, read data directly from a GCS path into a DataFrame. This operation is also GPU-accelerated, allowing you to load large datasets from cloud storage at high speed.

In [None]:
%%cudf.pandas.line_profile

print(f"Reading data from gs://{bucket_name}/nyc_taxi_data.parquet...")
df_from_gcs = pd.read_parquet(f"gs://{bucket_name}/nyc_taxi_data.parquet")

df_from_gcs.head()



---


## Clean Up

To avoid incurring unexpected charges to your Google Cloud account, you need to clean up the resources you created.

### Delete resources

The following code permanently deletes the GCS bucket and the locally downloaded NYC taxi dataset.

In [None]:

# Permanately delete the GCS bucket
print(f"Deleting GCS bucket: gs://{bucket_name}...")
!gsutil rm -r -f gs://{bucket_name}
print("Bucket deleted.")

# Remove NYC taxi dataset on the Colab runtime
print("Deleting local 'nyc_taxi_data' directory...")
!rm -rf nyc_taxi_data
print("Local files deleted.")

### Shut down your Colab runtime

* In the Google Cloud console, go to the Colab Enterprise **Runtimes** page.
* In the **Region** menu, select the region that contains your runtime.
* Select the runtime you want to delete.
* Click **Delete**.
* Click **Confirm**.

### Delete your Notebook

* In the Google Cloud console, go to the Colab Enterprise **My Notebooks** page.
* In the **Region** menu, select the region that contains your notebook.
* Select the notebook you want to delete.
* Click **Delete**.
* Click **Confirm**.



---


## Recap

Congratulations! You've successfully accelerated a pandas analytics workflow using NVIDIA cuDF on Colab Enterprise. You learned how to configure GPU-enabled runtimes, enable `cudf.pandas` for zero-code-change acceleration, profile code for bottlenecks, and integrate with Google Cloud Storage.

### Reference docs

* [Colab Enterprise Documentation](https://cloud.google.com/colab/docs/introduction)
* [NVIDIA cuDF Documentation](https://docs.rapids.ai/api/cudf/stable/)