# Accelerating large string data processing with cudf pandas accelerator mode (cudf.pandas)
<a href="https://github.com/rapidsai/cudf">cuDF</a> is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a <a href="https://rapids.ai/cudf-pandas/">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.

**Author:** Allison Ding, Mitesh Patel <br>
**Date:** October 3, 2025

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [7]:
!nvidia-smi  # this should display information about available GPUs

Fri Oct  3 23:16:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.09              Driver Version: 580.82.09      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GB10                    Off |   0000000F:01:00.0 Off |                  N/A |
| N/A   44C    P0             10W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

# Download the data

## Overview
The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.

We'll need to download a curated copy of this [Kaggle dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv) directly from the kaggle API.  

**Data License and Terms** <br>
As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here:https://opendatacommons.org/licenses/by/1-0/index.html. For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

**Are there restrictions on how I can use this data? </br>**
For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

## Get the Data
First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235).  

Once generated, make sure to have the **kaggle.json** file in the same folder as the notebook

Next, run this code below, which should also take 1-2 minutes:

In [8]:
!pip install kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json



In [18]:
# Download the dataset through kaggle API-
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024
#unzip the file to access contents
!unzip 1-3m-linkedin-jobs-and-skills-2024.zip

# Analysis with cuDF Pandas

The magic command `%load_ext cudf.pandas` enables GPU acceleration for pandas data processing in a Jupyter notebook, allowing most pandas operations to automatically execute on NVIDIA GPUs for improved performance. 

With this extension loaded before importing pandas, your code can use standard pandas syntax while gaining the benefits of GPU speedup, automatically falling back to CPU execution for operations not supported on the GPU. This provides a seamless way to accelerate existing pandas workflows with zero code changes, especially for large data analytics tasks or machine learning preprocessing.

In [1]:
%load_ext cudf.pandas

In [2]:
import pandas as pd
import numpy as np

We'll run a piece of code to get a feel what GPU-acceleration brings to pandas workflows.

In [3]:
import time
start_time = time.time()

In [2]:
%time job_summary_df = pd.read_csv("job_summary.csv", dtype=('str'))
print("Dataset Size (in GB):",round(job_summary_df.memory_usage(
    deep=True).sum()/(1024**3),2))

CPU times: user 185 ms, sys: 2.08 s, total: 2.27 s
Wall time: 2.95 s
Dataset Size (in GB): 4.76


The same dataset takes about around 1.5 minutes to load with pandas. That's around **5x speedup** with no changes to the code!

Let's load the remaining two datasets as well:

In [4]:
%%time
job_skills_df = pd.read_csv("job_skills.csv", dtype=('str'))
job_postings_df = pd.read_csv("linkedin_job_postings.csv", dtype=('str'))

CPU times: user 45.3 ms, sys: 199 ms, total: 244 ms
Wall time: 354 ms


In [38]:
%%time
job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()
job_summary_df['summary_length'].head()

CPU times: user 4.46 ms, sys: 3.1 ms, total: 7.56 ms
Wall time: 46.3 ms


0     957
1    3816
2    5314
3    2774
4    2749
Name: summary_length, dtype: int32

That was lightning fast! We went from around 10+ (with pandas) to a few milliseconds.

In [39]:
%%time
df_merged=pd.merge(job_postings_df, job_summary_df, how="left", on="job_link")
df_merged.head()

CPU times: user 39.8 ms, sys: 30 ms, total: 69.8 ms
Wall time: 211 ms


Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_summary,summary_length
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite,Responsibilities\nJob Description Summary\nJob...,4602.0
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite,Employment Type:\nFull time\nShift:\nDescripti...,2950.0
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite,Job Details\nDescription\nWhat You'll Do\nAs a...,4571.0
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite,Who We Are\nRand Realty is a family-owned brok...,3944.0
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite,,


In [40]:
%%time
df_merged.groupby(['company',"job_title"]).agg({
    "summary_length":"mean"}).sort_values(by='summary_length', ascending = False).fillna(0)

CPU times: user 33.2 ms, sys: 17.3 ms, total: 50.6 ms
Wall time: 120 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,summary_length
company,job_title,Unnamed: 2_level_1
ClickJobs.io,Adolescent Behavioral Health Therapist - Substance Use Specialty (Entry Senior Level) Psychiatry,23748.0
Mt. San Antonio College,"Chief, Police and Campus Safety",22998.0
CareerBeacon,Airside/Groundside Project Manager [Halifax International Airport Authority],22938.0
Tacoma Community College,Anthropology Professor - Part-time,22790.0
"IRS, Office of Chief Counsel",Program Analyst (12-Month Roster),22774.0
...,...,...
鴻海精密工業股份有限公司,HR Specialist - Payroll & Benefit,0.0
鴻海精密工業股份有限公司,Material Planner,0.0
鴻海精密工業股份有限公司,RFQ Specialist,0.0
鴻海精密工業股份有限公司,Supply Chain Program Manager,0.0


We went down from around 5 seconds to less than a second here. This is in line with our speedups on other operations!

In [41]:
%%time
# Group by company, job_title, and month, and calculate the mean of summary_length
grouped_df = df_merged.groupby(['job_title', 'job_location']).agg({'summary_length': 'mean'})

# Reset index to sort by job_title and month
grouped_df = grouped_df.reset_index()

# Sort by job_title and month
sorted_df = grouped_df.sort_values(by=['job_title', 'job_location','summary_length'],
                                   ascending=False).reset_index(drop=True).fillna(0)
sorted_df

CPU times: user 13.7 ms, sys: 20.3 ms, total: 34 ms
Wall time: 156 ms


Unnamed: 0,job_title,job_location,summary_length
0,"🔥Nurse Manager, Patient Services - Operating Room","Lake George, NY",7342.0
1,🔥Behavioral Health RN 3 12s,"Glens Falls, NY",2787.0
2,🔥 Surgical Technologist - Evenings,"Lake George, NY",2920.0
3,🔥 Physician Practice Clinical Lead RN,"Saratoga Springs, NY",2945.0
4,🔥 Physican Practice LPN - Green,"Lake George, NY",2969.0
...,...,...,...
1104106,"""Attorney"" (Gov Appt/Non-Merit) Jobs","Kentucky, United States",2427.0
1104107,"""Accountant""","Shavano Park, TX",1497.0
1104108,"""Accountant""","Basking Ridge, NJ",1073.0
1104109,"""Accountant""","Austin, TX",1993.0


The acceleration is consistently 10x+ for complex aggregations and sorting that involve multiple columns.

In [4]:
end_time = time.time()
execution_time = end_time - start_time
print(execution_time)

5.182934522628784


# Summary

With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups.

To learn more about cudf.pandas, we encourage you to visit https://rapids.ai/cudf-pandas.