<a href="https://colab.research.google.com/github/MEOWcanCODE/Zotrday/blob/main/Spotify_rapids_cudf_pandas_accelerator_mode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 10 Minutes to RAPIDS cuDF's pandas accelerator mode (cudf.pandas)

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook is a short introduction to `cudf.pandas`.

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [None]:
!nvidia-smi  # this should display information about available GPUs

Mon Jul 15 07:00:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8              13W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

With our GPU-enabled Colab runtime active, we're ready to go. cuDF is available by default in the GPU-enabled runtime.

If you're interested in installing on other platforms, please visit https://rapids.ai/#quick-start to learn more.

In [None]:
import cudf  # this should work without any errors


stdout:



stderr:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/usr/local/lib/python3.10/dist-packages/numba/cuda/cudadrv/driver.py", line 295, in __getattr__
    raise CudaSupportError("Error at driver init: \n%s:" %
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: 

CUDA driver library cannot be found.
If you are sure that a CUDA driver is installed,
try setting environment variable NUMBA_CUDA_DRIVER
with the file path of the CUDA driver shared library.
:


Not patching Numba


ImportError: 
================================================================
Failed to import CuPy.

If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.

On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.

Check the Installation Guide for details:
  https://docs.cupy.dev/en/latest/install.html

Original error:
  ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
================================================================


We'll also install `plotly-express` for visualizing data.

### Environment Note
If you're not running this notebook on Colab, you may need to reload the webpage for the `plotly.express` visualizations to work correctly.


In [None]:
!pip install plotly-express

Collecting plotly-express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Installing collected packages: plotly-express
Successfully installed plotly-express-0.4.1


# Data Extraction


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("Most Streamed Spotify Songs 2024.csv", encoding='unicode_escape')

print(df.count())

Track                         4600
Album Name                    4600
Artist                        4595
Release Date                  4600
ISRC                          4600
All Time Rank                 4600
Track Score                   4600
Spotify Streams               4487
Spotify Playlist Count        4530
Spotify Playlist Reach        4528
Spotify Popularity            3796
YouTube Views                 4292
YouTube Likes                 4285
TikTok Posts                  3427
TikTok Likes                  3620
TikTok Views                  3619
YouTube Playlist Reach        3591
Apple Music Playlist Count    4039
AirPlay Spins                 4102
SiriusXM Spins                2477
Deezer Playlist Count         3679
Deezer Playlist Reach         3672
Amazon Playlist Count         3545
Pandora Streams               3494
Pandora Track Stations        3332
Soundcloud Streams            1267
Shazam Counts                 4023
TIDAL Popularity                 0
Explicit Track      

## Data Cleaning

In [None]:
# drop duplicates
df_no_dup = df.drop_duplicates()
print("No Dupes:", df_no_dup.shape)

'''
['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track', 'Date']

['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'tidal_popularity',
       'explicit_track', 'date']
'''
accepted = ['Track', 'Album Name', 'Artist', 'All Time Rank', 'Spotify Streams', 'Explicit Track']

print(df_no_dup.columns)

# make columns lower case and no space
columns = df_no_dup.columns
for i in range(len(columns)):
    old_name = columns[i]

    if old_name == "Release Date":
        df_no_dup['date'] = pd.to_datetime(df_no_dup['Release Date']) # change date format fr MDY to YMD
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    elif old_name not in accepted: # drop unnecessary columns
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    else:
        new_name = old_name.lower()
        new_name = new_name.translate(str.maketrans(" ,", "_-")) # change space to underscores and commas to dashes
        df_no_dup = df_no_dup.rename(columns={old_name: new_name})

df_no_dup.dropna(inplace=True)
print(df_no_dup.head)

No Dupes: (4598, 29)
Index(['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track'],
      dtype='object')
<bound method NDFrame.head of                                 track                        album_name  \
0                 MILLION DOLLAR BABY      Million Dollar Baby - Single   
1                         Not Like Us                       Not Like Us   
2          i like the way you kiss me        I like the way you kiss me   
3   

## Filtering Non-Ascii Characters


In [None]:
def is_ascii(string):
    try:
        a = int(string) # filter out int
        return True
    except:
        if type(string) == pd.Timestamp: # filter out datetimes
            return True
        return all((ord(c) >= 32 and ord(c) <= 126) for c in string) # finally filter out non-ascii chars

mask = df_no_dup.applymap(is_ascii).all(axis=1)
df_ascii = df_no_dup[mask]
print(df_ascii.head)

<bound method NDFrame.head of                                          track  \
0                          MILLION DOLLAR BABY   
1                                  Not Like Us   
2                   i like the way you kiss me   
3                                      Flowers   
4                                      Houdini   
...                                        ...   
4593  Jaragandi (From "Game Changer") (Telugu)   
4595                         For the Last Time   
4596                          Dil Meri Na Sune   
4597                     Grace (feat. 42 Dugg)   
4598                       Nashe Si Chadh Gayi   

                                    album_name          artist all_time_rank  \
0                 Million Dollar Baby - Single   Tommy Richman             1   
1                                  Not Like Us  Kendrick Lamar             2   
2                   I like the way you kiss me         Artemas             3   
3                             Flowers - Single   

## Let's time it!

Loading and processing this data took a little time. Let's measure how long these pipelines take in Pandas:

In [None]:
%%time

df = pd.read_csv("Most Streamed Spotify Songs 2024.csv", encoding='unicode_escape')

# drop duplicates
df_no_dup = df.drop_duplicates()

accepted = ['Track', 'Album Name', 'Artist', 'All Time Rank', 'Spotify Streams', 'Explicit Track']

# make columns lower case and no space
columns = df_no_dup.columns
for i in range(len(columns)):
    old_name = columns[i]

    if old_name == "Release Date":
        df_no_dup['date'] = pd.to_datetime(df_no_dup['Release Date']) # change date format fr MDY to YMD
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    elif old_name not in accepted: # drop unnecessary columns
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    else:
        new_name = old_name.lower()
        new_name = new_name.translate(str.maketrans(" ,", "_-")) # change space to underscores and commas to dashes
        df_no_dup = df_no_dup.rename(columns={old_name: new_name})

df_no_dup.dropna(inplace=True)

CPU times: user 92.4 ms, sys: 7.92 ms, total: 100 ms
Wall time: 100 ms


In [None]:
%%time

def is_ascii(string):
    try:
        a = int(string) # filter out int
        return True
    except:
        if type(string) == pd.Timestamp: # filter out datetimes
            return True
        return all((ord(c) >= 32 and ord(c) <= 126) for c in string) # finally filter out non-ascii chars

mask = df_no_dup.applymap(is_ascii).all(axis=1)
df_ascii = df_no_dup[mask]
print(df_ascii.head)

<bound method NDFrame.head of                                          track  \
0                          MILLION DOLLAR BABY   
1                                  Not Like Us   
2                   i like the way you kiss me   
3                                      Flowers   
4                                      Houdini   
...                                        ...   
4593  Jaragandi (From "Game Changer") (Telugu)   
4595                         For the Last Time   
4596                          Dil Meri Na Sune   
4597                     Grace (feat. 42 Dugg)   
4598                       Nashe Si Chadh Gayi   

                                    album_name          artist all_time_rank  \
0                 Million Dollar Baby - Single   Tommy Richman             1   
1                                  Not Like Us  Kendrick Lamar             2   
2                   I like the way you kiss me         Artemas             3   
3                             Flowers - Single   

# Using cudf.pandas

Now, let's re-run the Pandas code above with the `cudf.pandas` extension loaded.

Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

In [None]:
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

In [None]:
%load_ext cudf.pandas

In [None]:
%%time

import pandas as pd

df = pd.read_csv("Most Streamed Spotify Songs 2024.csv", encoding='unicode_escape')

# drop duplicates
df_no_dup = df.drop_duplicates()

'''
['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track', 'Date']

['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'tidal_popularity',
       'explicit_track', 'date']
'''
accepted = ['Track', 'Album Name', 'Artist', 'All Time Rank', 'Spotify Streams', 'Explicit Track']

# make columns lower case and no space
columns = df_no_dup.columns
for i in range(len(columns)):
    old_name = columns[i]

    if old_name == "Release Date":
        df_no_dup['date'] = pd.to_datetime(df_no_dup['Release Date']) # change date format fr MDY to YMD
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    elif old_name not in accepted: # drop unnecessary columns
        df_no_dup = df_no_dup.drop(columns[i], axis=1)

    else:
        new_name = old_name.lower()
        new_name = new_name.translate(str.maketrans(" ,", "_-")) # change space to underscores and commas to dashes
        df_no_dup = df_no_dup.rename(columns={old_name: new_name})

df_no_dup.dropna(inplace=True)
print(df_no_dup.head)

<cudf.pandas.fast_slow_proxy._MethodProxy object at 0x790b81eef1f0>
CPU times: user 768 ms, sys: 146 ms, total: 914 ms
Wall time: 1.11 s


In [None]:
%%time

def is_ascii(s):
    try:
        int(s)  # filter out int
        return True
    except ValueError:
        if isinstance(s, pd.Timestamp):  # filter out datetimes
            return True
        return  all(c in string.printable for c in s)

mask = df_no_dup.applymap(is_ascii).all(axis=1)
df_ascii = df_no_dup[mask]
print(df_ascii.head)



NameError: name 'ord' is not defined

In [None]:
%%time

weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

CPU times: user 261 ms, sys: 63.9 ms, total: 325 ms
Wall time: 341 ms


issue_weekday
Sunday        462992
Saturday     1108385
Monday       2488563
Wednesday    2760088
Tuesday      2809949
Friday       2891679
Thursday     2913951
Name: Summons Number, dtype: int64

Much faster! Operations that took 5-20 seconds can now potentially finish in just milliseconds without changing any code.

# Understanding Performance

`cudf.pandas` provides profiling utilities to help you better understand performance. With these tools, you can identify which parts of your code ran on the GPU and which parts ran on the CPU.

They're accessible in the `cudf.pandas` namespace since the `cudf.pandas` extension was loaded above with `load_ext cudf.pandas`.

#### Colab Note
If you're running in Colab, the first time you run use the profiler it may take 10+ seconds due to Colab's debugger interacting with the built-in Python function [sys.settrace](https://docs.python.org/3/library/sys.html#sys.settrace) that we use for profiling. For demo purposes, this isn't an issue. Just run the cell again.

## Profiling Functionality

We can generate a per-function and per-line profile:

In [None]:
%%cudf.pandas.profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis, numeric_only=True)
    axis = 1

counts = small_df.groupby("a").b.count()

In [None]:
%%cudf.pandas.line_profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis, numeric_only=True)
    axis = 1

counts = small_df.groupby("a").b.count()

## Behind the scenes: What's going on here?

When you load `cudf.pandas`, Pandas types like `Series` and `DataFrame` are replaced by proxy objects that dispatch operations to cuDF when possible. We can verify that `cudf.pandas` is active by looking at our `pd` variable:

In [None]:
pd

<module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))>

As a result, all pandas functions, methods, and created objects are proxies:

In [None]:
type(pd.read_csv)

Operations supported by cuDF will be **very** fast:

In [None]:
%%time
df.count(axis=0)

CPU times: user 2.9 ms, sys: 0 ns, total: 2.9 ms
Wall time: 2.82 ms


Registration State       15435607
Violation Description    15117819
Vehicle Body Type        15402365
Issue Date               15435607
Summons Number           15435607
issue_weekday            15435607
dtype: int64

Operations not supported by cuDF will be slower, as they fall back to using Pandas (copying data between the CPU and GPU under the hood as needed). For example, cuDF does not currently support the `axis=` parameter to the `count` method. So this operation will run on the CPU and be noticeably slower than the previous one.

In [None]:
%%time
df.count(axis=1) # This will use pandas, because cuDF doesn't support axis=1 for the .count() method

CPU times: user 5.99 s, sys: 2 s, total: 7.99 s
Wall time: 7.87 s


0           5
1           5
2           5
3           5
4           5
           ..
15435602    6
15435603    6
15435604    6
15435605    6
15435606    6
Length: 15435607, dtype: int64

But the story doesn't end here. We often need to mix our own code with third-party libraries that other people have written. Many of these libraries accept pandas objects as inputs.

# Using third-party libraries with cudf.pandas

You can pass Pandas objects to third-party libraries when using `cudf.pandas`, just like you would when using regular Pandas.

Below, we show an example of using [plotly-express](https://plotly.com/python/plotly-express/) to visualize the data we've been processing:

## Visualizing which states have more pickup trucks relative to other vehicles?

In [None]:
import plotly.express as px

df = df.rename(columns={
    "Registration State": "reg_state",
    "Vehicle Body Type": "vehicle_type",
})

# vehicle counts per state:
counts = df.groupby("reg_state").size().sort_index()
# vehicles with type "PICK" (Pickup Truck)
pickup_counts = df.where(df["vehicle_type"] == "PICK").groupby("reg_state").size()
# percentage of pickup trucks by state:
pickup_frac = ((pickup_counts / counts) * 100).rename("% Pickup Trucks")
del pickup_frac["MB"]  # (Manitoba is a huge outlier!)

# plot the results:
pickup_frac = pickup_frac.reset_index()
px.choropleth(pickup_frac, locations="reg_state", color="% Pickup Trucks", locationmode="USA-states", scope="usa")

## Beyond just passing data: **Accelerating** third-party code

Being able to pass these proxy objects to libraries like Plotly is great, but the benefits don't end there.

When you enable `cudf.pandas`, pandas operations running **inside the third-party library's functions** will also benefit from GPU acceleration where possible!

Below, you can see an image illustrating how `cudf.pandas` can accelerate the pandas backend in Ibis, a library that provides a unified DataFrame API to various backends. We ran this example on a system with an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU.


By loading the `cudf.pandas` extension, pandas operations within Ibis can use the GPU with zero code change. It just works.

![ibis](https://drive.google.com/uc?id=1uOJq2JtbgVb7tb8qw8a2gG3JRBo72t_H)

# Conclusion

With `cudf.pandas`, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` and run your existing code on a GPU!

To learn more, we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas).