# Rapids

![](https://rapids.ai/images/RAPIDS-logo.png)

The RAPIDS data science framework is a collection of libraries for running end-to-end data science pipelines completely on the GPU. The interaction is designed to have a familiar look and feel to working in Python, but utilizes optimized NVIDIA® CUDA® primitives and high-bandwidth GPU memory under the hood. Below are some links to help getting started with each of the individual RAPIDS libraries.

# CuDF's

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

In [9]:
import cupy as cp
import pandas as pd
import cudf

In [15]:
# cp.random.seed(12)

# Series

In [11]:
s = cudf.Series([1, 2, 3, None, 4])

In [16]:
s

0       1
1       2
2       3
3    <NA>
4       4
dtype: int64

# DataFrame

Creating a cudf.DataFrame by specifying values for each column.

### Looking at head

In [32]:
df = cudf.DataFrame(
    {
        "a": list(range(20)),
        "b": list(range(70,50,-1)),
    }
)

In [33]:
df.head()

Unnamed: 0,a,b
0,0,70
1,1,69
2,2,68
3,3,67
4,4,66


Sorting by values.

In [34]:
df.sort_values(by="b")

Unnamed: 0,a,b
19,19,51
18,18,52
17,17,53
16,16,54
15,15,55
14,14,56
13,13,57
12,12,58
11,11,59
10,10,60


### Selecting a Column

Selecting a single column, which initially yields a cudf.Series. Calling compute results in a cudf.Series (equivalent to df.a).

In [35]:
df["a"]

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
Name: a, dtype: int64

### Selecting Rows by Label

Selecting rows from index 2 to index 5 from columns ‘a’ and ‘b’.


In [37]:
df.loc[2:5, ["a", "b"]]

Unnamed: 0,a,b
2,2,68
3,3,67
4,4,66
5,5,65


### Selecting Rows by Position

Selecting via integers and integer slices, like numpy/pandas.

In [38]:
df.iloc[0]

a     0
b    70
Name: 0, dtype: int64

In [39]:
df.iloc[0:3, 0:2]

Unnamed: 0,a,b
0,0,70
1,1,69
2,2,68


You can also select elements of a DataFrame or Series with direct index access.

In [40]:
df[3:5]

Unnamed: 0,a,b
3,3,67
4,4,66


### Boolean Indexing

Selecting rows in a DataFrame or Series by direct Boolean indexing.


In [41]:
df[df.b > 15]

Unnamed: 0,a,b
0,0,70
1,1,69
2,2,68
3,3,67
4,4,66
5,5,65
6,6,64
7,7,63
8,8,62
9,9,61


Selecting values from a DataFrame where a Boolean condition is met, via the query API.


In [43]:
df.query("b == 55")

Unnamed: 0,a,b
15,15,55


Using the isin method for filtering.


In [44]:
df[df.a.isin([0, 5])]

Unnamed: 0,a,b
0,0,70
5,5,65


### MultiIndex

cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically automatically produces a DataFrame with a MultiIndex.

In [45]:
arrays = [["a", "a", "b", "b"], [1, 2, 3, 4]]
tuples = list(zip(*arrays))
idx = cudf.MultiIndex.from_tuples(tuples)
idx

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 3),
            ('b', 4)],
           )

This index can back either axis of a DataFrame.

In [46]:
gdf1 = cudf.DataFrame(
    {"first": cp.random.rand(4), "second": cp.random.rand(4)}
)
gdf1.index = idx
gdf1

Unnamed: 0,Unnamed: 1,first,second
a,1,0.082654,0.967955
a,2,0.399417,0.441425
b,3,0.784297,0.793582
b,4,0.070303,0.271711


In [47]:
gdf2 = cudf.DataFrame(
    {"first": cp.random.rand(4), "second": cp.random.rand(4)}
).T
gdf2.columns = idx
gdf2

Unnamed: 0_level_0,a,a,b,b
Unnamed: 0_level_1,1,2,3,4
first,0.343382,0.0037,0.20043,0.581614
second,0.907812,0.101512,0.24179,0.22418


Accessing values of a DataFrame with a MultiIndex, both with .loc

In [49]:
gdf1.loc[("b", 3)]

first     0.784297
second    0.793582
Name: ('b', 3), dtype: float64

And .iloc


In [50]:
gdf1.iloc[0:2]

Unnamed: 0,Unnamed: 1,first,second
a,1,0.082654,0.967955
a,2,0.399417,0.441425


### Missing Data

Missing data can be replaced by using the fillna method.


In [51]:
s.fillna(999)

0      1
1      2
2      3
3    999
4      4
dtype: int64

In [52]:
s

0       1
1       2
2       3
3    <NA>
4       4
dtype: int64

In [53]:
ds.fillna(999).head(n=3)

0    1
1    2
2    3
dtype: int64

### Stats

Calculating descriptive statistics for a Series.


In [54]:
s.mean()

2.5

In [55]:
s.var()

1.666666666666666

### Applymap

Applying functions to a Series.

In [59]:
df.head()

Unnamed: 0,a,b
0,0,70
1,1,69
2,2,68
3,3,67
4,4,66


In [56]:
def add_ten(num):
    return num + 10

In [57]:
df["a"].apply(add_ten)

0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
12    22
13    23
14    24
15    25
16    26
17    27
18    28
19    29
Name: a, dtype: int64

### Histogramming

Counting the number of occurrences of each unique value of variable.


In [61]:
df.a.value_counts()

15    1
6     1
1     1
14    1
2     1
5     1
11    1
7     1
17    1
13    1
8     1
16    1
0     1
10    1
4     1
9     1
19    1
18    1
3     1
12    1
Name: a, dtype: int32

### String Methods

Like pandas, cuDF provides string processing methods in the str attribute of Series.

In [62]:
s = cudf.Series(["A", "B", "C", "Aaba", "Baca", None, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: object

As well as simple manipulation, We can also match strings using regular expressions.


In [63]:
s.str.match("^[aAc].+")

0    False
1    False
2    False
3     True
4    False
5     <NA>
6    False
7    False
8     True
dtype: bool

### Concat

Concatenating Series and DataFrames row-wise.


In [64]:
s = cudf.Series([1, 2, 3, None, 5])
cudf.concat([s, s])

0       1
1       2
2       3
3    <NA>
4       5
0       1
1       2
2       3
3    <NA>
4       5
dtype: int64

### Join

Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index.

In [68]:
df_a = cudf.DataFrame()
df_a["key"] = ["a", "b", "c", "d", "e"]
df_a["vals_a"] = [float(i + 10) for i in range(5)]

In [69]:
df_a

Unnamed: 0,key,vals_a
0,a,10.0
1,b,11.0
2,c,12.0
3,d,13.0
4,e,14.0


In [70]:
df_b = cudf.DataFrame()
df_b["key"] = ["a", "c", "e"]
df_b["vals_b"] = [float(i + 100) for i in range(3)]

In [71]:
df_b

Unnamed: 0,key,vals_b
0,a,100.0
1,c,101.0
2,e,102.0


In [67]:
merged = df_a.merge(df_b, on=["key"], how="left")
merged

Unnamed: 0,key,vals_a,vals_b
0,a,10.0,100.0
1,c,12.0,101.0
2,e,14.0,102.0
3,b,11.0,
4,d,13.0,


### Grouping

Like pandas, cuDF support the Split-Apply-Combine groupby paradigm.
- Grouping and then applying the sum function to the grouped data.



In [73]:
df["agg_col1"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df["agg_col2"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]

In [74]:
df

Unnamed: 0,a,b,agg_col1,agg_col2
0,0,70,1,1
1,1,69,0,0
2,2,68,1,0
3,3,67,0,1
4,4,66,1,0
5,5,65,0,0
6,6,64,1,1
7,7,63,0,0
8,8,62,1,0
9,9,61,0,1


In [75]:
df.groupby("agg_col1").sum()

Unnamed: 0_level_0,a,b,agg_col2
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,90,610,4
0,100,600,3


Grouping hierarchically then applying the sum function to grouped data.


In [76]:
df.groupby(["agg_col1", "agg_col2"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
agg_col1,agg_col2,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,54,366
0,0,73,417
1,1,36,244
0,1,27,183


Grouping and applying statistical functions to specific columns, using agg.


In [78]:
df.groupby("agg_col1").agg({"a": "max", "b": "mean"})

Unnamed: 0_level_0,a,b
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1
1,18,61.0
0,19,60.0


### Transpose

Transposing a dataframe, using either the transpose method or T property. Currently, all columns must have the same type.

In [79]:
sample = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
sample

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [80]:
sample.transpose()

Unnamed: 0,0,1,2
a,1,2,3
b,4,5,6


# Time Series
DataFrames supports datetime typed columns, which allow users to interact with and filter data based on specific timestamps.

In [81]:
import datetime as dt

date_df = cudf.DataFrame()
date_df["date"] = pd.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = cp.random.sample(len(date_df))

date_df

Unnamed: 0,date,value
0,2018-11-20,0.986051
1,2018-11-21,0.232034
2,2018-11-22,0.397617
3,2018-11-23,0.103839
4,2018-11-24,0.948364
...,...,...
67,2019-01-26,0.944597
68,2019-01-27,0.520279
69,2019-01-28,0.841959
70,2019-01-29,0.547023


In [82]:
search_date = dt.datetime.strptime("2018-11-23", "%Y-%m-%d")
date_df.query("date <= @search_date")

Unnamed: 0,date,value
0,2018-11-20,0.986051
1,2018-11-21,0.232034
2,2018-11-22,0.397617
3,2018-11-23,0.103839


### Categoricals

DataFrames support categorical columns.


In [84]:
gdf = cudf.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "grade": ["a", "b", "b", "a", "a", "e"]}
)

gdf

Unnamed: 0,id,grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,a
5,6,e


In [85]:
gdf["grade"] = gdf["grade"].astype("category")
gdf

Unnamed: 0,id,grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,a
5,6,e


Accessing the categories of a column.


In [86]:
gdf.grade.cat.categories

StringIndex(['a' 'b' 'e'], dtype='object')

Accessing the underlying code values of each categorical observation.


In [87]:
gdf.grade.cat.codes

0    0
1    1
2    1
3    0
4    0
5    2
dtype: uint8

# Converting to Pandas
Converting a cuDF DataFrame to a pandas DataFrame.


In [89]:
df.head()

Unnamed: 0,a,b,agg_col1,agg_col2
0,0,70,1,1
1,1,69,0,0
2,2,68,1,0
3,3,67,0,1
4,4,66,1,0


In [90]:
type(df)

cudf.core.dataframe.DataFrame

In [91]:
df.head().to_pandas()

Unnamed: 0,a,b,agg_col1,agg_col2
0,0,70,1,1
1,1,69,0,0
2,2,68,1,0
3,3,67,0,1
4,4,66,1,0


To convert the first few entries to pandas, we similarly call .head() on the dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert.

# Converting to Numpy

Converting a cuDF DataFrame to a numpy ndarray.


In [92]:
df.to_numpy()

array([[ 0, 70,  1,  1],
       [ 1, 69,  0,  0],
       [ 2, 68,  1,  0],
       [ 3, 67,  0,  1],
       [ 4, 66,  1,  0],
       [ 5, 65,  0,  0],
       [ 6, 64,  1,  1],
       [ 7, 63,  0,  0],
       [ 8, 62,  1,  0],
       [ 9, 61,  0,  1],
       [10, 60,  1,  0],
       [11, 59,  0,  0],
       [12, 58,  1,  1],
       [13, 57,  0,  0],
       [14, 56,  1,  0],
       [15, 55,  0,  1],
       [16, 54,  1,  0],
       [17, 53,  0,  0],
       [18, 52,  1,  1],
       [19, 51,  0,  0]])

Converting a cuDF Series to a numpy ndarray.


In [93]:
df["a"].to_numpy()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

### Converting to Arrow

Converting a cuDF or Dask-cuDF DataFrame to a PyArrow Table.


In [94]:
df.to_arrow()

pyarrow.Table
a: int64
b: int64
agg_col1: int64
agg_col2: int64
----
a: [[0,1,2,3,4,...,15,16,17,18,19]]
b: [[70,69,68,67,66,...,55,54,53,52,51]]
agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]
agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]

# Reading/Writing CSV Files

Writing to a csv file


In [98]:
df.to_csv("foo.csv", index=False)

Reading from a csv file.

In [99]:
df = cudf.read_csv("foo.csv")

In [100]:
df.head()

Unnamed: 0,a,b,agg_col1,agg_col2
0,0,70,1,1
1,1,69,0,0
2,2,68,1,0
3,3,67,0,1
4,4,66,1,0


### Reading/Writing Parquet Files
Writing to parquet files with cuDF’s GPU-accelerated parquet writer


In [101]:
df.to_parquet("temp_parquet")

Reading parquet files with cuDF’s GPU-accelerated parquet reader.


In [104]:
df = cudf.read_parquet("temp_parquet")

In [105]:
df

Unnamed: 0,a,b,agg_col1,agg_col2
0,0,70,1,1
1,1,69,0,0
2,2,68,1,0
3,3,67,0,1
4,4,66,1,0
5,5,65,0,0
6,6,64,1,1
7,7,63,0,0
8,8,62,1,0
9,9,61,0,1
