Intro to cuDF
=======================

Welcome to first cuDF tutorial notebook! This is a short introduction to cuDF, partly modeled after 10 Minutes to Pandas, geared primarily for new users. cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. While we'll only cover some of the cuDF functionality, but at the end of this tutorial we hope you'll feel confident creating and analyzing GPU DataFrames.

We'll start by introducing the pandas library, and quickly move on cuDF.

In [2]:
import os
import numpy as np # array processing
import math
np.random.seed(12) # set seed to ensure reproducibility

<a id="pandas"></a>
## Pandas

Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data - as the name suggests - comes in a structured form, often represented by a table or CSV. We'll focus the majority of these tutorials on working with structured data.

There exist many tools in the Python ecosystem for working with structured, tabular data but few are as widely used as Pandas. Pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing and many more. 

For more information on Pandas, check out the excellent documentation: http://pandas.pydata.org/pandas-docs/stable/

Below we show how to create a Pandas DataFrame, an internal object for representing tabular data.

In [4]:
import pandas as pd; print('Pandas Version:', pd.__version__)


# here we create a Pandas DataFrame with
# two columns named "key" and "value"
df = pd.DataFrame() # create DataFrame
df['key'] = [0, 0, 2, 2, 3] # puts values in a column
df['value'] = [float(i + 10) for i in range(5)]
print(df)

Pandas Version: 1.2.5
   key  value
0    0   10.0
1    0   11.0
2    2   12.0
3    2   13.0
4    3   14.0


We can perform many operations on this data. For example, let's say we wanted to sum all values in the in the `value` column. We could accomplish this using the following syntax:

In [3]:
aggregation = df['value'].sum()
print(aggregation)

60.0


In [4]:
df['value'].mean()

12.0

Pandas rule of thumb:
<br>
For every GB/MB of data, have at least 5 - 10 times as much available memory to avoid going out of memory.
<br>
Due to Pandas' internal structure (block manager) handles things.

<a id="cudf"></a>
## cuDF

Pandas is fantastic for working with small datasets that fit into your system's memory and workflows that are not computationally intense. However, datasets are growing larger and data scientists are working with increasingly complex workloads - the need for accelerated computing is increasing rapidly.

cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing Pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.

Below, we show how to create a cuDF DataFrame.

In [9]:
import cudf; print('cuDF Version:', cudf.__version__)


# here we create a cuDF DataFrame with
# two columns named "key" and "value"
df = cudf.DataFrame()
df['key'] = [0, 0, 2, 2, 3]
df['value'] = [float(i + 10) for i in range(5)]
df

cuDF Version: 21.08.03


Unnamed: 0,key,value
0,0,10.0
1,0,11.0
2,2,12.0
3,2,13.0
4,3,14.0


In [6]:
# not a pandas df
type(df)

cudf.core.dataframe.DataFrame

As before, we can take this cuDF DataFrame and perform a `sum` operation over the `value` column. The key difference is that any operations we perform using cuDF use the GPU instead of the CPU.

In [7]:
aggregation = df['value'].sum()
print(aggregation)

60.0


Note how the syntax for both creating and manipulating a cuDF DataFrame is identical to the syntax necessary to create and manipulate Pandas DataFrames; the cuDF API is based on the Pandas API. This design choice minimizes the cognitive burden of switching from a CPU based workflow to a GPU based workflow and allows data scientists to focus on solving problems while benefitting from the speed of a GPU!

# DataFrame Basics with cuDF

In the following tutorial, you'll get a chance to familiarize yourself with cuDF. For those of you with experience using pandas, this should look nearly identical.

Along the way you'll notice small exercises. These exercises are designed to help you get a feel for writing the code yourself, but if you get stuck, you can take a look at the solutions.

Portions of this were borrowed from the 10 Minutes to cuDF guide.

Object Creation
---------------

Creating a `cudf.Series`.

In [8]:
# series = individual column of data
# support NULL values
s = cudf.Series([1,2,3,None,4])
print(s)

0       1
1       2
2       3
3    <NA>
4       4
dtype: int64


Creating a `cudf.DataFrame` by specifying values for each column.

In [9]:
df = cudf.DataFrame({
'a': list(range(20)),
'b': list(reversed(range(20))),
'c': list(range(20))})
print(df)

     a   b   c
0    0  19   0
1    1  18   1
2    2  17   2
3    3  16   3
4    4  15   4
5    5  14   5
6    6  13   6
7    7  12   7
8    8  11   8
9    9  10   9
10  10   9  10
11  11   8  11
12  12   7  12
13  13   6  13
14  14   5  14
15  15   4  15
16  16   3  16
17  17   2  17
18  18   1  18
19  19   0  19


Creating a `cudf.DataFrame` from a `pd.Dataframe`.

In [10]:
# an existing pandas df
pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})

# use from_pandas() to create cudf equiv
gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)

   a     b
0  0   0.1
1  1   0.2
2  2  <NA>
3  3   0.3


Viewing Data
-------------

Viewing the top rows of a GPU dataframe.

In [11]:
print(df.head(2)) 

   a   b  c
0  0  19  0
1  1  18  1


Sorting by values.

In [12]:
print(df.sort_values(by='b')) # sort by column b, ascending by default

     a   b   c
19  19   0  19
18  18   1  18
17  17   2  17
16  16   3  16
15  15   4  15
14  14   5  14
13  13   6  13
12  12   7  12
11  11   8  11
10  10   9  10
9    9  10   9
8    8  11   8
7    7  12   7
6    6  13   6
5    5  14   5
4    4  15   4
3    3  16   3
2    2  17   2
1    1  18   1
0    0  19   0


Selection
------------

## Getting

Selecting a single column, which initially yields a `cudf.Series` (equivalent to `df.a`).

In [13]:
# same as
df.__getitem__?

In [14]:
print(df['a'])

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
Name: a, dtype: int64


## Selection by Label

Selecting rows from index 2 to index 5 from columns `a` and `b`.

In [15]:
print(df.loc[2:5, ['a', 'b']]) # grab specific rows and columns by name

   a   b
2  2  17
3  3  16
4  4  15
5  5  14


## Selection by Position

Selecting via integers and integer slices, like numpy/pandas.

In [16]:
df.iloc[0] # first row return as a column

a     0
b    19
c     0
Name: 0, dtype: int64

In [17]:
df.iloc[0:3, 0:2] # first 3 rows and first 2 columns

Unnamed: 0,a,b
0,0,19
1,1,18
2,2,17


You can also select elements of a `DataFrame` or `Series` with direct index access.

In [18]:
df[3:5]

Unnamed: 0,a,b,c
3,3,16,3
4,4,15,4


In [19]:
s[3:5]

3    <NA>
4       4
dtype: int64

## Exercise 1

Try to select only the rows at index `4` and `9` from `df`.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>print(df.iloc[[4,9]])
   </pre>
</details>

In [20]:
# index to be inside a list
df.iloc[[4,9]]

Unnamed: 0,a,b,c
4,4,15,4
9,9,10,9


## Boolean Indexing

Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing.

In [21]:
df.b > 15

0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: b, dtype: bool

In [22]:
# create filters
print(df[df.b > 15])

   a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3


Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API.

In [23]:
print(df.query("b == 3"))  

     a  b   c
16  16  3  16


In [24]:
val = 3
df.query("b == @val")

Unnamed: 0,a,b,c
16,16,3,16


You can also pass local variables to cuDF queries, via the `local_dict` keyword or `@` operator.

In [25]:
cudf_comparator = 3
print(df.query("b == @cudf_comparator"))

     a  b   c
16  16  3  16


Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`.

## Exercise 2

Try to select only the rows from `df` where the value in column `b` is greater than the value in column `c` + 6.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>print(df.query("b > c + 6"))
   </pre>
</details>

In [26]:
# better to be explicit
df.loc[df.b > (df.c + 6)]

Unnamed: 0,a,b,c
0,0,19,0
1,1,18,1
2,2,17,2
3,3,16,3
4,4,15,4
5,5,14,5
6,6,13,6


Missing Data
------------

Missing data can be replaced by using the `fillna` method.

In [27]:
print(s.fillna(999))

0      1
1      2
2      3
3    999
4      4
dtype: int64


Operations
------------

## Stats

Calculating descriptive statistics for a `Series`.

In [28]:
# create a series from 0 to 9
s = cudf.Series(np.arange(10)).astype(np.float32)

In [29]:
print(s.mean(), s.var(), s.std(), s.kurtosis(), s.skew())

4.5 9.166666666666668 3.0276503540974917 -1.200000000000001 0.0


In [30]:
# summary statistics
print(df.describe())

              a         b         c
count  20.00000  20.00000  20.00000
mean    9.50000   9.50000   9.50000
std     5.91608   5.91608   5.91608
min     0.00000   0.00000   0.00000
25%     4.75000   4.75000   4.75000
50%     9.50000   9.50000   9.50000
75%    14.25000  14.25000  14.25000
max    19.00000  19.00000  19.00000


## Applymap

Applying functions to a `Series`.

In [31]:
def add_ten(num):
    return num + 10

print(s.applymap(add_ten))

# cuDF only supports custom functions on columns that aren't strings

0    10.0
1    11.0
2    12.0
3    13.0
4    14.0
5    15.0
6    16.0
7    17.0
8    18.0
9    19.0
dtype: float64


## String Methods

Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF](https://docs.rapids.ai/api/cudf/nightly/) and [nvStrings](https://docs.rapids.ai/api/nvstrings/nightly/) API documentation for more information.

In [32]:
s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])
# lowercase strings
print(s.str.lower())

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: object


## Exercise 3

Try to convert all the strings to uppercase. Take a look at the nvStrings API documentation linked above if you need some help.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>print(s.str.upper())
   </pre>
</details>

In [33]:
# s.str is the string accessor
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: object

## Concat

Concatenating `Series` and `DataFrames` row-wise.

In [34]:
s = cudf.Series([1, 2, 3, None, 5])
# combine columns with repeating indices by default
print(cudf.concat([s, s]))

0       1
1       2
2       3
3    <NA>
4       5
0       1
1       2
2       3
3    <NA>
4       5
dtype: int64


In [35]:
# concat on different axis (column)
print(cudf.concat([s, s], axis=1))

      0     1
0     1     1
1     2     2
2     3     3
3  <NA>  <NA>
4     5     5


## Append

Appending values from another `Series` or array-like object.

In [36]:
print(s.append(s))

0       1
1       2
2       3
3    <NA>
4       5
0       1
1       2
2       3
3    <NA>
4       5
dtype: int64


## Join

Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index.

In [37]:
df_a = cudf.DataFrame()
df_a['key'] = ['a', 'b', 'c', 'd', 'e']
df_a['vals_a'] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b['key'] = ['a', 'c', 'e']
df_b['vals_b'] = [float(i+100) for i in range(3)]

merged = df_a.merge(df_b, on=['key'], how='left')
print(merged)

# by default, parallel joins won't preserve order

  key  vals_a vals_b
0   a    10.0  100.0
1   c    12.0  101.0
2   e    14.0  102.0
3   b    11.0   <NA>
4   d    13.0   <NA>


## Exercise 4

Using the DataFrames we created above, try to do an `inner` join using `merge`.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>print(df_a.merge(df_b, on=['key'], how='inner'))
   </pre>
</details>

In [38]:
df_a.merge(df_b, on=['key'], how='inner')

Unnamed: 0,key,vals_a,vals_b
0,a,10.0,100.0
1,c,12.0,101.0
2,e,14.0,102.0


In [39]:
# if keys are different
df_a.merge(df_b, how='inner', left_on=['key'], right_on=['key'])

Unnamed: 0,key,vals_a,vals_b
0,a,10.0,100.0
1,c,12.0,101.0
2,e,14.0,102.0


## Grouping

Like pandas, cuDF supports the [Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) groupby paradigm.

In [40]:
df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]

Grouping and then applying the `sum` function to the grouped data.

In [41]:
df.groupby('agg_col1').sum()

Unnamed: 0_level_0,a,b,c,agg_col2
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,90,100,90,4
0,100,90,100,3


Grouping and applying statistical functions to specific columns, using `agg`.

In [42]:
# different aggregation for different columns
df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'})

Unnamed: 0_level_0,a,b,c
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,18,10.0,90
0,19,9.0,100


In [43]:
# multiple aggregations for single column by using a list
df.groupby('agg_col1').agg({'a':'max', 'b':['mean', 'count'], 'c':'sum'})

Unnamed: 0_level_0,a,b,b,c
Unnamed: 0_level_1,max,mean,count,sum
agg_col1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,18,10.0,10,90
0,19,9.0,10,100


## Exercise 5

We can also group by multiple columns at once, which we call grouping hierarchically. Try to group `df` by `agg_col1` and `agg_col2` and calculate the mean of column `a` and minimum of column `b`.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>df.groupby(['agg_col1', 'agg_col2']).agg({'a':'mean', 'b':'min'})
   </pre>
</details>

In [44]:
# group by multiple columns
df.groupby(['agg_col1', 'agg_col2']).agg({'a':'max', 'b':'mean', 'c':'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
agg_col1,agg_col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,16,10.0,54
0,0,19,8.571429,73
1,1,18,10.0,36
0,1,15,10.0,27


Time Series
------------


`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps.

In [45]:
date_df = cudf.DataFrame()
date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')
date_df['value'] = np.random.sample(len(date_df))
print(date_df.head())

        date     value
0 2018-11-20  0.154163
1 2018-11-21  0.740050
2 2018-11-22  0.263315
3 2018-11-23  0.533739
4 2018-11-24  0.014575


In [46]:
# datetime64[ns] - nanosecond precision
date_df.dtypes

date     datetime64[ns]
value           float64
dtype: object

## Exercise 6

Try to use `query` to filter `date_df` to only those row with a date before `2018-11-23`. This is a bit trickier than the prior exercises. As a hint, you'll want to explore the `to_datetime` function from the `pandas` library.

<details><summary><b>Solution</b></summary>
   <pre>
    <br>
    search_date = pd.to_datetime('2018-11-23')
    date_df.loc[date_df.date &lt= search_date]
            </br>
   </pre>
</details>

In [47]:
# create datetime object
time_filter = pd.Timestamp('2018-11-23')
date_df.query("date < @time_filter")

# these doesn't work since it involves strings
# date_df.loc[date_df.date < "2018-11-23"]
# date_df.query("date < '2018-11-23'")

Unnamed: 0,date,value
0,2018-11-20,0.154163
1,2018-11-21,0.74005
2,2018-11-22,0.263315


You can also interact with datetime columns to extract things like the day, hour, minute, and more.

In [48]:
date_df['minute'] = date_df.date.dt.minute # second, hour
print(date_df.head())

        date     value  minute
0 2018-11-20  0.154163       0
1 2018-11-21  0.740050       0
2 2018-11-22  0.263315       0
3 2018-11-23  0.533739       0
4 2018-11-24  0.014575       0


Converting Data Representation
--------------------------------

## CuPy

CuPy is a GPU array library with a NumPy consistent API.

Combining cuDF and CuPy is particularly useful (it's the GPU equivalent of combining pandas and NumPy), so we've put together a Getting Started [guide](https://docs.rapids.ai/api/cudf/nightly/10min-cudf-cupy.html).

For now, note that you can convert a DataFrame or a Series to a CuPy array with `.values`. 

cuDF and CuPy interact without making any copies.

In [49]:
# pandas to np array
# cuDF to CuPy array
df.values

array([[ 0, 19,  0,  1,  1],
       [ 1, 18,  1,  0,  0],
       [ 2, 17,  2,  1,  0],
       [ 3, 16,  3,  0,  1],
       [ 4, 15,  4,  1,  0],
       [ 5, 14,  5,  0,  0],
       [ 6, 13,  6,  1,  1],
       [ 7, 12,  7,  0,  0],
       [ 8, 11,  8,  1,  0],
       [ 9, 10,  9,  0,  1],
       [10,  9, 10,  1,  0],
       [11,  8, 11,  0,  0],
       [12,  7, 12,  1,  1],
       [13,  6, 13,  0,  0],
       [14,  5, 14,  1,  0],
       [15,  4, 15,  0,  1],
       [16,  3, 16,  1,  0],
       [17,  2, 17,  0,  0],
       [18,  1, 18,  1,  1],
       [19,  0, 19,  0,  0]])

## Pandas

Converting a cuDF `DataFrame` to a pandas `DataFrame`.

In [50]:
df.head().to_pandas()

Unnamed: 0,a,b,c,agg_col1,agg_col2
0,0,19,0,1,1
1,1,18,1,0,0
2,2,17,2,1,0
3,3,16,3,0,1
4,4,15,4,1,0


## Numpy

Converting a cuDF `DataFrame` to a numpy `ndarray`.

In [51]:
df.as_matrix()

array([[ 0, 19,  0,  1,  1],
       [ 1, 18,  1,  0,  0],
       [ 2, 17,  2,  1,  0],
       [ 3, 16,  3,  0,  1],
       [ 4, 15,  4,  1,  0],
       [ 5, 14,  5,  0,  0],
       [ 6, 13,  6,  1,  1],
       [ 7, 12,  7,  0,  0],
       [ 8, 11,  8,  1,  0],
       [ 9, 10,  9,  0,  1],
       [10,  9, 10,  1,  0],
       [11,  8, 11,  0,  0],
       [12,  7, 12,  1,  1],
       [13,  6, 13,  0,  0],
       [14,  5, 14,  1,  0],
       [15,  4, 15,  0,  1],
       [16,  3, 16,  1,  0],
       [17,  2, 17,  0,  0],
       [18,  1, 18,  1,  1],
       [19,  0, 19,  0,  0]])

Converting a cuDF `Series` to a numpy `ndarray`.

In [52]:
df['a'].to_array()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Getting Data In/Out
------------------------


## CSV

Writing to a CSV file, using a GPU-accelerated CSV writer.

In [53]:
if not os.path.exists('example_output'):
    os.mkdir('example_output')
    
df.to_csv('example_output/foo.csv', index=False)

Reading from a csv file.

In [54]:
df = cudf.read_csv('example_output/foo.csv')
print(df)

     a   b   c  agg_col1  agg_col2
0    0  19   0         1         1
1    1  18   1         0         0
2    2  17   2         1         0
3    3  16   3         0         1
4    4  15   4         1         0
5    5  14   5         0         0
6    6  13   6         1         1
7    7  12   7         0         0
8    8  11   8         1         0
9    9  10   9         0         1
10  10   9  10         1         0
11  11   8  11         0         0
12  12   7  12         1         1
13  13   6  13         0         0
14  14   5  14         1         0
15  15   4  15         0         1
16  16   3  16         1         0
17  17   2  17         0         0
18  18   1  18         1         1
19  19   0  19         0         0


That's it! You've got the basics of cuDF down! Let's talk a little bit about the computational performance of cuDF and GPUs.

# Performace

One of the primary reasons to use cuDF over pandas is performance. For some workflows, the GPU can be **much** faster than the CPU. Let's illustrate this by starting with a small example: creating a DataFrame and calculating the sum of a column.

In [55]:
a = np.random.rand(10000000) # 10 million values

In [56]:
pdf = pd.DataFrame({'a': a})
cdf = cudf.DataFrame({'a': a})

In [57]:
%%timeit
pdf['a'].sum()

18.2 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [58]:
%%timeit
cdf['a'].sum()

The slowest run took 4.26 times longer than the fastest. This could mean that an intermediate result is being cached.
16.9 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Pretty cool! This is a pretty small example, though.

### A More Realistic Example: Sensor Data Analytics

To get a more realistic sense of how powerful cuDF and GPUs can be, let's imagine you had a fleet of sensors that collect data every millisecond. These sensors could be measuring pressure, temperature, or something else entirely.

Let's imagine we want to analyze one day's worth of sensor data. We'll assign random values for the sensor `value` to use for this example. We'll start by creating the data, and generating some datetime related features like we learned about in the above tutorial.

In [6]:
%%time

# timeseries
date_df = pd.DataFrame()
date_df['date'] = pd.date_range(start='2019-07-05', end='2019-07-06', freq='min')
date_df['value'] = np.random.sample(len(date_df))

date_df['hour'] = date_df.date.dt.hour
date_df['minute'] = date_df.date.dt.minute

print(date_df.shape)
date_df.head()

(1441, 4)
CPU times: user 13.9 ms, sys: 1.48 ms, total: 15.4 ms
Wall time: 10.3 ms


Unnamed: 0,date,value,hour,minute
0,2019-07-05 00:00:00,0.154163,0,0
1,2019-07-05 00:01:00,0.74005,0,1
2,2019-07-05 00:02:00,0.263315,0,2
3,2019-07-05 00:03:00,0.533739,0,3
4,2019-07-05 00:04:00,0.014575,0,4


Just creating the data takes a while! Let's do our analysis. From our sensor data, we want to get the maximum sensor value for each minute. Since we don't want to combine values in the same minute of different hours, we'll need to do a hierarchical groupby.

In [7]:
%time results = date_df.groupby(['hour', 'minute']).agg({'value':'max'})
results.head()

CPU times: user 13.2 ms, sys: 1.14 ms, total: 14.3 ms
Wall time: 16.6 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,value
hour,minute,Unnamed: 2_level_1
0,0,0.499229
0,1,0.74005
0,2,0.263315
0,3,0.533739
0,4,0.014575


This is fairly slow! Imagine if we had a fleet of sensors. Time would become a serious concern.

Let's try this in cuDF now, using the GPU DataFrame. We'll run the same code as above.

In [10]:
%%time

date_df = cudf.DataFrame()
date_df['date'] = pd.date_range(start='2019-07-05', end='2019-07-06', freq='ms')
date_df['value'] = np.random.sample(len(date_df))

date_df['hour'] = date_df.date.dt.hour
date_df['minute'] = date_df.date.dt.minute
date_df['second'] = date_df.date.dt.second

print(date_df.shape)
print(date_df.head())

(86400001, 5)
                     date     value  hour  minute  second
0 2019-07-05 00:00:00.000  0.145682     0       0       0
1 2019-07-05 00:00:00.001  0.760025     0       0       0
2 2019-07-05 00:00:00.002  0.680434     0       0       0
3 2019-07-05 00:00:00.003  0.110188     0       0       0
4 2019-07-05 00:00:00.004  0.793416     0       0       0
CPU times: user 2.06 s, sys: 1.42 s, total: 3.48 s
Wall time: 5.41 s


In [11]:
%time results = date_df.groupby(['hour', 'minute', 'second']).agg({'value':'max'})
results.head()

MemoryError: std::bad_alloc: CUDA error at: /home/qwertygp/miniconda3/envs/rapids-21.08/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

Unnamed: 0_level_0,Unnamed: 1_level_0,value
hour,minute,Unnamed: 2_level_1
0,0,0.499229
0,1,0.74005
0,2,0.263315
0,3,0.533739
0,4,0.014575


While the results may vary slightly, it should be clear that GPU acceleration can make a significant difference. We can get much faster results with the same code!

## Exercise 7

Play around with some more pandas and cuDF operations and compare the performance between them. What operations can you find that gives the highest performance ratio? 

You can start a cell with `%%time` to time the cell, or with `%%timeit`, which runs the cell multiple times and gives an average. `%%timeit` gives a more accurate benchmark but takes longer to run.