<span style="color:red">**_This notebook is for execution on GPU notes only!_**</span>

# Data Analytics with cuDF

You will begin your accelerated data science training with an introduction to [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API that enables you to create and manipulate GPU-accelerated dataframes. cuDF implements a very similar interface to Pandas so that Python data scientists can use it with very little ramp up. Throughout this notebook we will provide Pandas counterparts to the cuDF operations you perform to build your intuition about how much faster cuDF can be, even for seemingly simple operations.

## Objectives

By the time you complete this notebook you will be able to:

- Read and write data to and from disk with cuDF
- Perform basic data exploration and cleaning operations with cuDF

## Imports

Here we import cuDF and CuPy for GPU-accelerated dataframes and math operations, plus the CPU libraries Pandas and NumPy on which they are based and which we will use for performance comparisons:

In [1]:
import os
import cudf
import cupy as cp

import pandas as pd
import numpy as np

In [2]:
cudf.set_allocator("managed")

## Reading and Writing Data

### Reading Data

Using [cuDF](https://github.com/rapidsai/cudf), the RAPIDS API providing a GPU-accelerated dataframe, we can read data from [a variety of formats](https://rapidsai.github.io/projects/cudf/en/0.10.0/api.html#module-cudf.io.csv), including csv, json, parquet, feather, orc, and Pandas dataframes, among others.

To begin with, we will be reading the dataset of flights again. Here we read this data from a local csv file directly into GPU memory:

In [3]:
from glob import glob
filenames = sorted(glob(os.path.join('data', 'nycflights', '*.csv')))

In [4]:
filenames

['data/nycflights/1990.csv',
 'data/nycflights/1991.csv',
 'data/nycflights/1992.csv',
 'data/nycflights/1993.csv',
 'data/nycflights/1994.csv',
 'data/nycflights/1995.csv',
 'data/nycflights/1996.csv',
 'data/nycflights/1997.csv',
 'data/nycflights/1998.csv',
 'data/nycflights/1999.csv']

In [5]:
filepath = glob("./data/nycflights/*.csv")

Just like in pandas, we need to use concat when reading multiple files into one Dataframe:

#### cuDF

In [6]:
cdf = cudf.concat((cudf.read_csv(f) for f in filepath), ignore_index=True)

In [7]:
cdf.dtypes

Year                   int64
Month                  int64
DayofMonth             int64
DayOfWeek              int64
DepTime              float64
CRSDepTime             int64
ArrTime              float64
CRSArrTime             int64
UniqueCarrier         object
FlightNum              int64
TailNum               object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin                object
Dest                  object
Distance             float64
TaxiIn                 int64
TaxiOut                int64
Cancelled              int64
Diverted               int64
dtype: object

In [8]:
cdf.head(5)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,PIT,319.0,,,0,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,PIT,319.0,,,0,0
2,1990,1,3,3,1546.0,1540,1710.0,1701,US,33,...,,9.0,6.0,EWR,PIT,319.0,,,0,0
3,1990,1,4,4,1542.0,1540,1710.0,1701,US,33,...,,9.0,2.0,EWR,PIT,319.0,,,0,0
4,1990,1,5,5,1549.0,1540,1706.0,1701,US,33,...,,5.0,9.0,EWR,PIT,319.0,,,0,0


In [9]:
cdf.shape

(2611892, 23)

Here for comparison we read the same data into a Pandas dataframe:

#### pandas

In [10]:
pdf = pd.concat((pd.read_csv(f) for f in filepath),ignore_index=True)
cdf.shape == pdf.shape

True

In [11]:
len(pdf)

2611892

In [12]:
len(cdf)

2611892

Because of the sophisticated GPU memory management behind the scenes in cuDF, the first data load into a fresh RAPIDS memory environment is sometimes substantially slower than subsequent loads. The RAPIDS Memory Manager is preparing additional memory to accommodate the array of data science operations that you may be interested in using on the data, rather than allocating and deallocating the memory repeatedly throughout your workflow.

### Writing to File

cuDF also provides methods for writing data to files. Here we create a new dataframe specifically containing the year 1990 and then write it to `nyc_1990.csv`, before doing the same with Pandas for comparison.

#### cuDF

In [13]:
nyc_1990 = cdf.loc[cdf["Year"] == 1990]
print(f"{nyc_1990.shape[0]} flights departed from one of New York's airports in 1990")

271539 flights departed from one of New York's airports in 1990


In [14]:
nyc_1990.to_csv("nyc_1990.csv")

#### pandas

In [15]:
nyc_1990_pd = pdf.loc[pdf["Year"] == 1990]

In [16]:
nyc_1990_pd.to_csv("nyc_1990_pd.csv")

## Exercise: Initial Data Exploration

Now that we have some data loaded, let's do some initial exploration.

Use the `head`, `dtypes`, and `columns` methods on `cdf`, as well as the `value_counts` on individual `cdf` columns, to orient yourself to the data. If you're interested, use the `%time` magic command to compare performance against the same operations on the Pandas `pdf`.

You can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below).

In [17]:
# Begin your initial exploration here. Create more cells as needed.
cdf.head(3)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,PIT,319.0,,,0,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,PIT,319.0,,,0,0
2,1990,1,3,3,1546.0,1540,1710.0,1701,US,33,...,,9.0,6.0,EWR,PIT,319.0,,,0,0


In [18]:
cdf.tail(5)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
2611887,1993,12,27,1,713.0,700,821.0,800,CO,1671,...,,21.0,13.0,EWR,BWI,169.0,,,0,0
2611888,1993,12,28,2,657.0,700,807.0,800,CO,1671,...,,7.0,-3.0,EWR,BWI,169.0,,,0,0
2611889,1993,12,29,3,700.0,700,815.0,800,CO,1671,...,,15.0,0.0,EWR,BWI,169.0,,,0,0
2611890,1993,12,30,4,703.0,700,802.0,800,CO,1671,...,,2.0,3.0,EWR,BWI,169.0,,,0,0
2611891,1993,12,31,5,656.0,700,805.0,800,CO,1671,...,,5.0,-4.0,EWR,BWI,169.0,,,0,0


In [19]:
cdf.columns

Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
       'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
       'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
       'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'Diverted'],
      dtype='object')

In [20]:
cdf.dtypes

Year                   int64
Month                  int64
DayofMonth             int64
DayOfWeek              int64
DepTime              float64
CRSDepTime             int64
ArrTime              float64
CRSArrTime             int64
UniqueCarrier         object
FlightNum              int64
TailNum               object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin                object
Dest                  object
Distance             float64
TaxiIn                 int64
TaxiOut                int64
Cancelled              int64
Diverted               int64
dtype: object

## Basic Operations with cuDF

Except for being much more performant with large data sets, cuDF looks and feels a lot like Pandas. In this section we highlight a few very simple operations. When performing data operations on cuDF dataframes, column operations are typically much more performant than row-wise operations.

### Converting Data Types

We will sometimes need to convert data types. Here we convert the `Cancelled` column from `int64` to `bool`, comparing performance with Pandas:

#### cuDF

In [21]:
%time cdf["Cancelled"] = cdf["Cancelled"].astype("bool")

CPU times: user 2.14 ms, sys: 30 µs, total: 2.17 ms
Wall time: 1.53 ms


#### pandas

In [22]:
%time pdf["Cancelled"] = pdf["Cancelled"].astype("bool")

CPU times: user 13.5 ms, sys: 13.1 ms, total: 26.6 ms
Wall time: 25.6 ms


### Column-Wise Aggregations

Similarly, column-wise aggregations take advantage of the GPU's architecture and RAPIDS' memory format.
Let's compute the mean departure delay of non-cancelled flights.

#### cuDF

In [23]:
%time cdf[~cdf.Cancelled]["DepDelay"].mean()

CPU times: user 29.7 ms, sys: 31.6 ms, total: 61.3 ms
Wall time: 60 ms


9.206602541321965

#### pandas

In [24]:
%time pdf[~pdf.Cancelled]["DepDelay"].mean()

CPU times: user 142 ms, sys: 31.8 ms, total: 174 ms
Wall time: 172 ms


9.206602541321965

### String Operations

Although strings are not a datatype traditionally associated with GPUs, cuDF supports powerful accelerated string operations.

#### cuDF

Here, we change destination airport names to first letter upper, and the rest to lower case

In [25]:
%time cdf["Dest"] = cdf["Dest"].str.title()

CPU times: user 5.01 ms, sys: 700 µs, total: 5.71 ms
Wall time: 4.48 ms


In [26]:
cdf.head(5)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,Pit,319.0,,,False,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,Pit,319.0,,,False,0
2,1990,1,3,3,1546.0,1540,1710.0,1701,US,33,...,,9.0,6.0,EWR,Pit,319.0,,,False,0
3,1990,1,4,4,1542.0,1540,1710.0,1701,US,33,...,,9.0,2.0,EWR,Pit,319.0,,,False,0
4,1990,1,5,5,1549.0,1540,1706.0,1701,US,33,...,,5.0,9.0,EWR,Pit,319.0,,,False,0


#### pandas

In [27]:
%time pdf["Dest"] = pdf["Dest"].str.title()

CPU times: user 694 ms, sys: 36.4 ms, total: 731 ms
Wall time: 728 ms


In [28]:
pdf.head(5)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,Pit,319.0,,,False,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,Pit,319.0,,,False,0
2,1990,1,3,3,1546.0,1540,1710.0,1701,US,33,...,,9.0,6.0,EWR,Pit,319.0,,,False,0
3,1990,1,4,4,1542.0,1540,1710.0,1701,US,33,...,,9.0,2.0,EWR,Pit,319.0,,,False,0
4,1990,1,5,5,1549.0,1540,1706.0,1701,US,33,...,,5.0,9.0,EWR,Pit,319.0,,,False,0


### Data Selections with `loc` and `iloc`

cuDF also supports the core data subsetting tools `loc` (label-based locator) and `iloc` (integer-based locator).

Our data's labels happen to be incrementing numbers, though as with Pandas, `loc` will include every value it is passed whereas `iloc` will give the half-open range (omitting the final value).

In [29]:
cdf.loc[100:105]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
100,1990,1,15,1,1108.0,1110,1240.0,1243,US,49,...,,-3.0,-2.0,LGA,Cle,418.0,,,False,0
101,1990,1,16,2,1115.0,1110,1242.0,1243,US,49,...,,-1.0,5.0,LGA,Cle,418.0,,,False,0
102,1990,1,18,4,1136.0,1110,1310.0,1243,US,49,...,,27.0,26.0,LGA,Cle,418.0,,,False,0
103,1990,1,19,5,1106.0,1110,1238.0,1243,US,49,...,,-5.0,-4.0,LGA,Cle,418.0,,,False,0
104,1990,1,20,6,1127.0,1110,1313.0,1243,US,49,...,,30.0,17.0,LGA,Cle,418.0,,,False,0
105,1990,1,21,7,1111.0,1110,1244.0,1243,US,49,...,,1.0,1.0,LGA,Cle,418.0,,,False,0


In [30]:
cdf.iloc[100:105]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
100,1990,1,15,1,1108.0,1110,1240.0,1243,US,49,...,,-3.0,-2.0,LGA,Cle,418.0,,,False,0
101,1990,1,16,2,1115.0,1110,1242.0,1243,US,49,...,,-1.0,5.0,LGA,Cle,418.0,,,False,0
102,1990,1,18,4,1136.0,1110,1310.0,1243,US,49,...,,27.0,26.0,LGA,Cle,418.0,,,False,0
103,1990,1,19,5,1106.0,1110,1238.0,1243,US,49,...,,-5.0,-4.0,LGA,Cle,418.0,,,False,0
104,1990,1,20,6,1127.0,1110,1313.0,1243,US,49,...,,30.0,17.0,LGA,Cle,418.0,,,False,0


We can use `loc` with boolean selections:

#### cuDF

In [31]:
%time origin_j = cdf.loc[cdf["Origin"].str.startswith("J")]
origin_j.head()

CPU times: user 10 ms, sys: 7 ms, total: 17 ms
Wall time: 15.7 ms


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
235,1990,1,1,1,1845.0,1755,2010.0,1942,US,105,...,,28.0,50.0,JFK,Pit,340.0,,,False,0
236,1990,1,2,2,1838.0,1755,2000.0,1942,US,105,...,,18.0,43.0,JFK,Pit,340.0,,,False,0
237,1990,1,3,3,1832.0,1755,2001.0,1942,US,105,...,,19.0,37.0,JFK,Pit,340.0,,,False,0
238,1990,1,4,4,1845.0,1755,2033.0,1942,US,105,...,,51.0,50.0,JFK,Pit,340.0,,,False,0
239,1990,1,5,5,1810.0,1755,1941.0,1942,US,105,...,,-1.0,15.0,JFK,Pit,340.0,,,False,0


#### pandas

In [32]:
%time origin_j_pd = pdf.loc[pdf["Origin"].str.startswith("J")]

CPU times: user 767 ms, sys: 8.02 ms, total: 775 ms
Wall time: 773 ms


### Combining with NumPy/CuPy Methods

We can combine cuDF methods with NumPy methods, just like Pandas. Here we use `np.logical_and` for element-wise boolean selection.

#### cuDF

In [33]:
%time dest_od = cdf.loc[np.logical_and(cdf["Dest"].str.startswith("O"), cdf["Dest"].str.endswith("d"))]
dest_od.head()

CPU times: user 13.5 ms, sys: 6.09 ms, total: 19.6 ms
Wall time: 18.4 ms


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
7400,1990,1,1,1,1747.0,1730,1915.0,1936,PA (1),2027,...,,-21.0,17.0,JFK,Ord,740.0,,,False,0
7401,1990,1,2,2,1732.0,1730,1907.0,1936,PA (1),2027,...,,-29.0,2.0,JFK,Ord,740.0,,,False,0
7402,1990,1,3,3,1730.0,1730,1927.0,1936,PA (1),2027,...,,-9.0,0.0,JFK,Ord,740.0,,,False,0
7403,1990,1,4,4,1840.0,1730,2045.0,1936,PA (1),2027,...,,69.0,70.0,JFK,Ord,740.0,,,False,0
7404,1990,1,5,5,1730.0,1730,1925.0,1936,PA (1),2027,...,,-11.0,0.0,JFK,Ord,740.0,,,False,0


For better performance at scale, we can use CuPy instead of NumPy, thereby performing the element-wise boolean `logical_and` operation on GPU. First time round CuPy functions are slower than their NumPy equivalent. This is because CuPy is just-in-time compiling the kernel under the hood the first time you use a function in a Python process, which takes a bit of time. So, it only makes sense to use CuPy functions if you repeatedly call them in your code.

In [36]:
%time dest_od = cdf.loc[cp.logical_and(cdf["Dest"].str.startswith("O"), cdf["Dest"].str.endswith("d"))]
dest_od.head()

CPU times: user 8.14 ms, sys: 6.19 ms, total: 14.3 ms
Wall time: 13.2 ms


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
7400,1990,1,1,1,1747.0,1730,1915.0,1936,PA (1),2027,...,,-21.0,17.0,JFK,Ord,740.0,,,False,0
7401,1990,1,2,2,1732.0,1730,1907.0,1936,PA (1),2027,...,,-29.0,2.0,JFK,Ord,740.0,,,False,0
7402,1990,1,3,3,1730.0,1730,1927.0,1936,PA (1),2027,...,,-9.0,0.0,JFK,Ord,740.0,,,False,0
7403,1990,1,4,4,1840.0,1730,2045.0,1936,PA (1),2027,...,,69.0,70.0,JFK,Ord,740.0,,,False,0
7404,1990,1,5,5,1730.0,1730,1925.0,1936,PA (1),2027,...,,-11.0,0.0,JFK,Ord,740.0,,,False,0


#### pandas

In [37]:
%time dest_od = pdf.loc[np.logical_and(pdf["Dest"].str.startswith("O"), pdf["Dest"].str.endswith("d"))]

CPU times: user 1.39 s, sys: 11.3 ms, total: 1.4 s
Wall time: 1.4 s


## Exercise: Basic Data Cleaning

For this exercise we ask you to perform two simple data cleaning tasks using several of the techniques described above:

1. Modifying the data type of a couple columns
2. Transforming string data into our desired format

### 1. Modify `dtypes`

Examine the `dtypes` of `cdf` and convert the "Diverted" data type to boolean.

In [44]:
# examine data tpyes

In [45]:
# convert data type

Solution

In [None]:
cdf["Diverted"] = cdf["Diverted"].astype("bool")
cdf.dtypes

### 2. Title Case the Counties

As it stands, all of the origins are UPPERCASE:

In [44]:
cdf["Origin"].head()

0    EWR
1    EWR
2    EWR
3    EWR
4    EWR
Name: Origin, dtype: object

Convert them to title case as we have already done with the `Dest` column. Use the time magic.

Solution

In [47]:
%time cdf["Origin"] = cdf["Origin"].str.title()

CPU times: user 3.83 ms, sys: 3.38 ms, total: 7.2 ms
Wall time: 5.56 ms


Now, check the mean arrival delay of all non-diverted flights.

Solution

In [53]:
cdf[~cdf.Diverted]["ArrDelay"].mean()

8.012850468211875

## Grouping and Sorting

### Group operations

Group operations with cuDF work the same way as in Pandas.

#### cuDF

In [40]:
%%time
departure_delays = cdf["DepDelay"].groupby(cdf["Origin"])
avg_departure_delay = departure_delays.mean()
avg_departure_delay

CPU times: user 8.07 ms, sys: 10.1 ms, total: 18.1 ms
Wall time: 17.4 ms


Origin
JFK    10.351299
EWR    10.295469
LGA     7.431142
Name: DepDelay, dtype: float64

#### pandas

In [41]:
%%time
departure_delays = pdf["DepDelay"].groupby(pdf["Origin"])
avg_departure_delay = departure_delays.mean()
avg_departure_delay

CPU times: user 117 ms, sys: 10.7 ms, total: 128 ms
Wall time: 128 ms


Origin
EWR    10.295469
JFK    10.351299
LGA     7.431142
Name: DepDelay, dtype: float64

### Sorting

Sorting is also very similar to Pandas, though cuDF does not support in-place sorting.

#### cuDF

In [42]:
%time cdf_dest = cdf["Dest"].sort_values()
print(cdf_dest[:3])
print(cdf_dest[-3:])

CPU times: user 12 ms, sys: 13.8 ms, total: 25.8 ms
Wall time: 25.2 ms
33199      Abe
2096407    Abe
2096408    Abe
Name: Dest, dtype: object
2602769    Tys
2602770    Tys
2602771    Tys
Name: Dest, dtype: object


#### pandas

In [43]:
%time pdf_dest = pdf["Dest"].sort_values()
print(pdf_dest[:3])
print(pdf_dest[-3:])

CPU times: user 1.89 s, sys: 11 ms, total: 1.9 s
Wall time: 1.89 s
2096419    Abe
2096416    Abe
2096417    Abe
Name: Dest, dtype: object
2144899    Tys
2144905    Tys
2123753    Tys
Name: Dest, dtype: object


## Exercise: Lowerst Arrival Delay

For this exercise you will need to use both `groupby` and `sort_values`.

We would like to know which destinations are associated with the lowest average arrival delay.

In [45]:
# Your turn:


Solution:

In [56]:
mean_arrival_delays = cdf["ArrDelay"].groupby(cdf["Dest"]).mean().sort_values()
mean_arrival_delays #Tucson (Tus), Arizona is the right answer.

Tus           -32.5
Bhm    -5.151785714
Myr    -4.819672131
Oma    -4.802281369
Sav    -2.234317343
           ...     
Isp     27.71428571
Ewr     34.58333333
Ict     35.31578947
Jfk     114.1666667
Stx            <NA>
Name: ArrDelay, Length: 99, dtype: float64

Please Restart the Kernel:

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)