<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://developer.nvidia.com/sites/default/files/akamai/embedded/images/EDU/DLI%20Asset%20-%20Logo.jpg" width="400" height="186" /></a></center>

# Speed Up DataFrame Operations w/ RAPIDS cuDF

## Welcome
A **DataFrame** is a 2-dimensional data structure used to represent data in a tabular format, like a spreadsheet or SQL table. Originally offered through the Python Data Analysis ([pandas](https://pandas.pydata.org/docs/)) library, DataFrames have become very popular for its familiar representation along with a robust set of features that are intuitive and expressive. 

Raw data often needs to be manipulated before it can be used for further purposes such as generating **Business Intelligence**, creating **Dashboard Visualization**, or training **Machine Learning** models. These preprocessing steps can include **filtering**, **merging**, **grouping**, and **aggregating**. 

Below is a typical data processing pipeline: 
![flow](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/flow.png?raw=true)

According to [studies](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=29f71b266f63), data preparation accounts for ~80% of the work for analysts. This could be due in part to the rapid increase in the size of data as well as the iterative nature of analytics. 

Recognizing this potential bottleneck, NVIDIA created [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) that leverages GPU hardware and software to perform data manipulation tasks with parallel computing, **saving valuable time and resources**. The cuDF library is part of the larger [**RAPIDS**](https://rapids.ai/) data science framework that allows for the execution of **end-to-end analytics pipelines** entirely on GPUs. One of the focus for cuDF and its companion suite of open source software libraries is to provide syntax that is similar to their CPU counterparts, **making it easy to implement**. 

This notebook is intended to demonstrate speedup in data processing by moving common DataFrame operations to the GPU with minimal changes to existing code. 

### Environment Sanity Check

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a RAPIDS supported GPU such as Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

### Setup
Because RAPIDS cuDF isn't readily available in this Google Colab environment, it needs to be installed following the below steps: 
1. Updates gcc in Colab
2. Installs Conda
3. Install RAPIDS' current stable version of its libraries
4. Copy RAPIDS' .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

![pause](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/pause.png?raw=true)

Were you assigned a compatible GPU? 
If not, click the _Runtime_ dropdown at the top of the page, then _Factory Reset Runtime_ to get another assignment.

In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

![pause](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/pause.png?raw=true)

Don't run the next cell until Colab session has crashed. 

In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

![pause](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/pause.png?raw=true)

Don't run the next cell until Colab session has crashed. 

In [None]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

Next we can begin the installation of RAPIDS in this Colab environment. This step can take up to ~15 mins. Execute the _next cell_ and check out this article on [**10 Minutes to cdDF and Dask-cuDF**](https://docs.rapids.ai/api/cudf/stable/10min.html) while the installation completes. 

In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
!python rapidsai-csp-utils/colab/install_rapids.py stable core
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

## Interactive Exercise

In [52]:
import numpy as np # for generating sample data

import pandas as df
# import cudf as df
import time # for clocking process times
import matplotlib.pyplot as plt # for visualizing results

class Timer: # creating a Timer helper class to measure execution time
  def __enter__(self):
    self.start=time.perf_counter()
    return self
  def __exit__(self, *args):
    self.end=time.perf_counter()
    self.interval=self.end-self.start

### Loading a Sample Data
We start our demonstration by generating two 2-dimensional arrays of random numbers - we've configured for sizeable arrays at 1MM rows by 50 columns each. Then they are converted to DataFrames using ```pandas.DataFrame()``` or ```cudf.DataFrame()```:

In [53]:
rows=1000000
columns=50

In [None]:
def load_data(): 
  data_a=np.random.randint(0, 100, (rows, columns))
  data_b=np.random.randint(0, 100, (rows, columns))
  dataframe_a=df.DataFrame(data_a, columns=[f'a_{i}' for i in range(columns)])
  dataframe_b=df.DataFrame(data_b, columns=[f'b_{i}' for i in range(columns)])
  return dataframe_a, dataframe_b

with Timer() as process_time: 
  dataframe_a, dataframe_b=load_data()

print(f'The loading process took {process_time.interval:.2f} seconds')
display(dataframe_a.tail(5))
display(dataframe_b.tail(5))

![check](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true)

We created two DataFrames, _dataframe_a_ and _dataframe_b_ that are 1000000 rows by 50 columns (col_1, col_2, ... col_48, col_49) each. 

### Merging Data
Sometimes data can come from multiple sources and need to be merged into one with ```DataFrame.merge()```. For example, a typical retail data storage infrastructure may include a customer table and separate transaction and product tables. Merging the data allows the correct details to be included in a single DataFrame to get the insight needed. 

In [None]:
def merge_data(left_df, right_df):
  combined_df=df.merge(left_df, right_df, left_index=True, right_index=True)
  return combined_df

with Timer() as process_time: 
  combined_df=merge_data(dataframe_a, dataframe_b)

print(f'The merging process took {process_time.interval:.2f} seconds')
display(combined_df.head())

![check](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true)

We merged two DataFrames, _dataframe_a_ and _dataframe_b_ on their _index_ into one larger DataFrame that is 1000000 rows by 100 columns (a_0, a_1, ..., b_48, b_49). 

### Summarize
Exploring data begins with **descriptive statistics**, which often involves finding the **central tendency** and **dispersion**. They are a quick way to summarize distributions. Measures of central tendency includes the mean, median, and mode - they are used to describe the center of a set of data values. Measures of dispersion include variance and standard deviation - they are used to describe the degree to which data is distributed around the center. We can quickly perform simple descriptive statistics with the ```DataFrame.describe()``` method. 

In [None]:
def summarize(dataframe):
  summary_df=dataframe.describe()
  return summary_df

with Timer() as process_time: 
  summary_df=summarize(combined_df)

print(f'The summarizing process took {process_time.interval:.2f} seconds')
display(summary_df)

![check](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true)

Since this is a sample data set, we see that each of columns/features (a_0, a_1, ..., b_48, b_49) have 1000000 values with an average ~50 and standard deviation of ~30

### Correlation - Exploring Relationships
We might be interested in finding relationships/dependencies between two or more variables through their correlation with ```DataFrame.corr()```. Correlation is a number between -1 and 1 that describes the strength of the association between two variables. Two variables with a correlation of 1 suggests that they change together in the same direction while a correlation of -1 suggests that they change together in the opposite direction. 

In [None]:
def correlation(dataframe): 
  corr_df=dataframe.corr()
  return corr_df

with Timer() as process_time: 
  corr_df=correlation(combined_df)

print(f'The correlation process took {process_time.interval:.2f} seconds')
display(corr_df.head())

![check](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true)

The resulting cross tabulation shows that each column/feature (a_0, a_1, ..., b_48, b_49) have a perfect correlation (1) with itself and is not correlated (~0) with each other. 

### Grouping
We can compare subsets of the data to explore the significance of categories and classes with the ```DataFrame.groupby()``` method. We can even group continuous data values into a smaller number of bins with ```pandas.cut()``` or ```cudf.cut()``` to simplify our analysis. The groupings usually follow an aggregation such as mean or count. For example, we can group our data into 5 equidistant bins based on their sequential index. 

In [None]:
def groupby_summarize(dataframe):
  dataframe['group']=dataframe.index
  dataframe['group']=df.cut(dataframe['group'], 5)
  group_describe_df=dataframe.groupby('group').mean().reset_index(drop=True)
  return group_describe_df

with Timer() as process_time: 
  group_describe_df=groupby_summarize(combined_df)

print(f'The grouping process took {process_time.interval:.2f} seconds')
display(group_describe_df)

![check](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true)

The resulting DataFrame shows that each group maintains an average of ~50 for each column/feature (a_0, a_1, ..., b_48, b_49) as expected for this sample data. 

### Putting it together
We can measure the total elapsed time for this sample data processing workflow. 

In [None]:
def pipeline():
  performance={}

  with Timer() as process_time: 
    dataframe_a, dataframe_b=load_data()
  performance['load data']=process_time.interval

  with Timer() as process_time: 
    combined_df=merge_data(dataframe_a, dataframe_b)
  performance['merge data']=process_time.interval
  
  with Timer() as process_time: 
    summarize(combined_df)
  performance['summarize']=process_time.interval
  
  with Timer() as process_time: 
    correlation(combined_df)
  performance['correlation']=process_time.interval
  
  with Timer() as process_time: 
    groupby_summarize(combined_df)
  performance['groupby & summarize']=process_time.interval
  
  if df.__name__=='cudf': 
    df.DataFrame([performance], index=['gpu']).to_pandas().plot(kind='bar', stacked=True)
  else: 
    df.DataFrame([performance], index=['gpu']).plot(kind='bar', stacked=True)

  return None

### Timing the Pipeline on CPU

In [None]:
import pandas as df
pipeline()

### Switching to GPU
Traditionally, these tasks are frequently done (as we did) using the popular [**pandas**](https://pandas.pydata.org/) library, which only runs on a single CPU. NVIDIA's [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) library was built with the users in mind - by offering nearly identical syntax to its CPU counterpart, developers only have to make few changes to their existing code to take advantage of its capabilities. 

In [61]:
import cudf as df

**That's it!** cuDF uses nearly identical syntax to the familiar pandas API. **Brilliant!** It's worth noting that there are some features that are unique to each library, but conviniently there are a lot of overlaps. 

In [None]:
pipeline()

### Comparing Results
In a trial run, **cuDF** completed the data processing tasks in nearly 10x faster than **pandas**. The expectations is that the speedup will be even more significant as the size of the data becomes largers. Feel free to give it a try by modifying the dimensions of the data above. 

![result](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/result.png?raw=true)

## Conclusion
Congratulations on completing the notebook! Want to learn more about cuDF and the rest of the RAPIDS framework? Check out the follow-up to this course, [Accelerating End-to-End Data Science Workflows]('https://courses.nvidia.com/courses/course-v1:DLI+S-DS-01+V1/about') or our other online courses at [NVIDIA DLI]('https://www.nvidia.com/en-us/training/online/').