![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)  <img src='images/data_wrangling.png' align='right' width=100>

# *Practicum AI Data*: RAPIDS - cuDF

This exercise is inspired by [Nvidia's online deep learning courses](https://courses.nvidia.com/courses/course-v1:DLI+S-DS-01+V1/).
***

In this notebook, we will start to explore [RAPIDS](https://rapids.ai/), a data science framework which inculde a collection of libraries for running end-to-end data science pipelines completely on the GPU. <img src='images/RAPIDS-logo-purple.png' align='right' width=200>The interaction is designed to have a familiar look and feel to working in Python, but utilizes optimized NVIDIA® CUDA® primitives and high-bandwidth GPU memory under the hood.

![The Rapids pipeline](images/rapids_arrow.png)

In this notebook, we will just focus on the [cuDF](https://docs.rapids.ai/api/cudf/stable/) library and compare the performance with pandas.

## Objectives

By the end of this notebook, you will be able to:

1. Perform data reading and writing operations using cuDF.
2. Compare the performance of basic data manipulations between cuDF and pandas.

## 1. cuDF 

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

**Why use cuDF?**

* Enhanced Performance: By utilizing GPUs, cuDF can process data at a much faster rate compared to traditional CPU-based frameworks, enabling quicker analysis and exploration of large datasets.

* Seamless Transition: cuDF's pandas-like API allows users familiar with pandas to easily switch to cuDF without extensive relearning, saving time and effort.

* Scalability: cuDF's ability to efficiently handle large datasets enables users to work with massive amounts of data that would otherwise be challenging or impossible to process with traditional tools, empowering data scientists and analysts to tackle more complex tasks.

**The features of cuDF**

* GPU Acceleration: cuDF leverages the power of GPUs to accelerate data processing, offering significant performance improvements compared to traditional CPU-based data processing frameworks.

* Familiar API: cuDF provides an API that closely resembles pandas, making it easy for pandas users to transition to cuDF and leverage their existing knowledge and codebase.

* Efficient Handling of Large Datasets: cuDF is designed to efficiently handle large datasets that exceed the memory capacity of a single GPU or CPU, utilizing GPU memory management techniques and supporting out-of-core processing.

## 2. cuDF vs pandas

**Available GPU Accelerators**

To obtain details about the available GPUs in your environment, their current memory usage, and any active processes utilizing them, please execute the following cells.

In [1]:
# Check GPU availability

from numba import cuda

cuda_available = cuda.is_available()
print("GPU Available:", cuda_available)

GPU Available: True


Running the `nvidia-smi` command provides detailed information about the GPUs, including utilization, memory usage, temperature, power consumption, and more.

In [2]:
!nvidia-smi

Wed May 31 11:57:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   51C    P0   311W / 400W |   3004MiB / 81251MiB |     89%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   26C    P0    61W / 400W |      0MiB / 81251MiB |      0%      Default |
|       

**Time commands**

In Jupyter coding environments, you can utilize time commands that are recognizable by the presence of `%` or `%%`. These time commands will provide summary information about the execution time of code for a single line or an entire cell.

To compare the performance between pandas and cuDF, you can employ the `%time` command to measure the execution time of code for a line, and the `%%time` command to measure the execution time of code for an entire cell. This will allow you to assess and compare the performance of the two libraries.

In [3]:
from time import sleep

%time sleep(2) # %time only times one line
sleep(1)

CPU times: user 57 µs, sys: 1.08 ms, total: 1.14 ms
Wall time: 2 s


In [6]:
%%time 
# %%time will time the entire cell

sleep(1)
sleep(2)

CPU times: user 831 µs, sys: 65 µs, total: 896 µs
Wall time: 3 s


### 2.1 Reading and Writing Data

Using cuDF, a GPU-accelerated dataframe provided by the RAPIDS API, we have the capability to read data from various formats, including csv, json, parquet, feather, orc, and even Pandas dataframes.

#### Reading Data

In this notebook, we will focus on working with a substantial dataset consisting of nearly 20 million records. This dataset contains comprehensive information about different crops cultivated in the United States, including acreage, production, yield, and other relevant statistics. The dataset is sourced from the United States Department of Agriculture (USDA) National Agricultural Statistics Service, which diligently maintains agricultural data. For more information on obtaining this dataset, please refer to our [03.3_getting_data.ipynb](03.3_getting_data.ipynb) notebook.

We will directly read the data from a local csv file into GPU memory, utilizing the capabilities of cuDF.

In [7]:
# Import necessary libraries

import cudf
import pandas as pd

In [8]:
%time gdf = cudf.read_csv('data/qs_crops.csv')
gdf.shape

CPU times: user 1.67 s, sys: 3.45 s, total: 5.12 s
Wall time: 7.43 s


(20430138, 39)

In [9]:
gdf.dtypes

SOURCE_DESC              object
SECTOR_DESC              object
GROUP_DESC               object
COMMODITY_DESC           object
CLASS_DESC               object
PRODN_PRACTICE_DESC      object
UTIL_PRACTICE_DESC       object
STATISTICCAT_DESC        object
UNIT_DESC                object
SHORT_DESC               object
DOMAIN_DESC              object
DOMAINCAT_DESC           object
AGG_LEVEL_DESC           object
STATE_ANSI                int64
STATE_FIPS_CODE           int64
STATE_ALPHA              object
STATE_NAME               object
ASD_CODE                  int64
ASD_DESC                 object
COUNTY_ANSI               int64
COUNTY_CODE               int64
COUNTY_NAME              object
REGION_DESC              object
ZIP_5                     int64
WATERSHED_CODE            int64
WATERSHED_DESC           object
CONGR_DISTRICT_CODE        int8
COUNTRY_CODE              int64
COUNTRY_NAME             object
LOCATION_DESC            object
YEAR                      int64
FREQ_DES

For the purpose of comparison, we will now read the same dataset into a Pandas dataframe.

In [10]:
%time df = pd.read_csv('data/qs_crops.csv', low_memory=False)
gdf.shape == df.shape

CPU times: user 1min 39s, sys: 17.5 s, total: 1min 56s
Wall time: 1min 56s


True

Throughout this notebook, we will frequently utilize **gdf** to represent a GPU dataframe, which stands for GPU-accelerated dataframe using cuDF. Similarly, we will use **df** to refer to a CPU dataframe when comparing performance between GPU and CPU computations.

#### Writing to File

In addition to reading data, cuDF also offers methods for writing data to files. We will create a new dataframe specifically containing crops data for the state of Florida and then write it to a file named *florida_crops.csv* using cuDF. We will also perform the same operation using Pandas for comparison purposes.

**cuDF**

In [11]:
%time florida_crops = gdf.loc[gdf['STATE_NAME'] == 'FLORIDA']
print(f'{florida_crops.shape[0]} rows crop data')

CPU times: user 19.7 ms, sys: 26.5 ms, total: 46.2 ms
Wall time: 45 ms
267542 rows crop data


In [15]:
%time florida_crops.to_csv('data/florida_crops.csv')

CPU times: user 19.7 ms, sys: 104 ms, total: 124 ms
Wall time: 199 ms


**pandas**

In [16]:
%time florida_crops_pd = df.loc[df['STATE_NAME'] == 'FLORIDA']
print(f'{florida_crops.shape[0]} rows crop data')

CPU times: user 1.2 s, sys: 96.1 ms, total: 1.3 s
Wall time: 1.29 s
267542 rows crop data


In [17]:
%time florida_crops_pd.to_csv('data/florida_crops_pd.csv')

CPU times: user 2.9 s, sys: 63 ms, total: 2.96 s
Wall time: 3.09 s


### 2.2 Converting Data Types

Aside from its superior performance with large datasets, cuDF closely resembles the syntax and functionality of Pandas. In this section, we will showcase a few simple operations to highlight the similarities. It's worth noting that, in cuDF, column operations are generally more efficient than row-wise operations.

There are instances where we need to convert integer values into floats. In the following example, we will convert the 'COUNTY_CODE' column from an object datatype to float32 in cuDF. We will also compare the performance of this operation with Pandas.

**cuDF**

In [18]:
%time gdf['COUNTY_CODE'] = gdf['COUNTY_CODE'].astype('float32')

CPU times: user 548 µs, sys: 2.75 ms, total: 3.3 ms
Wall time: 2.64 ms


**pandas**

In [19]:
%time df['COUNTY_CODE'] = df['COUNTY_CODE'].astype('float32')

CPU times: user 68.4 ms, sys: 367 ms, total: 435 ms
Wall time: 434 ms


### 2.3 Column-Wise Aggregations

Column-wise aggregations leverage the architecture of the GPU and the memory format of RAPIDS to achieve efficient computations.

**cuDF**

In [21]:
%%time

# Convert the data type of the "VALUE" column from object to float

import numpy as np

gdf_value = []

numpy_array=gdf['VALUE'].to_numpy()

for value in numpy_array:
    if value is None:
         value = 0
    try:
        gdf_value.append(float(value))
    except ValueError:
        gdf_value.append(np.nan)
        
gdf_value = cudf.DataFrame(gdf_value)

CPU times: user 10.7 s, sys: 535 ms, total: 11.3 s
Wall time: 11.2 s


In [22]:
%time gdf_value.mean()

CPU times: user 4.33 ms, sys: 1.01 ms, total: 5.34 ms
Wall time: 4.63 ms


0    72.457203
dtype: float64

**pandas**

In [23]:
%%time

# Convert the data type of the "VALUE" column from object to float

import numpy as np

df_value = []

numpy_array=df['VALUE'].to_numpy()

for value in numpy_array:
    if value is None:
         value = 0
    try:
        df_value.append(float(value))
    except ValueError:
        df_value.append(np.nan)
        
df_value = pd.DataFrame(df_value)

CPU times: user 11 s, sys: 495 ms, total: 11.5 s
Wall time: 11.4 s


In [24]:
%time df_value.mean()

CPU times: user 124 ms, sys: 80.9 ms, total: 205 ms
Wall time: 204 ms


0    72.457228
dtype: float64

### 2.4 Data Subsetting 

cuDF also provides support for two core data subsetting tools: `loc` (label-based locator) and `iloc` (integer-based locator).

In our dataset, the labels happen to be incrementing numbers. Similar to Pandas, the `loc` function includes every value it is passed, while the `iloc` function provides a half-open range, excluding the final value. These functions allow for flexible and precise data subsetting based on labels or integer positions, depending on your specific requirements.

In [25]:
gdf.loc[200:205]

Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
200,CENSUS,CROPS,FIELD CROPS,HAY,ALFALFA,IRRIGATED,ALL UTILIZATION PRACTICES,AREA HARVESTED,OPERATIONS,"HAY, ALFALFA, IRRIGATED - OPERATIONS WITH AREA...",...,"MICHIGAN, CENTRAL, GLADWIN",2007,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,3,�
201,SURVEY,CROPS,FIELD CROPS,SOYBEANS,ALL CLASSES,NON-IRRIGATED,ALL UTILIZATION PRACTICES,PRODUCTION,BU,"SOYBEANS, NON-IRRIGATED - PRODUCTION, MEASURED...",...,NEBRASKA,1988,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,44054000,�
202,CENSUS,CROPS,FIELD CROPS,WHEAT,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,AREA HARVESTED,OPERATIONS,WHEAT - OPERATIONS WITH AREA HARVESTED,...,"VIRGINIA, CENTRAL, APPOMATTOX",2007,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,2,�
203,SURVEY,CROPS,FIELD CROPS,WHEAT,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"WHEAT - YIELD, MEASURED IN BU / ACRE",...,"IOWA, NORTHWEST, OSCEOLA",1985,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,43,�
204,SURVEY,CROPS,FIELD CROPS,TOBACCO,FIRE-CURED VA BELT (TYPE 21),ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,LB / ACRE,"TOBACCO, FIRE-CURED VA BELT (TYPE 21) - YIELD,...",...,"VIRGINIA, CENTRAL, NELSON",1944,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,940,�
205,SURVEY,CROPS,FIELD CROPS,TOBACCO,FLUE-CURED NC BORD & SC BELT (TYPE 13),ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,AREA HARVESTED,ACRES,"TOBACCO, FLUE-CURED NC BORD & SC BELT (TYPE 13...",...,NORTH CAROLINA,1940,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,58000,�


In [26]:
gdf.iloc[200:205]

Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
200,CENSUS,CROPS,FIELD CROPS,HAY,ALFALFA,IRRIGATED,ALL UTILIZATION PRACTICES,AREA HARVESTED,OPERATIONS,"HAY, ALFALFA, IRRIGATED - OPERATIONS WITH AREA...",...,"MICHIGAN, CENTRAL, GLADWIN",2007,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,3,�
201,SURVEY,CROPS,FIELD CROPS,SOYBEANS,ALL CLASSES,NON-IRRIGATED,ALL UTILIZATION PRACTICES,PRODUCTION,BU,"SOYBEANS, NON-IRRIGATED - PRODUCTION, MEASURED...",...,NEBRASKA,1988,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,44054000,�
202,CENSUS,CROPS,FIELD CROPS,WHEAT,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,AREA HARVESTED,OPERATIONS,WHEAT - OPERATIONS WITH AREA HARVESTED,...,"VIRGINIA, CENTRAL, APPOMATTOX",2007,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,2,�
203,SURVEY,CROPS,FIELD CROPS,WHEAT,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"WHEAT - YIELD, MEASURED IN BU / ACRE",...,"IOWA, NORTHWEST, OSCEOLA",1985,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,43,�
204,SURVEY,CROPS,FIELD CROPS,TOBACCO,FIRE-CURED VA BELT (TYPE 21),ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,LB / ACRE,"TOBACCO, FIRE-CURED VA BELT (TYPE 21) - YIELD,...",...,"VIRGINIA, CENTRAL, NELSON",1944,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,940,�


The `loc` function also allows us to select specific rows or columns from a cuDF dataframe based on boolean conditions. By specifying a boolean array or series as the index to the `loc` function, we can filter the dataframe and retrieve the rows or columns that satisfy the given conditions.

**cuDF**

In [27]:
%time crop_names = gdf.loc[gdf['COMMODITY_DESC'].str.startswith('O')]
crop_names.head()

CPU times: user 15 ms, sys: 47 ms, total: 61.9 ms
Wall time: 60.7 ms


Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
12,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRODUCTION,BU,"OATS - PRODUCTION, MEASURED IN BU",...,"ILLINOIS, EAST SOUTHEAST, EFFINGHAM",1965,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,117500.0,�
30,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,IRRIGATED,ALL UTILIZATION PRACTICES,AREA HARVESTED,ACRES,"OATS, IRRIGATED - ACRES HARVESTED",...,"MONTANA, CENTRAL, GOLDEN VALLEY",1969,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,100.0,�
39,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ON FARM,STOCKS,BU,"OATS, ON FARM - STOCKS, MEASURED IN BU",...,NEW MEXICO,1966,POINT IN TIME,12,12,FIRST OF DEC,,2012-01-01 00:00:00,67000.0,�
40,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRICE RECEIVED,$ / BU,"OATS - PRICE RECEIVED, MEASURED IN $ / BU",...,SOUTH DAKOTA,1980,MONTHLY,4,4,APR,,2012-01-01 00:00:00,1.26,�
64,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"OATS - YIELD, MEASURED IN BU / ACRE",...,"IDAHO, EAST, BANNOCK",1999,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,60.0,�


**pandas**

In [28]:
%time crop_names_pd = df.loc[df['COMMODITY_DESC'].str.startswith('O')]
crop_names_pd.head()

CPU times: user 5.65 s, sys: 604 ms, total: 6.25 s
Wall time: 6.23 s


Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
12,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRODUCTION,BU,"OATS - PRODUCTION, MEASURED IN BU",...,"ILLINOIS, EAST SOUTHEAST, EFFINGHAM",1965,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,117500.0,
30,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,IRRIGATED,ALL UTILIZATION PRACTICES,AREA HARVESTED,ACRES,"OATS, IRRIGATED - ACRES HARVESTED",...,"MONTANA, CENTRAL, GOLDEN VALLEY",1969,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,100.0,
39,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ON FARM,STOCKS,BU,"OATS, ON FARM - STOCKS, MEASURED IN BU",...,NEW MEXICO,1966,POINT IN TIME,12,12,FIRST OF DEC,,2012-01-01 00:00:00,67000.0,
40,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRICE RECEIVED,$ / BU,"OATS - PRICE RECEIVED, MEASURED IN $ / BU",...,SOUTH DAKOTA,1980,MONTHLY,4,4,APR,,2012-01-01 00:00:00,1.26,
64,SURVEY,CROPS,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"OATS - YIELD, MEASURED IN BU / ACRE",...,"IDAHO, EAST, BANNOCK",1999,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,60.0,


### 2.5 Grouping and Sorting

#### Grouping

Grouping with cuDF follows the same principles as in Pandas.

Once the data is grouped, you can access specific groups using the `get_group()` method and provide the name of the desired group. For instance, select the group of states named 'FLORIDA'.

**cuDF**

In [30]:
%time states_groups = gdf[['STATE_NAME', 'COMMODITY_DESC']].groupby(['STATE_NAME'])
florida = states_groups.get_group('FLORIDA')
florida[:5]

CPU times: user 552 µs, sys: 0 ns, total: 552 µs
Wall time: 556 µs


Unnamed: 0,STATE_NAME,COMMODITY_DESC
189,FLORIDA,OATS
244,FLORIDA,GRAPEFRUIT
270,FLORIDA,ORANGES
302,FLORIDA,FRUIT & TREE NUT TOTALS
378,FLORIDA,GRASSES


**pandas**

In [32]:
%time states_groups_pd = df[['STATE_NAME', 'COMMODITY_DESC']].groupby(['STATE_NAME'])
florida = states_groups_pd.get_group('FLORIDA')
florida[:5]

CPU times: user 99.5 ms, sys: 99.2 ms, total: 199 ms
Wall time: 198 ms


Unnamed: 0,STATE_NAME,COMMODITY_DESC
189,FLORIDA,OATS
244,FLORIDA,GRAPEFRUIT
270,FLORIDA,ORANGES
302,FLORIDA,FRUIT & TREE NUT TOTALS
378,FLORIDA,GRASSES


#### Sorting

Sorting in cuDF is quite similar to Pandas, although cuDF does not support in-place sorting.

To sort a cuDF dataframe, you can use the `sort_values()` function, which takes one or more columns as arguments and returns a new sorted dataframe.

**cuDF**

In [33]:
%time gdf_names = gdf['STATE_NAME'].sort_values()
print(gdf_names[:5]) 
print(gdf_names[1000000:1000005])

CPU times: user 39.1 ms, sys: 31.9 ms, total: 71 ms
Wall time: 69.9 ms
127    ALABAMA
135    ALABAMA
173    ALABAMA
273    ALABAMA
343    ALABAMA
Name: STATE_NAME, dtype: object
2907995    CALIFORNIA
2907997    CALIFORNIA
2908026    CALIFORNIA
2908066    CALIFORNIA
2908069    CALIFORNIA
Name: STATE_NAME, dtype: object


**pandas**

In [34]:
%time df_names = df['STATE_NAME'].sort_values()
print(df_names[:5]) 
print(df_names[1000000:1000005])

CPU times: user 19.2 s, sys: 182 ms, total: 19.4 s
Wall time: 19.4 s
7964121     ALABAMA
19889613    ALABAMA
1099912     ALABAMA
7346492     ALABAMA
7346462     ALABAMA
Name: STATE_NAME, dtype: object
9545973     CALIFORNIA
6546940     CALIFORNIA
18959403    CALIFORNIA
14621686    CALIFORNIA
1811194     CALIFORNIA
Name: STATE_NAME, dtype: object


### 2.6 Other useful methods

#### String Operations

Despite strings not being a data type traditionally associated with GPUs, cuDF offers robust support for accelerated string operations.

cuDF provides powerful string operations that leverage the GPU's computational capabilities. These operations allow you to efficiently perform various string manipulations and transformations on cuDF dataframes, similar to how you would handle strings in Pandas.

With cuDF's accelerated string operations, you can take advantage of the GPU's parallel processing capabilities to perform string-related tasks efficiently and effectively.

**cuDF**

In [35]:
%time gdf['COMMODITY_DESC'] = gdf['COMMODITY_DESC'].str.title()

CPU times: user 7.23 ms, sys: 7.67 ms, total: 14.9 ms
Wall time: 14 ms


In [36]:
gdf.head()

Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
0,SURVEY,CROPS,FIELD CROPS,Soybeans,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"SOYBEANS - YIELD, MEASURED IN BU / ACRE",...,"MICHIGAN, SOUTHWEST, CASS",1972,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,23.1,�
1,SURVEY,CROPS,FIELD CROPS,Soybeans,ALL CLASSES,ALL PRODUCTION PRACTICES,ON FARM,STOCKS,BU,"SOYBEANS, ON FARM - STOCKS, MEASURED IN BU",...,TENNESSEE,1965,POINT IN TIME,12,12,FIRST OF DEC,,2012-01-01 00:00:00,2236000.0,�
2,SURVEY,CROPS,FIELD CROPS,Sugarbeets,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,SUCROSE,PCT,"SUGARBEETS - SUCROSE, MEASURED IN PCT",...,"OHIO, NORTHWEST, PUTNAM",1983,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,16.26,�
3,SURVEY,CROPS,FIELD CROPS,Hay,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRODUCTION,TONS,"HAY - PRODUCTION, MEASURED IN TONS",...,"MISSOURI, NORTHWEST, ANDREW",1992,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,49500.0,�
4,SURVEY,CROPS,FIELD CROPS,Corn,ALL CLASSES,ALL PRODUCTION PRACTICES,SILAGE,PRODUCTION,TONS,"CORN, SILAGE - PRODUCTION, MEASURED IN TONS",...,"NEW YORK, CENTRAL, CORTLAND",1991,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,184200.0,�


**pandas**

In [37]:
%time df['COMMODITY_DESC'] = df['COMMODITY_DESC'].str.title()

CPU times: user 5.8 s, sys: 1.42 s, total: 7.22 s
Wall time: 7.19 s


In [38]:
df.head()

Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
0,SURVEY,CROPS,FIELD CROPS,Soybeans,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,YIELD,BU / ACRE,"SOYBEANS - YIELD, MEASURED IN BU / ACRE",...,"MICHIGAN, SOUTHWEST, CASS",1972,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,23.1,
1,SURVEY,CROPS,FIELD CROPS,Soybeans,ALL CLASSES,ALL PRODUCTION PRACTICES,ON FARM,STOCKS,BU,"SOYBEANS, ON FARM - STOCKS, MEASURED IN BU",...,TENNESSEE,1965,POINT IN TIME,12,12,FIRST OF DEC,,2012-01-01 00:00:00,2236000.0,
2,SURVEY,CROPS,FIELD CROPS,Sugarbeets,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,SUCROSE,PCT,"SUGARBEETS - SUCROSE, MEASURED IN PCT",...,"OHIO, NORTHWEST, PUTNAM",1983,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,16.26,
3,SURVEY,CROPS,FIELD CROPS,Hay,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PRODUCTION,TONS,"HAY - PRODUCTION, MEASURED IN TONS",...,"MISSOURI, NORTHWEST, ANDREW",1992,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,49500.0,
4,SURVEY,CROPS,FIELD CROPS,Corn,ALL CLASSES,ALL PRODUCTION PRACTICES,SILAGE,PRODUCTION,TONS,"CORN, SILAGE - PRODUCTION, MEASURED IN TONS",...,"NEW YORK, CENTRAL, CORTLAND",1991,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,184200.0,


#### The `unique` method

The `unique` method in cuDF efficiently scans through the data and extracts only the unique values present in a column or rows.

By utilizing parallel processing on the GPU, the unique method in cuDF can quickly identify and return the unique values from a specific column or rows in a cuDF dataframe. This functionality is particularly useful when you need to extract distinct values for further analysis or data processing tasks.

**cuDF**

In [39]:
%time gdf['COMMODITY_DESC'].unique()

CPU times: user 40.6 ms, sys: 36.4 ms, total: 77 ms
Wall time: 76 ms


0                    Alcohol Coproducts
1                               Almonds
2                              Amaranth
3                                Apples
4                              Apricots
                     ...               
262                          Watercress
263                               Wheat
264                           Wild Rice
265    Woody Ornamentals & Vines, Other
266                                Yams
Name: COMMODITY_DESC, Length: 267, dtype: object

**pandas**

In [40]:
%time df['COMMODITY_DESC'].unique()

CPU times: user 1.01 s, sys: 49.4 ms, total: 1.06 s
Wall time: 1.05 s


array(['Soybeans', 'Sugarbeets', 'Hay', 'Corn', 'Wheat', 'Sunflower',
       'Oats', 'Cotton', 'Rye', 'Crop Totals', 'Hay & Haylage',
       'Sugarcane', 'Beans', 'Sorghum', 'Bedding Plants, Annual',
       'Potatoes', 'Tobacco', 'Barley', 'Greens', 'Horticulture Totals',
       'Tomatoes', 'Vegetable Totals', 'Pumpkins',
       'Bedding Plants, Herbaceous Perennial', 'Blackberries', 'Squash',
       'Pecans', 'Pastureland', 'Rice', 'Orchards', 'Berry Totals',
       'Fruit & Tree Nut Totals', 'Fruit & Nut Plants', 'Grapes',
       'Apples', 'Cut Christmas Trees & Short Term Woody Crops',
       'Peanuts', 'Sweet Corn', 'Grain Storage Capacity',
       'Flowering Plants, Potted', 'Turnips', 'Tangerines',
       'Nursery Totals', 'Sweet Potatoes', 'Flaxseed', 'Blueberries',
       'Cut Christmas Trees', 'Soil', 'Grain', 'Foliage Plants', 'Mint',
       'Beets', 'Deciduous Shade Trees', 'Grapefruit',
       'Vegetables, Other', 'Fruit & Tree Nuts, Other', 'Peas',
       'Nectarines', 'Tr

#### The `value_counts` method

The `value_counts()` method is a valuable function that returns an object with the counts of unique values in a column of a DataFrame. This function is particularly useful when you want to determine the frequency or occurrence of different values within a specific column.

To identify the top 10 crops based on their occurrence count, you can use the following code snippet:

**cuDF**

In [41]:
%time gdf['COMMODITY_DESC'].value_counts()[:10]

CPU times: user 17.7 ms, sys: 10.7 ms, total: 28.3 ms
Wall time: 27.3 ms


Wheat               3971067
Corn                2103901
Hay                 1673872
Soybeans            1168626
Barley               848615
Oats                 817520
Sorghum              752455
Cotton               710160
Soil                 452440
Vegetable Totals     402757
Name: COMMODITY_DESC, dtype: int32

**pandas**

In [42]:
%time df['COMMODITY_DESC'].value_counts()[:10]

CPU times: user 1.57 s, sys: 0 ns, total: 1.57 s
Wall time: 1.56 s


Wheat               3971067
Corn                2103901
Hay                 1673872
Soybeans            1168626
Barley               848615
Oats                 817520
Sorghum              752455
Cotton               710160
Soil                 452440
Vegetable Totals     402757
Name: COMMODITY_DESC, dtype: int64

## Conclusion

This notebook compares the performance and features of two data manipulation libraries: cuDF (GPU-accelerated) and pandas (CPU-based). cuDF excels in speed and scalability with large datasets, thanks to GPU acceleration. pandas offers comprehensive functionalities and broad community support.

Choosing between them depends on specific needs. cuDF is ideal for high-performance computing on large datasets, while pandas is favored for its extensive functionality and ease of use. Users can make informed decisions on leveraging cuDF's speed or pandas' comprehensive features for data manipulation and analysis tasks by considering factors like dataset size and complexity.

For additional functions and features of cuDF, you can visit the [official website](https://docs.rapids.ai/api/cudf/stable/) and refer to the documentation for detailed information.

***

## Bonus Questions

We will use the `%load` command to load the content of a specified file into the cell, primarily for the purpose of incorporating solutions once you have finished the exercises.

### Q1: Modify Data Type

Analyze the data types of the **gdf** DataFrame and convert any 64-bit data types to their 32-bit counterparts.

* Retrieve the data types of each column in the gdf DataFrame and store them in the **gdf_dtypes** variable. Iterate over each column and check if the data type is either 'int64' or 'float64'.

* If a column has a 64-bit data type, use the `astype()` method to convert it to the corresponding 32-bit data type.

In [None]:
# Code it

**Solution**

In [112]:
%load solutions/03.3_modify_dtypes

SOURCE_DESC               object
SECTOR_DESC               object
GROUP_DESC                object
COMMODITY_DESC            object
CLASS_DESC                object
PRODN_PRACTICE_DESC       object
UTIL_PRACTICE_DESC        object
STATISTICCAT_DESC         object
UNIT_DESC                 object
SHORT_DESC                object
DOMAIN_DESC               object
DOMAINCAT_DESC            object
AGG_LEVEL_DESC            object
STATE_ANSI                 int32
STATE_FIPS_CODE            int64
STATE_ALPHA               object
STATE_NAME                object
ASD_CODE                   int32
ASD_DESC                  object
COUNTY_ANSI                int64
COUNTY_CODE              float32
COUNTY_NAME               object
REGION_DESC               object
ZIP_5                      int64
WATERSHED_CODE             int64
WATERSHED_DESC            object
CONGR_DISTRICT_CODE         int8
COUNTRY_CODE               int64
COUNTRY_NAME              object
LOCATION_DESC             object
YEAR      

### Q2: Oranges states

In this exercise, the objective is to identify the states involved in the cultivation of the commodity 'Oranges'. 

* To achieve this, we utilize the `loc()` method to filter rows from the data where the 'COMMODITY_DESC' column is equal to 'Oranges'. We extract the 'COMMODITY_DESC' and 'STATE_NAME' columns, forming the oranges DataFrame. 

* By accessing the 'STATE_NAME' column of the oranges DataFrame and applying the `unique()` method, we obtain an array of unique state names that represent the states engaged in the cultivation of oranges.

In [None]:
# Code it

**Solution**

In [106]:
%load solutions/03.3_oranges_states

0                      ALABAMA
1               AMERICAN SAMOA
2                      ARIZONA
3                   CALIFORNIA
4                      FLORIDA
5                      GEORGIA
6                         GUAM
7                       HAWAII
8                    LOUISIANA
9                  MISSISSIPPI
10                    NEW YORK
11    NORTHERN MARIANA ISLANDS
12                 PUERTO RICO
13              SOUTH CAROLINA
14                       TEXAS
15                    US TOTAL
16    VIRGIN ISLANDS OF THE US
Name: STATE_NAME, dtype: object

### Q3: Top five crops

To identify the top 5 crops in Florida, you will utilize the `groupby` and `value_counts` methods. 

* By grouping the data by 'STATE_NAME' and retrieving the group for Florida using `get_group`, you can count the occurrences of each crop in Florida. 

* Then, you can use the `value_counts()` method combined with slicing to select the top 5 crops based on their counts in Florida, like `value_counts()[:5]`.

In [None]:
# Code it

**Solution**

In [93]:
%load solutions/03.3_top_five_crops

STATE_NAME  COMMODITY_DESC  
FLORIDA     Soil                15180
            Corn                15142
            Peanuts             14592
            Oranges             10671
            Vegetable Totals    10166
dtype: int64