# Introduction to RAPIDS

The RAPIDS data science framework is a GPU-empowered collection of libraries for executing end-to-end data science pipelines completely in the GPU. It is designed to make an effective use of the computational capabilities of GPUs with optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory. The primary objective behind using RAPIDS is to accelerate individual parts of the typical data science workflow, and thereby accelerating the complete end-to-end workflow in Data Preparation and Machine Learning.

Read through [this](https://medium.com/future-vision/what-is-rapids-ai-7e552d80a1d2) medium article to understand how RAPIDS works.
<br><br>
If you have already worked with pandas and numpy previously, most of the tutorial will seem very familiar to you. If you haven't, do not worry. This is a great place to start!

## SETUP

**Note that pandas is a data analysis and manipulation tool built on top of the Python programming language to perform various tasks (e.g.: loading, joining, aggregating, filtering data). cuDF is a GPU DataFrame library that helps perform similar functionalities with massive acceleration.**

Before we dive in, please make sure to check out the official documentation [here](https://docs.rapids.ai/api) to get an overall idea. Additionally, refer to the [cheatsheet](https://rapids.ai/assets/files/cheatsheet.pdf) for a crisp and clear representation of the functionalities provided by RAPIDS.

In [1]:
import cudf

In [2]:

import pandas as pd

## SECTION 1: CUDF BASICS
### cuDF DataFrame
Firstly, we will understand creating dataframes in cuDF. You can build a dataframe in multiple ways as shown in the official documentation. Let us first initialize the dataframe object.

In [3]:
gdf = cudf.DataFrame()

Now that we have a cudf.Dataframe object, we will build the dataframe with values. Let us explore adding values by defining them through their columns.

In [4]:
#creates a column named 'index' with the values 0, 1, 2, 3, 4
gdf['index'] = [0, 1, 2, 3, 4]

#creates a column named 'value' with the values 10, 20, 30, 40, 50
gdf['value'] = [10, 20, 30, 40, 50]

#displays the current cudf dataframe
gdf

Unnamed: 0,index,value
0,0,10
1,1,20
2,2,30
3,3,40
4,4,50


We can also build the dataframe with list of rows of the dataframe as tuples.

In [10]:
#the first parameter is the data and the second parameter is the name of the columns
df = cudf.DataFrame([
    (5, 60),
    (6, 70),
    (7, 80),
],
columns = ['index', 'value'])
df

Unnamed: 0,index,value
0,5,60
1,6,70
2,7,80


### 1. Concat DataFrames
Now that we have created two dataframes in different methods, notice that they are similarly structured. This means that the number of columns and their names are the same. Now, let us combine these two dataframes into one.

In [24]:
#1
#Concat the dataframes such that df is appended to gdf and display gdf

# Concatenate the DataFrames with ignore_index=True
combined_gdf = cudf.concat([gdf, df], ignore_index=True)

# Display the combined cuDF DataFrame
combined_gdf

Unnamed: 0,index,value
0,0,10
1,1,20
2,2,30
3,3,40
4,4,50
5,5,60
6,6,70
7,7,80


### 2. Summary Statistics
CUDF dataframes have easily callable internal methods to summarise the data in your dataframe, for eg., sum, count, etc. Let us find some statistics of our dataframe. Find the mean and the standard deviation of the values column.

In [19]:
#2
#Use the in-built mean and standard deviation functions of the dataframe and display their values

# Calculate the mean and standard deviation of the values column
mean_values = combined_gdf['value'].mean()
std_values = combined_gdf['value'].std()

# Display the mean and standard deviation
print("Mean:", mean_values)
print("Standard Deviation:", std_values)

Mean: 45.0
Standard Deviation: 24.49489742783178


### 3. User-Defined Functions on Columns
You can alter the values of each column by applying a user defined function directly on the values. Let us add 10 to all the elements in our 'value' column.

In [32]:
#3
#Define a function that returns value + 10

#Refer to applymap to see how to apply the function and display results

def add_ten(x):
    return x + 10

combined_gdf['value'] = combined_gdf['value'].map(add_ten)

# Display the updated DataFrame
print(combined_gdf)

   index  value
0      0     20
1      1     30
2      2     40
3      3     50
4      4     60
5      5     70
6      6     80
7      7     90


## SECTION 2: CUDF using Netflix Movie Dataset
Now that we have a basic understanding of how to work with a cuDF DataFrame, let us try to work with creating one from a dataset. We will be using the dataset from [here](https://www.kaggle.com/shivamb/netflix-shows) to get hands-on with cuDF.<br>

### Reading a CSV file
Import the netfilx_titles.csv dataset into a cuDF dataframe.

In [36]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [38]:
from google.colab import files
uploaded = files.upload()

Saving netflix_titles.csv to netflix_titles.csv


In [39]:
ls

[0m[01;34mdrive[0m/  netflix_titles.csv  [01;34msample_data[0m/


In [40]:
gdf = cudf.read_csv('netflix_titles.csv')

### Converting a Pandas DataFrame
Alternatively, you could also read the data using Pandas and convert the dataframe to support cuDF functionalities.

In [41]:
#creates a pandas dataframe
pdf = pd.read_csv('netflix_titles.csv')

#creates cudf dataframe from pandas dataframe
gdf = cudf.DataFrame.from_pandas(pdf)

#display dataframe
gdf

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


Let us now delve into some questions on the dataset itself!

### 1. Dropping columns
This dataset has a lot of missing values primarily in the columns director and cast. Therefore, we will drop these two columns from our dataframe.

In [None]:
#1
#Display gdf after dropping to verify that the columns have been dropped

# Drop the 'director' and 'cast' columns
gdf.drop(columns=['director', 'cast'], inplace=True)


In [46]:
gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


### 2. Missing values
The dataset needs to be cleaned first. There are several NA values in the data that add no value, we can choose to drop these records. Create a clean dataframe with no NA values.

In [48]:
#2
#Display gdf after dropping to verify that the NA values have been dropped

# Drop rows containing NA values
clean_gdf = gdf.dropna()


In [49]:
clean_gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
4,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
7,s8,Movie,Sankofa,"United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
...,...,...,...,...,...,...,...,...,...,...
8801,s8802,Movie,Zinzana,"United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
8802,s8803,Movie,Zodiac,United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,s8805,Movie,Zombieland,United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [52]:
clean_gdf = clean_gdf.reset_index(drop=True)
clean_gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
3,s8,Movie,Sankofa,"United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
4,s9,TV Show,The Great British Baking Show,United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
...,...,...,...,...,...,...,...,...,...,...
7956,s8802,Movie,Zinzana,"United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
7957,s8803,Movie,Zodiac,United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
7958,s8805,Movie,Zombieland,United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
7959,s8806,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


### 3. Querying DataFrame
Find the shows that were released in the year 2011.

In [53]:
#3
#Display all records released in 2011 using a query

query_result = clean_gdf.query('release_year == 2011')

print(query_result)

     show_id     type                                     title  \
33       s57    Movie  Naruto Shippuden the Movie: Blood Prison   
91      s144    Movie                             Green Lantern   
146     s211    Movie                                Ragini MMS   
150     s217    Movie                          Shor In the City   
151     s218    Movie                         The Dirty Picture   
...      ...      ...                                       ...   
7824   s8664    Movie                            Unruly Friends   
7855   s8697    Movie                                 War Horse   
7894   s8737  TV Show                             Who's the One   
7928   s8772    Movie                          Yaara O Dildaara   
7946   s8792    Movie                               Young Adult   

                   country          date_added  release_year rating  duration  \
33                   Japan  September 15, 2021          2011  TV-14   102 min   
91           United States   Sept

In [54]:
num_records_query = len(query_result)

print("Number of records in the query result:", num_records_query)

Number of records in the query result: 179


### 4. Unique values
Find the number of different types of ratings, e.g., R, PG, etc.

In [55]:
#4
#Print the number of ratings

rating_counts = clean_gdf['rating'].value_counts()

print("Counts of different ratings:")
print(rating_counts)

Counts of different ratings:
rating
TV-MA       2929
TV-14       1927
R            788
TV-PG        771
PG-13        482
PG           281
TV-Y7        235
TV-Y         227
TV-G         190
NR            79
G             41
TV-Y7-FV       5
UR             3
NC-17          3
Name: count, dtype: int64


### 5. Sort values
Sort the dataframe according to the year the record was released (latest first).

In [56]:
#5
#Refer to sort_values function, which takes the target column name and the sorting mode

sorted_clean_gdf = clean_gdf.sort_values(by='release_year', ascending=False)

sorted_clean_gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
4,s9,TV Show,The Great British Baking Show,United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
5,s10,Movie,The Starling,United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
6,s13,Movie,Je Suis Karl,"Germany, Czech Republic","September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...
...,...,...,...,...,...,...,...,...,...,...
7821,s8661,Movie,Undercover: How to Operate Behind Enemy Lines,United States,"March 31, 2017",1943,TV-PG,61 min,"Classic Movies, Documentaries",This World War II-era training film dramatizes...
7897,s8740,Movie,Why We Fight: The Battle of Russia,United States,"March 31, 2017",1943,TV-PG,82 min,Documentaries,This installment of Frank Capra's acclaimed do...
7920,s8764,Movie,WWII: Report from the Aleutians,United States,"March 31, 2017",1943,TV-PG,45 min,Documentaries,Filmmaker John Huston narrates this Oscar-nomi...
6986,s7791,Movie,Prelude to War,United States,"March 31, 2017",1942,TV-14,52 min,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...


### 6a. Count values
Find the number of movies and shows that are available using <I>value_counts

In [57]:
#6a
#Refer to value_counts()

type_counts = clean_gdf['type'].value_counts()

# Display the counts of movies and shows
print("Counts of movies and shows:")
print(type_counts)

Counts of movies and shows:
type
Movie      5687
TV Show    2274
Name: count, dtype: int64


### 6b. GroupBy
Alternatively, you can also find the number of movies and shows using a GroupBy.

In [58]:
#6b
#Refer to GroupBy and size

type_counts_grouped = clean_gdf.groupby('type').size()

# Display the counts of movies and shows
print("Counts of movies and shows:")
print(type_counts_grouped)

Counts of movies and shows:
type
TV Show    2274
Movie      5687
dtype: int64


### 7. Bonus: One-Hot Encoding
Now that you have looked at a few functionalities provided by RAPIDS, let us go a step further. In general, several applications in Machine Learning use One-hot encoding to convert categorical, non-numerical data into numerical values that the model can use. These encodings represent data as a vector of 0s and 1s, which help analyse data better. Here, let us try to encode the type of the title, i.e., a movie or a TV show into one-hot encodings.

<B> Note: cuDF contains a one-hot-encoding function that you can use.


In [102]:
#convert column of dataframe from series to an array
record_title = gdf['title'].to_array()
record_type = gdf['type'].to_array()

#create pandas df with corresponding attributes
movie_df = pd.DataFrame({'record_title': record_title, 'record_type': record_type})
movie_df.record_type = movie_df.record_type.astype('category')

#7
#step 1: Convert the dataframe to cudf
movie_cudf = cudf.DataFrame.from_pandas(movie_df)

#step 2: Create a column called record_codes with the numerically encoded values
movie_cudf['record_codes'] = movie_cudf['record_type'].cat.codes

#step 3: Identify the unique codes
unique_codes = movie_cudf['record_codes'].unique()

#step 4:
encoded_df = cudf.get_dummies(movie_cudf['record_type'])

encoded_df

AttributeError: 'Series' object has no attribute 'to_array'

In [100]:

record_title = gdf['title'].to_pandas()
record_type = gdf['type'].to_pandas()

# Create pandas DataFrame with corresponding attributes
movie_df = pd.DataFrame({'record_title': record_title, 'record_type': record_type})
movie_df['record_type'] = movie_df['record_type'].astype('category')

# Convert pandas DataFrame to cudf DataFrame
movie_cudf = cudf.DataFrame.from_pandas(movie_df)

# Create a column called 'record_codes' with numerically encoded values
movie_cudf['record_codes'] = movie_cudf['record_type'].cat.codes

# Identify the unique codes
unique_codes = movie_cudf['record_codes'].unique()

# Create an encoded DataFrame representing the type of each record
encoded_df = cudf.get_dummies(movie_cudf['record_type'])

# Display the encoded DataFrame
print(encoded_df)

      Movie  TV Show
0      True    False
1     False     True
2     False     True
3     False     True
4     False     True
...     ...      ...
8802   True    False
8803  False     True
8804   True    False
8805   True    False
8806   True    False

[8807 rows x 2 columns]
