# Introduction to RAPIDS

The RAPIDS data science framework is a GPU-empowered collection of libraries for executing end-to-end data science pipelines completely in the GPU. It is designed to make an effective use of the computational capabilities of GPUs with optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory. The primary objective behind using RAPIDS is to accelerate individual parts of the typical data science workflow, and thereby accelerating the complete end-to-end workflow in Data Preparation and Machine Learning.

Read through [this](https://medium.com/future-vision/what-is-rapids-ai-7e552d80a1d2) medium article to understand how RAPIDS works.
<br><br>
If you have already worked with pandas and numpy previously, most of the tutorial will seem very familiar to you. If you haven't, do not worry. This is a great place to start!

## SETUP

**Note that pandas is a data analysis and manipulation tool built on top of the Python programming language to perform various tasks (e.g.: loading, joining, aggregating, filtering data). cuDF is a GPU DataFrame library that helps perform similar functionalities with massive acceleration.**

In [None]:
import cudf
import pandas as pd

Before we dive in, please make sure to check out the official documentation [here](https://docs.rapids.ai/api) to get an overall idea. Additionally, refer to the [cheatsheet](https://rapids.ai/assets/files/cheatsheet.pdf) for a crisp and clear representation of the functionalities provided by RAPIDS.

## SECTION 1: CUDF BASICS
### cuDF DataFrame
Firstly, we will understand creating dataframes in cuDF. You can build a dataframe in multiple ways as shown in the official documentation. Let us first initialize the dataframe object.

In [None]:
gdf = cudf.DataFrame()

Now that we have a cudf.Dataframe object, we will build the dataframe with values. Let us explore adding values by defining them through their columns.

In [None]:
#creates a column named 'index' with the values 0, 1, 2, 3, 4
gdf['index'] = [0, 1, 2, 3, 4]

#creates a column named 'value' with the values 10, 20, 30, 40, 50
gdf['value'] = [10, 20, 30, 40, 50]

#displays the current cudf dataframe
gdf

Unnamed: 0,index,value
0,0,10
1,1,20
2,2,30
3,3,40
4,4,50


We can also build the dataframe with list of rows of the dataframe as tuples.

In [None]:
#the first parameter is the data and the second parameter is the name of the columns
df = cudf.DataFrame([
    (5, 60),
    (6, 70),
    (7, 80),
],
columns = ['index', 'value'])
df

Unnamed: 0,index,value
0,5,60
1,6,70
2,7,80


### 1. Concat DataFrames
Now that we have created two dataframes in different methods, notice that they are similarly structured. This means that the number of columns and their names are the same. Now, let us combine these two dataframes into one.

In [None]:
#1
#Concat the dataframes such that df is appended to gdf and display gdf
gdf = cudf.concat([gdf, df])
gdf

Unnamed: 0,index,value
0,0,10
1,1,20
2,2,30
3,3,40
4,4,50
0,5,60
1,6,70
2,7,80


### 2. Summary Statistics
CUDF dataframes have easily callable internal methods to summarise the data in your dataframe, for eg., sum, count, etc. Let us find some statistics of our dataframe. Find the mean and the standard deviation of the values column.

In [None]:
#2
#Use the in-built mean and standard deviation functions of the dataframe and display their values
print(gdf["value"].mean())
print(gdf["value"].std())

45.0
24.49489742783178


### 3. User-Defined Functions on Columns
You can alter the values of each column by applying a user defined function directly on the values. Let us add 10 to all the elements in our 'value' column.

In [None]:
#3
#Define a function that returns value + 10
def add_ten(x):
    return x + 10

#Refer to applymap to see how to apply the function and display results
gdf['value'] = gdf['value'].apply(add_ten)
gdf

Unnamed: 0,index,value
0,0,20
1,1,30
2,2,40
3,3,50
4,4,60
0,5,70
1,6,80
2,7,90


## SECTION 2: CUDF using Netflix Movie Dataset
Now that we have a basic understanding of how to work with a cuDF DataFrame, let us try to work with creating one from a dataset. We will be using the dataset from [here](https://www.kaggle.com/shivamb/netflix-shows) to get hands-on with cuDF.<br>

### Reading a CSV file
Import the netfilx_titles.csv dataset into a cuDF dataframe.

In [None]:
gdf = cudf.read_csv('netflix_titles.csv')

### Converting a Pandas DataFrame
Alternatively, you could also read the data using Pandas and convert the dataframe to support cuDF functionalities.

In [None]:
#creates a pandas dataframe
pdf = pd.read_csv('netflix_titles.csv')

#creates cudf dataframe from pandas dataframe
gdf = cudf.DataFrame.from_pandas(pdf)

#display dataframe
gdf

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


Let us now delve into some questions on the dataset itself!

### 1. Dropping columns
This dataset has a lot of missing values primarily in the columns director and cast. Therefore, we will drop these two columns from our dataframe.

In [None]:
#1
#Display gdf after dropping to verify that the columns have been dropped
gdf = gdf.drop(labels = ['director', 'cast'], axis = 1)
gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


### 2. Missing values
The dataset needs to be cleaned first. There are several NA values in the data that add no value, we can choose to drop these records. Create a clean dataframe with no NA values.

In [None]:
#2
#Display gdf after dropping to verify that the NA values have been dropped
gdf = gdf.dropna()
gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
4,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
7,s8,Movie,Sankofa,"United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
...,...,...,...,...,...,...,...,...,...,...
8801,s8802,Movie,Zinzana,"United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
8802,s8803,Movie,Zodiac,United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,s8805,Movie,Zombieland,United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


### 3. Querying DataFrame
Find the shows that were released in the year 2011.

In [None]:
#3
#Display all records released in 2011 using a query
gdf.query("release_year == 2011")

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
56,s57,Movie,Naruto Shippuden the Movie: Blood Prison,Japan,"September 15, 2021",2011,TV-14,102 min,"Action & Adventure, Anime Features, Internatio...",Mistakenly accused of an attack on the Fourth ...
143,s144,Movie,Green Lantern,United States,"September 1, 2021",2011,PG-13,114 min,"Action & Adventure, Sci-Fi & Fantasy",Test pilot Hal Jordan harnesses glowing new po...
210,s211,Movie,Ragini MMS,India,"August 27, 2021",2011,TV-MA,93 min,"Horror Movies, International Movies",A couple out to have a sensuous weekend at a h...
216,s217,Movie,Shor In the City,India,"August 27, 2021",2011,TV-14,106 min,"Comedies, Dramas, Independent Movies",When three small-time Mumbai crooks steal a ba...
217,s218,Movie,The Dirty Picture,India,"August 27, 2021",2011,TV-14,145 min,"Comedies, Dramas, International Movies",After running away from home in search of movi...
...,...,...,...,...,...,...,...,...,...,...
8663,s8664,Movie,Unruly Friends,Egypt,"June 20, 2019",2011,TV-14,83 min,"International Movies, Thrillers",A young woman discovers that familial and psyc...
8696,s8697,Movie,War Horse,"United States, India","May 6, 2019",2011,PG-13,147 min,Dramas,"During World War I, the bond between a young E..."
8736,s8737,TV Show,Who's the One,Taiwan,"January 1, 2017",2011,TV-14,1 Season,"International TV Shows, Romantic TV Shows, TV ...",A doctor performs plastic surgery on a fat man...
8771,s8772,Movie,Yaara O Dildaara,India,"November 1, 2017",2011,TV-14,132 min,"Dramas, International Movies, Music & Musicals",The patriarch of a wealthy family with one ind...


### 4. Unique values
Find the number of different types of ratings, e.g., R, PG, etc.

In [None]:
#4
#Print the number of ratings
gdf['rating'].nunique()

14

### 5. Sort values
Sort the dataframe according to the year the record was released (latest first).

In [None]:
#5
#Refer to sort_values function, which takes the target column name and the sorting mode
gdf = gdf.sort_values('release_year', ascending = False)
gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
4,s5,TV Show,Kota Factory,India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
8,s9,TV Show,The Great British Baking Show,United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,"Germany, Czech Republic","September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...
...,...,...,...,...,...,...,...,...,...,...
8660,s8661,Movie,Undercover: How to Operate Behind Enemy Lines,United States,"March 31, 2017",1943,TV-PG,61 min,"Classic Movies, Documentaries",This World War II-era training film dramatizes...
8739,s8740,Movie,Why We Fight: The Battle of Russia,United States,"March 31, 2017",1943,TV-PG,82 min,Documentaries,This installment of Frank Capra's acclaimed do...
8763,s8764,Movie,WWII: Report from the Aleutians,United States,"March 31, 2017",1943,TV-PG,45 min,Documentaries,Filmmaker John Huston narrates this Oscar-nomi...
7790,s7791,Movie,Prelude to War,United States,"March 31, 2017",1942,TV-14,52 min,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...


### 6a. Count values
Find the number of movies and shows that are available using <I>value_counts

In [None]:
#6a
#Refer to value_counts()
gdf.value_counts(subset = ['type'])

type   
Movie      5687
TV Show    2274
Name: count, dtype: int64

### 6b. GroupBy
Alternatively, you can also find the number of movies and shows using a GroupBy.

In [None]:
#6b
#Refer to GroupBy and size
gdf.groupby(["type"]).size()

type
TV Show    2274
Movie      5687
dtype: int64

### 7. Bonus: One-Hot Encoding
Now that you have looked at a few functionalities provided by RAPIDS, let us go a step further. In general, several applications in Machine Learning use One-hot encoding to convert categorical, non-numerical data into numerical values that the model can use. These encodings represent data as a vector of 0s and 1s, which help analyse data better. Here, let us try to encode the type of the title, i.e., a movie or a TV show into one-hot encodings.

<B> Note: cuDF contains a one-hot-encoding function that you can use.


In [None]:
#convert column of dataframe from series to an array
record_title = gdf['title'].to_array()
record_type = gdf['type'].to_array()

#create pandas df with corresponding attributes
movie_df = pd.DataFrame({'record_title': record_title, 'record_type': record_type})
movie_df.record_type = movie_df.record_type.astype('category')

#7
#step 1: Convert the dataframe to cudf
movie_gdf = cudf.from_pandas(movie_df)

#step 2: Create a column called record_codes with the numerically encoded values
movie_gdf.assign(record_codes = cudf.get_dummies(gdf, columns=['type']))

#step 3: Identify the unique codes
movie_gdf.unique()

#step 4: Create an encoded dataframe representing the type of each record and dispaly it
movie_gdf.assign(record_codes = cudf.get_dummies(gdf, columns=['type']))
movie_gdf

AttributeError: 'Series' object has no attribute 'to_array'