# Welcome to Google Colab

Colab is essentially Google's way of hosting a [jupyter notebook](https://jupyter.org/). A very popular tool to use as a data scientist!

It allows us to write code, documentation, and output visuals all in one place.

To be able to edit the code in this workshop. Please make a copy for yourself

`file > save a copy in drive`

This should open a new tab with your own copy of this notebook. It can take a minute to load.

Colab also gives you some options for running complicated computations such as training deep learning model. To see access those options:

`Runtime > change runtime type` Select `GPU`, `TPU`, or `None`

We don't need to change anything for this workshop, but its a great resource if you start learning deep learning and don't have a powerful GPU at home. 

This is a text cell. 

You can add a new text cell by clicking `+ Text` above. 

It does not highlight wrong spelling. I apologize for any typos!

# Before we get started

Welcome to the Intro to Rapids crash course for the Galvanize Datathon. 

### What is Rapids AI? 

The RAPIDS suite of software libraries, built on CUDA-X AI, gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar dataframe API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset size

In short: the rapids suite allows data scientists to do their entire workflow in a GPU environment, and facilitates distributed computing to to more quickly and effectively train models using big data. 

Read more about Rapids AI and its capabilities [here](https://rapids.ai/about.html). 

More Resources: Look through these on your own after the lecture. 

#### General Rapids
- [Rapids docs](https://docs.rapids.ai/)
- [Official Rapids getting started notebooks](https://github.com/rapidsai-community/notebooks-contrib/tree/main/getting_started_materials/intro_tutorials_and_guides)
#### cuML demos and guides
- [Example cuML notebook: Training and Evaluating Machine Learning Models in cuML](https://github.com/rapidsai/cuml/blob/main/docs/source/estimator_intro.ipynb)
- [More cuML demos](https://github.com/rapidsai/cuml/tree/main/notebooks)
- [Community tutorials and guides](https://github.com/rapidsai-community/notebooks-contrib/tree/main/community_tutorials_and_guides)
#### Time series analytics and machine learning with Rapids
- [Medium article on time series with rapids](https://medium.com/rapids-ai/arima-forecast-large-time-series-datasets-with-rapids-cuml-18428a00d02e)
- [Free book on forecasting](https://otexts.com/fpp2/seasonal-arima.html)

Linear regression: predicting bikeshare rentals, could be useful if adapted to timeseries. 
https://github.com/rapidsai-community/notebooks-contrib/tree/main/the_archive/archived_rapids_blog_notebooks/regression


# cuDF and cuML Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

## Pandas 
Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data - as the name suggests - comes in a structured form, often represented by a table or CSV. We'll focus the majority of these tutorials on working with these types of data.

There exist many tools in the Python ecosystem for working with structured, tabular data but few are as widely used as Pandas. Pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing and many more.

For more information on Pandas, check out the excellent documentation: http://pandas.pydata.org/pandas-docs/stable/

Below we show how to create a Pandas DataFrame, an internal object for representing tabular data.

In [1]:
import pandas as pd
import requests

# download CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
# read in like usual using read_csv()
df = pd.read_csv(url)
df['tip_percentage'] = df['tip']/df['total_bill']*100

df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,16.054159
2,21.01,3.50,Male,No,Sun,Dinner,3,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,14.680765
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.392697
240,27.18,2.00,Female,Yes,Sat,Dinner,2,7.358352
241,22.67,2.00,Male,Yes,Sat,Dinner,2,8.822232
242,17.82,1.75,Male,No,Sat,Dinner,2,9.820426


# [cuDF](https://github.com/rapidsai/cudf)

Pandas is fantastic for working with small datasets that fit into your system's memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads - the need for accelerated compute arises.

cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing Pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.

Below, we show how to create a cuDF DataFrame.

In [4]:
import cudf
# Note, cudf on colab takes a second to import. your cell is running normally
import io

# Cudf is capable of reading a csv just like pandas
tips_df = cudf.read_csv(url)
tips_df
# feature engineering can also be similarly achieved to pandas
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# let's display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
6    15.622920
1    21.729202
4    14.594901
3    15.215685
2    16.571919
5    14.149549
Name: tip_percentage, dtype: float64


In [5]:
# It looks just like a pandas DataFrame! how convenient!
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,16.054159
2,21.01,3.50,Male,No,Sun,Dinner,3,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,14.680765
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.392697
240,27.18,2.00,Female,Yes,Sat,Dinner,2,7.358352
241,22.67,2.00,Male,Yes,Sat,Dinner,2,8.822232
242,17.82,1.75,Male,No,Sat,Dinner,2,9.820426


### Making a series with cuDF

There are two main data structures in cuDF: a Series object and a DataFrame object. Multiple Series objects are used as columns for a DataFrame. We'll first explore the Series class and build upon that foundation to later introduce how to work with objects of type DataFrame.

We can create a Series object using the cudf.Series class.

In [6]:
# There are several ways to represent data using cuDF. 
# The most common formats are int8, int32, int64, float32, and float64.

column = cudf.Series(range(0,20))

# cuDF encapsulates the series methods for pandas, so we can use .describe() to get information
column.describe()

count    20.00000
mean      9.50000
std       5.91608
min       0.00000
25%       4.75000
50%       9.50000
75%      14.25000
max      19.00000
dtype: float64

In [7]:
# here's another example, We can create a new series with a different 
# index by using the set_index method.

new_column = column.set_index(range(20,40)) 
print(new_column)

20     0
21     1
22     2
23     3
24     4
25     5
26     6
27     7
28     8
29     9
30    10
31    11
32    12
33    13
34    14
35    15
36    16
37    17
38    18
39    19
dtype: int64


### Creating a cuDF DataFrame using lists
There are several ways to create a cuDF DataFrame. The easiest of these is to instantiate an empty cuDF DataFrame and then use Python list objects or NumPy arrays to create columns. Below, we create an empty cuDF DataFrame.

In [8]:
df = cudf.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [9]:
import numpy as np


# here we create two columns named "key" and "value" and add them to the dataframe
df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
print(df)

   key  value
0    0     10
1    1     11
2    2     12
3    3     13
4    4     14


## Creating a cudf DataFrame using a list of tuples or a dictionary

Another way we can create a cuDF DataFrame is by providing a mapping of column names to column values, either via a list of tuples or by using a dictionary. In the below examples, we create a list of two-value tuples; the first value is the name of the column - for example, id or timestamp - and the second value is a list of Python objects or Numpy arrays. Note that we don't have to constrain the data stored in our cuDF DataFrames to common data types like integers or floats - we can use more exotic data types such as datetimes or strings. We'll investigate how such data types behave on the GPU a bit later.

In [10]:
from datetime import datetime, timedelta

# some fake data
ids = np.arange(5)
t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
timestamps = [(t0+ timedelta(seconds=x)) for x in range(5)]
timestamps_np = np.array(timestamps, dtype='datetime64')

In [11]:
# creating it from the dictionary format
df = cudf.DataFrame({'id': ids, 'timestamp': timestamps_np})
print(df)

   id           timestamp
0   0 2018-10-07 12:00:00
1   1 2018-10-07 12:00:01
2   2 2018-10-07 12:00:02
3   3 2018-10-07 12:00:03
4   4 2018-10-07 12:00:04


## Creating a cudf DataFrame from a Pandas DataFrame
Pandas DataFrames are a first class citizen within cuDF - this means that we can create a cuDF DataFrame from a Pandas DataFrame and vice versa.

In [12]:
pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})

# We can use the cudf.from_pandas or cudf.DataFrame.from_pandas functions 
# to create a cuDF DataFrame from a Pandas DataFrame.
df = cudf.from_pandas(pandas_df)
type(df)

cudf.core.dataframe.DataFrame

#Data types
We can also inspect the data types of the columns of a cuDF DataFrame using the dtypes attribute.

We can also modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type. Be warned that silent errors may be introduced from nonsensical type conversations - for example, changing a float to an integer or vice versa.

In [13]:
df = cudf.DataFrame()

df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   key     5 non-null      int64
 1   value   5 non-null      int64
dtypes: int64(2)
memory usage: 80.0 bytes


In [14]:
# we can cast the column value types using .astype()
df['key'] = df['key'].astype(np.float32)
df['value'] = df['value'].astype(np.int32)
print(df.dtypes)

key      float32
value      int32
dtype: object


## cuDF API

The cuDF API is pleasantly simple and mirrors the Pandas API as closely as possible. In this section, we will explore the cuDF API and show how to perform common data manipulation operations.

### Selecting Rows or Columns
We can select rows from a cuDF DataFrame using slicing syntax.

In [18]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32)})

In [19]:
# slicing example
print(df[0:5])


     a      b
0  0.0  100.0
1  1.0   99.0
2  2.0   98.0
3  3.0   97.0
4  4.0   96.0


In [20]:
# selecting a column
print(df['a'])


0      0.0
1      1.0
2      2.0
3      3.0
4      4.0
      ... 
95    95.0
96    96.0
97    97.0
98    98.0
99    99.0
Name: a, Length: 100, dtype: float32


In [21]:
# .loc() from pandas
print(df.loc[0:5, ['a']])


     a
0  0.0
1  1.0
2  2.0
3  3.0
4  4.0
5  5.0


### Defining New Columns
We often want to define new columns from existing columns.

In [22]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

In [23]:
# as we saw before, we can add columns using df['column_name'] or df.column_name
df['d'] = np.arange(200, 300).astype(np.float32)
df

Unnamed: 0,a,b,c,d
0,0.0,100.0,100.0,200.0
1,1.0,99.0,101.0,201.0
2,2.0,98.0,102.0,202.0
3,3.0,97.0,103.0,203.0
4,4.0,96.0,104.0,204.0
...,...,...,...,...
95,95.0,5.0,195.0,295.0
96,96.0,4.0,196.0,296.0
97,97.0,3.0,197.0,297.0
98,98.0,2.0,198.0,298.0


## cuDF DataFrame methods

### Dropping Columns
Alternatively, we may want to remove columns from our DataFrame. We can do so using the drop_column method. Note that this method removes a column in-place - meaning that the DataFrame we act on will be modified.

In [24]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

In [25]:
# same as in pandas!
df.drop(['b'], axis=1, inplace = True)
print(df)

       a      c
0    0.0  100.0
1    1.0  101.0
2    2.0  102.0
3    3.0  103.0
4    4.0  104.0
..   ...    ...
95  95.0  195.0
96  96.0  196.0
97  97.0  197.0
98  98.0  198.0
99  99.0  199.0

[100 rows x 2 columns]


### Sorting Data
Data is often not sorted before we start to work with it. Sorting data is is very useful for optimizing operations like joins and aggregations, especially when the data is distributed.

We can sort data in cuDF using the sort_values method and passing in which column we want to sort by.

In [26]:

df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b  c    d
0  0  1  0  100
1  0  1  1   99
2  0  0  2   98
3  0  0  3   97
4  0  0  4   96


In [27]:
# use sort values to sort by a specific column label
print(df.sort_values('d').head())


    a  b   c  d
99  3  1  99  1
98  3  0  98  2
97  3  1  97  3
96  3  0  96  4
95  3  1  95  5


### Concatenations

In everyday data science, we typically work with multiple sources of data and wish to combine these data into a single more meaningful representation. These operations are often called concatenations and joins. We can concatenate two or more dataframes together row-wise or column-wise by passing in a list of the dataframes to be concatenated into the cudf.concat function and specifying the axis along which to concatenate these dataframes.

If we want to concatenate the dataframes row-wise, we can specify axis=0. To concatenate column-wise, we can specify axis=1.

In [28]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})

In [29]:
# use .concat() to combine two cudf dataframes with the same columns
df = cudf.concat([df1, df2], axis=0)
df

Unnamed: 0,a,b,c,d
0,0,0,0,100
1,0,1,1,99
2,0,1,2,98
3,0,1,3,97
4,0,0,4,96
...,...,...,...,...
95,3,0,95,5
96,3,0,96,4
97,3,1,97,3
98,3,0,98,2


### Joins / Merges
Multiple dataframes can be joined together using a single (or multiple) column(s). There are two syntaxes for performing joins:

One can use the DataFrame.merge method and pass in another dataframe to join, or
One can use the cudf.merge function and pass in which dataframes to join.
Both syntaxes can also be passed a list of column names to an additional keyword argument on - this will specify which columns the dataframes should be joined on. If this keyword is not specified, cuDF will by default join using column names that appear in both dataframes.

In [30]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'e': np.arange(0, 100).astype(np.int32), 
                      'f': np.arange(100, 0, -1).astype(np.int32)})

In [31]:
# use merge to to join two dataframes with the same number of records
df = df1.merge(df2)
print(df.head())

   a  b   c    d  e    f
0  0  1   0  100  0  100
1  0  1  16   84  0  100
2  0  1   8   92  0  100
3  0  1  24   76  0  100
4  0  0   1   99  3   97


In [32]:
# alternate path to same result
df = cudf.merge(df1, df2)
print(df.head())

   a  b   c    d  e    f
0  0  1   0  100  0  100
1  0  1  16   84  0  100
2  0  1   8   92  0  100
3  0  1  24   76  0  100
4  0  0   1   99  3   97


In [33]:
# use the on key word to do an sql-like primary key assignment
# the parameter 'on' lets you choos which column or index level names to join on.
#  These must be found in both DataFrames. If on is None and not merging on indexes
#  then this defaults to the intersection of the columns in both DataFrames.
df = df1.merge(df2, on=['a'])
print(df.head())

   a  b_x   c    d  b_y  e    f
0  0    1   0  100    1  0  100
1  0    1  16   84    1  0  100
2  0    1   8   92    1  0  100
3  0    1  24   76    1  0  100
4  0    0   1   99    1  0  100


In [34]:
df = cudf.merge(df1, df2, on=['a'])
print(df.head())

   a  b_x   c   d  b_y   e   f
0  2    1  64  36    1  64  36
1  3    1  80  20    1  96   4
2  2    1  72  28    1  64  36
3  3    0  88  12    1  96   4
4  2    0  65  35    1  64  36


In [35]:
df = cudf.merge(df1, df2, on=['a', 'b'])
print(df.head())

   a  b   c    d  e    f
0  0  1   0  100  0  100
1  0  1  16   84  0  100
2  0  1   8   92  0  100
3  0  1  24   76  0  100
4  0  0   1   99  3   97


### Groupbys¶

A useful operation when working with datasets is to group the data using a specific key and aggregate the values mapping to those keys. For example, we might want to aggregate multiple temperature measurements taken during a day from a specific sensor and average those measurements to find avergage daily temperature at a specific geolocation.

cuDF allows us to perform such an operation using the groupby method. This will create an object of type cudf.groupby.groupby.Groupby that we can operate on using aggregation functions such as sum, var, or complex aggregation functions defined by the user.

We can also specify multiple columns to group on by passing a list of column names to the groupby method.

In [36]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b  c    d
0  0  0  0  100
1  0  1  1   99
2  0  0  2   98
3  0  1  3   97
4  0  0  4   96


In [37]:
# let's group by the 0-3 numbers in the 'a' column
grouped_df = df.groupby('a')
# we can then apply aggregate functions to make insights about these groups.
aggregation = grouped_df.sum()
print(aggregation)

    b     c     d
a                
0  14   300  2200
2  15  1550   950
3  12  2175   325
1  13   925  1575


### Statistical Operations
There are several statistical operations we can use to aggregate our data in meaningful ways. These can be applied to both Series and DataFrame objects.

In [38]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

In [39]:
# Aggregate functions can be applied on a column by column basis
# examples include sum(), mean(), median(), min(), and max()
df['a'].sum()

150

In [40]:
df['d'].max()

100

### Histogramming
We can access the value counts of a column using the value_counts method. Note that this is typically used with columns representing discrete data i.e. integers, strings, categoricals, etc. We may not be as interested in the value counts of numerical data e.g. how often the value 2.1 appears. The results of the value_counts method can be used with Python plotting libraries like Matplotlib or Seaborn to generate visualizations such as histograms.

In [41]:
result = df['a'].value_counts()
print(result)

0    25
2    25
3    25
1    25
Name: a, dtype: int32


## Missing Data

Sometimes data is not as clean as we would like it - often there wrong values or values that are missing entirely. cuDF DataFrames can represent missing values using the Python None keyword.

In [42]:

df = cudf.DataFrame({'a': [0, None, 2, 3, 4, 5, 6, 7, 8, None, 10],
                     'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
                     'c': [0.0, 0.1, None, None, 0.4, 0.5, None, 0.7, 0.8, 0.9, 1.0]})
print(df)

       a     b     c
0      0   0.0   0.0
1   <NA>   0.1   0.1
2      2   0.2  <NA>
3      3  <NA>  <NA>
4      4   0.4   0.4
5      5   0.5   0.5
6      6   0.6  <NA>
7      7   0.7   0.7
8      8   0.8   0.8
9   <NA>   0.9   0.9
10    10   1.0   1.0


In [43]:
# the .fillna() method from pandas is conserved to cudf
df['c'] = df['c'].fillna(999)
print(df)

       a     b      c
0      0   0.0    0.0
1   <NA>   0.1    0.1
2      2   0.2  999.0
3      3  <NA>  999.0
4      4   0.4    0.4
5      5   0.5    0.5
6      6   0.6  999.0
7      7   0.7    0.7
8      8   0.8    0.8
9   <NA>   0.9    0.9
10    10   1.0    1.0


### One Hot Encoding¶
Data scientists often work with discrete data such as integers or categories. However, this data can be represented using a One Hote Encoding format.

cuDF allows us to convert these discrete datas to a One Hot Encoding format using the one_hot_encoding method. We can pass this method the column name to convert, a prefix with which to prepend to each newly created column, and the categories of data to create new columns for. We can pass in all the categories in the discrete data or a subset - cuDF will flexibly handle both and only create new columns for the categories specified.

In [44]:
categories = [0, 1, 2, 3]
df = cudf.DataFrame({'a': np.repeat(categories, 25).astype(np.int32), 
                     'b': np.arange(0, 100).astype(np.int32), 
                     'c': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b    c
0  0  0  100
1  0  1   99
2  0  2   98
3  0  3   97
4  0  4   96


In [45]:
result = df.one_hot_encoding('a', prefix='a_', cats=categories)
print(result.head())
print(result.tail())

   a  b    c  a__0  a__1  a__2  a__3
0  0  0  100   1.0   0.0   0.0   0.0
1  0  1   99   1.0   0.0   0.0   0.0
2  0  2   98   1.0   0.0   0.0   0.0
3  0  3   97   1.0   0.0   0.0   0.0
4  0  4   96   1.0   0.0   0.0   0.0
    a   b  c  a__0  a__1  a__2  a__3
95  3  95  5   0.0   0.0   0.0   1.0
96  3  96  4   0.0   0.0   0.0   1.0
97  3  97  3   0.0   0.0   0.0   1.0
98  3  98  2   0.0   0.0   0.0   1.0
99  3  99  1   0.0   0.0   0.0   1.0


In [46]:
# you can change which values are one hot encoded using the cats parameter
result = df.one_hot_encoding('a', prefix='a_', cats=[0, 1, 2])
print(result.head())
print(result.tail())

   a  b    c  a__0  a__1  a__2
0  0  0  100   1.0   0.0   0.0
1  0  1   99   1.0   0.0   0.0
2  0  2   98   1.0   0.0   0.0
3  0  3   97   1.0   0.0   0.0
4  0  4   96   1.0   0.0   0.0
    a   b  c  a__0  a__1  a__2
95  3  95  5   0.0   0.0   0.0
96  3  96  4   0.0   0.0   0.0
97  3  97  3   0.0   0.0   0.0
98  3  98  2   0.0   0.0   0.0
99  3  99  1   0.0   0.0   0.0


# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib