<a href="https://colab.research.google.com/github/PadmarajBhat/Rapids.AI/blob/master/rapids_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4.

In [1]:
!nvidia-smi

Sat Jul 20 02:11:53 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    18W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
print(device_name)
if device_name != b'Tesla T4':
  raise Exception("""
    Unfortunately this instance does not have a T4 GPU.
    
    Please make sure you've configured Colab to request a GPU instance type.
    
    Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.

    If you get a K80 GPU, try Runtime -> Reset all runtimes...
  """)
else:
  print('Woo! You got the right kind of GPU!')

b'Tesla T4'
Woo! You got the right kind of GPU!


#Setup:

1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
2. Install RAPIDS libraries
3. Set necessary environment variables
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions

In [3]:
# intall miniconda
!wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

# install RAPIDS packages
!conda install -q -y --prefix /usr/local -c conda-forge \
  -c rapidsai-nightly/label/cuda10.0 -c nvidia/label/cuda10.0 \
  cudf cuml

!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local
!conda install -q -y --prefix /usr/local -c nvidia -c rapidsai \
  -c numba -c conda-forge -c defaults nvstrings=0.8 python=3.6 cudatoolkit=10.0

!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local
!conda install  -q -y --prefix /usr/local dask

# set environment vars
import sys, os, shutil
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

# copy .so files to current working dir
for fn in ['libcudf.so', 'librmm.so']:
  shutil.copy('/usr/local/lib/'+fn, os.getcwd())

--2019-07-20 02:12:33--  https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.200.79, 104.18.201.79, 2606:4700::6812:c94f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.200.79|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58468498 (56M) [application/x-sh]
Saving to: ‘Miniconda3-4.5.4-Linux-x86_64.sh’


2019-07-20 02:12:34 (113 MB/s) - ‘Miniconda3-4.5.4-Linux-x86_64.sh’ saved [58468498/58468498]

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
Python 3.6.5 :: Anaconda, Inc.
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7-

# cuDF and cuML Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [9]:
!ls -l "/content/drive/My Drive/Colab Notebooks/TrainData_PA.csv"

-rw------- 1 root root 5493629 Jul 16 23:19 '/content/drive/My Drive/Colab Notebooks/TrainData_PA.csv'


In [0]:
import nvstrings, nvcategory, cudf
import io, requests
tips_df = cudf.read_csv("/content/drive/My Drive/Colab Notebooks/TrainData_PA.csv")
tips_df.head()

<cudf.DataFrame ncols=40 nrows=5 >

In [0]:
print(tips_df.head(), tips_df.shape, tips_df.columns)

   county        city  zipcode                            address  state  rent            latitude ...  HomePrice
0    None     WEXFORD            266 Clematis Dr Allegheny County     PA  2400             40.6182 ...     158051
1    None   WHITEHALL                2310 N 1st Ave Lehigh County     PA   995           40.649906 ...     158051
2    None   WHITEHALL           3338 St Stephens Ln Lehigh County     PA  1740           40.646282 ...     158051
3    None  WAYNESBORO                97 W Main St Franklin County     PA   675  39.756992000000004 ...     158051
4    None  QUAKERTOWN                 200 E Broad St Bucks County     PA  1300  40.441176999999996 ...     158051
[32 more columns] (18203, 40) Index(['county', 'city', 'zipcode', 'address', 'state', 'rent', 'latitude',
       'longitude', 'cemetery_dist_miles', 'nationalhighway_miles',
       'railline_miles', 'starbucks_miles', 'walmart_miles', 'hospital_miles',
       'physician_dist_miles', 'dentist_dist_miles', 'opt_dist_

In [0]:
print(tips_df[tips_df.columns[:-1]])

    county         city  zipcode                            address  state  rent            latitude ...  Crime_Rate
0    None      WEXFORD            266 Clematis Dr Allegheny County     PA  2400             40.6182 ...         2.4
1    None    WHITEHALL                2310 N 1st Ave Lehigh County     PA   995           40.649906 ...         2.4
2    None    WHITEHALL           3338 St Stephens Ln Lehigh County     PA  1740           40.646282 ...         2.4
3    None   WAYNESBORO                97 W Main St Franklin County     PA   675  39.756992000000004 ...         2.4
4    None   QUAKERTOWN                 200 E Broad St Bucks County     PA  1300  40.441176999999996 ...         2.4
5    None   WAYNESBORO           407 Viewpoint Way Franklin County     PA  1025  39.766594000000005 ...         2.4
6    None   WAYNESBORO           403 Viewpoint Way Franklin County     PA  1025  39.766580000000005 ...         2.4
7    None   WAYNESBORO                240 Crown Ct Franklin County     

In [0]:
print(tips_df.groupby('city').HomePrice.mean().reset_index())

              city           HomePrice
0        ABINGTON   165602.2857142857
1        AIRVILLE            158051.0
2           AKRON  170204.57142857142
3  ALBRIGHTSVILLE  111965.84615384616
4        ALBURTIS  194317.18181818182
5       ALIQUIPPA   90321.06666666667
6       ALLENTOWN  144091.08250825084
7    ALLISON PARK  185304.86956521738
8         ALTOONA  141058.81818181818
9          AMBLER           259464.06
[664 more rows]


In [0]:
from cuml import SGD

sgd = SGD(eta0=0.1)

result_sgd = sgd.fit(tips_df[tips_df.columns[:-1]], tips_df[tips_df.columns[-1]])

ValueError: ignored

In [0]:

#print(filter(tips_df.columns,(tips_df.dtypes == np.float64)))
#print(tips_df.select_dtypes(np.number).dtypes)

tips_numeric = tips_df.select_dtypes(include=np.float64).fillna(0)
print(tips_numeric.dtypes)
result_sgd = sgd.fit(tips_numeric[tips_numeric.columns[:-1]], tips_numeric[tips_numeric.columns[-1]])

zipcode                  float64
latitude                 float64
longitude                float64
cemetery_dist_miles      float64
nationalhighway_miles    float64
railline_miles           float64
starbucks_miles          float64
walmart_miles            float64
hospital_miles           float64
physician_dist_miles     float64
dentist_dist_miles       float64
opt_dist_miles           float64
vet_dist_miles           float64
farmers_miles            float64
time                     float64
lotsize                  float64
Census_MedianIncome      float64
CollegeGrads             float64
WhiteCollar              float64
Schools                  float64
Unemployment             float64
EmploymentDiversity      float64
Census_Vacancy           float64
Crime_Rate               float64
dtype: object


Let us see the to and from Pandas trasformations; but it is not so clear if the pandas will be on local memory or on distributed memory. As in what if the data is huge for a cluster node.

In [0]:
pdf = tips_numeric.to_pandas()

In [0]:
cudf.from_pandas(pdf)

<cudf.DataFrame ncols=25 nrows=18203 >

Apperantly, there is one more df called as dask_df which is for the distributed df computing. And this answers our last question. But how do we configure it ?

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
Python 3.6.5 :: Anaconda, Inc.
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7-had09818_2 ...
installing: zlib-1.2.11-ha838bed_2 ...
installing: libedit-3.1.20170329-h6b74fdf_2 ...
installing: readline-7.0-ha6073c6_4 ...
installing: sqlite-3.23.1-he433501_0 ...
installing: asn1crypto-0.24.0-py36_0 ...
installing: certifi-2018.4.16-py36_0 ...
installing: chardet-3.0.4-py36h0f667ec_1 ...
installing: idna-2.6-py36h82fb2a8_1 ...
installing: pycosat-0.6.3-py36h0a5515d_0 ...
installing: pycparser-2.18-py36hf9f622e_1 ...
installing: pysocks-1.6.8-py36_0 ...
installing: ruamel_yaml-0.15.37-py36h14c

In [44]:
import dask.dataframe as dd
import numpy as np
tips_df = dd.read_csv("/content/drive/My Drive/Colab Notebooks/TrainData_PA.csv", assume_missing=True)
tips_df.head()

Unnamed: 0,county,city,zipcode,address,state,rent,latitude,longitude,cemetery_dist_miles,nationalhighway_miles,railline_miles,starbucks_miles,walmart_miles,hospital_miles,physician_dist_miles,dentist_dist_miles,opt_dist_miles,vet_dist_miles,farmers_miles,time,bed,bath,halfbath,sqft,property_type,garage,yearbuilt,pool,fireplace,patio,lotsize,Census_MedianIncome,CollegeGrads,WhiteCollar,Schools,Unemployment,EmploymentDiversity,Census_Vacancy,Crime_Rate,HomePrice
0,,WEXFORD,,266 Clematis Dr Allegheny County,PA,2400.0,40.6182,-80.0776,1.019586,0.206222,0.629888,1.348776,3.326397,1.584675,0.229126,0.472933,0.651244,7.323725,1.094678,2016.25,3.0,2.0,1.0,2000.0,Condo,1.0,2008.0,0.0,1.0,0.0,4086.388045,54476.09,21.0,66.57,48.3,5.1,3.48,3.42,2.4,158051.0
1,,WHITEHALL,,2310 N 1st Ave Lehigh County,PA,995.0,40.649906,-75.47894,1.019586,0.206222,0.629888,1.348776,3.326397,1.584675,0.229126,0.472933,0.651244,7.323725,1.094678,2016.25,2.0,1.0,1.0,1100.0,Condo,0.0,1935.0,0.0,0.0,0.0,2247.513425,54476.09,21.0,66.57,48.3,5.1,3.48,3.42,2.4,158051.0
2,,WHITEHALL,,3338 St Stephens Ln Lehigh County,PA,1740.0,40.646282,-75.510056,1.019586,0.206222,0.629888,1.348776,3.326397,1.584675,0.229126,0.472933,0.651244,7.323725,1.094678,2015.75,3.0,2.0,1.0,1522.0,Condo,0.0,2006.0,0.0,1.0,1.0,3109.741302,54476.09,21.0,66.57,48.3,5.1,3.48,3.42,2.4,158051.0
3,,WAYNESBORO,,97 W Main St Franklin County,PA,675.0,39.756992,-77.579704,1.019586,0.206222,0.629888,1.348776,3.326397,1.584675,0.229126,0.472933,0.651244,7.323725,1.094678,2016.25,3.0,1.0,1.0,1150.0,Condo,0.0,1960.0,0.0,0.0,0.0,2349.673126,54476.09,21.0,66.57,48.3,5.1,3.48,3.42,2.4,158051.0
4,,QUAKERTOWN,,200 E Broad St Bucks County,PA,1300.0,40.441177,-75.33254,1.019586,0.206222,0.629888,1.348776,3.326397,1.584675,0.229126,0.472933,0.651244,7.323725,1.094678,2016.25,3.0,2.0,1.0,1000.0,SFR,0.0,1960.0,0.0,0.0,0.0,2043.194023,54476.09,21.0,66.57,48.3,5.1,3.48,3.42,2.4,158051.0


https://docs.dask.org/en/latest/install.html - for installation and other doc

In [23]:
tips_df.select_dtypes(include=np.number).dtypes

zipcode                  float64
rent                     float64
latitude                 float64
longitude                float64
cemetery_dist_miles      float64
nationalhighway_miles    float64
railline_miles           float64
starbucks_miles          float64
walmart_miles            float64
hospital_miles           float64
physician_dist_miles     float64
dentist_dist_miles       float64
opt_dist_miles           float64
vet_dist_miles           float64
farmers_miles            float64
time                     float64
bed                      float64
bath                     float64
halfbath                 float64
sqft                     float64
garage                   float64
yearbuilt                float64
pool                     float64
fireplace                float64
patio                    float64
lotsize                  float64
Census_MedianIncome      float64
CollegeGrads             float64
WhiteCollar              float64
Schools                  float64
Unemployme

In [24]:
tips_df['zipcode']=tips_df.zipcode.astype(np.int64)
tips_df.select_dtypes(include=np.number).dtypes

zipcode                    int64
rent                     float64
latitude                 float64
longitude                float64
cemetery_dist_miles      float64
nationalhighway_miles    float64
railline_miles           float64
starbucks_miles          float64
walmart_miles            float64
hospital_miles           float64
physician_dist_miles     float64
dentist_dist_miles       float64
opt_dist_miles           float64
vet_dist_miles           float64
farmers_miles            float64
time                     float64
bed                      float64
bath                     float64
halfbath                 float64
sqft                     float64
garage                   float64
yearbuilt                float64
pool                     float64
fireplace                float64
patio                    float64
lotsize                  float64
Census_MedianIncome      float64
CollegeGrads             float64
WhiteCollar              float64
Schools                  float64
Unemployme

In [25]:
tips_df.npartitions

1

In [26]:
tips_df.divisions

(None, None)

### Example of delayed function:

Here the map to integer did not happen at the read_csv line 

In [34]:
print(tips_df.shape)
print(list(tips_df[tips_df.yearbuilt.isna()]["yearbuilt"]))
tips_df['yearbuilt']=tips_df.yearbuilt.fillna(0.0)
tips_df.head()
tips_df = tips_df.set_index("yearbuilt")

(Delayed('int-ee25b1e8-8473-43ed-a32f-1309a7404e72'), 40)


ValueError: ignored

In [36]:
list(tips_df.yearbuilt)

ValueError: ignored

In [37]:
tips_df.yearbuilt.head()

ValueError: ignored

In [39]:
tips_df.HomePrice.head()

ValueError: ignored

In [43]:
print(tips_df.info())

<class 'dask.dataframe.core.DataFrame'>
Columns: 40 entries, county to HomePrice
dtypes: object(5), float64(34), int64(1)None


#[cuML](https://github.com/rapidsai/cuml)#

This snippet loads a 

As above, all calculations are performed on the GPU.

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-extended