# Sklearn vs Cuml RAPIDS Cudf vs Pandas <a class="anchor" id="tea"></a>

<a href="https://www.linkedin.com/in/ouassim-adnane/">Ouassim Adnane</a> 03 July 2020

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F848555%2F76dca114607de3d4e8886b55452c9d0b%2Frpvssk.jpg?generation=1593708010327774&alt=media">

# Overview 

### In this quick notebook,the goal is to see how much better will Rapids performs Cudf and Cuml the future replacement of Pandas and Sklearn I've augmented the Titanic Dataset*8192 to be able to compare the performance

<h2 style="color:red">If you enjoyed this work or you found it helpful , an upvotes would be very much appreciated  :-)</h2>

# Table of Contents  <a class="anchor" id="toc"></a>

<div style="background: #f9f9f9 none repeat scroll 0 0;border: 1px solid #aaa;display: table;font-size: 95%;margin-bottom: 1em;padding: 20px;width: 400px;">
<h3>Contents</h3>
<ul style="font-weight: 700;text-align: left;list-style: outside none none !important;">
<li style="list-style: outside none none !important;"><a href="#sr">1- SetUp Rapids</a></li>
<li style="list-style: outside none none !important;"><a href="#dp">2- Data Preparation</a></li>
<li style="list-style: outside none none !important;"><a href="#pa">3- Pandas Apply (CPU) vs Cudf Apply row (GPU)</a></li>
<li style="list-style: outside none none !important;"><a href="#sx">4- SpeedUp Xgboost with GPU</a></li>
<li style="list-style: outside none none !important;"><a href="#skr">5- Sklearn(Cpu) vs Rapids Cuml(Gpu)</a></li>
<li style="list-style: outside none none !important;"><a href="#rsa">6- Rapids Supported Algorithms</a></li>

</ul>
</div>

# Tutorials <a class="anchor" id="tu"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

List of some Rapids tutorials

<div>
    <ul style="  list-style-type: none;width: 800px;">
    <a href="https://www.youtube.com/playlist?list=PLshVYqgOF84fpDG-Zk5ucwdIRksI-4B1S" style="text-decoration:none;color:black" target="_blank">
    <li style="float: left;margin: 0 15px 0 0;font: 200 12px/1.5 Georgia, Times New Roman, serif;padding: 5px;overflow: auto;">
        <img src="https://www.mirrorreview.com/wp-content/uploads/2018/10/NVIDIA-announced-a-GPU-acceleration-platform.jpg" style="float: left;margin: 0 15px 0 0;width:200px;hight:300px">
      <p style="font: bold 20px/1.5 Helvetica, Verdana, sans-serif;">Rapids Youtube Playlist</p>
      <p style="font: 200 12px/1.5 Georgia, Times New Roman, serif;">I've compiled down a list of youtube videos that explain RAPIDS (Open-Source GPU-Acceleration)</p>
    </li>
      </a>
      <hr style="width:100%;text-align:left;margin-left:0">  
    <a href="https://docs.rapids.ai/" style="text-decoration:none;color:black" target="_blank">
    <li style="float: left;margin: 0 15px 0 0;font: 200 12px/1.5 Georgia, Times New Roman, serif;padding: 5px;overflow: auto;">
        <img src="https://www.xaasjournal.com/wp-content/uploads/2020/05/IT_Documentation_Files_Digitization.jpg" style="float: left;margin: 0px 15px 0 0;width:200px;height:100px">
      <p style="font: bold 20px/1.5 Helvetica, Verdana, sans-serif;">Rapids Documentation</p>
      <p style="font: 200 12px/1.5 Georgia, Times New Roman, serif;">This site serves as a collection of all the documentation for RAPIDS. Whether you’re new to RAPIDS, looking to contribute, or are a part of the RAPIDS team, the docs here will help guide you
</p>
    </li>
      </a>
              <hr style="width:100%;text-align:left;margin-left:0">  
    <a href="https://docs.rapids.ai/api/cudf/stable/10min.html" style="text-decoration:none;color:black" target="_blank">
    <li style="float: left;margin: 0 15px 0 0;font: 200 12px/1.5 Georgia, Times New Roman, serif;padding: 5px;overflow: auto;">
        <img src="https://www.xaasjournal.com/wp-content/uploads/2020/05/IT_Documentation_Files_Digitization.jpg" style="float: left;margin: 0px 15px 0 0;width:200px;height:100px">
      <p style="font: bold 20px/1.5 Helvetica, Verdana, sans-serif;">10 Minutes to cuDF and Dask-cuDF</p>
      <p style="font: 200 12px/1.5 Georgia, Times New Roman, serif;">Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.
</p>
    </li>
      </a>
</ul>        
</div>

# SetUp Rapids <a class="anchor" id="sr"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

In [None]:
import sys
!cp ../input/rapids/rapids.0.13.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.6/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.6"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/


### Imports 

In [None]:
import numpy as np 
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
from math import cos, sin, asin, sqrt, pi
from tqdm import tqdm 
import time
tqdm.pandas()
!pip install swifter 2>/dev/null 1>/dev/null
import swifter 
import xgboost as xgb

#pandas 
import pandas as pd 

#Sklearn models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import Lasso
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.ensemble import RandomForestClassifier


#Rapids 
import cudf
from cuml import LogisticRegression as cLogisticRegression
from cuml.neighbors import KNeighborsClassifier as cKNeighborsClassifier
from cuml import SVC as cSVC
from cuml.linear_model import Lasso as cLasso
from cuml.manifold import TSNE as cTSNE
from cuml import DBSCAN as cDBSCAN
from cuml.decomposition import PCA as cPCA
from cuml.ensemble import RandomForestClassifier as cRandomForestClassifier


import warnings 
warnings.filterwarnings("ignore")

# Data Preparation <a class="anchor" id="dp"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

In [None]:
data = pd.read_csv('../input/titanic/train.csv')

In [None]:
# Little data preprocessing for the models 
features= [ 'Pclass','Sex','Age','SibSp','Parch','Fare','Embarked',"Survived"]
x = data[features]
x['Age'] = x['Age'].fillna(x['Age'].median())
x['Embarked']= x['Embarked'].fillna(x['Embarked'].value_counts().index[0])
LE = LabelEncoder()
x['Sex'] = LE.fit_transform(x['Sex'])
x['Embarked'] = LE.fit_transform(x['Embarked'])


I've augmented the Titanic Dataset*8192 to be able to compare the performance

In [None]:
for i in range(13):
    x = pd.concat([x,x])

In [None]:
len(x)

In [None]:
x.reset_index(inplace=True)

transform pandas to cudf object

In [None]:
x_cudf = cudf.from_pandas(x)

# Pandas Apply (CPU) vs Cudf Apply row (GPU) <a class="anchor" id="pa"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

create a test function just to comapre the performance 

In [None]:
def test_function(Fare):
    out=0
    for i in range(10):
        out +=(sin(Fare/2)**2 + cos(Fare) * cos(Fare) * sin(Fare/2)**2)*i
    return out 

on a single core 

In [None]:
start_time = time.time()
x["test"]=x.Fare.progress_apply(test_function)
print("%s seconds " % round((time.time() - start_time),2))

Trying to use all avaiable cpu cores with swifter https://github.com/jmcarpenter2/swifter

In [None]:
start_time = time.time()
x["test"]=x.Fare.swifter.apply(test_function)
print("%s seconds " % round((time.time() - start_time),2))

Not bad of a timming

In [None]:
def test_function(Fare,test):
    for i,x in enumerate(Fare):
        for j in range(10):
            test[i]  += (sin(x/2)**2 + cos(x) * cos(x) * sin(x/2)**2)*j

In [None]:
start_time = time.time()
x_cudf = x_cudf.apply_rows(test_function,
                   incols=['Fare'],
                   outcols=dict(test=np.float64),
                   kwargs=dict())
print("%s seconds " % round((time.time() - start_time),2))

### The difference in speed is clear, but the downside is the apply_rows and apply_chuncks have a different syntax than pandas apply and still no support for string columns in apply_rows but overall that huge preprocessing time gain

# SpeedUp Xgboost with GPU <a class="anchor" id="sx"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

In [None]:
dtrain = xgb.DMatrix(x.drop(["Survived"],axis=1),label=x["Survived"])

In [None]:
num_round = 100
print("Training with CPU ...")
param = {}
param['tree_method'] = 'hist'
tmp = time.time()
xgb.train(param, dtrain, num_round)
cpu_time = time.time() - tmp
print("CPU Training Time: %s seconds" % (str(cpu_time)))

In [None]:
print("Training with Single GPU ...")
param = {}
param['tree_method'] = 'gpu_hist'
tmp = time.time()

xgb.train(param, dtrain, num_round)
gpu_time = time.time() - tmp
print("GPU Training Time: %s seconds" % (str(gpu_time)))

In [None]:
dtrain = xgb.DMatrix(x_cudf.drop(["Survived"],axis=1),x_cudf["Survived"])

In [None]:
print("Training with Single GPU ...")
param = {}
param['tree_method'] = 'gpu_hist'
tmp = time.time()

xgb.train(param, dtrain, num_round)
gpu_time = time.time() - tmp
print("GPU Training Time: %s seconds" % (str(gpu_time)))

# Sklearn(Cpu) vs Rapids Cuml(Gpu) <a class="anchor" id="skr"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

### Sklearn Models 

In [None]:
tmp = time.time()
LogisticRegression().fit(X=x.drop(["Survived"],axis=1),y=x["Survived"])
cpu_time = time.time() - tmp
print("LogisticRegression Time: %s seconds" % (str(round(cpu_time,3))))

tmp = time.time()
KNeighborsClassifier().fit(X=x.drop(["Survived"],axis=1)[:1000000],y=x["Survived"][:1000000])
cpu_time = time.time() - tmp
print("KNeighbors Time: %s seconds" % (str(round(cpu_time,3))))


tmp = time.time()
SVC().fit(X=x.drop(["Survived"],axis=1)[:50000],y=x["Survived"][:50000])
cpu_time = time.time() - tmp
print("SVM Training Time: %s seconds" % (str(round(cpu_time,3))))


tmp = time.time()
Lasso().fit(X=x.drop(["Survived"],axis=1),y=x["Survived"])
cpu_time = time.time() - tmp
print("Lasso Training Time: %s seconds" % (str(round(cpu_time,3))))


tmp = time.time()
TSNE(n_components=2).fit(x.drop(["Survived"],axis=1)[:10000])
cpu_time = time.time() - tmp
print("TSNE Training Time: %s seconds" % (str(round(cpu_time,3))))


tmp = time.time()
DBSCAN(eps=0.6, min_samples=2).fit(x.drop(["Survived"],axis=1)[:100000])
cpu_time = time.time() - tmp
print("DBScan Training Time: %s seconds" % (str(round(cpu_time,3))))


tmp = time.time()
PCA(n_components=2).fit(x.drop(["Survived"],axis=1)[:100000])
cpu_time = time.time() - tmp
print("PCA Training Time: %s seconds" % (str(round(cpu_time,3))))

### cuml Models 

In [None]:
import gc
gc.collect()

In [None]:
x_cudf["Survived"] = x_cudf["Survived"].astype(np.float64)

tmp = time.time()
cLogisticRegression().fit(X=x_cudf.drop(["Survived"],axis=1),y=x_cudf["Survived"])
gpu_time = time.time() - tmp
print("LogisticRegression Time: %s seconds" % (str(round(gpu_time,3))))



tmp = time.time()
cKNeighborsClassifier().fit(X=x_cudf.drop(["Survived"],axis=1)[:1000000],y=x_cudf["Survived"][:1000000])
gpu_time = time.time() - tmp
print("KNeighbors Time: %s seconds" % (str(round(gpu_time,3))))



tmp = time.time()
cSVC().fit(X=x_cudf.drop(["Survived"],axis=1)[:50000],y=x_cudf["Survived"][:50000])
gpu_time = time.time() - tmp
print("SVM Training Time: %s seconds" % (str(round(gpu_time,3))))


tmp = time.time()
cLasso().fit(X=x_cudf.drop(["Survived"],axis=1),y=x_cudf["Survived"])
gpu_time = time.time() - tmp
print("Lasso Training Time: %s seconds" % (str(round(gpu_time,3))))
gc.collect()

tmp = time.time()
cTSNE(n_components=2).fit(X=x_cudf.drop(["Survived"],axis=1)[:10000])
gpu_time = time.time() - tmp
print("TSNE Training Time: %s seconds" % (str(round(gpu_time,3))))



tmp = time.time()
cDBSCAN(eps=0.6, min_samples=2).fit(X=x_cudf.drop(["Survived"],axis=1)[:100000])
gpu_time = time.time() - tmp
print("DbScan Training Time: %s seconds" % (str(round(gpu_time,3))))


tmp = time.time()
cPCA(n_components=2).fit(X=x_cudf.drop(["Survived"],axis=1)[:100000])
gpu_time = time.time() - tmp
print("PCA Training Time: %s seconds" % (str(round(gpu_time,3))))

### Random Forest Cpu Sklearn 

In [None]:
tmp = time.time()
RandomForestClassifier(n_estimators = 150, max_depth=13).fit(X=x.drop(["Survived"],axis=1)[:1000000],y=x["Survived"][:1000000])
cpu_time = time.time() - tmp
print("Random Forest Time: %s seconds" % (str(round(cpu_time,3))))

### Random Forest Gpu Rapids 

In [None]:
tmp = time.time()
model = cRandomForestClassifier(n_estimators = 150, max_depth=13)
model.fit(X=x_cudf.drop(["Survived"],axis=1)[:1000000],y=x_cudf["Survived"].astype("int32")[:1000000])
gpu_time = time.time() - tmp
print("Random Forest Training Time: %s seconds" % (str(round(gpu_time,3))))

<h3>This is just my opinion after a few days of Rapids usage </h3>

<h4>In my opinion Rapids CuML will overcome Sklenar in just a few years, Sklearn still has more algorithms but obviously, in just a few years Rapids will catch up.</h4>
<h4> 
For Rapids, CuDF is good for reducing preprocessing time but still, it is not as much easy to use as pandas and does not support word on strings in apply_rows, yes I know much of the preprocessing can be done otherwise with replace and other functions but still, we are familiar with pandas apply, it would be nice if they add the string support in apply rows, other than that Rapids is a game-changer it saves a ton of times while processing large datasets.</h4>

# Rapids Supported Algorithms  <a class="anchor" id="rsa"></a>
<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left" >Back to the table of contents</a>

<b>Classification / Regression </b>
<ul>
<li>Linear Regression</li>
<li>Logistic Regression</li>
<li>Ridge Regression</li>
<li>Lasso Regression</li>
<li>ElasticNet Regression</li>
<li>Mini Batch SGD Classifier</li>
<li>Mini Batch SGD Regressor</li>
<li>Stochastic Gradient Descent</li>
<li>Random Forest</li>
<li>Forest Inferencing</li>
<li>Coordinate Descent</li>
<li>Quasi-Newton</li>
<li>Support Vector Machines</li>
<li>Nearest Neighbors Classification</li>
<li>Nearest Neighbors Regression</li></ul>


<b>Clustering</b><ul>
<li>K-Means Clustering</li>
<li>DBSCAN</li>
</ul>


<b>Dimensionality Reduction and Manifold Learning</b>
<ul>
<li>Principal Component Analysis</li>
<li>Truncated SVD</li>
<li>UMAP</li>
<li>Random Projections</li>
<li>TSNE</li></ul>


<b>Time Series</b>
<ul>
<li>HoltWinters</li>
<li>ARIMA</li>
</ul>    
    

<a href="#tea"><img  src="https://za.heytv.org/wp-content/uploads/2019/08/AGF-l79DYZtk_pSyfWgIP3D-3yi8YN6ZeWO0E8tyLgs800-c-k-c0xffffffff-no-rj-mo.jpeg" style="height: 300px"/></a>