&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../../START_HERE.ipynb)

[Previous Notebook](Challenge.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](Challenge.ipynb)
[2]


# Solution
Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data, and it one of the most common and useful tools in the Python data science ecosystem. cuML is the RAPIDS library that implements similar machine learning algorithms that use CUDA to run on GPUs, with an API that mirrors the Scikit-learn one as much as possible.

This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. These data were used to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).

Here is the dataset link: https://www.kaggle.com/crawford/gene-expression

## Here is the list of exercises and modules to work on in the lab:

- Convert the serial Pandas computations to CuDF operations.
- Utilize CuML to accelerate the machine learning models.
- Experiment with Dask to create a cluster and distribute the data and scale the operations.

You will start writing code from <a href='#ex'>here</a>, but make sure you execute the data processing blocks to understand the dataset.



The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries.

In [1]:
import matplotlib.pyplot as plt
import numpy as np; print('NumPy Version:', np.__version__)
%matplotlib inline
import sys
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing 
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
import cudf
import cupy
import matplotlib.pyplot as plt
# import for model building
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from cuml.linear_model import MBSGDRegressor as cumlSGD
from sklearn.linear_model import SGDRegressor as skSGD
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from cuml.ensemble import RandomForestClassifier as curfc
from sklearn.ensemble import RandomForestClassifier as skrfc
from cuml import make_regression
from cuml.linear_model import LinearRegression as cuLinearRegression
from cuml.metrics.regression import r2_score
from sklearn.linear_model import LinearRegression as skLinearRegression
from cuml.neighbors import KNeighborsClassifier as KNeighborsC
from sklearn.neighbors import KNeighborsClassifier
from cuml.linear_model import MBSGDClassifier as cumlMBSGDClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from cuml import Ridge
from cuml.linear_model import Ridge
from sklearn.linear_model import Ridge
from cuml import LogisticRegression
from sklearn.linear_model import LogisticRegression as skLogistic
from cuml.linear_model import ElasticNet
from sklearn import linear_model
from cuml.linear_model import Lasso
from cuml.solvers import SGD as cumlSGD
from sklearn.metrics import accuracy_score
from sklearn import model_selection, datasets
from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf

from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
from sklearn.ensemble import RandomForestClassifier as sklRF

NumPy Version: 1.19.2
Scikit-Learn Version: 0.23.1


We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe.

In [2]:
%%time
y = pd.read_csv('../../data/actual.csv')
print(y.shape)
y.head()

(72, 2)
CPU times: user 2.11 ms, sys: 2.15 ms, total: 4.27 ms
Wall time: 3.95 ms


Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


Let's convert our target variable categories to numbers.

In [3]:
y['cancer'].value_counts()
# Recode label to numeric
y = y.replace({'ALL':0,'AML':1})
labels = ['ALL', 'AML'] # for plotting convenience later on

Read the training and test data provided in the challenge from the data folder. View their dimensions.

In [4]:
# Import training data
df_train = pd.read_csv('../../data/data_set_ALL_AML_train.csv')
print(df_train.shape)

# Import testing data
df_test = pd.read_csv('../../data/data_set_ALL_AML_independent.csv')
print(df_test.shape)

(7129, 78)
(7129, 70)


Observe the first few rows of the train dataframe and the data format.

In [5]:
df_train.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


Observe the first few rows of the test dataframe and the data format.

In [6]:
df_test.head()

Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,...,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,...,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,...,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,...,-5,A,63,A,-46,A,-124,A,-81,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,328,A,295,A,276,A,182,A,...,141,A,95,A,146,A,431,A,9,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-224,A,-226,A,-211,A,-289,A,...,-256,A,-191,A,-172,A,-496,A,-294,A


As we can see, the data set has categorical values but only for the columns starting with "call". We won't use the columns having categorical values, but remove them.

In [7]:
# Remove "call" columns from training and testing data
train_to_keep = [col for col in df_train.columns if "call" not in col]
test_to_keep = [col for col in df_test.columns if "call" not in col]

X_train_tr = df_train[train_to_keep]
X_test_tr = df_test[test_to_keep]

Rename the columns and reindex for formatting purposes and ease in reading the data.

In [8]:
train_columns_titles = ['Gene Description', 'Gene Accession Number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', 
       '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']

X_train_tr = X_train_tr.reindex(columns=train_columns_titles)

In [9]:
test_columns_titles = ['Gene Description', 'Gene Accession Number','39', '40', '41', '42', '43', '44', '45', '46',
       '47', '48', '49', '50', '51', '52', '53',  '54', '55', '56', '57', '58', '59',
       '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']

X_test_tr = X_test_tr.reindex(columns=test_columns_titles)

We will take the transpose of the dataframe so that each row is a patient and each column is a gene.

In [10]:
X_train = X_train_tr.T
X_test = X_test_tr.T

print(X_train.shape) 
X_train.head()

(40, 7129)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
Gene Description,AFFX-BioB-5_at (endogenous control),AFFX-BioB-M_at (endogenous control),AFFX-BioB-3_at (endogenous control),AFFX-BioC-5_at (endogenous control),AFFX-BioC-3_at (endogenous control),AFFX-BioDn-5_at (endogenous control),AFFX-BioDn-3_at (endogenous control),AFFX-CreX-5_at (endogenous control),AFFX-CreX-3_at (endogenous control),AFFX-BioB-5_st (endogenous control),...,Transcription factor Stat5b (stat5b) mRNA,Breast epithelial antigen BA46 mRNA,GB DEF = Calcium/calmodulin-dependent protein ...,TUBULIN ALPHA-4 CHAIN,CYP4B1 Cytochrome P450; subfamily IVB; polypep...,PTGER3 Prostaglandin E receptor 3 (subtype EP3...,HMG2 High-mobility group (nonhistone chromosom...,RB1 Retinoblastoma 1 (including osteosarcoma),GB DEF = Glycophorin Sta (type A) exons 3 and ...,GB DEF = mRNA (clone 1A7)
Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41


Just clearning the data, removing extra columns and converting to numerical values.

In [11]:
# Clean up the column names for training and testing data
X_train.columns = X_train.iloc[1]
X_train = X_train.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)

# Clean up the column names for Testing data
X_test.columns = X_test.iloc[1]
X_test = X_test.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)

print(X_train.shape)
print(X_test.shape)
X_train.head()

(38, 7129)
(34, 7129)


Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


We have the 38 patients as rows in the training set, and the other 34 as rows in the testing set. Each of those datasets has 7129 gene expression features. But we haven't yet associated the target labels with the right patients. You will recall that all the labels are all stored in a single dataframe. Let's split the data so that the patients and labels match up across the training and testing dataframes.We are now splitting the data into train and test sets. We will subset the first 38 patient's cancer types.

In [12]:
X_train = X_train.reset_index(drop=True)
y_train = y[y.patient <= 38].reset_index(drop=True)

# Subset the rest for testing
X_test = X_test.reset_index(drop=True)
y_test = y[y.patient > 38].reset_index(drop=True)

Generate descriptive statistics to analyse the data further.

In [13]:
X_train.describe()

Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
count,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,...,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0
mean,-120.868421,-150.526316,-17.157895,181.394737,-276.552632,-439.210526,-43.578947,-201.184211,99.052632,112.131579,...,178.763158,750.842105,8.815789,399.131579,-20.052632,869.052632,335.842105,19.210526,504.394737,-29.210526
std,109.555656,75.734507,117.686144,117.468004,111.004431,135.458412,219.482393,90.838989,83.178397,211.815597,...,84.82683,298.008392,77.108507,469.579868,42.346031,482.366461,209.826766,31.158841,728.744405,30.851132
min,-476.0,-327.0,-307.0,-36.0,-541.0,-790.0,-479.0,-463.0,-82.0,-215.0,...,30.0,224.0,-178.0,36.0,-112.0,195.0,41.0,-50.0,-2.0,-94.0
25%,-138.75,-205.0,-83.25,81.25,-374.25,-547.0,-169.0,-239.25,36.0,-47.0,...,120.0,575.5,-42.75,174.5,-48.0,595.25,232.75,8.0,136.0,-42.75
50%,-106.5,-141.5,-43.5,200.0,-263.0,-426.5,-33.5,-185.5,99.5,70.5,...,174.5,700.0,10.5,266.0,-18.0,744.5,308.5,20.0,243.5,-26.0
75%,-68.25,-94.75,47.25,279.25,-188.75,-344.75,79.0,-144.75,152.25,242.75,...,231.75,969.5,57.0,451.75,9.25,1112.0,389.5,30.25,487.25,-11.5
max,17.0,-20.0,265.0,392.0,-51.0,-155.0,419.0,-24.0,283.0,561.0,...,356.0,1653.0,218.0,2527.0,52.0,2315.0,1109.0,115.0,3193.0,36.0


Clearly there is some variation in the scales across the different features. Many machine learning models work much better with data that's on the same scale, so let's create a scaled version of the dataset.

In [14]:
X_train_fl = X_train.astype(float, 64)
X_test_fl = X_test.astype(float, 64)

# Apply the same scaling to both datasets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_fl)
X_test = scaler.transform(X_test_fl) # note that we transform rather than fit_transform

<a id='ex2'></a><br>
Convert the pandas dataframes to CuDF dataframes to carry out the further CuML tasks.

In [15]:
%%time
X_cudf_train = cudf.DataFrame(X_train)
X_cudf_test = cudf.DataFrame(X_test)

y_cudf_train = cudf.DataFrame(y_train)
#y_cudf_test = cudf.Series(y_test.values)

CPU times: user 1.58 s, sys: 435 ms, total: 2.02 s
Wall time: 2.03 s


Given below is the scikit-learn implementation for predicting the target variable using ElasticNet Classifier to show you how we can use this dataset and fit the model. Observe which dataframe and been used, and how the model is created.

## ElasticNet Classifier
### Scikit-learn model

#### Fit

In [16]:
%%time
regr = ElasticNet()
regr.fit(X_train, y_train.iloc[:,1])

CPU times: user 364 ms, sys: 77.4 ms, total: 441 ms
Wall time: 440 ms


ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=<cuml.raft.common.handle.Handle object at 0x7f13bc25ab10>, output_type='numpy', verbose=4)

#### Evaluate

In [17]:
%%time
X_test = X_test.astype(np.float64)
y_test = y_test.astype(np.float64)
print(regr.score(X_test,y_test.iloc[:,1]))

-0.06174317372378324
CPU times: user 1.91 ms, sys: 1.97 ms, total: 3.88 ms
Wall time: 3.11 ms


<a id='ex3'></a><br>

### CuML model

#### Fit

In [18]:
%%time
enet = ElasticNet()

enet.fit(X_cudf_train, y_train.iloc[:,1])

CPU times: user 696 ms, sys: 21 ms, total: 717 ms
Wall time: 721 ms


ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=<cuml.raft.common.handle.Handle object at 0x7f13bc25a9b0>, output_type='cudf', verbose=4)

### Evaluate

In [19]:
%%time
X_cudf_test = X_cudf_test.astype(np.float64)
print(enet.score(X_cudf_test, y_test.iloc[:,1]))

-0.06174317372378324
CPU times: user 1.12 s, sys: 602 µs, total: 1.12 s
Wall time: 1.12 s


# Logistic Regression

## Scikit-learn

### Fit

In [20]:
%%time
clf = skLogistic()
clf.fit(X_train, y_train.iloc[:,1])


CPU times: user 2.25 s, sys: 4.74 s, total: 6.99 s
Wall time: 119 ms


LogisticRegression()

### Evaluate

In [21]:
%%time
print(clf.score(X_test, y_test.iloc[:,1]))

0.8235294117647058
CPU times: user 57.7 ms, sys: 84.8 ms, total: 143 ms
Wall time: 2.23 ms


## CuML

### Fit

In [22]:
%%time
reg = LogisticRegression()
reg.fit(X_cudf_train,y_cudf_train.iloc[:,1])

CPU times: user 2.34 s, sys: 4.21 s, total: 6.54 s
Wall time: 408 ms


LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=4, l1_ratio=None, solver='qn', handle=<cuml.raft.common.handle.Handle object at 0x7f13b388cd70>, output_type='cudf')

### Evaluate

In [23]:
%%time
print(reg.score(X_cudf_test, y_test.iloc[:,1]))

0.8529411554336548
CPU times: user 756 ms, sys: 1.18 ms, total: 757 ms
Wall time: 757 ms


# Nearest Neighbours Classifier

## Scikit-learn

### Fit

In [24]:
%%time
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train.iloc[:,1])

CPU times: user 6.71 ms, sys: 677 µs, total: 7.39 ms
Wall time: 6.47 ms


KNeighborsClassifier(n_neighbors=3)

### Evaluate

In [25]:
%%time
print(neigh.score(X_test, y_test.iloc[:,1]))

0.7058823529411765
CPU times: user 17.1 ms, sys: 2.11 ms, total: 19.2 ms
Wall time: 18.6 ms


## CuML

### Fit

In [26]:
%%time
knn = KNeighborsC(n_neighbors=10)
knn.fit(X_cudf_train, y_cudf_train.iloc[:,1])

CPU times: user 1.24 s, sys: 14 ms, total: 1.26 s
Wall time: 1.26 s


KNeighborsClassifier(weights='uniform')

### Evaluate

In [27]:
%%time
print(knn.score(X_cudf_test, y_test.iloc[:,1]))

0.6470588445663452
CPU times: user 1.24 s, sys: 133 ms, total: 1.38 s
Wall time: 1.38 s


# Dask Integration

We will try using the Random Forests Classifier  and implement using CuML and Dask.

# Start Dask cluster

In [28]:
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization

## Define Parameters

In addition to the number of examples, random forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower `max_depth` values can greatly speed up fitting, though going too low may reduce accuracy.

In [29]:
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000

## Distribute data to worker GPUs

In [30]:
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [31]:
n_partitions = n_workers

def distribute(X, y):
    # First convert to cudf (with real data, you would likely load in cuDF format to start)
    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))
    y_cudf = cudf.Series(y)

    # Partition with Dask
    # In this case, each worker will train on 1/n_partitions fraction of the data
    X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)
    y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)

    # Persist to cache the data in active memory
    X_dask, y_dask = \
      dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)
    
    return X_dask, y_dask

X_train_dask, y_train_dask = distribute(X_train, y_train.iloc[:,1])
X_test_dask, y_test_dask = distribute(X_test, y_test.iloc[:,1])

# Create the  Scikit-learn model

Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison.

In [38]:
%%time

# Use all avilable CPU cores
skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
skl_model.fit(X_train, y_train.iloc[:,1])

CPU times: user 2.39 s, sys: 757 ms, total: 3.14 s
Wall time: 2.17 s


RandomForestClassifier(max_depth=12, n_estimators=1000, n_jobs=-1)


## Train the distributed cuML model

In [39]:
%%time

cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)

wait(cuml_model.rfs) # Allow asynchronous training tasks to finish

CPU times: user 66.5 ms, sys: 9.68 ms, total: 76.2 ms
Wall time: 1.06 s


DoneAndNotDoneFutures(done={<Future: finished, type: cuml.RandomForestClassifier, key: _construct_rf-2e8840d7-b3db-48fa-ab39-009b433976e5>, <Future: finished, type: cuml.RandomForestClassifier, key: _construct_rf-b2094e50-7d2c-49d9-9f28-444e287e85d2>}, not_done=set())

# Predict and check accuracy

In [40]:
skl_y_pred = skl_model.predict(X_test)
cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()

# Due to randomness in the algorithm, you may see slight variation in accuracies
print("SKLearn accuracy:  ", accuracy_score(y_test.iloc[:,1], skl_y_pred))
print("CuML accuracy:     ", accuracy_score(y_test_dask.compute().to_array(), cuml_y_pred))

SKLearn accuracy:   0.7941176470588235
CuML accuracy:      0.5882352941176471


<a id='ex4'></a><br>

# CONCLUSION

Let's compare the performance of our solution!

| Algorithm     | Implementation | Accuracy      | Time | Algorithm     | Implementation | Accuracy      | Time |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| ElasticNet    | Scikit-learn       | -0.06      | 428 ms      | ElasticNet      | CuML      | -0.06      | 680 ms      |
| Logistic Regression     | Scikit-learn        | 0.82     | 173 ms       | Logistic Regression   | CuML       | 0.85     | 2.52 s       |
| Nearest Neighbours Classifier     | Scikit-learn       | 0.70      | 6.17 ms       | Nearest Neighbours Classifier     | CuML        | 0.64     | 1.74 s     |
| Random Forests Classifier  | Scikit-learn     | 0.72     |   2.17 s  | Random Forests Classifier | Dask-CuML | 0.64  | 1.06 s

Write down your observations and compare the CuML and Scikit learn scores. They should be approximately equal.  We hope that you found this exercise exciting and beneficial in understanding RAPIDS better. Share your highest accuracy and try to use the unique features of RAPIDS for accelerating your data science pipelines. Don't restrict yourself to the previously explained concepts, but use the documentation to apply more models and functions and achieve the best results.

##### Thus we can observe that for most cases, the CuML implementation is increasing the computation time by a few milliseconds, and time difference isnt seen as the data size is small in this sample. However as you increase the dataset, the same implementation shows drastic results. The Dask Implementation reduces the computation time by half and utilizes multiple GPUs to carry out the calculations. The slight difference in accuracy is due to the slight difference in runtime execution of both codes, but the accuracy is nearly constant. 



# References



<p xmlns:dct="http://purl.org/dc/terms/">
  <a rel="license"
     href="http://creativecommons.org/publicdomain/zero/1.0/">
    <center><img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0"  /></center>
  </a>
 
</p>


- The dataset is licensed under a CC0: Public Domain license.

- Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 286:531-537. (1999). Published: 1999.10.14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander


## Licensing
  
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0).

[Previous Notebook](Challenge.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](Challenge.ipynb)
[2]
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;


&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../../START_HERE.ipynb)