This notebook walks through a basic example of using the GPU-accelerated estimators from [RAPIDS](https://rapids.ai/) cuML and [DMLC/XGBoost](https://github.com/dmlc/xgboost) with TPOT for classification tasks. You must have access to an NVIDIA GPU and have cuML installed in your environment. Running this notebook without cuML, will cause TPOT to raise a `ValueError` indicating you should install cuML.

It is intended to show how the `TPOT cuML` configuration can provide significant benefits on medium-sized and larger datasets. 

## Downloading Data

This example uses the Higgs Boson [dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) from the UCI Machine Learning Repositoru.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from tpot import TPOTClassifier

In [14]:
# This is a 2.7 GB file.
# Please make sure you have space before uncommenting the code below and downloading this file.

if not os.path.isfile("HIGGS.csv.gz"):
    !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

In [15]:
# This fuction is borrowed from https://github.com/NVIDIA/gbm-bench/blob/master/datasets.py
# Thanks!

def prepare_higgs(dataset_folder, nrows=None):
    higgs = pd.read_csv("HIGGS.csv.gz", nrows=nrows)
    X = higgs.iloc[:, 1:].to_numpy(dtype=np.float32)
    y = higgs.iloc[:, 0].to_numpy(dtype=np.int64)
    return train_test_split(X, y, stratify=y, random_state=77, test_size=0.2)

## Running TPOTClassifier

In the interest of time, we'll only use a 500,000 row sample of this file. 500,000 rows is more than enough for this example.

With the example configuration below (10 generations, population size of 10, two-fold cross validation), the `TPOT cuML` configuration provides a significant speedup.

Such speedups also mean you can create larger evolutionary search strategies while **still** returning faster results.

In [17]:
NROWS = 500_000
X_train, X_test, y_train, y_test = prepare_higgs("./", nrows=NROWS)

Note that for cuML to work correctly, you must set `n_jobs=1` (the default setting).

In [9]:
%%time

# cuML TPOT setup
SEED = 12
GENERATIONS = 10
POP_SIZE = 10
CV = 2

tpot = TPOTClassifier(
    generations=GENERATIONS,
    population_size=POP_SIZE,
    random_state=SEED,
    config_dict="TPOT cuML",
    n_jobs=1, # cuML requires n_jobs=1, the default
    cv=CV,
    verbosity=2,
)

tpot.fit(X_train, y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.7103025000000001
Generation 2 - Current best internal CV score: 0.71385
Generation 3 - Current best internal CV score: 0.725755
Generation 4 - Current best internal CV score: 0.7299725
Generation 5 - Current best internal CV score: 0.7299725
Generation 6 - Current best internal CV score: 0.7299725
Generation 7 - Current best internal CV score: 0.7309975
Generation 8 - Current best internal CV score: 0.7309975
Generation 9 - Current best internal CV score: 0.7309975
Generation 10 - Current best internal CV score: 0.7309975
Best pipeline: XGBClassifier(ZeroCount(input_matrix), alpha=1, learning_rate=0.1, max_depth=6, min_child_weight=13, n_estimators=100, nthread=1, subsample=0.8500000000000001, tree_method=gpu_hist)
CPU times: user 4min 59s, sys: 13min 27s, total: 18min 27s
Wall time: 18min 29s


TPOTClassifier(config_dict='TPOT cuML', cv=2, generations=10,
               log_file=<ipykernel.iostream.OutStream object at 0x7f282044a7d0>,
               population_size=10, random_state=12, verbosity=2)

In [11]:
%%time

preds = tpot.predict(X_test)
print(accuracy_score(y_test, preds))

0.7308499813079834
CPU times: user 565 ms, sys: 5.52 ms, total: 570 ms
Wall time: 569 ms


In [12]:
%%time

# Default TPOT setup with same params
tpot = TPOTClassifier(
    generations=GENERATIONS,
    population_size=POP_SIZE,
    random_state=SEED,
    n_jobs=-1,
    cv=CV,
    verbosity=2,
)

tpot.fit(X_train, y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.7184675
Generation 2 - Current best internal CV score: 0.7184675
Generation 3 - Current best internal CV score: 0.7198
Generation 4 - Current best internal CV score: 0.7210825000000001
Generation 5 - Current best internal CV score: 0.7222999999999999
Generation 6 - Current best internal CV score: 0.7222999999999999
Generation 7 - Current best internal CV score: 0.7270125000000001
Generation 8 - Current best internal CV score: 0.73546
Generation 9 - Current best internal CV score: 0.73546
Generation 10 - Current best internal CV score: 0.735545
Best pipeline: XGBClassifier(OneHotEncoder(input_matrix, minimum_fraction=0.2, sparse=False, threshold=10), learning_rate=0.1, max_depth=9, min_child_weight=19, n_estimators=100, nthread=1, subsample=1.0)
CPU times: user 10min, sys: 1min 8s, total: 11min 9s
Wall time: 5h 17min 28s


TPOTClassifier(cv=2, generations=10,
               log_file=<ipykernel.iostream.OutStream object at 0x7f282044a7d0>,
               n_jobs=-1, population_size=10, random_state=12, verbosity=2)

In [14]:
%%time

preds = tpot.predict(X_test)
print(accuracy_score(y_test, preds))

0.7378900051116943
CPU times: user 968 ms, sys: 0 ns, total: 968 ms
Wall time: 967 ms
