This notebook walks through a basic example of using the GPU-accelerated estimators from [RAPIDS](https://rapids.ai/) cuML and [DMLC/XGBoost](https://github.com/dmlc/xgboost) with TPOT for classification tasks. You must have access to an NVIDIA GPU and have cuML installed in your environment. Running this notebook without cuML will cause TPOT to raise a `ValueError`, indicating you should install cuML.

In [4]:
from tpot import TPOTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [5]:
NSAMPLES = 50000
NFEATURES = 20
SEED = 12

# For cuML with TPOT, you must use CPU data (such as NumPy arrays)
X, y = make_classification(
    n_samples=NSAMPLES,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    n_redundant=0,
    class_sep=0.55,
    n_classes=2,
    random_state=SEED,
    
)

X = X.astype("float32")

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=SEED)

Note that for cuML to work correctly, you must set `n_jobs=1` (the default setting).

In [6]:
# TPOT setup
GENERATIONS = 5
POP_SIZE = 100
CV = 5

tpot = TPOTClassifier(
    generations=GENERATIONS,
    population_size=POP_SIZE,
    random_state=SEED,
    config_dict="TPOT cuML",
    n_jobs=1, # cuML requires n_jobs=1, the default
    cv=CV,
    verbosity=2,
)

tpot.fit(X_train, y_train)

preds = tpot.predict(X_test)
print(accuracy_score(y_test, preds))

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=30.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.9695733333333334
Generation 2 - Current best internal CV score: 0.9695733333333334
Generation 3 - Current best internal CV score: 0.9695733333333334
Generation 4 - Current best internal CV score: 0.9705333333333334
Generation 5 - Current best internal CV score: 0.9705333333333334
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=20, weights=uniform)
0.97704


In [7]:
tpot.export('tpot_classification_cuml_pipeline.py')
print(tpot.export())

import numpy as np
import pandas as pd
from cuml.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=12)

# Average CV score on the training set was: 0.9705333333333334
exported_pipeline = KNeighborsClassifier(n_neighbors=20, weights="uniform")
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 12)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

