TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

TPOT Automate the ML Pipeline

In [2]:
# !pip install tpot

# Implementation

In [3]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

In [4]:
digits = load_digits()

In [5]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.80, test_size=0.20)

In [6]:
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)



A Jupyter Widget

Generation 1 - Current best internal CV score: 0.952679295731436
Generation 2 - Current best internal CV score: 0.962477488379542
Generation 3 - Current best internal CV score: 0.962477488379542
Generation 4 - Current best internal CV score: 0.9742552743129256
Generation 5 - Current best internal CV score: 0.9784297532054452

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.05, min_samples_leaf=1, min_samples_split=5, n_estimators=100)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=5,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=20,
        random_state=None, scoring=None, subsample=1.0, template=None,
        use_dask=False, verbosity=2, warm_start=False)

In [7]:
print(tpot.score(X_test, y_test))

0.9638888888888889


# Regression

 TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

In [9]:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split


In [10]:
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)

In [11]:
tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))



A Jupyter Widget

Generation 1 - Current best internal CV score: -18.88543497531915
Generation 2 - Current best internal CV score: -18.05126013833063
Generation 3 - Current best internal CV score: -15.626505378336498
Generation 4 - Current best internal CV score: -14.428749992714245
Generation 5 - Current best internal CV score: -14.428749992714245

Best pipeline: RandomForestRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=True, max_features=0.6000000000000001, min_samples_leaf=2, min_samples_split=9, n_estimators=100)
-5.607737587860179
