FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

perib · 2022-03-22T19:17:34Z

TPOT uses FeatureUnion to combined the outputs of multiple operators. However, it is possible for tpot to put in two stacking estimators within a FeatureUnion block. This causes tpot to pass along two identical copies on the dataset into the next operator.

Context of the issue

This increases computational load and complexity, especially for large datasets, with no benefit. It may also have a performance impact on certain models.

Process to reproduce the issue

User creates TPOT instance
User calls TPOT fit() function with training data
TPOT will generate a pipeline as described.

To demonstrate the issue, below is code using a pipeline that was found by tpot.

from sklearn.pipeline import FeatureUnion, Pipeline
from tpot.builtins import StackingEstimator, ZeroCount
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.decomposition import PCA
import numpy as np
 
p = Pipeline(
    [('featureunion', FeatureUnion(transformer_list=[('stackingestimator-1',
                                StackingEstimator(estimator=RandomForestRegressor(max_features=0.45,
                                                                                  min_samples_leaf=9,
                                                                                  min_samples_split=4))),
                               ('stackingestimator-2',
                                StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.7500000000000001,
                                                                                min_samples_leaf=20,
                                                                                min_samples_split=18)))])), ('stackingestimator-1', StackingEstimator(estimator=SGDRegressor(alpha=0.01, eta0=1.0,
                                         fit_intercept=False, l1_ratio=0.0,
                                         loss='epsilon_insensitive',
                                         penalty='elasticnet', power_t=1.0))), ('pca', PCA(iterated_power=3, svd_solver='randomized')), ('stackingestimator-2', StackingEstimator(estimator=SGDRegressor(alpha=0.001, fit_intercept=False,
                                         l1_ratio=0.0,
                                         loss='epsilon_insensitive',
                                         penalty='elasticnet', power_t=1.0))), ('zerocount', ZeroCount()), ('sgdregressor', SGDRegressor(alpha=0.001, fit_intercept=False, l1_ratio=0.5,
             learning_rate='constant', loss='huber', penalty='elasticnet',
             power_t=0.1))]
)
 
X = np.random.rand(5,10)
y = np.random.rand(5)
 
p.fit(X,y)
 
xx = [range(10)]
print("Input data " ,xx)
print("After featureUnion ", p.steps[0][1].transform(xx))

Expected result

The data should not be copied over twice.

[Estimator 1 predictions, Estimator 2 predictions, X]

[0.44, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Current result

[Estimator 1 predictions, X, Estimator 2 predictions, X]

[0.44, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Possible fix

Here is my idea off the top of my head:
Limit featureUnion to selectors, transformers, and at most one classifier or regressor. That way only one copy of the data exists. When more than one classifier or regressor is used, replace the featureUnion with the sklearn stackingclassifier or stackingregressor. These functions similarly allow multiple models to pass along their predictions, but then only pass forward one copy of the dataset.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

The text was updated successfully, but these errors were encountered:

perib · 2022-05-17T21:42:38Z

I wanted to add another data replication issue.

The FunctionTransformer module can also be set to exactly copy the input into the next layer. I have generated another pipeline where Several feature unions are stacked with multiple function transformers that are essentially just leading to multiple copies of the data.

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

perib commented Mar 22, 2022

perib commented May 17, 2022

FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

Comments

perib commented Mar 22, 2022

Context of the issue

Process to reproduce the issue

Expected result

Current result

Possible fix

perib commented May 17, 2022