Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity. #1242

Open
perib opened this issue Mar 22, 2022 · 1 comment

Comments

@perib
Copy link
Contributor

perib commented Mar 22, 2022

TPOT uses FeatureUnion to combined the outputs of multiple operators. However, it is possible for tpot to put in two stacking estimators within a FeatureUnion block. This causes tpot to pass along two identical copies on the dataset into the next operator.

Context of the issue

This increases computational load and complexity, especially for large datasets, with no benefit. It may also have a performance impact on certain models.

Process to reproduce the issue

  1. User creates TPOT instance
  2. User calls TPOT fit() function with training data
  3. TPOT will generate a pipeline as described.

To demonstrate the issue, below is code using a pipeline that was found by tpot.

from sklearn.pipeline import FeatureUnion, Pipeline
from tpot.builtins import StackingEstimator, ZeroCount
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.decomposition import PCA
import numpy as np
 
p = Pipeline(
    [('featureunion', FeatureUnion(transformer_list=[('stackingestimator-1',
                                StackingEstimator(estimator=RandomForestRegressor(max_features=0.45,
                                                                                  min_samples_leaf=9,
                                                                                  min_samples_split=4))),
                               ('stackingestimator-2',
                                StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.7500000000000001,
                                                                                min_samples_leaf=20,
                                                                                min_samples_split=18)))])), ('stackingestimator-1', StackingEstimator(estimator=SGDRegressor(alpha=0.01, eta0=1.0,
                                         fit_intercept=False, l1_ratio=0.0,
                                         loss='epsilon_insensitive',
                                         penalty='elasticnet', power_t=1.0))), ('pca', PCA(iterated_power=3, svd_solver='randomized')), ('stackingestimator-2', StackingEstimator(estimator=SGDRegressor(alpha=0.001, fit_intercept=False,
                                         l1_ratio=0.0,
                                         loss='epsilon_insensitive',
                                         penalty='elasticnet', power_t=1.0))), ('zerocount', ZeroCount()), ('sgdregressor', SGDRegressor(alpha=0.001, fit_intercept=False, l1_ratio=0.5,
             learning_rate='constant', loss='huber', penalty='elasticnet',
             power_t=0.1))]
)
 
X = np.random.rand(5,10)
y = np.random.rand(5)
 
p.fit(X,y)
 
xx = [range(10)]
print("Input data " ,xx)
print("After featureUnion ", p.steps[0][1].transform(xx))

Expected result

The data should not be copied over twice.

[Estimator 1 predictions, Estimator 2 predictions, X]

[0.44, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Current result

[Estimator 1 predictions, X, Estimator 2 predictions, X]

[0.44, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Possible fix

Here is my idea off the top of my head:
Limit featureUnion to selectors, transformers, and at most one classifier or regressor. That way only one copy of the data exists. When more than one classifier or regressor is used, replace the featureUnion with the sklearn stackingclassifier or stackingregressor. These functions similarly allow multiple models to pass along their predictions, but then only pass forward one copy of the dataset.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

@perib
Copy link
Contributor Author

perib commented May 17, 2022

I wanted to add another data replication issue.

The FunctionTransformer module can also be set to exactly copy the input into the next layer. I have generated another pipeline where Several feature unions are stacked with multiple function transformers that are essentially just leading to multiple copies of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant