Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some bugs in the generated code with feature selection and scaler #69

Closed
kadarakos opened this issue Jan 2, 2016 · 2 comments
Closed
Labels

Comments

@kadarakos
Copy link
Contributor

I ran a couple of experiments on MNIST and observed that the code generation is a bit buggy at the moment. In the first example only operator generated is SelectPercentile

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's SelectPercentile for feature selection
training_features = result2.loc[training_indices].drop('class', axis=1)
training_class_vals = result2.loc[training_indices, 'class'].values

if len(training_features.columns.values) == 0:
result3 = result2.copy()
else:
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(training_features.values, training_class_vals)
mask = selector.get_support(True)
mask_cols = list(training_features.iloc[:, mask].columns) + ['class']
result3 = result2[mask_cols]
  • No indentation
  • result2 is not defined
  • optimized_pipeline_ contains _select_percentile, svc, _standard_scaler, but svc and
    standard scaler don't appear in the generated code

Another example with RobustScaler:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


# Use Scikit-learn's RobustScaler to scale the features
training_features = result3.loc[training_indices].drop('class', axis=1)
result4 = result3.copy()

if len(training_features.columns.values) > 0:
scaler = RobustScaler()
scaler.fit(training_features.values.astype(np.float64))
scaled_features = scaler.transform(result4.drop('class', axis=1).values.astype(np.float64))

for col_num, column in enumerate(result4.drop('class', axis=1).columns.values):
    result4.loc[:, column] = scaled_features[:, col_num]
  • No indentation
  • result2 is not defined
  • optimized_pipeline_ contains _robust_scaler, svc, svc, _select_percentile, but svc, svc and
    _select_percentile, don't appear in the generated code
@rhiever
Copy link
Contributor

rhiever commented Jan 2, 2016

👍 I noticed some of these bugs when reviewing the code the other day as well. Will look to address these soon.

@kadarakos
Copy link
Contributor Author

I fixed the problem with a very minor bugfix in PR #68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants