New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EaSearchCV - CV splitting is failing with Xarray inputs currently #204
Comments
Here are some notes I took regarding cross validation, thinking about how to cross validation work with xarray/xarray_filters data structures. Currently this pipe = Pipeline([
('sampler', Sampler(max_time_steps=max_time_steps)),
('time', Differencing(layers=FEATURE_LAYERS)),
('flatten', Flatten()),
('soil_phys', AddSoilPhysicalChemical()),
('drop_null', DropNaRows()),
('get_y', GetY(SOIL_MOISTURE)),
('None', None),
('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
('pca', ChooseWithPreproc()),
('estimator', linear_model.LinearRegression(n_jobs=-1))]) In pseudocode this is what cross validating at for n in range(num_samples):
dset = pipe.steps[0][1].fit_transform(**sample_args)
for name, step in pipe.steps[:-1]:
dset = step.fit_transform(dset)
return pipe.steps[-1][1].fit(dset) Currently sklearn cross validation iterators would support:
Nested cross validation idea - cross validating at the input samples level (e.g. filenames that make up an xarray data structure in Sampler), as well as cross validation splitting of the input matrix: def make_sample(n):
return load_array('big_file_{}.nc'.format(n))
xarray_pipe = Pipeline([
('sampler', Sampler(func=make_sample),
('time', Differencing(layers=FEATURE_LAYERS)),
('flatten', Flatten()),
('soil_phys', AddSoilPhysicalChemical()),
('drop_null', DropNaRows()),
('get_y', GetY(SOIL_MOISTURE)),
('None', None),])
numpy_or_dask_pipe = Pipeline([
('scaler', ChooseWithPreproc(trans_if=log_trans_only_positive)),
('pca', ChooseWithPreproc()),
('estimator', linear_model.LinearRegression(n_jobs=-1))])
def nested_cross_val(outer_num_samples, inner_num_samples):
for n in range(outer_num_samples):
# This is the outer cross validation, e.g
# over file names or dates that
# determine a sample reading from file, for examples.
# In the NLDAS ML
sample = xarray_pipe[0][1].fit_transform(n)
for name, step in xarray_pipe.steps[1:]:
sample = step.fit_transform(sample)
X, y = sample
for n in range(inner_num_samples): # e.g. KFold
# test / train split X, y
# run the steps in the `numpy_pipe`
# This is the "inner cross validation" Nested cross validation inside evolutionary search (pseudocode): def ea_search_cv(outer_cv, inner_cv):
pop = initialize()
for generation in range(ngen):
# Each generation in evo algo
for model in pop:
# Each member of population
# Do outer / inner cross validations
nested_cross_validation(outer_cv, inner_cv)
scores = # accumulate two-layer cross validation scores
# The EA search chooses the
# best parameters based on cross validation scores
pop = select_new_population(scores)
return pop |
In
EaSearchCV
(PR #192), cross validation fails when given a MLDataset / Dataset in a Pipeline or estimator, as the cross validation logic indask-searchcv
and/or scikit-learn subsets rows of a dask array/dataframe or numpy array. Things to consider:param
to control transformers for viz. These transformers may be transforming a dataset with 1 or moreDataArray
sIdeas: We want to allow both use styles, but cross validation can only be supported on the portion of a pipeline that has a 2D features matrix, e.g a numpy or dask array or MLDataset/Dataset with a single
features
DataArray whose.values
array is given to cross validation tools. For example,in a Pipeline of:to_features
on the MLDataset of 4-D arrays to convert to features matrix.features
2D DataArrayHyperparameterization should be supported for any step but cross validation only of the steps 3 through 6 that use a typical feature matrix.
The text was updated successfully, but these errors were encountered: