Allow passing fixed transformers to evaluations #367

PierreGtch · 2023-05-05T14:19:06Z

Hi @sylvchev,
I think it would be very convenient to allow passing "fixed" sklearn transformer(s) (i.e. with a transform method but no fit method) to the evaluations. A pipeline starting with such a fixed transformer currently applies k-folds times the transformation to the data when being evaluated. One time is enough if the transformer does not need to be trained.
If we evaluate multiple pipelines all starting with the same transformer, the time gain can be even greater.

concrete use-case

I use pre-trained (and frozen) neural networks for feature extraction (with skorch). The expensive part of the evaluation is the feature extraction. Training a classifier on the extracted features is relatively fast.

implementation

Implementing that in BaseParadigm.get_data would require only little changes to the different evaluations.

What do you think?

The text was updated successfully, but these errors were encountered:

bruAristimunha · 2023-05-05T15:40:50Z

Hi @PierreGtch!

You don't need to change the evaluation inside the moabb. You will need to build a function or transformation using a pre-trained deep learning model to extract this new feature space and use this representation for the classification step. The library Scikit-learn has enough flexibility for you.

A possible path is to follow this tutorial: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py

And translate this for deep learning, something like:

from  sklearn.preprocessing import FunctionTransformer

pre_trained_torch_model = ...
def feature_from_deep_learning(X, model):
    return model(X)
transformation_step = FunctionTransformer(func=feature_from_deep_learning,
                                          kw_args={"model": pre_trained_torch_model})

Maybe you will need to do some trick or another to indicate that the model is already "fitted", and pay attention on how you will load the model to ensure no data leak.

transformation_step.__sklearn_is_fitted__ = True

bruAristimunha · 2023-05-05T15:42:22Z

As Igor and I are passing a similar issue, I asked his opinion on the subject

PierreGtch · 2023-05-05T16:03:18Z

Hi @bruAristimunha,
Yes I agree, it's not needed, the goal would rather be to improve the speed of the evaluation procedure.
My problem is not to use a neural network in a sklearn pipeline (I use skorch for that), and it's not specific to neural networks. This feature would be useful for any pipeline that starts by transformations that don't need to be trained.

For example, in the WithinSessionEvaluation, by default we do a 5-folds cross-validation. In this case, the same transformation is currently applied 5 times to each session. What I propose is to: first, apply the fixed transformation, then, perform the 5-folds cv on the transformed data.

(I can do the implementation of course)

bruAristimunha · 2023-05-05T16:32:56Z

If I understand correctly, and you are using skorch, you just need to extract the PyTorch model trained pre_trained_torch_model = clf.module or pre_trained_torch_model = cf.module_ and use the inside a transformation function like in the example above.

Now, suppose you want to apply a more agnostic transformation to the dataset before the split. In that case, I think it wouldn't be in the evaluation that would need to be changed, but in the paradigm, together with the pre-processing functions (i.e., resample).

It seems to me that we are agreeing but in different words. Maybe it's cool for us to talk more with a concrete example. Perhaps we can follow with a small code example. I was wondering, can you make one example?

PierreGtch · 2023-05-05T20:59:53Z

Yes, using a pytorch model in a transformation function was not a problem, it was just an example.

Here are examples:

Currently, we have to do this:

from sklearn.pipeline import make_pipeline
from moabb.evaluations import WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery

transformer = ...
classifier = ...
pipelines = {'transformer+classifier': make_pipeline(transformer, classifier)}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines)

What I propose could be implemented in two different ways:
Option 1:

pipelines = {'classifier': classifier}
paradigm = LeftRightImagery(transformer=transformer)
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines)

Option 2:
2.a. Where we could either use the same transformer for all the classifiers:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformer)

2.b. or a different one for each:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
transformers = {'classifier1': transformer1, 'classifier2': transformer2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformers)

bruAristimunha · 2023-05-07T07:05:17Z

I like the first option, what is your preference @sylvchev?

sylvchev · 2023-05-08T21:19:12Z

Hi @PierreGtch
This is an excellent idea, in most of the cases the feature extraction is fast but in some cases it could be quite long to compute. Having the scikit transformers separated from the classifier to speed up the computation is a good thing and could be quite useful.
I prefer the option 2, as the transformer argument is linked with the pipelines for me, and not from the paradigm. For now, the arguments of the paradigm class are mostly related with raw signal processing (frequency filter, crop, electrode selection, etc.) Also the possibility to have multiple couples of classifier/transformer is interesting, even if it adds some complexity.
Another thing that could really speed up the processing time is to cache the transformer results. The user should be properly warned that it could take a lot of disk space (on the order of the datasets). This could save lots of time and CPU consumption, especially if you rerun experiments with different classifiers.

PierreGtch · 2023-05-09T08:39:50Z

Hi @sylvchev I also prefer option 2, I think it can be simpler to understand for users.

What do you mean by cache the transformer results? On disk or in memory? I was assuming we should cache them in memory while they are used within each call to process. If you meant on disk, I'm interested, what do you have in mind?

Also, I just had another question: do you think we should also introduce a parameter like transformer_suffix to add after the pipelines names in the results? This could be useful when you want to test all the combinations between a list of transformers and a list of classifiers; you could just do:

result1 = eval.process(pipelines, transformer1, transformer_suffix="_transformer1")  
result2 = eval.process(pipelines, transformer2, transformer_suffix="_transformer2")

But when transformers is a dict, I'm not sure what should be the behaviour of transformer_suffix. Ignored? Expect also a dict? Expect None?

Without a transformer_suffix parameter, the user can still do such an evaluation but has to change the keys of the pipelines in between calls.

An option 3 could be to still pass the transformers as a dict to process, but their keys would not have to match the classifiers' keys. Then, the results could be computed for every possible transformer/classifier pair. In the results, the pipeline name of a transformer/classifier pair would be something like f"{key_transformer}_{key_pipeline}". What do you think?

Also, a potential risk is that some unsupervised algorithms make use of the data passed to transform to train, i.e. all the folds of the cv. The sklearn algos (like t-SNE) would require a call to fit before transform can be called, so they would raise an error. However, nothing prevents the users from defining their own transformer that trains an unsupervised algo during the call to transform.

PierreGtch · 2023-05-16T16:53:46Z

I think I now prefer option 3: this way, the pipelines and transformers parameters would have relatively similar behaviours and not depend too much on each other. Also, there wouldn't be multiple cases like with option 1.

What do you think @bruAristimunha @sylvchev ?

sylvchev · 2023-05-26T01:01:30Z

What do you mean by cache the transformer results? On disk or in memory? I was assuming we should cache them in memory while they are used within each call to process. If you meant on disk, I'm interested, what do you have in mind?

I was thinking about disk cache. This could be useful when there are computationally intensive preprocessing/transformation of the dataset. I used this disk caching in a previous project and it worked quite well to speed up computation.

Also, I just had another question: do you think we should also introduce a parameter like transformer_suffix to add after the pipelines names in the results? This could be useful when you want to test all the combinations between a list of transformers and a list of classifiers; you could just do:
result1 = eval.process(pipelines, transformer1, transformer_suffix="_transformer1")
result2 = eval.process(pipelines, transformer2, transformer_suffix="_transformer2")
But when transformers is a dict, I'm not sure what should be the behaviour of transformer_suffix. Ignored? Expect also a dict? Expect None?

I think adding a suffix in the pipeline that depends of the applied transformer is a good idea. It could be simpler to extract it directly from the transformer dict key rather than to add a specific argument in the process function. In that case, I think we should constraint the transformer argument to always be a dict.

Without a transformer_suffix parameter, the user can still do such an evaluation but has to change the keys of the pipelines in between calls.

Not sure to understand why it is necessary to change the key between calls. Same transformer applied to same pipeline should give the same pipeline+suffix name, don't they?

An option 3 could be to still pass the transformers as a dict to process, but their keys would not have to match the classifiers' keys. Then, the results could be computed for every possible transformer/classifier pair. In the results, the pipeline name of a transformer/classifier pair would be something like f"{key_transformer}_{key_pipeline}". What do you think?

yes, I prefer also this 3rd option.

Also, a potential risk is that some unsupervised algorithms make use of the data passed to transform to train, i.e. all the folds of the cv. The sklearn algos (like t-SNE) would require a call to fit before transform can be called, so they would raise an error. However, nothing prevents the users from defining their own transformer that trains an unsupervised algo during the call to transform.

The leakage of information is a major risk. I think we could ensure that point by checking that the transformer estimator or pipeline are purely unsupervised and do not use the label information, and raise a warning if it is the case.

PierreGtch · 2023-05-26T09:31:41Z

Hi @sylvchev
I started to implement option 3 here because I need it for the Bruxelles meeting. I'm quite busy preparing for the meeting rn but will work on this PR after. Currently, I only implemented the within-subject evaluation code and it is something like this:

for subject in dataset.subject_list:
    X_no_tf, labels, metadata = paradigm.get_data(dataset, [subject])
    for name_tf, transformer in transformers.items():
        X = transformer.transform(X_no_tf)
        for session in metadata.session.unique():
            ix = metadata.session==session
            for name_pipe, pipeline in pipelines.items():
                ...
                pipeline.fit(X[ix], labels[ix])
                ...
                results.add(name=name_tf + " + " + name_pipe, ...)

Do you have any preliminary comments?

I was thinking about disk cache. This could be useful when there are computationally intensive preprocessing/transformation of the dataset. I used this disk caching in a previous project and it worked quite well to speed up computation.

Yes, it would be great to have such a caching mechanism automatically taken care of by MOABB! I created a new issue for that: #385.

The leakage of information is a major risk. I think we could ensure that point by checking that the transformer estimator or pipeline are purely unsupervised and do not use the label information, and raise a warning if it is the case.

Even unsupervised algorithms can be problematic if they train on the (unlabelled) data we ask them to transform. For example: in a cross-session evaluation, we will probably pass the data from all the sessions simultaneously to the transformer. If this transformer uses all the sessions to train an unsupervised algorithm, it breaks the train/test separation...
We need transformers that don't train at all on the data we ask them to transform.

PierreGtch · 2023-07-26T16:19:49Z

Now we have PR #408,
If we implement #429 (easy), it becomes super easy to implement option 2 in the case where all the pipelines use the same transformer!

The API would be:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformer=transformer)

bruAristimunha · 2023-08-01T10:14:51Z

Closes with #408

PierreGtch · 2023-08-01T10:28:46Z

This is not completely implemented in #408. Currently, we can only pass a fixed transformer to dataset.get_data() and to paradigm.get_data() but not yet to evaluation.process(). But we are not far!

bruAristimunha · 2023-08-01T10:58:59Z

Oh sorry >.<

PierreGtch · 2023-08-31T11:32:11Z

Closed by #372 which implements option 2.a. (see above).

bruAristimunha assigned bruAristimunha and carraraig May 5, 2023

PierreGtch mentioned this issue May 16, 2023

Apply fixed transformer before evaluation #372

Merged

PierreGtch added a commit to PierreGtch/moabb that referenced this issue May 23, 2023

add transformers option 1 NeuroTechX#367

7d7ae42

PierreGtch mentioned this issue May 26, 2023

Disk caching of preprocessing/transformation result #385

Closed

bruAristimunha closed this as completed Aug 1, 2023

bruAristimunha reopened this Aug 1, 2023

PierreGtch added this to the 0.6.0 milestone Aug 28, 2023

PierreGtch closed this as completed Aug 31, 2023

sylvchev modified the milestones: 0.5 latest dev version, 1.0.0 Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing fixed transformers to evaluations #367

Allow passing fixed transformers to evaluations #367

PierreGtch commented May 5, 2023

bruAristimunha commented May 5, 2023

bruAristimunha commented May 5, 2023

PierreGtch commented May 5, 2023 •

edited

bruAristimunha commented May 5, 2023

PierreGtch commented May 5, 2023 •

edited

bruAristimunha commented May 7, 2023

sylvchev commented May 8, 2023 •

edited

PierreGtch commented May 9, 2023 •

edited

PierreGtch commented May 16, 2023

sylvchev commented May 26, 2023

PierreGtch commented May 26, 2023

PierreGtch commented Jul 26, 2023

bruAristimunha commented Aug 1, 2023

PierreGtch commented Aug 1, 2023

bruAristimunha commented Aug 1, 2023

PierreGtch commented Aug 31, 2023

Allow passing fixed transformers to evaluations #367

Allow passing fixed transformers to evaluations #367

Comments

PierreGtch commented May 5, 2023

concrete use-case

implementation

bruAristimunha commented May 5, 2023

bruAristimunha commented May 5, 2023

PierreGtch commented May 5, 2023 • edited

bruAristimunha commented May 5, 2023

PierreGtch commented May 5, 2023 • edited

bruAristimunha commented May 7, 2023

sylvchev commented May 8, 2023 • edited

PierreGtch commented May 9, 2023 • edited

PierreGtch commented May 16, 2023

sylvchev commented May 26, 2023

PierreGtch commented May 26, 2023

PierreGtch commented Jul 26, 2023

bruAristimunha commented Aug 1, 2023

PierreGtch commented Aug 1, 2023

bruAristimunha commented Aug 1, 2023

PierreGtch commented Aug 31, 2023

PierreGtch commented May 5, 2023 •

edited

PierreGtch commented May 5, 2023 •

edited

sylvchev commented May 8, 2023 •

edited

PierreGtch commented May 9, 2023 •

edited