Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing fixed transformers to evaluations #367

Closed
PierreGtch opened this issue May 5, 2023 · 16 comments
Closed

Allow passing fixed transformers to evaluations #367

PierreGtch opened this issue May 5, 2023 · 16 comments
Assignees
Milestone

Comments

@PierreGtch
Copy link
Collaborator

Hi @sylvchev,
I think it would be very convenient to allow passing "fixed" sklearn transformer(s) (i.e. with a transform method but no fit method) to the evaluations. A pipeline starting with such a fixed transformer currently applies k-folds times the transformation to the data when being evaluated. One time is enough if the transformer does not need to be trained.
If we evaluate multiple pipelines all starting with the same transformer, the time gain can be even greater.

concrete use-case

I use pre-trained (and frozen) neural networks for feature extraction (with skorch). The expensive part of the evaluation is the feature extraction. Training a classifier on the extracted features is relatively fast.

implementation

Implementing that in BaseParadigm.get_data would require only little changes to the different evaluations.

What do you think?

@bruAristimunha
Copy link
Collaborator

Hi @PierreGtch!

You don't need to change the evaluation inside the moabb. You will need to build a function or transformation using a pre-trained deep learning model to extract this new feature space and use this representation for the classification step. The library Scikit-learn has enough flexibility for you.

A possible path is to follow this tutorial: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py

And translate this for deep learning, something like:

from  sklearn.preprocessing import FunctionTransformer

pre_trained_torch_model = ...
def feature_from_deep_learning(X, model):
    return model(X)
transformation_step = FunctionTransformer(func=feature_from_deep_learning,
                                          kw_args={"model": pre_trained_torch_model})

Maybe you will need to do some trick or another to indicate that the model is already "fitted", and pay attention on how you will load the model to ensure no data leak.

transformation_step.__sklearn_is_fitted__ = True

@bruAristimunha
Copy link
Collaborator

As Igor and I are passing a similar issue, I asked his opinion on the subject

@PierreGtch
Copy link
Collaborator Author

PierreGtch commented May 5, 2023

Hi @bruAristimunha,
Yes I agree, it's not needed, the goal would rather be to improve the speed of the evaluation procedure.
My problem is not to use a neural network in a sklearn pipeline (I use skorch for that), and it's not specific to neural networks. This feature would be useful for any pipeline that starts by transformations that don't need to be trained.

For example, in the WithinSessionEvaluation, by default we do a 5-folds cross-validation. In this case, the same transformation is currently applied 5 times to each session. What I propose is to: first, apply the fixed transformation, then, perform the 5-folds cv on the transformed data.

(I can do the implementation of course)

@bruAristimunha
Copy link
Collaborator

If I understand correctly, and you are using skorch, you just need to extract the PyTorch model trained pre_trained_torch_model = clf.module or pre_trained_torch_model = cf.module_ and use the inside a transformation function like in the example above.

Now, suppose you want to apply a more agnostic transformation to the dataset before the split. In that case, I think it wouldn't be in the evaluation that would need to be changed, but in the paradigm, together with the pre-processing functions (i.e., resample).

It seems to me that we are agreeing but in different words. Maybe it's cool for us to talk more with a concrete example. Perhaps we can follow with a small code example. I was wondering, can you make one example?

@PierreGtch
Copy link
Collaborator Author

PierreGtch commented May 5, 2023

Yes, using a pytorch model in a transformation function was not a problem, it was just an example.

Here are examples:

Currently, we have to do this:

from sklearn.pipeline import make_pipeline
from moabb.evaluations import WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery

transformer = ...
classifier = ...
pipelines = {'transformer+classifier': make_pipeline(transformer, classifier)}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines)

What I propose could be implemented in two different ways:
Option 1:

pipelines = {'classifier': classifier}
paradigm = LeftRightImagery(transformer=transformer)
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines)

Option 2:
2.a. Where we could either use the same transformer for all the classifiers:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformer)

2.b. or a different one for each:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
transformers = {'classifier1': transformer1, 'classifier2': transformer2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformers)

@bruAristimunha
Copy link
Collaborator

I like the first option, what is your preference @sylvchev?

@sylvchev
Copy link
Member

sylvchev commented May 8, 2023

Hi @PierreGtch
This is an excellent idea, in most of the cases the feature extraction is fast but in some cases it could be quite long to compute. Having the scikit transformers separated from the classifier to speed up the computation is a good thing and could be quite useful.
I prefer the option 2, as the transformer argument is linked with the pipelines for me, and not from the paradigm. For now, the arguments of the paradigm class are mostly related with raw signal processing (frequency filter, crop, electrode selection, etc.) Also the possibility to have multiple couples of classifier/transformer is interesting, even if it adds some complexity.
Another thing that could really speed up the processing time is to cache the transformer results. The user should be properly warned that it could take a lot of disk space (on the order of the datasets). This could save lots of time and CPU consumption, especially if you rerun experiments with different classifiers.

@PierreGtch
Copy link
Collaborator Author

PierreGtch commented May 9, 2023

Hi @sylvchev I also prefer option 2, I think it can be simpler to understand for users.

What do you mean by cache the transformer results? On disk or in memory? I was assuming we should cache them in memory while they are used within each call to process. If you meant on disk, I'm interested, what do you have in mind?

Also, I just had another question: do you think we should also introduce a parameter like transformer_suffix to add after the pipelines names in the results? This could be useful when you want to test all the combinations between a list of transformers and a list of classifiers; you could just do:

result1 = eval.process(pipelines, transformer1, transformer_suffix="_transformer1")  
result2 = eval.process(pipelines, transformer2, transformer_suffix="_transformer2")  

But when transformers is a dict, I'm not sure what should be the behaviour of transformer_suffix. Ignored? Expect also a dict? Expect None?

Without a transformer_suffix parameter, the user can still do such an evaluation but has to change the keys of the pipelines in between calls.


An option 3 could be to still pass the transformers as a dict to process, but their keys would not have to match the classifiers' keys. Then, the results could be computed for every possible transformer/classifier pair. In the results, the pipeline name of a transformer/classifier pair would be something like f"{key_transformer}_{key_pipeline}". What do you think?


Also, a potential risk is that some unsupervised algorithms make use of the data passed to transform to train, i.e. all the folds of the cv. The sklearn algos (like t-SNE) would require a call to fit before transform can be called, so they would raise an error. However, nothing prevents the users from defining their own transformer that trains an unsupervised algo during the call to transform.

@PierreGtch
Copy link
Collaborator Author

I think I now prefer option 3: this way, the pipelines and transformers parameters would have relatively similar behaviours and not depend too much on each other. Also, there wouldn't be multiple cases like with option 1.

What do you think @bruAristimunha @sylvchev ?

PierreGtch added a commit to PierreGtch/moabb that referenced this issue May 23, 2023
@sylvchev
Copy link
Member

What do you mean by cache the transformer results? On disk or in memory? I was assuming we should cache them in memory while they are used within each call to process. If you meant on disk, I'm interested, what do you have in mind?

I was thinking about disk cache. This could be useful when there are computationally intensive preprocessing/transformation of the dataset. I used this disk caching in a previous project and it worked quite well to speed up computation.

Also, I just had another question: do you think we should also introduce a parameter like transformer_suffix to add after the pipelines names in the results? This could be useful when you want to test all the combinations between a list of transformers and a list of classifiers; you could just do:
result1 = eval.process(pipelines, transformer1, transformer_suffix="_transformer1")
result2 = eval.process(pipelines, transformer2, transformer_suffix="_transformer2")
But when transformers is a dict, I'm not sure what should be the behaviour of transformer_suffix. Ignored? Expect also a dict? Expect None?

I think adding a suffix in the pipeline that depends of the applied transformer is a good idea. It could be simpler to extract it directly from the transformer dict key rather than to add a specific argument in the process function. In that case, I think we should constraint the transformer argument to always be a dict.

Without a transformer_suffix parameter, the user can still do such an evaluation but has to change the keys of the pipelines in between calls.

Not sure to understand why it is necessary to change the key between calls. Same transformer applied to same pipeline should give the same pipeline+suffix name, don't they?

An option 3 could be to still pass the transformers as a dict to process, but their keys would not have to match the classifiers' keys. Then, the results could be computed for every possible transformer/classifier pair. In the results, the pipeline name of a transformer/classifier pair would be something like f"{key_transformer}_{key_pipeline}". What do you think?

yes, I prefer also this 3rd option.

Also, a potential risk is that some unsupervised algorithms make use of the data passed to transform to train, i.e. all the folds of the cv. The sklearn algos (like t-SNE) would require a call to fit before transform can be called, so they would raise an error. However, nothing prevents the users from defining their own transformer that trains an unsupervised algo during the call to transform.

The leakage of information is a major risk. I think we could ensure that point by checking that the transformer estimator or pipeline are purely unsupervised and do not use the label information, and raise a warning if it is the case.

@PierreGtch
Copy link
Collaborator Author

Hi @sylvchev
I started to implement option 3 here because I need it for the Bruxelles meeting. I'm quite busy preparing for the meeting rn but will work on this PR after. Currently, I only implemented the within-subject evaluation code and it is something like this:

for subject in dataset.subject_list:
    X_no_tf, labels, metadata = paradigm.get_data(dataset, [subject])
    for name_tf, transformer in transformers.items():
        X = transformer.transform(X_no_tf)
        for session in metadata.session.unique():
            ix = metadata.session==session
            for name_pipe, pipeline in pipelines.items():
                ...
                pipeline.fit(X[ix], labels[ix])
                ...
                results.add(name=name_tf + " + " + name_pipe, ...)

Do you have any preliminary comments?

I was thinking about disk cache. This could be useful when there are computationally intensive preprocessing/transformation of the dataset. I used this disk caching in a previous project and it worked quite well to speed up computation.

Yes, it would be great to have such a caching mechanism automatically taken care of by MOABB! I created a new issue for that: #385.

The leakage of information is a major risk. I think we could ensure that point by checking that the transformer estimator or pipeline are purely unsupervised and do not use the label information, and raise a warning if it is the case.

Even unsupervised algorithms can be problematic if they train on the (unlabelled) data we ask them to transform. For example: in a cross-session evaluation, we will probably pass the data from all the sessions simultaneously to the transformer. If this transformer uses all the sessions to train an unsupervised algorithm, it breaks the train/test separation...
We need transformers that don't train at all on the data we ask them to transform.

@PierreGtch
Copy link
Collaborator Author

Now we have PR #408,
If we implement #429 (easy), it becomes super easy to implement option 2 in the case where all the pipelines use the same transformer!

The API would be:

pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformer=transformer)

@bruAristimunha
Copy link
Collaborator

Closes with #408

@PierreGtch
Copy link
Collaborator Author

This is not completely implemented in #408. Currently, we can only pass a fixed transformer to dataset.get_data() and to paradigm.get_data() but not yet to evaluation.process(). But we are not far!

@bruAristimunha bruAristimunha reopened this Aug 1, 2023
@bruAristimunha
Copy link
Collaborator

Oh sorry >.<

@PierreGtch PierreGtch added this to the 0.6.0 milestone Aug 28, 2023
@PierreGtch
Copy link
Collaborator Author

Closed by #372 which implements option 2.a. (see above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants