Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop-in replacements for cross_val_predict and cross_val_score etc #18

Open
jrasero opened this issue Feb 25, 2022 · 14 comments
Open

drop-in replacements for cross_val_predict and cross_val_score etc #18

jrasero opened this issue Feb 25, 2022 · 14 comments

Comments

@jrasero
Copy link
Contributor

jrasero commented Feb 25, 2022

Pradeep,

could something like this be of interest for the library?

The idea would be to create a class that would do fit and predict including deconfounding and the use of the estimator in an encapsulated way.

Below is a skeleton example. This would only deconfound the input data.

cross_val_predict and cross_val_score functions could as well be implemented.

from sklearn.base import clone

class SklearnWrapper():

    def __init__(self,
                 deconfounder,
                 estimator):

        self.deconfounder = deconfounder
        self.estimator = estimator

    def fit(self,
            input_data,
            target_data,
            confounders,
            sample_weight=None):

        # clone input arguments
        deconfounder = clone(self.deconfounder)
        estimator = clone(self.estimator)

        # Deconfound input data
        deconf_input = deconfounder.fit_transform(input_data, confounders)
        self.deconfounder_ = deconfounder

        # Fit deconfounded input data
        estimator.fit(deconf_input, target_data, sample_weight)
        self.estimator_ = estimator

        return self

    def predict(self,
                input_data,
                confounders):

        deconf_input = self.deconfounder_.transform(input_data, confounders)

        return self.estimator_.predict(deconf_input)

@raamana
Copy link
Owner

raamana commented Feb 26, 2022

Great suggestion Javier, let me think. This may make it more syntactically convenient or easier for new users, esp. regarding example 1 here: https://raamana.github.io/confounds/usage.html

Important question is how do we handle more complicated use-cases e.g. to do advanced cross-validation! If you can quickly sketch an outline how such a thing can be adapted for cross_val_predict and cross_val_score, it would be easier to decide on it. It's a tradeoff between how much encapsulate (plug&play) and how much we stick to few small modular blocks.

@jrasero
Copy link
Contributor Author

jrasero commented Feb 28, 2022

Sure.

I think we could more or less "borrow" (i.e. copy) cross_val_predict and cross_val_score original sklearn implementations and just rewrite the auxiliary functions _fit_and_score and _fit_and_predict that they respectively use in each fold iteration.

This could be below a skeleton example for cross_val_predict. Note that it takes the same arguments as the original sklearn implementation plus an argument for the confounders. Here I required this new argument to be passed by key, and not by position as X and y. I think this would avoid passing the confounders as X or y by mistake.

def cross_val_predict(
    estimator,
    X,
    y=None,
    *,
    confounds,  # Mandatory, and better (IMHO) if needed to be passed by key.
    groups=None,
    cv=None,
    n_jobs=None,
    verbose=0,
    fit_params=None,
    pre_dispatch="2*n_jobs",
    method="predict",
):
    #########################
    # Here would probably come  #
    # all initial checks                    #
    #########################

    ##########################
    # Probably some original code #
    # from sklearn                           #
    ##########################

    parallel=Parallel(n_jobs=n_jobs, verbose=verbose,
                      pre_dispatch=pre_dispatch)
  
    # Here we call there our implementation of _fit_and_predict. 
    # I preprended the word deconf to show that we also deconfound the data before fitting.
    predictions=parallel(
        delayed(_deconf_fit_and_predict)(
            clone(estimator), X, y, confounds,
            train, test, verbose, fit_params, method
        )
        for train, test in splits
    )

    ####################
    # Prepraring prediction #
    # for output                   #
    ####################

    return predictions


def _deconf_fit_and_predict(estimator,
                            X,
                            y,
                            C, # Confounders
                            train,
                            test,
                            verbose,
                            fit_params,
                            method):
    
    # Split into training and test sets
    X_train, y_train, C_train = X[train], y[train], C[train]
    X_test, y_test, C_test = X[test], y[test], C[test]

    # N.B. estimator should be our sklearn wrapper. We should require this 
    # during the initial checks of cross_val_predict
    estimator.fit(X_train, y_train, confounders=C_train)
    predictions= estimator.predict(X_test, confounders=C_test)

    ####################
    # Probably some code #
    ####################


    return predictions
    

@raamana
Copy link
Owner

raamana commented Mar 7, 2022

this is a great idea Javier! Instead of leaving it to the users to the right thing, we can provide the necessary wrappers to the most common use cases, and do the right thing. Please go ahead and do it, with one suggestion to name these clearly different from the sklearn counterparts to avoid any confusion.. something like deconfounded_cross_validation or something like that

@jrasero
Copy link
Contributor Author

jrasero commented Mar 7, 2022

Sure, that makes sense.

I will start working on this this week.

@jrasero
Copy link
Contributor Author

jrasero commented Mar 11, 2022

Ok, so thinking about all of this this past few days, I think we need to at least implement the following objects:

  • A deconfouded Estimator. We could name it "DeconfEstimator". This could cover in principle any task (regression/classification) and its main methods would be fit, fit_predict and predict.

  • A deconfouded Transformer? We could name it DeconfTransformer. Its main methods here would be fit, fit_transform and transform.

  • A deconfounded cross_val_predict. We could name it "deconfounded_cv_predict" and the implementation would be based on the example above.

  • A deconfounded cross_val_score. We could name it "deconfounded_cv_score". For this and following the same rationale of sklearn, this would just call a function that we will create with the name "deconfounded_cross_validate" (in sklearn they call "cross_validate")

  • A deconfounded optimization object, something like a "DeconfGridSearchCV".

All but the last implementation should be simple. About this last one, it's a shame, because I don't see a quick way of leveraging the original scikit-learn's GridSearchCV class to save coding (e.g. by inheriting from it), so maybe the best would be just to copy as much code as possible from the original GridSearchCV class and adapt it to our purposes.

Finally, I anticipate that there will be a case that may give us some trouble in the future. In principle, in "DeconfEstimator" the estimator to pass could be a pipeline object. So first the data would be deconfounded and then they will go through the pipeline. The problem is when the pipeline contains an imputer operation to deal with NaN's in the data. In that case, unless confounds enables to omit the NaNs when deconfounding, I guess it will give an error. Anyway, we can come to this case in the future.

I'll start working on this. Follow-up here (https://github.com/jrasero/confounds/tree/sklearn_wrapper)

@raamana
Copy link
Owner

raamana commented Mar 11, 2022

Thanks Javier. Most of our existing classes like Residualize() are supposed to be sklearn estimators already to the extent possible (what you refer to as A Deconfouded Estimator). I spent a lot of time trying to get them to pass sklearn tests but I realized their test suite has many issues and fundamental limitations, so let's not waste time with that. I don't follow the need for Transformer as our existing deconfouders are already transformers, right? I can see a clear need for cross_val_predict and GridSearchCV but perhaps I am missing something, so let's discuss this in more detail before you invest too much time into this.

@jrasero
Copy link
Contributor Author

jrasero commented Mar 11, 2022

Yes, yes, the deconfounders will always be transformers, but I was talking about the part that comes after these. For example, a PCA would be a transformer and has a different API from a classifier or regressor object.

But I guess for now implementing this kind of objects could secondary. I agree with you that cross_val_predict and GridSearchCV are the most important pieces right now.

@raamana
Copy link
Owner

raamana commented Mar 11, 2022

perhaps we could consider allowing users to pass an sklearn pipeline object (for preprocessing) prior to deconfouding and before the prediction estimator is applied. Let's start with the simpler case, and based on how it turns out, we can slowly add more useful features. We certainly dont want to recreate all of sklearn, and we want to do things that they can't or won't do.

@raamana
Copy link
Owner

raamana commented Jun 9, 2022

HI @jrasero , would you be participating in the OHBM BrainHack virtually? Let me know. I was thinking picking up few of the ideas/pending issues here, and working on them during the hackathon/conf.

@raamana raamana changed the title sklearn wrapper drop-in replacements for cross_val_predict and cross_val_score etc Jun 16, 2022
@raamana
Copy link
Owner

raamana commented Jun 16, 2022

Hi @jrasero, let me know when you have sometime today, so we can discuss where you are at, and how we can get this finished this hackathon?

@jrasero
Copy link
Contributor Author

jrasero commented Jun 16, 2022

Hey @raamana, I am free now. Let me reach you via email to maybe do a quick zoom meeting?

@jrasero
Copy link
Contributor Author

jrasero commented Jun 16, 2022

Here is the branch I created several months ago for this issue, and its status as of today:

https://github.com/jrasero/confounds/blob/sklearn_wrapper/confounds/sklearn.py

Once finished, I'll do the pull-request

@jrasero
Copy link
Contributor Author

jrasero commented Jun 17, 2022

Ok @raamana , I finally implemented a few things today. A DeconfEstimator class, which first deconfounds the data and then runs a passed estimator, a deconfounded_cv_predict function to get predictions in a cross-validation scheme including deconfounding, and deconfounded_cv_score, the same but yielding the performance scores.

I've also added a few tests to these funcionalities.

Please, take a look, and let me know if you see these OK. I can pull request all of this if you want.

@raamana
Copy link
Owner

raamana commented Jun 17, 2022

Fantastic. Please send a PR when you are ready, Javi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants