Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better validation of inputs to Deconfounders #20

Open
raamana opened this issue Apr 29, 2022 · 13 comments
Open

better validation of inputs to Deconfounders #20

raamana opened this issue Apr 29, 2022 · 13 comments

Comments

@raamana
Copy link
Owner

raamana commented Apr 29, 2022

the #19 reminds me of how some users can be confused given the code lets the second argument to .fit() and .transform() optional with y=None. The only reason we have y=None is to try follow sklearn conventions and to pass their tests, but given we can't pass them anyway, we should tighten them up and make it an error to not supply the second [necessary] input argument.

cc @jrasero @jameschapman19

@jrasero
Copy link
Contributor

jrasero commented May 2, 2022

I totally agree.

Maybe we could drop the None value as default for y in both functions, and then add an if condition to raise an error if a second argument is not passed, something like this:

if y is None:
    raise ValueError("A second argument, corresponding to the confounders, should be passed")

This, and a better clarification in the documentation.

@raamana
Copy link
Owner Author

raamana commented May 2, 2022

In fact, we should not even call it y :), and call it C.

Also, if we stop caring about sklearn, we can even get rid of addtitional layers of calls ._fit() and ._transform(), although there might be other considerations we need to worry about (ability to use sklearn pipepine infra etc). I need to think about this.

@jrasero
Copy link
Contributor

jrasero commented May 2, 2022

100% agreed. That would avoid misinterpretations.

Also, if we are going to stop caring about sklearn, and this is just a matter of personal preference, I would require X and C to be passed by their keyword, e.g. doing fit(self, *, X, C). In that way the user always knows what they are passing to the functions. Right now things can be passed positionally, but someone could confuse the order of X and C. I think this may happen to sklearn experienced users, where the order of X and y in both fit and transform has a logic behind (X->y).

@raamana
Copy link
Owner Author

raamana commented May 3, 2022

Good idea, I think that should be possible, let me double check how to enforce it (or please share if you know the details already). btw, would you want to take a crack at this? :)

@jrasero
Copy link
Contributor

jrasero commented May 9, 2022

Sure Pradeep. This is what I was referring to. Let's take the example for Residualize:

class Residualize(BaseDeconfound):
    """
    Deconfounding estimator class that residualizes the input features by
    subtracting the contributions from the confound variables

    Example methods: Linear, Kernel Ridge, Gaussian Process Regression etc
    """


    def __init__(self, model='linear'):
        """Constructor"""

        super().__init__(name='Residualize')

        self.model = model


    def fit(self,
            *,
            X,  # variable names chosen to correspond to sklearn when possible
            C,  # y is the confound variables here, not the target!
            ):
        """
        Fits the residualizing model (estimates the contributions of confounding
        variables (C) to the given [training] feature set X. 

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        C : ndarray
            Array of covariates, shape (n_samples, n_covariates)
            This does not refer to target as is typical in scikit-learn.

        Returns
        -------
        self : object
            Returns self
        """

        return self._fit(X, C)  # which itself must return self


    def _fit(self, in_features, confounds=None):
        """Actual fit method"""

        in_features = check_array(in_features)
        confounds = check_array(confounds, ensure_2d=False)

        # turning it into 2D, in case if its just a column
        if confounds.ndim == 1:
            confounds = confounds[:, np.newaxis]

        try:
            check_consistent_length(in_features, confounds)
        except:
            raise ValueError('X (features) and y (confounds) '
                             'must have the same number of rows/samplets!')

        self.n_features_ = in_features.shape[1]

        regr_model = clone(get_model(self.model))
        regr_model.fit(confounds, in_features)
        self.model_ = regr_model

        return self


    def transform(self, *, X, C):
        """
        Transforms the given feature set by residualizing the [test] features
        by subtracting the contributions of their confounding variables.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        C : ndarray
            Array of covariates, shape (n_samples, n_covariates)
            This does not refer to target as is typical in scikit-learn.

        Returns
        -------
        self : object
            Returns self
        """

        return self._transform(X, C)


    def _transform(self, test_features, test_confounds):
        """Actual deconfounding of the test features"""

        check_is_fitted(self, ('model_', 'n_features_'))
        test_features = check_array(test_features, accept_sparse=True)

        if test_features.shape[1] != self.n_features_:
            raise ValueError('number of features must be {}. Given {}'
                             ''.format(self.n_features_, test_features.shape[1]))

        if test_confounds is None:  # during estimator checks
            return test_features  # do nothing

        test_confounds = check_array(test_confounds, ensure_2d=False)
        if test_confounds.ndim == 1:
            test_confounds = test_confounds[:, np.newaxis]
        check_consistent_length(test_features, test_confounds)

        # test features as can be explained/predicted by their covariates
        test_feat_predicted = self.model_.predict(test_confounds)
        residuals = test_features - test_feat_predicted

        return residuals

As you can see, I changed the name of y to C, and added an asterisk between self and X and C in fit and transform. Then, if you try to run things like this:

res = Residualize()
res.fit(X, C)

it will give you an error, because you now have to pass things by their keyword, not their position. That is, you have to write the following to work:

res = Residualize()
res.fit(X=X, C=C)

It's more wordy, but less prone to errors in my opinion, as the user explicitly knows what they are passing to the functions.

@raamana
Copy link
Owner Author

raamana commented May 9, 2022

thanks Javier - can you link me to some docs on the how this * trick works?

@jrasero
Copy link
Contributor

jrasero commented May 9, 2022

@raamana
Copy link
Owner Author

raamana commented May 9, 2022

I see, thanks. but that would allow weird behaviour like .transform(random_arg1, random_arg2, X=X,C=C), no? I know people won't do it but you see my point?

@jrasero
Copy link
Contributor

jrasero commented May 10, 2022

No, but that's not possible, Pradeep. Users will still be able to pass only two arguments, in this case X and C. Other weird scenarios like the one you've posted would give an error.

I know that using keyword-only arguments might not be standard, but I find it more clear, particularly in our case in which the order logic of the input data and covariates do not follow as straightforward as in the usual sklearn scenario (input data/target).

Anyway, it was just a suggestion and can be easily changed at any moment in the future if wanted. For example, if we see that users find the current behaviour confusing. We need tester users :-)

@raamana
Copy link
Owner Author

raamana commented May 10, 2022

I like the idea, just want to make sure there will be no side effects etc

@jrasero
Copy link
Contributor

jrasero commented Jun 17, 2022

Pradeep, I've made a little progress today on this issue, but there's something that we need to decide before I carry on pushing my changes.

If we finally decide to give an error if the second input argument in fit and transform is not supplied, as you originally proposed, then we should remove/comment out the test_estimator_API function in the tests modules, otherwise it won't pass the tests due to check_estimator function giving an error.

If we finally do this, we could also remove lines like these in base.py, which were particularly included to pass the mentioned test

if test_confounds is None:  # during estimator checks
            return test_features  # do nothing

if test_confounds is None: # during estimator checks

@raamana
Copy link
Owner Author

raamana commented Jun 18, 2022

sounds good - we can do whatever is needed to achieve our objectives. Let's discuss this when we discuss the other PR #21 too?

@jrasero
Copy link
Contributor

jrasero commented Jun 18, 2022

sound good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants