Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functional api/identify effect #640

Merged
merged 8 commits into from Sep 27, 2022
Merged

Conversation

andresmor-ms
Copy link
Collaborator

First of a series of PR to add a functional API according to: https://github.com/py-why/dowhy/wiki/API-proposal-for-v1

  • Refactor identify_effect to have a functional API
    • Created BackdoorIdentifier class and extracted the logic from CausalIdentifier to be just a Protocol
    • Refactor the identify_effect method of BackdoorIdentifier and IDIdentifier to take the graph as parameter
    • Moved constants into enums for easier type checking
    • Backwards compatible with previous CausalModel API
    • Added notebook as demo that CausalModel API and new API behaves the same way

Signed-off-by: Andres Morales <andresmor@microsoft.com>
Signed-off-by: Andres Morales <andresmor@microsoft.com>
Signed-off-by: Andres Morales <andresmor@microsoft.com>
darthtrevino
darthtrevino previously approved these changes Sep 19, 2022
dowhy/causal_identifier/id_identifier.py Outdated Show resolved Hide resolved
dowhy/causal_identifier/identify_effect.py Show resolved Hide resolved
Copy link
Member

@emrekiciman emrekiciman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andresmor-ms! Overall, this looks great! One change is that I don't think that frontdoor and IV identification should be in a class called backdoor identifier. I see that used to be called CausalIdentifier. I agree that that name is too generic. Let's either rename the backdoor class to something in-between (will think about what might be appropriate); or track an issue to refactor frontdoor and IV out into their own classes.

dowhy/causal_identifier/backdoor_identifier.py Outdated Show resolved Hide resolved
dowhy/causal_identifier/backdoor_identifier.py Outdated Show resolved Hide resolved
dowhy/causal_identifier/backdoor_identifier.py Outdated Show resolved Hide resolved
dowhy/causal_identifier/backdoor_identifier.py Outdated Show resolved Hide resolved
Signed-off-by: Andres Morales <andresmor@microsoft.com>
Copy link
Member

@amit-sharma amit-sharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. Thanks for adding this @andresmor-ms.

I've added a few comments in my initial review.

docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
dowhy/causal_estimators/two_stage_regression_estimator.py Outdated Show resolved Hide resolved
dowhy/causal_identifier/identify_effect.py Show resolved Hide resolved
Signed-off-by: Andres Morales <andresmor@microsoft.com>
Signed-off-by: Andres Morales <andresmor@microsoft.com>
emrekiciman
emrekiciman previously approved these changes Sep 21, 2022
Copy link
Member

@amit-sharma amit-sharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just added a few comments to improve readability of the notebook.

Post that, we can merge this in.

docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
docs/source/example_notebooks/functional_api.ipynb Outdated Show resolved Hide resolved
Copy link
Member

@petergtz petergtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andresmor-ms My apologies for being a bit late in the game here. I missed this one.

I have a very general question: What is the idea behind having classes for identifiers opposed to, say, a simple function? E.g. auto_identify_effect(...)

Typically, we want to use objects, when we plan to give them a longer lifecycle. E.g. we want to pass them around where the consumer is not required to know what kind of identifier it is (IDIdentfier or AutoIdentifier). But if our typical usage is

identifier = AutoIdentifier(...)
identifier.identify_effect()

this can be simplified by providing a single call. As an added benefit, a lot less bookkeeping of variables is necessary (in the implementation of the class).

So yea, I'm curious about the usage patterns you have in mind here.

@petergtz
Copy link
Member

I can see now that the identifiers are actually used in CausalModel (which I missed previously). So it seems we need those classes for backwards-compatibility.

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

@amit-sharma That's probably not just a comment for @andresmor-ms, but also a more general question. What do you think?

@andresmor-ms
Copy link
Collaborator Author

@andresmor-ms My apologies for being a bit late in the game here. I missed this one.

I have a very general question: What is the idea behind having classes for identifiers opposed to, say, a simple function? E.g. auto_identify_effect(...)

Typically, we want to use objects, when we plan to give them a longer lifecycle. E.g. we want to pass them around where the consumer is not required to know what kind of identifier it is (IDIdentfier or AutoIdentifier). But if our typical usage is

identifier = AutoIdentifier(...)
identifier.identify_effect()

this can be simplified by providing a single call. As an added benefit, a lot less bookkeeping of variables is necessary (in the implementation of the class).

So yea, I'm curious about the usage patterns you have in mind here.

No worries.

The intention of not refactoring this as functions (and keeping the classes) is mainly to keep backwards compatibility with the object-oriented API, I agree that having functions like auto_identify_effect(...) would simplify the code and the API, however there are some parts of the code that reference a CausalIdentifier object (for example the CausalModel keeps track of the identifier that was used), I could dig into what the usage is and check if it can be easily removed as part of this refactor.

As for usage patterns I believe that what we want to achieve is having a single point where you can configure everything (the identify_effect(...) function), in the future this function would have more logic and defaults to guide the user what to use if they don't know what paramters should be set, the general usage would be:

import dowhy

dowhy.identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name,
    method=AutoIdentifier(...),
)

Just as a note, per my understanding what Peter suggest is to use it like this:

import dowhy

dowhy.auto_identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name,
    estimand_type=...,
    adjustment_method=...,
)

(We could still have a general identify_effect(...) function with defaults for users that need guidance or don't know which one to use)

@amit-sharma, @emrekiciman, I'd also like to know your opinion on this.

@emrekiciman
Copy link
Member

Backwards compatibility is important for a while -- we haven't decided how long but I would think at least 6-12 months.

Yes, general usage would be per below. For the unfamiliar user, the identify_effect function could call AutoIdentifier as a default. I.e., like this:

import dowhy

dowhy.identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name
)

Conceptually, I see the identify_effect() function signature developing into a function that accepts arguments that are related to the causal question being asked: treatment, outcome, whether we care about direct/indirect effects, CATE, etc. In contrast, the specific identification methods (AutoIdentifier, IDIdentifier, ...) may also accept configuration arguments (how hard to search, whether to return a maximal or minimal conditioning set) that are more about the identification method rather than the causal question being asked. I expect we'll have more PRs to address this, though. Right now, would be good to keep this PR focused on the functional refactoring?

@emrekiciman
Copy link
Member

@petergtz said:

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

Peter, I think deprecating the classes is where want to go. I'd probably hold off on marking them as deprecated at least until we complete the functional refactoring of the whole API. I might also suggest not marking them as deprecated until the new API is no longer marked "experimental". I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

Emre

@petergtz
Copy link
Member

I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

@emrekiciman Yes, that's certainly true :-). Also agree with the rest of what you said. It seems like what I'd like to see is that we don't want new library features built heavily on things we plan to deprecate. And that's sometimes hard to keep in mind when plans are made, but their execution takes several months. So comments here would help.

I believe we agreed, we would implement the "legacy" API using the new functional API. That's why I suggested to introduce auto_identify_effect and id_identify_effect and then have the corresponding classes just call these.

Then we could add a comment on the classes saying that "we plan to deprecate them" and "new code should directly depend on corresponding_function_xyz".

I'm fine with the global functionidentify_effect when it provides convenient defaults for e.g. the effect identifier. That will already provide value.

@andresmor-ms
Copy link
Collaborator Author

@petergtz said:

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

Peter, I think deprecating the classes is where want to go. I'd probably hold off on marking them as deprecated at least until we complete the functional refactoring of the whole API. I might also suggest not marking them as deprecated until the new API is no longer marked "experimental". I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

Emre

This is what I'm thinking to move forward:

  1. Refactor the classes to just call a function auto_identify_effect or id_identify_effect (as peter suggested)
  2. Keep the identify_effect() function as entry point with defaults for beginners.
  3. Do not deprecate yet the old API as the new one is not complete, add docs to class to warn that it will be deprecated in the future

What do you think? @emrekiciman @petergtz @amit-sharma

@petergtz
Copy link
Member

What do you think?

Sounds great! Thanks, really appreciate your effort here, @andresmor-ms.

Signed-off-by: Andres Morales <andresmor@microsoft.com>
petergtz
petergtz previously approved these changes Sep 23, 2022
@amit-sharma
Copy link
Member

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

I'd like to offer a different perspective. I agree on removing CausalModel class and moving to functional top-level user API because the CausalModel class was just a book-keeping class. For the AutoIdentifier/IDIdentifier classes, here's some reasons to keep them as classes (the same arguments may also apply to estimation classes CausalEstimator and CausalRefuter classes):

  1. Having identifier methods as classes can make it easier for newcomers to add a new identification method. They would simply need to inherit from CausalIdentifier and then implement one method, identify_effect. All the other common methods will be available for free. So far, this class structure has worked really well for external contributors, who have found it easy to come up with PRs for new identification or refutation methods. A nice side-effect is that all the new experimental code is safely wrapped in a class, with low chances of affecting other functionality.
  2. Having a common type for all identifier methods is useful for type-checking the top-level API, dowhy.identify_effect(method=...). Here users can provide any class that subclasses CausalIdentifier. This also helps in setting up basic minimum hygiene (e.g., all identifiers are expected to take in minimum standard parameters).
  3. For the identifier method specifically, it helps us separate the initialization parameters of an identifier from actually calling it. This is helpful when a user wants to use the same identifier repeatedly for multiple effects in a graph (for different treatments and outcomes).
  4. Thinking ahead for estimation where we also have classes for each estimator, it helps us setup a minimum protocol for adding a new estimator (e.g., implementing fit, predict and do methods in a class). Since sklearn methods are class-based, it allows people to easily extend existing sklearn models for causal inference, by simply adding a do method.

More generally, I am thinking about an ideal state where we provide a stable, user-facing API for causal tasks. At the same time, we make it super simple for people to add new underlying methods that support the stable API. In that direction, I feel having underlying methods as classes could be a useful abstraction.

@petergtz
Copy link
Member

  1. Having identifier methods as classes can make it easier for newcomers to add a new identification method. They would simply need to inherit from CausalIdentifier and then implement one method, identify_effect. All the other common methods will be available for free. So far, this class structure has worked really well for external contributors, who have found it easy to come up with PRs for new identification or refutation methods. A nice side-effect is that all the new experimental code is safely wrapped in a class, with low chances of affecting other functionality.

@amit-sharma I'd like to respectfully disagree with the motivation to have a class for simple code re-use. Using inheritance for code re-use has fallen out of favor for multiple reasons in software design. Why that is, has been explained in multiple other places. The Rust book also describes this in the chapter about OOP. Python allows it, but I wouldn't use the feature unnecessarily. And I think what applies for us, is, that requiring to inherit from CausalIdentifier, to be able to re-use those methods, is actually more burden than just being able to call the plain functions (in the way that @andresmor-ms has implemented it in the latest revision of this PR).

  1. Having a common type for all identifier methods is useful for type-checking the top-level API, dowhy.identify_effect(method=...). Here users can provide any class that subclasses CausalIdentifier. This also helps in setting up basic minimum hygiene (e.g., all identifiers are expected to take in minimum standard parameters).

I think we discussed to introduce type hints for type-checking. That way we wouldn't require a wrapper for this task. Note that I'm not necessarily opposed to the global identify_effect function when it provides a simplified API with helpful defaults over the specialized functions.

  1. For the identifier method specifically, it helps us separate the initialization parameters of an identifier from actually calling it. This is helpful when a user wants to use the same identifier repeatedly for multiple effects in a graph (for different treatments and outcomes).

That's indeed a scenario I could see for certain use cases. That's why I asked if we expect this usage pattern in this comment. Note though: we'd have to change the API of those classes nonetheless: right now, treatment_name and outcome_name are part of initialization, not of identify_effect.

  1. Thinking ahead for estimation where we also have classes for each estimator, it helps us setup a minimum protocol for adding a new estimator (e.g., implementing fit, predict and do methods in a class). Since sklearn methods are class-based, it allows people to easily extend existing sklearn models for causal inference, by simply adding a do method.

Agree with estimators being classes. The usage is slightly different from identifiers though, and we take advantage of polymorphism (opposed to code re-use): by having an interface/Protocol for estimators we can opaquely pass an estimator we got from fitting the estimand to the estimate_effect function without knowing the actual implementation of the estimator.

@emrekiciman
Copy link
Member

I don't feel strongly about classes vs global functions, so won't say much about that.

Looking at the current code, I do have a question about how we can be clear in the code about the official and supported API, vs. internal functions. For example, let's say we want identify_effect, auto_identify_effect and id_identify_effect as top-level functions that users can call directly. But I'm looking at the code and also seeing top-level functions for identify_nde_effect, find_valid_adjustment_sets, build_backdoor_estimands_dict, identify_mediation_second_stage_confounders, construct_frontdoor_estimand, etc.

So, this is maybe more of a python question. What's the way we keep (at least most of) these other functions private, hidden, or otherwise clearly outside of the supported API?

@petergtz
Copy link
Member

So, this is maybe more of a python question. What's the way we keep (at least most of) these other functions private, hidden, or otherwise clearly outside of the supported API?

@emrekiciman https://peps.python.org/pep-0008/#public-and-internal-interfaces provides some guidelines here:

Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal interfaces exempt from the usual backwards compatibility guarantees. All undocumented interfaces should be assumed to be internal.

Also:

To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute. Setting __all__ to an empty list indicates that the module has no public API.

And finally:

internal interfaces (packages, modules, classes, functions, attributes or other names) should still be prefixed with a single leading underscore.

The latter recommendation is often a bit unpractical in my experience: when you prefix a module or function with _, style checkers or linters often flag this when you import it form another module. The control mechanisms are simply not precise enough in Python. Which is why I'd go with the first 2 recommendations.

@andresmor-ms
Copy link
Collaborator Author

andresmor-ms commented Sep 26, 2022

The control mechanisms are simply not precise enough in Python. Which is why I'd go with the first 2 recommendations.

I added the __all__ for the causal_identifier package:

__all__ = [
    "AutoIdentifier",
    "auto_identify_effect",
    "id_identify_effect",
    "BackdoorAdjustment",
    "EstimandType",
    "IdentifiedEstimand",
    "IDIdentifier",
    "identify_effect",
]

I didn't add it to the dowhy package because I wasn't sure if we wanted all of those there, I'd suggest that for the dowhy package we add, which are part of the new functional API:

__all__ = [
    "auto_identify_effect",
    "id_identify_effect",
    "EstimandType",
    "IdentifiedEstimand",
    "identify_effect",
]

Another question would be if we want to remove the ones that we offer on dowhy package from the dowhy.causal_identifier package

So that the "official" import to get the functions would be:

import dowhy
dowhy.auto_identify_effect(...)

and not:

import dowhy.causal_identifier
dowhy.causal_identifier.identify_effect(...)

Or if we want both of them to be supported imports?

amit-sharma
amit-sharma previously approved these changes Sep 26, 2022
darthtrevino
darthtrevino previously approved these changes Sep 26, 2022
dowhy/causal_identifier/id_identifier.py Outdated Show resolved Hide resolved
Signed-off-by: Andres Morales <andresmor@microsoft.com>
Copy link
Member

@emrekiciman emrekiciman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been a robust discussion. Given that the PR has been approved previously by everyone before @andresmor-ms added type hints, I'll go ahead and accept and merge.

Let's leave additional points (the remaining all public interface declarations, further refinement of functional vs class interfaces, demarcation of deprecated and experimental functionality, ...) as future PRs.

@emrekiciman emrekiciman merged commit db953a6 into main Sep 27, 2022
@emrekiciman emrekiciman deleted the functional_api/identify_effect branch September 27, 2022 01:38
@petergtz petergtz added the enhancement New feature or request label Nov 10, 2022
@Padarn
Copy link
Contributor

Padarn commented Nov 20, 2022

Sorry to ask a question on an old MR but I'm trying to understand some of the design direction. I'm having a bit of trouble understanding:

I can see now that the identifiers are actually used in CausalModel (which I missed previously). So it seems we need those classes for backwards-compatibility.

Its not clear to me why these couldn't be replaced by functions? Which code are you trying to maintain backwards compatibility with - users or internally?

@amit-sharma
Copy link
Member

Hey @Padarn We only want to maintain backwards compatibility with the user API, so identifiers have already been replaced by functions. you can take a look at the main branch. E.g., the main identify_effect function.

@Padarn
Copy link
Contributor

Padarn commented Nov 22, 2022

Ohh that is much more clear now. Thanks for the clarification @amit-sharma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants