Functional api/identify effect #640

andresmor-ms · 2022-09-16T21:35:24Z

First of a series of PR to add a functional API according to: https://github.com/py-why/dowhy/wiki/API-proposal-for-v1

Refactor identify_effect to have a functional API
- Created BackdoorIdentifier class and extracted the logic from CausalIdentifier to be just a Protocol
- Refactor the identify_effect method of BackdoorIdentifier and IDIdentifier to take the graph as parameter
- Moved constants into enums for easier type checking
- Backwards compatible with previous CausalModel API
- Added notebook as demo that CausalModel API and new API behaves the same way

Signed-off-by: Andres Morales <andresmor@microsoft.com>

dowhy/causal_identifier/id_identifier.py

dowhy/causal_identifier/identify_effect.py

emrekiciman

Thanks @andresmor-ms! Overall, this looks great! One change is that I don't think that frontdoor and IV identification should be in a class called backdoor identifier. I see that used to be called CausalIdentifier. I agree that that name is too generic. Let's either rename the backdoor class to something in-between (will think about what might be appropriate); or track an issue to refactor frontdoor and IV out into their own classes.

dowhy/causal_identifier/backdoor_identifier.py

Signed-off-by: Andres Morales <andresmor@microsoft.com>

amit-sharma

This is a great start. Thanks for adding this @andresmor-ms.

I've added a few comments in my initial review.

docs/source/example_notebooks/functional_api.ipynb

docs/source/example_notebooks/dowhy_efficient_backdoor_example.ipynb

dowhy/causal_estimators/two_stage_regression_estimator.py

dowhy/causal_identifier/identify_effect.py

Signed-off-by: Andres Morales <andresmor@microsoft.com>

amit-sharma

LGTM. I just added a few comments to improve readability of the notebook.

Post that, we can merge this in.

docs/source/example_notebooks/functional_api.ipynb

dowhy/causal_identifier/identify_effect.py

petergtz

@andresmor-ms My apologies for being a bit late in the game here. I missed this one.

I have a very general question: What is the idea behind having classes for identifiers opposed to, say, a simple function? E.g. auto_identify_effect(...)

Typically, we want to use objects, when we plan to give them a longer lifecycle. E.g. we want to pass them around where the consumer is not required to know what kind of identifier it is (IDIdentfier or AutoIdentifier). But if our typical usage is

identifier = AutoIdentifier(...)
identifier.identify_effect()

this can be simplified by providing a single call. As an added benefit, a lot less bookkeeping of variables is necessary (in the implementation of the class).

So yea, I'm curious about the usage patterns you have in mind here.

dowhy/causal_identifier/id_identifier.py

petergtz · 2022-09-22T15:27:27Z

I can see now that the identifiers are actually used in CausalModel (which I missed previously). So it seems we need those classes for backwards-compatibility.

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

@amit-sharma That's probably not just a comment for @andresmor-ms, but also a more general question. What do you think?

andresmor-ms · 2022-09-22T15:39:41Z

@andresmor-ms My apologies for being a bit late in the game here. I missed this one.

I have a very general question: What is the idea behind having classes for identifiers opposed to, say, a simple function? E.g. auto_identify_effect(...)

Typically, we want to use objects, when we plan to give them a longer lifecycle. E.g. we want to pass them around where the consumer is not required to know what kind of identifier it is (IDIdentfier or AutoIdentifier). But if our typical usage is
identifier = AutoIdentifier(...)
identifier.identify_effect()
this can be simplified by providing a single call. As an added benefit, a lot less bookkeeping of variables is necessary (in the implementation of the class).

So yea, I'm curious about the usage patterns you have in mind here.

No worries.

The intention of not refactoring this as functions (and keeping the classes) is mainly to keep backwards compatibility with the object-oriented API, I agree that having functions like auto_identify_effect(...) would simplify the code and the API, however there are some parts of the code that reference a CausalIdentifier object (for example the CausalModel keeps track of the identifier that was used), I could dig into what the usage is and check if it can be easily removed as part of this refactor.

As for usage patterns I believe that what we want to achieve is having a single point where you can configure everything (the identify_effect(...) function), in the future this function would have more logic and defaults to guide the user what to use if they don't know what paramters should be set, the general usage would be:

import dowhy

dowhy.identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name,
    method=AutoIdentifier(...),
)

Just as a note, per my understanding what Peter suggest is to use it like this:

import dowhy

dowhy.auto_identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name,
    estimand_type=...,
    adjustment_method=...,
)

(We could still have a general identify_effect(...) function with defaults for users that need guidance or don't know which one to use)

@amit-sharma, @emrekiciman, I'd also like to know your opinion on this.

emrekiciman · 2022-09-22T15:51:16Z

Backwards compatibility is important for a while -- we haven't decided how long but I would think at least 6-12 months.

Yes, general usage would be per below. For the unfamiliar user, the identify_effect function could call AutoIdentifier as a default. I.e., like this:

import dowhy

dowhy.identify_effect(
    graph=graph,
    treatment=treatment_name,
    outcome=outcome_name
)

Conceptually, I see the identify_effect() function signature developing into a function that accepts arguments that are related to the causal question being asked: treatment, outcome, whether we care about direct/indirect effects, CATE, etc. In contrast, the specific identification methods (AutoIdentifier, IDIdentifier, ...) may also accept configuration arguments (how hard to search, whether to return a maximal or minimal conditioning set) that are more about the identification method rather than the causal question being asked. I expect we'll have more PRs to address this, though. Right now, would be good to keep this PR focused on the functional refactoring?

emrekiciman · 2022-09-22T16:25:02Z

@petergtz said:

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

Peter, I think deprecating the classes is where want to go. I'd probably hold off on marking them as deprecated at least until we complete the functional refactoring of the whole API. I might also suggest not marking them as deprecated until the new API is no longer marked "experimental". I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

Emre

petergtz · 2022-09-22T17:59:51Z

I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

@emrekiciman Yes, that's certainly true :-). Also agree with the rest of what you said. It seems like what I'd like to see is that we don't want new library features built heavily on things we plan to deprecate. And that's sometimes hard to keep in mind when plans are made, but their execution takes several months. So comments here would help.

I believe we agreed, we would implement the "legacy" API using the new functional API. That's why I suggested to introduce auto_identify_effect and id_identify_effect and then have the corresponding classes just call these.

Then we could add a comment on the classes saying that "we plan to deprecate them" and "new code should directly depend on corresponding_function_xyz".

I'm fine with the global functionidentify_effect when it provides convenient defaults for e.g. the effect identifier. That will already provide value.

andresmor-ms · 2022-09-22T18:29:09Z

@petergtz said:

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

Peter, I think deprecating the classes is where want to go. I'd probably hold off on marking them as deprecated at least until we complete the functional refactoring of the whole API. I might also suggest not marking them as deprecated until the new API is no longer marked "experimental". I.e., I don't think we want a user to have to choose between using "deprecated" vs "experimental" APIs :D

Emre

This is what I'm thinking to move forward:

Refactor the classes to just call a function auto_identify_effect or id_identify_effect (as peter suggested)
Keep the identify_effect() function as entry point with defaults for beginners.
Do not deprecate yet the old API as the new one is not complete, add docs to class to warn that it will be deprecated in the future

What do you think? @emrekiciman @petergtz @amit-sharma

petergtz · 2022-09-22T18:32:59Z

What do you think?

Sounds great! Thanks, really appreciate your effort here, @andresmor-ms.

Signed-off-by: Andres Morales <andresmor@microsoft.com>

amit-sharma · 2022-09-23T07:39:33Z

If this is true, I'm wondering: should AutoIdentifier simply delegate to a new auto_identify_effect function and IDIdentifier to a new id_identify_effect function, and then both classes are marked as deprecated. Otherwise, I see the risk that the direction is not clear and new code builds again on these classes instead of the functions.

I'd like to offer a different perspective. I agree on removing CausalModel class and moving to functional top-level user API because the CausalModel class was just a book-keeping class. For the AutoIdentifier/IDIdentifier classes, here's some reasons to keep them as classes (the same arguments may also apply to estimation classes CausalEstimator and CausalRefuter classes):

Having identifier methods as classes can make it easier for newcomers to add a new identification method. They would simply need to inherit from CausalIdentifier and then implement one method, identify_effect. All the other common methods will be available for free. So far, this class structure has worked really well for external contributors, who have found it easy to come up with PRs for new identification or refutation methods. A nice side-effect is that all the new experimental code is safely wrapped in a class, with low chances of affecting other functionality.
Having a common type for all identifier methods is useful for type-checking the top-level API, dowhy.identify_effect(method=...). Here users can provide any class that subclasses CausalIdentifier. This also helps in setting up basic minimum hygiene (e.g., all identifiers are expected to take in minimum standard parameters).
For the identifier method specifically, it helps us separate the initialization parameters of an identifier from actually calling it. This is helpful when a user wants to use the same identifier repeatedly for multiple effects in a graph (for different treatments and outcomes).
Thinking ahead for estimation where we also have classes for each estimator, it helps us setup a minimum protocol for adding a new estimator (e.g., implementing fit, predict and do methods in a class). Since sklearn methods are class-based, it allows people to easily extend existing sklearn models for causal inference, by simply adding a do method.

More generally, I am thinking about an ideal state where we provide a stable, user-facing API for causal tasks. At the same time, we make it super simple for people to add new underlying methods that support the stable API. In that direction, I feel having underlying methods as classes could be a useful abstraction.

petergtz · 2022-09-23T14:06:40Z

Having identifier methods as classes can make it easier for newcomers to add a new identification method. They would simply need to inherit from CausalIdentifier and then implement one method, identify_effect. All the other common methods will be available for free. So far, this class structure has worked really well for external contributors, who have found it easy to come up with PRs for new identification or refutation methods. A nice side-effect is that all the new experimental code is safely wrapped in a class, with low chances of affecting other functionality.

@amit-sharma I'd like to respectfully disagree with the motivation to have a class for simple code re-use. Using inheritance for code re-use has fallen out of favor for multiple reasons in software design. Why that is, has been explained in multiple other places. The Rust book also describes this in the chapter about OOP. Python allows it, but I wouldn't use the feature unnecessarily. And I think what applies for us, is, that requiring to inherit from CausalIdentifier, to be able to re-use those methods, is actually more burden than just being able to call the plain functions (in the way that @andresmor-ms has implemented it in the latest revision of this PR).

Having a common type for all identifier methods is useful for type-checking the top-level API, dowhy.identify_effect(method=...). Here users can provide any class that subclasses CausalIdentifier. This also helps in setting up basic minimum hygiene (e.g., all identifiers are expected to take in minimum standard parameters).

I think we discussed to introduce type hints for type-checking. That way we wouldn't require a wrapper for this task. Note that I'm not necessarily opposed to the global identify_effect function when it provides a simplified API with helpful defaults over the specialized functions.

For the identifier method specifically, it helps us separate the initialization parameters of an identifier from actually calling it. This is helpful when a user wants to use the same identifier repeatedly for multiple effects in a graph (for different treatments and outcomes).

That's indeed a scenario I could see for certain use cases. That's why I asked if we expect this usage pattern in this comment. Note though: we'd have to change the API of those classes nonetheless: right now, treatment_name and outcome_name are part of initialization, not of identify_effect.

Thinking ahead for estimation where we also have classes for each estimator, it helps us setup a minimum protocol for adding a new estimator (e.g., implementing fit, predict and do methods in a class). Since sklearn methods are class-based, it allows people to easily extend existing sklearn models for causal inference, by simply adding a do method.

Agree with estimators being classes. The usage is slightly different from identifiers though, and we take advantage of polymorphism (opposed to code re-use): by having an interface/Protocol for estimators we can opaquely pass an estimator we got from fitting the estimand to the estimate_effect function without knowing the actual implementation of the estimator.

emrekiciman · 2022-09-23T22:47:49Z

I don't feel strongly about classes vs global functions, so won't say much about that.

Looking at the current code, I do have a question about how we can be clear in the code about the official and supported API, vs. internal functions. For example, let's say we want identify_effect, auto_identify_effect and id_identify_effect as top-level functions that users can call directly. But I'm looking at the code and also seeing top-level functions for identify_nde_effect, find_valid_adjustment_sets, build_backdoor_estimands_dict, identify_mediation_second_stage_confounders, construct_frontdoor_estimand, etc.

So, this is maybe more of a python question. What's the way we keep (at least most of) these other functions private, hidden, or otherwise clearly outside of the supported API?

petergtz · 2022-09-26T12:28:26Z

So, this is maybe more of a python question. What's the way we keep (at least most of) these other functions private, hidden, or otherwise clearly outside of the supported API?

@emrekiciman https://peps.python.org/pep-0008/#public-and-internal-interfaces provides some guidelines here:

Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal interfaces exempt from the usual backwards compatibility guarantees. All undocumented interfaces should be assumed to be internal.

Also:

To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute. Setting __all__ to an empty list indicates that the module has no public API.

And finally:

internal interfaces (packages, modules, classes, functions, attributes or other names) should still be prefixed with a single leading underscore.

The latter recommendation is often a bit unpractical in my experience: when you prefix a module or function with _, style checkers or linters often flag this when you import it form another module. The control mechanisms are simply not precise enough in Python. Which is why I'd go with the first 2 recommendations.

andresmor-ms · 2022-09-26T15:00:31Z

The control mechanisms are simply not precise enough in Python. Which is why I'd go with the first 2 recommendations.

I added the __all__ for the causal_identifier package:

__all__ = [
    "AutoIdentifier",
    "auto_identify_effect",
    "id_identify_effect",
    "BackdoorAdjustment",
    "EstimandType",
    "IdentifiedEstimand",
    "IDIdentifier",
    "identify_effect",
]

I didn't add it to the dowhy package because I wasn't sure if we wanted all of those there, I'd suggest that for the dowhy package we add, which are part of the new functional API:

__all__ = [
    "auto_identify_effect",
    "id_identify_effect",
    "EstimandType",
    "IdentifiedEstimand",
    "identify_effect",
]

Another question would be if we want to remove the ones that we offer on dowhy package from the dowhy.causal_identifier package

So that the "official" import to get the functions would be:

import dowhy
dowhy.auto_identify_effect(...)

and not:

import dowhy.causal_identifier
dowhy.causal_identifier.identify_effect(...)

Or if we want both of them to be supported imports?

dowhy/causal_identifier/id_identifier.py

Signed-off-by: Andres Morales <andresmor@microsoft.com>

emrekiciman

This has been a robust discussion. Given that the PR has been approved previously by everyone before @andresmor-ms added type hints, I'll go ahead and accept and merge.

Let's leave additional points (the remaining all public interface declarations, further refinement of functional vs class interfaces, demarcation of deprecated and experimental functionality, ...) as future PRs.

Padarn · 2022-11-20T12:56:40Z

Sorry to ask a question on an old MR but I'm trying to understand some of the design direction. I'm having a bit of trouble understanding:

I can see now that the identifiers are actually used in CausalModel (which I missed previously). So it seems we need those classes for backwards-compatibility.

Its not clear to me why these couldn't be replaced by functions? Which code are you trying to maintain backwards compatibility with - users or internally?

amit-sharma · 2022-11-21T04:22:26Z

Hey @Padarn We only want to maintain backwards compatibility with the user API, so identifiers have already been replaced by functions. you can take a look at the main branch. E.g., the main identify_effect function.

Padarn · 2022-11-22T00:33:34Z

Ohh that is much more clear now. Thanks for the clarification @amit-sharma.

andresmor-ms requested review from darthtrevino, amit-sharma and emrekiciman September 16, 2022 21:35

andresmor-ms added 3 commits September 16, 2022 15:48

Functional API: Refactor identify_effect

64a5044

Signed-off-by: Andres Morales <andresmor@microsoft.com>

addressing some early comments by Amit

c3768d9

Signed-off-by: Andres Morales <andresmor@microsoft.com>

run format tool

6d1a464

Signed-off-by: Andres Morales <andresmor@microsoft.com>

andresmor-ms force-pushed the functional_api/identify_effect branch from dd4b31d to 6d1a464 Compare September 16, 2022 21:49

darthtrevino previously approved these changes Sep 19, 2022

View reviewed changes

dowhy/causal_identifier/id_identifier.py Outdated Show resolved Hide resolved

dowhy/causal_identifier/identify_effect.py Show resolved Hide resolved

emrekiciman requested changes Sep 20, 2022

View reviewed changes

Address PR comments

5494d1d

Signed-off-by: Andres Morales <andresmor@microsoft.com>

andresmor-ms dismissed darthtrevino’s stale review via 5494d1d September 20, 2022 22:05

amit-sharma reviewed Sep 21, 2022

View reviewed changes

andresmor-ms added 2 commits September 21, 2022 12:43

Imports as top level functions rename defaultidentifier

7d75891

Signed-off-by: Andres Morales <andresmor@microsoft.com>

Clear output of notebook

9c27b3a

Signed-off-by: Andres Morales <andresmor@microsoft.com>

emrekiciman mentioned this pull request Sep 21, 2022

Add support for non-singleton treatment/outcome sets to "has_directed_path" method #654

Open

andresmor-ms marked this pull request as ready for review September 21, 2022 21:52

emrekiciman previously approved these changes Sep 21, 2022

View reviewed changes

amit-sharma reviewed Sep 22, 2022

View reviewed changes

petergtz reviewed Sep 22, 2022

View reviewed changes

dowhy/causal_identifier/identify_effect.py Outdated Show resolved Hide resolved

petergtz reviewed Sep 22, 2022

View reviewed changes

dowhy/causal_identifier/id_identifier.py Outdated Show resolved Hide resolved

Refactor identify_effect to functions, notebook updates

f664ce4

Signed-off-by: Andres Morales <andresmor@microsoft.com>

andresmor-ms dismissed emrekiciman’s stale review via f664ce4 September 22, 2022 20:42

petergtz previously approved these changes Sep 23, 2022

View reviewed changes

amit-sharma previously approved these changes Sep 26, 2022

View reviewed changes

darthtrevino previously approved these changes Sep 26, 2022

View reviewed changes

dowhy/causal_identifier/id_identifier.py Outdated Show resolved Hide resolved

Add type hints

c59924f

Signed-off-by: Andres Morales <andresmor@microsoft.com>

andresmor-ms dismissed stale reviews from darthtrevino, amit-sharma, and petergtz via c59924f September 26, 2022 19:40

emrekiciman approved these changes Sep 27, 2022

View reviewed changes

emrekiciman merged commit db953a6 into main Sep 27, 2022

emrekiciman deleted the functional_api/identify_effect branch September 27, 2022 01:38

petergtz added the enhancement New feature or request label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functional api/identify effect #640

Functional api/identify effect #640

andresmor-ms commented Sep 16, 2022

emrekiciman left a comment

amit-sharma left a comment

amit-sharma left a comment

petergtz left a comment

petergtz commented Sep 22, 2022

andresmor-ms commented Sep 22, 2022

emrekiciman commented Sep 22, 2022

emrekiciman commented Sep 22, 2022

petergtz commented Sep 22, 2022

andresmor-ms commented Sep 22, 2022

petergtz commented Sep 22, 2022

amit-sharma commented Sep 23, 2022

petergtz commented Sep 23, 2022

emrekiciman commented Sep 23, 2022

petergtz commented Sep 26, 2022

andresmor-ms commented Sep 26, 2022 •

edited

emrekiciman left a comment

Padarn commented Nov 20, 2022

amit-sharma commented Nov 21, 2022

Padarn commented Nov 22, 2022

Functional api/identify effect #640

Functional api/identify effect #640

Conversation

andresmor-ms commented Sep 16, 2022

emrekiciman left a comment

Choose a reason for hiding this comment

amit-sharma left a comment

Choose a reason for hiding this comment

amit-sharma left a comment

Choose a reason for hiding this comment

petergtz left a comment

Choose a reason for hiding this comment

petergtz commented Sep 22, 2022

andresmor-ms commented Sep 22, 2022

emrekiciman commented Sep 22, 2022

emrekiciman commented Sep 22, 2022

petergtz commented Sep 22, 2022

andresmor-ms commented Sep 22, 2022

petergtz commented Sep 22, 2022

amit-sharma commented Sep 23, 2022

petergtz commented Sep 23, 2022

emrekiciman commented Sep 23, 2022

petergtz commented Sep 26, 2022

andresmor-ms commented Sep 26, 2022 • edited

emrekiciman left a comment

Choose a reason for hiding this comment

Padarn commented Nov 20, 2022

amit-sharma commented Nov 21, 2022

Padarn commented Nov 22, 2022

andresmor-ms commented Sep 26, 2022 •

edited