Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Should type aliases be used for Ibis types? #16

Open
OlivierBinette opened this issue Dec 16, 2023 · 2 comments
Open

Design: Should type aliases be used for Ibis types? #16

OlivierBinette opened this issue Dec 16, 2023 · 2 comments

Comments

@OlivierBinette
Copy link
Contributor

Note: this issue is not super important, but I think it's something to think about to help clarify some aspects of the code.

Most Mismo modules rely on Ibis types from ibis.expr.types. This introduces a direct dependency on this module throughout the code, while requiring users to refer to this external dependency to understand the code.

Would it be preferable for Mismo to have its own types module that only imports what it needs? This would provide a few advantages:

  1. It would constrain dependencies to Ibis types to a single module, making it easier to adapt to changes to Ibis, and maybe making it a bit easier to mock Ibis types in tests.
  2. It would make it possible to document ibis types used throughout the package, without needing users to refer to an external dependency.
  3. It would provide a place where to extend Ibis types and document these extensions.

For example, Mismo cluster metrics currently assume that Table objects representing membership vectors will have the two columns "record_id" and "label". This could be formalized by defining a Table subtype that can be documented in a single place, as follows:

class LabelingTable(Table):  # Could also be called "Clustering" or "MembershipVector"
    record_id: Column
    label: Column

This isn't technically duck typing (but does it matter, given that runtime Python doesn't generally care about nominal types?), and we're a bit liberal with our use of class attributes to describe what we want to be class attributes. But we can document the meaning and then use table.record_id and table.label in code functions that expect a membership vector as input.

Additionally, we could have Mismo submodules each have a types submodule, where important interface definitions are placed and documented. For instance, all metrics function have a shared interface, expecting two LabelingTable objects as input. This could be defined in metrics.types to make this a stable interface definition, and so that this interface definition only has to be documented in one place (instead of having to re-explain each metrics function individually. This could be done as follows:

class LinkagesComparisonFunction(Protocol):  # Or some better name
    def __call__(self, labels_true: LabelingTable, labels_pred: LabelingTable, **kwargs) -> float:
        ...

In the metrics, module, things can be kept as they are or, alternatively, we could define things as follows:

class SKlearnLinkagesComparisonFunction(LinkagesComparisonFunction):
    def __init__(self, sklearn_metric):
        self._sklearn_metric = sklearn_metric
        update_wrapper(self, sklearn_metric)  # This might have to be configured properly

    def __call__(self, labels_true: LabelingTable, labels_pred: LabelingTable, **kwargs) -> float:
        labels_true, labels_pred = _to_numpy_labels(labels_true, labels_pred)
            return metric(labels_true, labels_pred, **kwargs)

adjusted_mutual_info_score = SKlearnLinkagesComparisonFunction(_metrics.adjusted_mutual_info_score)
calinski_harabasz_score = SKlearnLinkagesComparisonFunction(_metrics.calinski_harabasz_score)
completeness_score = SKlearnLinkagesComparisonFunction(_metrics.completeness_score)
contingency_matrix = SKlearnLinkagesComparisonFunction(_metrics.contingency_matrix)
pair_confusion_matrix = SKlearnLinkagesComparisonFunction(_metrics.pair_confusion_matrix)
# Etc.

These changes shouldn't affect users, as they can continue to use duck typing to extend as they want. They don't need to know about types object and can grab metrics function directly.

@NickCrews
Copy link
Owner

Good thoughts here, appreciate them a lot. I thought about this similarly, and I think in fact if you look through the git history I even used to have something similar.

I think the LinkagesComparisonFunction protocol totally makes sense and we should do it. For typechecking, IDE support, documentation for humans, and reducing repetitiveness of docstrings we have to write. I would lean towards all of these types getting put into one module, or at least near each other, as opposed to spread throughout the codebase, so you can get a better idea of how all of the different components interact with each other better. Public API should probably be mismo.types.LinkagesComparisonFunction etc, perhaps the actual implementation is in _private.py files.

For the LabelingTable case, I thought about this too, but I don't want to go there:

  1. I don't want to have to maintain a type mapping of mismo types to ibis types. There are inevitably be inconsistencies there. It also is more mental overhead for users, because I don't think there is a way for us to reduce the complexity enough with that abstraction: they are going to need to understand ibis types AS WELL as mismo types, instead of just ibis types. Ibis documentation is decent, and if its not good enough then its less work for us to upstream improvements there than it would be to re-implement and re-document ourselves I think.
  2. I'm curious what you mean by mock Ibis types in tests. Do you have an example of that? re: insulation from ibis types changing, I think if they change upstream then I just want that to get reflected here immediately. Again, users are gonna need to know ibis anyways so having two different versions of the same thing isn't a good idea.
  3. The problem with returning a concrete LabelingTable class is that then as soon as the user does something like .mutate() on it, they get a normal ibis.Table result, which is both confusing and really limits the usefulness. But, I think that this could be useful for documentation, typing, and IDE support. So our type annotations can use them, and our docs can maybe even use them as an example.

Thoughts?

@OlivierBinette
Copy link
Contributor Author

For LabelingTable, I think the key thing is that it is meant to be a type, and not something that ever can be instanciated or subclassed to something else than a type. The LabelingTable class would only be used for type hints and for documentation purposes.

Unfortunately I don't know if it's possible to define this strictly as a type (or maybe there's a Python trick I don't know?) The way that I define LabelingTable as a subclass of Table for the purposes of duck typing is a bit... not great?

Regarding modules, I would put shared types at the top level, but then keep submodule-specific types either within the relevant submodules. Each (sub)module could have a types.py file containing types and interface definitions. Since this project uses duck typing, types don't need to have an implementation most of the time.

Note that I haven't yet worked on a project that made widespread use of types and duck typing (rather than abstract classes), so my idea is kind of experimental. It could be tried out in a single submodule before deciding if it's a good idea.

For the other questions:

  1. That makes sense. I think with duck typing users wouldn't need to know that there's a difference. E.g. we could define Table as an alias to Ibis.expr.types.Table, and then use Table imported from a main types module throughout. I think the main advantage is that this would make clear what subset of Ibis types is used by Mismo, and this could help manage the Ibis dependency in the future (e.g., quickly check if an Ibis update impacts Mismo at all). It could even be made clear what subset of Ibis Table functionality is needed by the code, helping Mismo be more easily portable to other frameworks.
  2. This is also a bit experimental in my mind. If we make clear what are expectations for input types, then we could more easily create "fake" or "mock" tables that match the required type, but that behave in unexpected ways, in order to check how errors are handled in the code. For instance, the Hypothesis python package is used for property-based testing and can (I think?) rely on type definitions to automatically generate tests. Not saying Hypothesis should be used, it's just the general idea that having a well-defined interface can help define and test edge cases.
  3. The LabelingTable would never be instanciated, just used as a type for type hints, and used for documentation purposes. It's not great to have this be a subclass of an Ibis table. Ideally it'd be a Protocol or type object with clear documentation, but not sure how to achieve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants