Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all types could be used in IntervalDict #29

Closed
demalexx opened this issue May 9, 2020 · 9 comments
Closed

Not all types could be used in IntervalDict #29

demalexx opened this issue May 9, 2020 · 9 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@demalexx
Copy link

demalexx commented May 9, 2020

if value == v:

I use IntervalDict to store pandas DataFrame. Because of this line #255 setting element to IntervalDict fails - DataFrame can't be compared to just any other object.

from pandas import DataFrame

d = portion.IntervalDict()
d[portion.singleton(1)] = DataFrame()
d[portion.singleton(2)] = DataFrame()

exit()
File "df.py", line 89, in <module>
    d[portion.singleton(2)] = DataFrame()
  File "venv/lib/python3.8/site-packages/portion/dict.py", line 255, in __setitem__
    if value == v:
  File "venv/lib/python3.8/site-packages/pandas/core/generic.py", line 1478, in __nonzero__
    raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

As a workaround I assign tuple (id(dataframe), dataframe), which prevents comparing DataFrame. But that's awkward.

Thanks!

@AlexandreDecan
Copy link
Owner

portion expects == to return a Boolean value, as stated by Python (see https://docs.python.org/3.8/reference/expressions.html#comparisons : "Comparisons yield boolean values: True or False").

The problem is that DataFrame cannot be compared the "usual way" (i.e. a == b where a and b are DataFrame does not evaluate to a Boolean value, but rather to a DataFrame).

I cannot rely on is instead of == since it could lead to awkward results, e.g.:

import portion as P                                                     
d = P.IntervalDict()                                                    
d[1] = []                                                               
d[2] = []                                                               
d.find([])                                                              

... with the current implementation relying on ==, this returns [1] | [2] as expected, where an implementation relying on is would produce () (i.e. empty interval).

@demalexx
Copy link
Author

demalexx commented May 9, 2020

Could you avoid comparing dict's values at all?

@AlexandreDecan
Copy link
Owner

Not easily. I need to compare values so I can merge intervals associated to the same value, and limit the space required to store the IntervalDict. It is also required for .find (but I already have workarounds in mind for this method).

The thing is that if you have [0,3] for object x and you add [1,5] for an equal y, it should result in [0,5] for x, and the only way to do this is to compare the "first" x with y.

While I understand that it's a shame we cannot store dataframes along with intervals, I would say that the problem comes from pandas. I understand they wanted to override the default behaviour of == for their own purpose (and I really appreciate this behaviour when I manipulate dataframes), but unfortunately, that means that any library relying on what would be expected from == cannot work correctly :(

One possibility would be to allow the creation of an IntervalDict for which the equality function can be provided by the user as a parameter. However, this has many implications since we heavily relied on the "default" equality (both implicitly and explicitly). At this point, I'm not in favour of implementing this since it's even more tricky than the workaround you have :-) I'm currently looking for alternatives (e.g. wrapping values into a wrapper whose equality function can be specified when an IntervalDict is created. This would minimize the amount of changes required in IntervalDict, but add an extra layer on top of the values, and extra steps to wrap/unwrap values when they are read and written :-/).

@demalexx
Copy link
Author

demalexx commented May 9, 2020

The thing is that if you have [0,3] for object x and you add [1,5] for an equal y, it should result in [0,5] for x, and the only way to do this is to compare the "first" x with y.

Another approach to solving this - remove that contract :)
I.e. builtin dict doesn't do that optimisation for you, why portion should?
In described case let it be [0] with x and [1, 5] with y, even if x == y. Then dev could write own code to merge intervals, or you could have it as helper method.

I also could suggest to have some parameter in IntervalDict constructor, which determines if intervals should be merged on the fly.

@AlexandreDecan
Copy link
Owner

I would like to keep this contract, since it means that all intervals related to a given value are merged together. Consider for example the following IntervalDict d:

d[P.closed(0, 1] = 'a'
d[P.closed(1, 2)] = 'a'

If we ask for d[P.open(-P.inf, P.inf)], we get P.closed(0,2] -> 'a'. It would be "surprising" to have two distinct keys for a single value. If we don't have this, then an IntervalDict has little value compared to a list of tuples (key, value) :-/

I'm currently more and more considering adding an eq parameter to the constructor, so one can pass an arbitrary function to compare values. This will default to ==, but can be set to lambda a,b: a.equals(b) in your case. However, before doing so, I would like to have more use-cases where such a parameter could be useful (i.e., are there any other cases apart pandas' dataframes that require it?).

@AlexandreDecan AlexandreDecan added enhancement New feature or request help wanted Extra attention is needed labels May 9, 2020
@demalexx
Copy link
Author

demalexx commented May 9, 2020

I wouldn't say it's surprising :) It depends on what problem you're solving with this IntervalDict. Anyway you should be ready that single d[<interval>] could return collection with intervals and values.

eq params looks like working solution. I then could just set eq=lambda a, b: False and will get IntervalDict that doesn't merge intervals ;)

Re other use cases - these must be some specific 3rd-party libs that overrides __eq__.

@AlexandreDecan
Copy link
Owner

If there was no alternative solution, I would implement eq' directly. But it is still possible to wrap/unwrap objects to make them comparable. I'm a bit relunctant to implement eqif it's only forpandas(ornumpy`, merely for the same reasons).

I guess there are other libraries that do no follow Python's expectations for __eq__, but I'm not aware of them ;-)

What I am also afraid of is that it opens the door to other similar changes. For example, one could argue that it would be nice to have the possibility to override/define the comparison functions (__gt__, __ge__, etc.) for values used in Interval. Although I agree it would be "nice" to have that possibility (and to be honest, it was implemented during the process leading up to 2.0.0, but removed because of the "long-term" implications and the complexity of the code), I do believe it's not the responsability of portion to define (or to allow to define) how objects should be compared, especially given there are other ways to do this (e.g. wrap/unwrap bounds).

Of course,I'm not saying that I'm not going to do anything about this issue :-) I just want to take the time to think carefully about the possibilities and alternatives, as well as the long term implications on the code (esp. since IntervalDict is likely to be completely rewritten at some point to achieve decent performances, likely on top of a kind of balanced binary tree).

@demalexx
Copy link
Author

No hurry :) Still it's great library that I'm using now on small project. Thanks!

@AlexandreDecan
Copy link
Owner

I discussed with some friends about this issue, since I couldn't decide what's the best way to address it. We concluded that it's probably better to stick with the current implementation, and to suggest to wrap/unwrap objects on-the-fly when they do not comply with Python's expected behaviour for __eq__ rather than having a dedicated piece of code to deal with those (limited) cases in portion.

It is very very likely that we will release a new implementation of IntervalDict at some point (manly to achieve better performances for this class). We'll consider this issue at that time ;-)

Considering the use of dataframes in the context of intervals, we're currently looking at the possibility to implement our Interval as a special extension type for pandas (see https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types). This is another way to address your use case, since it will allow Intervalto be use as indexes in a serie/dataframe. This is going to be a bit challenging (esp. to achieve high performance - something that is expected in the context of datafarmes) but that's a feature I would like to have in portion ;-) No ETA so far, btw... :)

@AlexandreDecan AlexandreDecan added long term Will be addressed later and removed long term Will be addressed later labels Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants