Not all types could be used in IntervalDict #29

demalexx · 2020-05-09T10:10:07Z

Line 255 in f87365a

if value == v:

I use IntervalDict to store pandas DataFrame. Because of this line #255 setting element to IntervalDict fails - DataFrame can't be compared to just any other object.

from pandas import DataFrame

d = portion.IntervalDict()
d[portion.singleton(1)] = DataFrame()
d[portion.singleton(2)] = DataFrame()

exit()

File "df.py", line 89, in <module>
    d[portion.singleton(2)] = DataFrame()
  File "venv/lib/python3.8/site-packages/portion/dict.py", line 255, in __setitem__
    if value == v:
  File "venv/lib/python3.8/site-packages/pandas/core/generic.py", line 1478, in __nonzero__
    raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

As a workaround I assign tuple (id(dataframe), dataframe), which prevents comparing DataFrame. But that's awkward.

Thanks!

The text was updated successfully, but these errors were encountered:

AlexandreDecan · 2020-05-09T10:22:15Z

portion expects == to return a Boolean value, as stated by Python (see https://docs.python.org/3.8/reference/expressions.html#comparisons : "Comparisons yield boolean values: True or False").

The problem is that DataFrame cannot be compared the "usual way" (i.e. a == b where a and b are DataFrame does not evaluate to a Boolean value, but rather to a DataFrame).

I cannot rely on is instead of == since it could lead to awkward results, e.g.:

import portion as P                                                     
d = P.IntervalDict()                                                    
d[1] = []                                                               
d[2] = []                                                               
d.find([])

... with the current implementation relying on ==, this returns [1] | [2] as expected, where an implementation relying on is would produce () (i.e. empty interval).

demalexx · 2020-05-09T11:24:28Z

Could you avoid comparing dict's values at all?

AlexandreDecan · 2020-05-09T12:40:41Z

Not easily. I need to compare values so I can merge intervals associated to the same value, and limit the space required to store the IntervalDict. It is also required for .find (but I already have workarounds in mind for this method).

The thing is that if you have [0,3] for object x and you add [1,5] for an equal y, it should result in [0,5] for x, and the only way to do this is to compare the "first" x with y.

While I understand that it's a shame we cannot store dataframes along with intervals, I would say that the problem comes from pandas. I understand they wanted to override the default behaviour of == for their own purpose (and I really appreciate this behaviour when I manipulate dataframes), but unfortunately, that means that any library relying on what would be expected from == cannot work correctly :(

One possibility would be to allow the creation of an IntervalDict for which the equality function can be provided by the user as a parameter. However, this has many implications since we heavily relied on the "default" equality (both implicitly and explicitly). At this point, I'm not in favour of implementing this since it's even more tricky than the workaround you have :-) I'm currently looking for alternatives (e.g. wrapping values into a wrapper whose equality function can be specified when an IntervalDict is created. This would minimize the amount of changes required in IntervalDict, but add an extra layer on top of the values, and extra steps to wrap/unwrap values when they are read and written :-/).

demalexx · 2020-05-09T13:03:27Z

The thing is that if you have [0,3] for object x and you add [1,5] for an equal y, it should result in [0,5] for x, and the only way to do this is to compare the "first" x with y.

Another approach to solving this - remove that contract :)
I.e. builtin dict doesn't do that optimisation for you, why portion should?
In described case let it be [0] with x and [1, 5] with y, even if x == y. Then dev could write own code to merge intervals, or you could have it as helper method.

I also could suggest to have some parameter in IntervalDict constructor, which determines if intervals should be merged on the fly.

AlexandreDecan · 2020-05-09T14:24:18Z

I would like to keep this contract, since it means that all intervals related to a given value are merged together. Consider for example the following IntervalDict d:

d[P.closed(0, 1] = 'a'
d[P.closed(1, 2)] = 'a'

If we ask for d[P.open(-P.inf, P.inf)], we get P.closed(0,2] -> 'a'. It would be "surprising" to have two distinct keys for a single value. If we don't have this, then an IntervalDict has little value compared to a list of tuples (key, value) :-/

I'm currently more and more considering adding an eq parameter to the constructor, so one can pass an arbitrary function to compare values. This will default to ==, but can be set to lambda a,b: a.equals(b) in your case. However, before doing so, I would like to have more use-cases where such a parameter could be useful (i.e., are there any other cases apart pandas' dataframes that require it?).

demalexx · 2020-05-09T16:01:52Z

I wouldn't say it's surprising :) It depends on what problem you're solving with this IntervalDict. Anyway you should be ready that single d[<interval>] could return collection with intervals and values.

eq params looks like working solution. I then could just set eq=lambda a, b: False and will get IntervalDict that doesn't merge intervals ;)

Re other use cases - these must be some specific 3rd-party libs that overrides __eq__.

AlexandreDecan · 2020-05-10T08:08:50Z

If there was no alternative solution, I would implement eq' directly. But it is still possible to wrap/unwrap objects to make them comparable. I'm a bit relunctant to implement eqif it's only forpandas(ornumpy`, merely for the same reasons).

I guess there are other libraries that do no follow Python's expectations for __eq__, but I'm not aware of them ;-)

What I am also afraid of is that it opens the door to other similar changes. For example, one could argue that it would be nice to have the possibility to override/define the comparison functions (__gt__, __ge__, etc.) for values used in Interval. Although I agree it would be "nice" to have that possibility (and to be honest, it was implemented during the process leading up to 2.0.0, but removed because of the "long-term" implications and the complexity of the code), I do believe it's not the responsability of portion to define (or to allow to define) how objects should be compared, especially given there are other ways to do this (e.g. wrap/unwrap bounds).

Of course,I'm not saying that I'm not going to do anything about this issue :-) I just want to take the time to think carefully about the possibilities and alternatives, as well as the long term implications on the code (esp. since IntervalDict is likely to be completely rewritten at some point to achieve decent performances, likely on top of a kind of balanced binary tree).

demalexx · 2020-05-10T10:32:18Z

No hurry :) Still it's great library that I'm using now on small project. Thanks!

AlexandreDecan · 2020-05-20T19:19:26Z

I discussed with some friends about this issue, since I couldn't decide what's the best way to address it. We concluded that it's probably better to stick with the current implementation, and to suggest to wrap/unwrap objects on-the-fly when they do not comply with Python's expected behaviour for __eq__ rather than having a dedicated piece of code to deal with those (limited) cases in portion.

It is very very likely that we will release a new implementation of IntervalDict at some point (manly to achieve better performances for this class). We'll consider this issue at that time ;-)

Considering the use of dataframes in the context of intervals, we're currently looking at the possibility to implement our Interval as a special extension type for pandas (see https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types). This is another way to address your use case, since it will allow Intervalto be use as indexes in a serie/dataframe. This is going to be a bit challenging (esp. to achieve high performance - something that is expected in the context of datafarmes) but that's a feature I would like to have in portion ;-) No ETA so far, btw... :)

AlexandreDecan added enhancement New feature or request help wanted Extra attention is needed labels May 9, 2020

AlexandreDecan closed this as completed May 20, 2020

This was referenced May 31, 2020

Feature Request - tolerance (in datetime or any other float) #30

Closed

Add typing info #27

Open

AlexandreDecan added long term Will be addressed later and removed long term Will be addressed later labels Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all types could be used in IntervalDict #29

Not all types could be used in IntervalDict #29

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 10, 2020

demalexx commented May 10, 2020

AlexandreDecan commented May 20, 2020

Not all types could be used in IntervalDict #29

Not all types could be used in IntervalDict #29

Comments

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 9, 2020

demalexx commented May 9, 2020

AlexandreDecan commented May 10, 2020

demalexx commented May 10, 2020

AlexandreDecan commented May 20, 2020