Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __format__ method to Interval (improvement) #76

Closed
adm271828 opened this issue Jan 28, 2023 · 16 comments
Closed

Add __format__ method to Interval (improvement) #76

adm271828 opened this issue Jan 28, 2023 · 16 comments
Labels
enhancement New feature or request Information This issue is mostly informative

Comments

@adm271828
Copy link

Hi Alexandre,

I would suggest to add a __format__ method to Interval, such as (or something equivalent, idea is easy to grasp):

  def __format__(self, spec):
    return P.to_string(self, lambda v: ('{:'+spec+'}').format(v))

Best regards,

Antoine

@adm271828
Copy link
Author

That's something I already considered but I don't remember exactly why I decided not to implement it. :-)

Oh I remember : that's because format can also be used to define some padding/alignment and there's no easy way to apply this to infinities, leading to unexpected string rendering and "inconsistencies".

Yes, but I'm pretty sure we can figure out something.

For the record, my first motivation to subclass Interval was to add such a method, because when working with datetime intervals formatting quickly yields to unreadable output.

Antoine

@AlexandreDecan AlexandreDecan added enhancement New feature or request help wanted Extra attention is needed labels Jan 28, 2023
@AlexandreDecan
Copy link
Owner

If we can parse format specifier then we could try to apply it to infinities as well. But I don't see how currently :-) help is welcome on this.

@adm271828
Copy link
Author

adm271828 commented Jan 29, 2023

Not sure there is a unique / perfect solution. Probably some design choice has to be made.

Not sure either that it is mandatory to try to implement all formating directives (e.g. alignment). I see them as indications subject to best effort, when it make sense, that can even be dropped. For instance if a string is to long for a length indication, it is not truncated. Another example is datetime formating that simply doesn't support padding or alignment directives.

So, how the spec argument of __format__ is used basically depends on the object it is invoked on and the use case motivation.

My motivation with datetime is that having (especially non atomic) intervals with the default datetime representation is just unreadable.

def __format__(self, spec):
    return P.to_string(self, lambda v: ('{:'+spec+'}').format(v))

The previous proposal is simple and fullfills this purpose.

If one wants to deal with aligments in a consistent way, one must find a way to split the spec in subpieces that would be applied over interval elements. What about using a separator? For instance:

f"{i:|4d|4s|^15s|}"

where | is the separator that breaks the spec in 3 parts, first parts applies to values that are not infinities, second part to infinities, and last part to overall result.

For instance:

  def __format__(self, spec):
    # FIXME: handle null spec
    # FIXME: handle exception because __format__ might not be defined for v
    parts = spec[1:].split(spec[0])
    pinf = '+inf'.__format__(parts[1]) if len(parts) > 1 else '+pinf'
    ninf = '-inf'.__format__(parts[1]) if len(parts) > 1 else '-pinf'
    result = P.to_string(self,
                         lambda v: v.__format__(parts[0]),
                         pinf=pinf,
                         ninf=ninf)
    
    if len(parts) > 2:
      result = result.__format__(parts[2])
    
    return result

Here, the separator is chosen by the user (first character after the semicolon). Another choice would be to use a predefined separator in which case it should not appear after the first semicolon. This would be slightly more readable, The drawback is that it can not be part of the spec for sub parts.

Best regards,

Antoine

@AlexandreDecan
Copy link
Owner

I don't really like the idea of using a separator. Formatting a string is already something sometimes complicated (because one has to remember the specifiers) and if we're going to support __format__ for intervals, then we should try to stick as much as possible to its "usual" behaviour for other, built-in, objects.

I like the simplicity of your first proposal:

def __format__(self, spec):
    return P.to_string(self, lambda v: ('{:'+spec+'}').format(v))

What would be nice is to be allow to extract all parts of spec that could be applied on infinities (that is, mostly padding/alignment) and to apply them on infinities. I don't know if there's some parser available in the Python standard library for doing so. If there's one, then it might be easy to implement something along this line.

@AlexandreDecan
Copy link
Owner

This could be a starting point:
https://peps.python.org/pep-3101/#user-defined-formatting (I still need to read it)

@adm271828
Copy link
Author

What would be nice is to be allow to extract all parts of spec that could be applied on infinities (that is, mostly padding/alignment) and to apply them on infinities. I don't know if there's some parser available in the Python standard library for doing so. If there's one, then it might be easy to implement something along this line.

Yes, but it seems to be impossible to have a spec that could apply at the same time to strings (infinities) and objects (of unknown type) in the interval. See datetime again: there is no way to specify a padding nor alignment (extra char in the spec are simply printed as is). What we are looking for is a "union" spec that would apply to arbitrary types... Seems complicated and beyond the scope of your library I think. Maybe the PEP you're pointing at...

I can' say I like the separator idea... quick and dirty solution. I might use the separator in my subclasses as a convenience (I didn't so far).

Maybe the reasonable choice is to do nothing (as you did) or simply provide the first proposal I made (with a warning about the limits en the doc).

Best regards,

Antoine

@AlexandreDecan
Copy link
Owner

I'll have a look at how format and its specifier is expected to be provided, and see if I can come up with a solution that works on all bounds (including infinities). If not, then I think it's indeed safer to do nothing (I prefer to avoid a "partial" solution, since in any case, to_string can be used anyway to cover all cases).

@adm271828
Copy link
Author

adm271828 commented Jan 29, 2023

I'm affraid there is no way to have a universal specifier, as specifiers (and how they are interpreted) depends on objet they are used with.

Of course to_string can be used, but the format protocol and the syntactic sugar it provides is much more compact/easy.

Anyway the idea of separator that came up because of this thread will solve (even if not perfect) some cases I have in a much more compact way. I will implement it in some of my subclasses of intervals!

Best regards.

@AlexandreDecan
Copy link
Owner

I'm affraid there is no way to have a universal specifier, as specifiers (and how they are interpreted) depends on objet they are used with.

Ideed, but the PEP defines a kind of "minimal set of specifiers" ([[fill]align][sign][#][0][minimumwidth][.precision][type]). Not everything makes sense for infinities, but it would be nice to be able to extract what makes sense for them.

Notice that I discussed this morning with another Pythonista, and he told me something interesting: none of the object containers in Python (e.g. list, dict, etc.) actually supports defining the format of nested objects through __format__. If we do so for interval, we somehow bring something unexpected since, for example, one would use __format__ to align the whole interval to the right of a 20-characters long string, while another one would use it to align each bound to the right of a 20-characters long string. The latter is what we initially define as the behaviour of __format__ while the former is what is usually expected from it. That may be confusing, and he even suggested to apply the former behaviour since we have a to_string function that can be used to apply the latter.

@adm271828
Copy link
Author

adm271828 commented Jan 31, 2023

This could be a starting point:
https://peps.python.org/pep-3101/#user-defined-formatting (I still need to read it)

Isn't the PEP you're referring to the one that introduced Formater class and already is in the language (python 2.6 and 3.0)?

Since meaning of format specifier depends on the object it is applied to, I really can't see how a unique specifier could be processed in order to apply some (relevant) parts of it to infinities (str), and other parts to be applied to unknown objects.

Consider some align requirements and intervals with datetimes: how is it possible to guess that you must delete the align spec when applying to a value, and keep for infinities. You can only do it if you have the knowledge of what format specifier is supported by non infinities.

Here is a refined idea with separator, taking advantage of # char as "alternate" specifier:

  • a simple specifier applies to upper/lower bound values (but not infinities). The underlying assumption is that the format will be understood by all values: this will be the case - probably the most common one - where all values have the same type. This is my initial proposal.
  • the alternate form start with #, followed by a separator (user chosen), followed by multiple separated format string that applies (in this order) to upper/lower bound values, then infinities, then the whole interval representation. For instance: f"{i:#/6.2f/^6/^24}". Values are expected to be floats, infinities are centered among 6 chars, and the whole interval centered again.

Simple, powerfull enough for most use cases... Much more compact and handy than using to_string for most cases also. Adopted for my subclasses!

Some examples:

>>> i=closed(1,2)
>>> j=closed(1,inf)
>>> k=closed(datetime(2000,1,1), datetime.now())
>>> i
[1,2]
>>> j
[1,+inf)
>>> k
[datetime.datetime(2000, 1, 1, 0, 0),datetime.datetime(2023, 1, 30, 17, 55, 55, 608774)]
>>> f"{i:08d}"
'[00000001,00000002]'
>>> f"{j:08d}"
'[00000001,+inf)'
>>> f"{i:#/08d//^30}"
'     [00000001,00000002]      '
>>> f"{j:#/08d/^8/^30}"
'     [00000001,  +inf  )      '
>>> f"{k:#/ %Y-%m-%d //^30}"
' [ 2000-01-01 , 2023-01-30 ]  '
>>> f"{i:#/#010x}"
'[0x00000001,0x00000002]'

Here is the code that would do it (not heavily tested). Anybody shall feel free to reuse it:

class I(P.Interval):

  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

  # Can owerwrite on a per class (or even instance) basis
  DISJ_STR = " | "
  SEP_STR = ","
  LEFT_OPEN_STR = "("
  LEFT_CLOSED_STR = "["
  RIGHT_OPEN_STR = ")"
  RIGHT_CLOSED_STR = "]"
  PINF_STR = "+inf"         # "+oo"
  NINF_STR = "-inf"         # "-oo"
  EMPTY_STR = "()"
  DEFAULT_CONV = str

  def to_str(self, 
             conv: Callable = repr, 
             *,
             disj: Optional[str] = None,
             sep: Optional[str] = None,
             left_open: Optional[str] = None,
             left_closed: Optional[str] = None,
             right_open: Optional[str] = None,
             right_closed: Optional[str] = None,
             pinf: Optional[str] = None,
             ninf: Optional[str] = None, 
             empty: Optional[str] = None) -> str:
    """
    """
    left_open = self.LEFT_OPEN_STR if left_open is None else left_open
    right_open = self.RIGHT_OPEN_STR if right_open is None else right_open

    if self.empty:
      return (empty if empty is not None
                else self.EMPTY_STR if self.EMPTY_STR
                else left_open + right_open)

    left_closed = self.LEFT_CLOSED_STR if left_closed is None else left_closed
    right_closed = self.RIGHT_CLOSED_STR if right_closed is None else right_closed
    sep = self.SEP_STR if sep is None else sep
    pinf = self.PINF_STR if pinf is None else pinf
    ninf = self.NINF_STR if ninf is None else ninf

    result = []

    for i in self._intervals:
      if i.lower == i.upper:
        result.append(left_closed + conv(i.lower) + right_closed)
      else:
        result.append(
          (left_open if i.left == P.OPEN else left_closed)
          + (ninf if i.lower == -P.inf else conv(i.lower))
          + sep
          + (pinf if i.upper == P.inf else conv(i.upper))
          + (right_open if i.right == P.OPEN else right_closed)
        )

    return (disj if disj is not None else self.DISJ_STR).join(result)


  def __format__(self, spec: str) -> str:
    """
    Default object formatter.

    The format specifier is applied to lower and upper bound values,
    provided they are not infinities. Since their is not known controled
    by the Interval class, it is up to the user to provide a compatible
    format specifier.

    An alternate specifier form (starting with '#') can be used in order
    to control how infinities and the overall result are formated.
    
    The '#' opening char must be followed by a (user chosen) separator.
    Every string between separators is then a format specifier that apply
    to the lower and upper bound values (first specifier), infinities
    (second specifier applicable to str) and the overall result (third
    specifier, applicable to str). Second and third specifiers are
    optional.

    Examples:
      >>> i=closed(1,2)
      >>> i
      [1,2]
      >>> j=closed(1,inf)
      >>> j
      [1,+inf)
      >>> k=closed(datetime(2000,1,1), datetime.now())
      >>> k
      [datetime.datetime(2000, 1, 1, 0, 0),datetime.datetime(2023, 1, 30, 
      17, 55, 55, 608774)]
      >>> f"{i:08d}"
      '[00000001,00000002]'
      >>> f"{j:08d}"
      '[00000001,+inf)'
      >>> f"{i:#/08d/^8/^30}"
      '     [00000001,00000002]      '
      >>> f"{j:#/08d/^8/^30}"
      '     [00000001,  +inf  )      '
      >>> f"{k:#/ %Y-%m-%d //^30}"
      ' [ 2000-01-01 , 2023-01-30 ]  '
      >>> f"{i:#/#010x}"
      '[0x00000001,0x00000002]'
    """
    pinf = self.PINF_STR
    ninf = self.NINF_STR

    # Format specifier applies to values, but not to infinities
    if not spec.startswith('#'):
      spec = '{:'+spec+'}'
      return self.to_str(conv = lambda v: spec.format(v))
      
    if len(spec) == 1:
      raise ValueError("Invalid format specifier (missing separator)")

    # Alternate form: multiple format specifier that apply to values,
    # infinities and the whole result. First char after '#' is a separator
    # used to split spec into sub specs. For instance '#/8.2f/^8'
    subspec = spec[2:].split(spec[1])

    if subspec[0]:
      if subspec[0][0] == '!':
        value_fmt = '{'+subspec[0]+'}'
      else:
        value_fmt = '{:'+subspec[0]+'}'

      conv = lambda v: value_fmt.format(v)
    else:
      conv = repr

    if len(subspec) > 1 and subspec[1]:
      pinf = ('{:'+subspec[1]+'}').format(pinf)
      ninf = ('{:'+subspec[1]+'}').format(ninf)

    result = self.to_str(conv=conv, ninf=ninf, pinf=pinf)

    if len(subspec) > 2 and subspec[2]:
      return ('{:'+subspec[2]+'}').format(result)

    return result

  
  def __str__(self):
    return self.to_str(conv=self.DEFAULT_CONV)


  def __repr__(self):
    # EQUIV TO return self.to_str(conv=repr) but without flexibility
    # and all constants hardcoded. Faster.
    if self.empty:
      return "()"

    return " | ".join(
      "["+repr(i.lower)+"]" if i.lower == i.upper
      else (("(" if i.left == P.OPEN else "[")
            + repr(i.lower)
            + ","
            + repr(i.upper)
            + (")" if i.right == P.OPEN else "]"))
        for i in self._intervals
    )

@adm271828
Copy link
Author

adm271828 commented Jan 31, 2023

And a few speed tests...

# Create Interval with a given number of atomic intervals
def make_interval(cls=P.Interval, n=10):
  return cls(*(P.closed(k, k+1) for k in range(0, 2*n, 2)))

# Original class
i0=make_interval(n=0)
i10=make_interval(n=10)
i100=make_interval(n=100)
i1000=make_interval(n=1000)
i10000=make_interval(n=10000)

# Derived class
j0=make_interval(cls=I,n=0)
j10=make_interval(cls=I,n=10)
j100=make_interval(cls=I,n=100)
j1000=make_interval(cls=I,n=1000)
j10000=make_interval(cls=I,n=10000)

Comparing repr implementation

>>> %timeit repr(i0)
... 192 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit repr(j0)
... 189 ns ± 3.33 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

>>> %timeit repr(i10)
... 8.31 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit repr(j10)
... 6.07 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit repr(i100)
... 77.7 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit repr(j100)
... 53 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit repr(i10000)
... 8.07 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit repr(j10000)
... 5.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • repr is a little bit faster: 30 % speedup
  • probably because it doesn't call format with '{}'.

Comparing to_string implementation

>>> %timeit P.to_string(i0)
... 560 ns ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
>>> %timeit P.to_string(i10)
... 55.8 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit P.to_string(i100)
... 535 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit P.to_string(i1000)
... 5.4 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit P.to_string(i10000)
... 57.8 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit j0.to_str()
... 418 ns ± 3.87 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
>>> %timeit j10.to_str()
... 10.8 µs ± 471 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit j100.to_str()
... 98.9 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit j1000.to_str()
... 1.01 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit j10000.to_str()
... 9.82 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • orignal to_string implementation is way slower then repr: about 7 time slower
  • to_str method is much faster than to_string function: 5 to 6 time faster
  • it is worth to duplicate repr code, even if to_str method could be called instead, as it faster with hardcoded strings

Best regards,

Antoine

@AlexandreDecan AlexandreDecan added Information This issue is mostly informative and removed help wanted Extra attention is needed labels Jan 31, 2023
@AlexandreDecan
Copy link
Owner

Thanks for sharing the snippets and for these benchmarks. That's interesting, even if I'm not surprised that calling a method is faster than calling the to_string function that cannot benefit from Interval internals to speed up its task.

You said:

Since meaning of format specifier depends on the object it is applied to, I really can't see how a unique specifier could be processed in order to apply some (relevant) parts of it to infinities (str), and other parts to be applied to unknown objects. Consider some align requirements and intervals with datetimes: how is it possible to guess that you must delete the align spec when applying to a value, and keep for infinities. You can only do it if you have the knowledge of what format specifier is supported by non infinities.

I agree :) My point was rather to try to extract from a specifier what can be applied to infinities, and to use it as-is for the object. This is: use the specifier as-is for the bounds that are not infinities, and extract what can be applied to infinities (e.g. padding, alignment, etc.) and apply it on infinities. Anyway, given the potential confusion that could arise because of __format__ as I was told, I'm a bit reluctant to implement it that way :-)

I've added the "information" label to this issue since I believe it will be helpful for people looking something similar. I'll definitely link to this issue in the future "Recipes" section of the documentation :-)

@AlexandreDecan
Copy link
Owner

I've changed our __repr__ and to_string so to avoid the use of format and to concatenate strings instead (even if the speed up is small, it's still worth to have it ;-)). I credit you for this in the changelog since without your benchmarks, I couldn't have spotted that one ;-)

@AlexandreDecan AlexandreDecan closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2023
@adm271828
Copy link
Author

adm271828 commented Feb 12, 2023

I've added the "information" label to this issue since I believe it will be helpful for people looking something similar. I'll definitely link to this issue in the future "Recipes" section of the documentation :-)

Hi Alexandre,

Glad you found it usefull.

Something puzzles me however. You find it usefull to get rid of unnecessary format calls: they yield a 1.3 speed improvement, and you changed your code.

But you discard as almost irrelevant a 7 times (!) speedud just saying it was expected. Of course it was expected, but 7 times speedup...! where there is absolutely no benefit for the user to stick to the "functionnal style" (simply have def to_string(i): return i->to_str() if you want it to be called without having a method call). Interesting case of confirmation bias.

Best regards,

Antoine

@AlexandreDecan
Copy link
Owner

Hi,

The speed-up for __repr__ comes for free while the changes you suggest for to_string/to_str requires additional code (and contracts) in Interval, so that to_string can implicitly use the internals of Interval. Given the difference is at most a few ms even for very large intervals (and a few ns for the most usual cases), I haven't made this change (and I know that the difference is even lower for __repr__, but as I said, the difference comes for free).

You said "interesting case of confirmation bias". I don't know if that's the case or not, but I always try to weigh the pros and cons of every decision I make about portion and its API. These decisions are not necessarily the best ones, but I am the one who will continue to maintain portion, to add new features, to fix bugs, etc. that is: the one that will have to live with the decisions and their consequences. For the current case, status-quo is the decision I have made and that I am ready to assume :)

@adm271828
Copy link
Author

These decisions are not necessarily the best ones, but I am the one who will continue to maintain portion, to add new features, to fix bugs, etc. that is: the one that will have to live with the decisions and their consequences.

Sure, Alexandre: your code, your design, your choices, your decisions... And I would even say, and I truly mean it (trust me!), the best decisions overall are the one you make :-)

But yet... you have 25 lines of code in io.py to implement to_string and 16 lines of code in interval.py to implement __repr__. What they do is exactly the same thing, because __repr__ is basically a call to to_string with default arguments. Just keep one and you'll get less code to test, less code to maintain, DRY. I'm not even speaking of coding style nor speed. This is a library maintainer point of view.

Now as a library user, the implicit contract is that a call to to_string (with default arguments) should be undistinguishable from a call to __repr__ (well in fact, to_string should probably be mapped to __str__ also, but it's another matter). The point is they are not (see issue #77 that has been deleted) : I wouldn't even have noticed there was two implementations if I didn't step into it and started to look into the code to understand why I got an exception. Still not speaking of coding style nor speed.

Now, it is obvious that the difference between the 2 is that one of them involves many unnecessary object instanciations (hence being far more slower) while the other do not. And that any two of them could be a call to the other one (or, better, they both should call a unique method). So, there is not need to change the API. And the speedup also comes for free.

Having said that, I don't see any benefit to keep 2 implementations, and to keep the slowest one.

There is probably one and I am missing something. But what? This is just curiosity.

And still, your code, your decision!

And thanks for it!

Best regards,

Antoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Information This issue is mostly informative
Projects
None yet
Development

No branches or pull requests

2 participants