Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load db Interval column as ISO 8601 duration string rather than Python timedelta #141

Open
Flix6x opened this issue Aug 18, 2023 · 3 comments
Assignees

Comments

@Flix6x
Copy link
Collaborator

Flix6x commented Aug 18, 2023

When initializing a DBSensor with an ISO 8601 duration string as its event_resolution, this is correctly interpreted in Postgres. However, when loading the sensor from the database to a Python object, the event resolution becomes a Python timedelta, which cannot accurately represent nominal durations such as a calendar day, a month or a year. This is problematic for daily, monthly and yearly sensors, and also why timely-beliefs currently still works best for hourly or more fine-grained data.

For example:

DBSensor(... event_resolution="P1M")

Becomes 1 mon in Postgres, which then becomes timedelta(days=30) in Python.

At first, we'd just like to make it possible to let Python code access the original ISO 8601 representation of the event resolution, from what is stored in the database. Here are my ideas:

1) Postgres intervalStyle setting

Although Postgres's intervalStyle does a great job at formatting the stored Interval as an ISO string, unfortunately this setting is only scoped to a database connection and not meant to let SQLAlchemy consistently load the Interval as an ISO string.

SET intervalstyle = 'iso_8601';
SELECT make_interval(0, 1, 0, 0, 0, 0, 0);

which yields "P1M". Given that the intervalStyle cannot be set globally, this functionality is not trivial to use for our purpose.

Update: it is actually possible to set the default for a given database, using:

ALTER DATABASE dbname SET intervalstyle = 'iso_8601';

Here's a possible migration we could consider:

def upgrade():
    # Use the appropriate SQL syntax for your PostgreSQL version
    op.execute("ALTER DATABASE dbname SET intervalstyle = 'iso_8601'")


def downgrade():
    # If needed, provide a downgrade step
    op.execute("ALTER DATABASE dbname SET intervalstyle = 'postgres'")

However, we then run into this error:

psycopg2.NotSupportedError: iso_8601 intervalstyle currently not supported

2) String rather than Interval column

This does not allow the column to be used for db-level computation, unless it is cast to an Interval before doing computations. This idea still needs a tech spike.

3) Cast to string and convert upon loading

My current idea is to give the Sensor object an extra property, something like event_resolution_string, get the string representation of the event_resolution from the database, and deal with the different possible intervalStyles upon loading.

Tech spike:

class SensorDBMixin(Sensor):
    ...
    event_resolution_str = Column(String(), Computed("cast(event_resolution as varchar(50))"))

class DBSensor(Base, SensorDBMixin):
    ...
    @property
    def event_resolution_string(self):
        iso_str = self.postgres_interval_to_iso8601(self.event_resolution_str)
        return iso_str

def postgres_interval_to_iso8601(interval_str):
    """https://database.guide/how-to-set-the-interval-output-format-in-postgresql/"""

    # intervalStyle = iso_8601
    if interval_str[0] == "P":
        return interval_str

    # intervalStyle = sql_standard
    if re.search(' ', interval_str) and not re.search('[a-zA-Z]', interval_str):
        raise NotImplementedError("intervalStyle = sql_standard")

    # intervalStyle = postgres_verbose
    components = ["year", "mon", "day", "hour", "min", "sec"]
    values = {comp: 0 for comp in components}

    for comp in components:
        match = re.search(fr"(\d+)\s+{comp}", interval_str)
        if match:
            values[comp] = int(match.group(1))

    # intervalStyle = postgres
    time_match = re.search(r"(\d+):(\d+):(\d+)", interval_str)
    if time_match:
        values["hour"], values["minute"], values["second"] = map(int, time_match.groups())

    # Build ISO 8601 string in short form, leaving out zero components (e.g. just P1D instead of P0Y0M1DT0H0M0S)
    iso_duration = "P"
    iso_duration += f"{values['year']}Y" if values['year'] else ""
    iso_duration += f"{values['mon']}M" if values['mon'] else ""
    iso_duration += f"{values['day']}D" if values['day'] else ""
    iso_duration += "T" if any(values[comp] for comp in ["hour", "min", "sec"]) else ""
    iso_duration += f"{values['hour']}H" if values['hour'] else ""
    iso_duration += f"{values['min']}M" if values['min'] else ""
    iso_duration += f"{values['sec']}S" if values['sec'] else ""

    return iso_duration

And some ideas for test cases:

postgres_interval = "+1-2 +25 +5:06:07"  # IntervalStyle = sql_standard
postgres_interval = "1 year 0 months 3 days 4:05:06"  # IntervalStyle = postgres
postgres_interval = "1 year 2 mons 25 days 5 hours 6 mins 7 secs"  # IntervalStyle = postgres_verbose
postgres_interval = "11 mons -1 days ago"  # IntervalStyle = postgres_verbose
postgres_interval = "P1Y2M25DT5H6M7S"  # IntervalStyle = iso_8601
iso8601_duration = postgres_interval_to_iso8601(postgres_interval)

We should also test updating the sensor resolution, given this potential issue.

@Flix6x Flix6x self-assigned this Aug 18, 2023
@create-issue-branch
Copy link

@victorgarcia98
Copy link
Contributor

Cool stuff! 😄

Following your steps, I came up with a way to avoid creating a new column.

Tech spike:

from sqlalchemy import types,func
from sqlalchemy.orm import column_property, declared_attr
from pandas.tseries.frequencies import to_offset

[...]

class SensorDBMixin(Sensor):
    [...]
    # rename event_resolution to _event_resolution
    _event_resolution = Column("event_resolution", Interval(), nullable=False, default=timedelta(hours=0))
    [...]  

    @declared_attr
    def event_resolution_str(self):
       """ Create a column property that casts event_resolution to String  """
        return column_property(func.cast(self.event_resolution, String))

    @hybrid_property
    def event_resolution(self):
        """Apply conversion from string to panda timedelta. TODO: use `postgres_interval_to_iso8601`"""

        if "year" in self.event_resolution_str:
            return to_offset("1Y")

        t = event_resolution_str.split(":")

        return timedelta(hours=int(t[0]), minutes=int(t[1]), seconds=int(t[2]))
    
    
    @event_resolution.expression
    def event_resolution(cls):
        """Map event_resolution expression to _event_resolution field. This is important to support things like:
        session.execute(select(Sensor.id, Sensor.event_resolution))
        """
        return cls._event_resolution
    
    @event_resolution.setter
    def event_resolution(self, value):
        """Map the setter of event_resolution to _event_resolution"""
        self._event_resolution = value

I tested the following:

>>> from flexmeasures.data.models.time_series import Sensor
>>> from sqlalchemy import cast, String, select
>>> session = app.db.session
>>> 
>>> s = Sensor.query.get(1)
>>> s.event_resolution
<YearEnd: month=12>
>>> s.event_resolution_str
'1 year'
>>> s.event_resolution = "P1Y"
>>> session.commit()
>>> s = Sensor.query.get(1)
>>> s.event_resolution
<YearEnd: month=12>
>>> print(select(Sensor.event_resolution))
SELECT sensor.event_resolution 
FROM sensor
>>> session.execute(select(Sensor.event_resolution))
<sqlalchemy.engine.result.ChunkedIteratorResult object at 0x7ff1509bf190>
>>> session.execute(select(cast(Sensor._event_resolution, String))).all()
[('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('1 year',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('1 year',)]
>>> session.execute(select(cast(Sensor.event_resolution, String))).all()
[('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('1 year',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('00:15:00',), ('1 year',)]

@victorgarcia98
Copy link
Contributor

victorgarcia98 commented Aug 22, 2023

This morning, @Flix6x and I had a meeting to discuss further that type to describe the event resolution.

For the record:

  • Adding DateOffsets is something people has reported: BUG: add between DateOffsets raises error pandas-dev/pandas#36590
  • Resampling works as expected (in DST transitions) with Offset Aliases but not with pd.Timedelta or datetime.timedelta. dateutil.relativedelta doesn't work.
  • Operations in dateutil.relativedelta work in general.
  • @Flix6x proposed to limit the possible event_resolution via API/CLI/UI to a set of controlled resolutions (e.g. P1Y, P1M, P1D, PT1H, PT15M, PT5M, PT1M).

If DateOffsets supported sum (currently it supports subtraction), I think it would be a good candidate to replace native timedelta.

TODOs:

  • Test if Postgres Interval type behaves properly for 1 day, 1 month and 1 year, specially checking that 1 day != 24h in DST transition days.
  • Check what happens when adding 1 day to a datetime. Does the result start at 00:00 next day or it results in the time 24H from the timestamp provided?

Feel free to add anything to this 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants