Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression with new database connection handling/transaction isolation #1555

Open
omad opened this issue Feb 21, 2024 · 1 comment
Open

Regression with new database connection handling/transaction isolation #1555

omad opened this issue Feb 21, 2024 · 1 comment
Labels

Comments

@omad
Copy link
Member

omad commented Feb 21, 2024

Background

Digital Earth Australia provides a Python Environment on the NCI which includes ODC. It's currently running the very old version 1.8.8 under Python 3.8.13 and we want to upgrade to the latest version. We've created a new environment using ODC core version 1.8.17 running under Python 3.10.8, but in testing we've been getting errors which weren't occuring before.

This issue was reported and discussed internally in Jira Issue DSNS-304, and included code to reproduce the error.

Error

The problem was reported as an unhandled PostgreSQL exception being thrown in a production script, which used to work cleanly.

Original code and stack trace from error

When executing the following script with the new module load dea/20231204

#!/usr/bin/env python3

import datacube

message = """Will this work? Will this cause the new dea module to error out?"""

def main(product="ga_ls8c_ard_3"):
    dc = datacube.Datacube(app="gen-list")

    for l1_dataset in dc.index.datasets.search(product=product):
        with open('io.txt', 'a') as file:
            file.write(message + '\n')

if __name__ == "__main__":
    main(product="ga_s2am_ard_3")

Stack Trace

Traceback (most recent call last):
  File "/g/data/u46/users/dsg547/sandbox/ard_pipeline_support/tickets/DSNS-304/example/./odc_call.py", line 18, in <module>
    main(product="ga_s2am_ard_3")
  File "/g/data/u46/users/dsg547/sandbox/ard_pipeline_support/tickets/DSNS-304/example/./odc_call.py", line 14, in main
    for l1_dataset in dc.index.datasets.search(product=product):
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search
    for product, datasets in self._do_search_by_product(query,
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product
    with self._db_connection() as connection:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection
    with self._index._active_connection(transaction=transaction) as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection
    with self._db._connect() as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect
    connection.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close
    conn.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close
    self._checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin
    _finalize_fairy(
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin
    finalizer(connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics
    characteristic.reset_characteristic(self, dbapi_connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic
    dialect.reset_isolation_level(dbapi_conn)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level
    self.set_isolation_level(dbapi_conn, self.default_isolation_level)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level
    connection.set_isolation_level(level)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

I believe @Ariana-B did some investigation resulting in #1525 , especially in relation to the connection.set_isolation_level() changes, but it hasn't yet been resolved.

New Findings

I've run some more tests today, and have more of an idea of what the issue is.

  • The network configuration at the NCI drops idle TCP connections after 5 minutes. This is extremely short, the normal default is 2 hours.
  • The above code first loads 4Gb of database results into memory (about 3 minutes), then slowly writes lines to disk IO while iterating over DB results (more than 10 minutes for the ~500k results).
  • Under ODC 1.8.8, the script then exits cleanly.
  • Under ODC 1.8.17, something attempts to use the db connection to change the isolation level. That fails because the network connection was idle for more than 5 minutes.

I'm not sure whether the change is ODC or SQLAlchemy or psycopg2, but I suspect it's ODC.

I'm about to write up a proposed improvement in a new issue, to allow passing arbitrary connection parameters to libpq, which will allow working around the TCP idle timeout issue here.

@SpacemanPaul
Copy link
Contributor

As discussed, any connection parameter that can be included in the database URL can already be user-specfied in 1.9 thanks to the configuration layer changes from EP10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants