Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading to 2.7.11 breaks older sqlite databases #8417

Closed
4 tasks done
hateyouinfinity opened this issue Feb 6, 2023 · 1 comment
Closed
4 tasks done

Upgrading to 2.7.11 breaks older sqlite databases #8417

hateyouinfinity opened this issue Feb 6, 2023 · 1 comment
Labels
bug Something isn't working great writeup This is a wonderful example of our standards

Comments

@hateyouinfinity
Copy link
Contributor

hateyouinfinity commented Feb 6, 2023

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

After upgrading to 2.7.11 orion started crashlooping, logs showed sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) duplicate column name: has_data.
Rolled back the migrations, downgraded to 2.7.10, tried to investigate.

Alembic upgrade never terminates (see Error).
The culpable revision is f92143d30c25, introduced in PR #8164.
It prepares flow_run_state and task_run_state tables for the subsequent migration of artifact data (in f92143d30c26)
The relevant part is that it calculates and sets the (boolean) value for has_data column based on the value of data column.
data can either be a NULL, a string 'null', or a string containing a json value.
The logic in populate_flow_has_data_in_batches is, I believe, intended to set has_data to 1 if data contains a json value and to 0 otherwise.
The current logic does not achieve that goal.

Problems

  1. Comparing anything with a NULL via an (in)equality operator results in a NULL (see 2.)
  2. Since the two parts of the condition are exclusive, disjunction leads to the condition evaluating to 1 for data='null' (same would happen for data=NULL if not for problem 1)
Goal data IS NOT NULL data != 'null' data IS NOT NULL or data != 'null'
NULL 0 0 NULL NULL
'null' 0 1 0 1
JSON 1 1 1 1

Problem 2 results in false positives, migrating data from every row to artifact table. This affects runs created on every version, though I'm not sure what the impact is besides bloat.
Problem 1 results in the upgrade never terminating if >500 rows with data=NULL are present in a table.
Seems like in versions>=2.6.0 data defaults to 'null', so this should only affect databases with runs created on <=2.5.0.

Solution

  1. Use IS NOT instead of != (again, see 2.)
  2. Use and instead of or
Goal data IS NOT NULL data IS NOT 'null' data IS NOT NULL or data IS NOT 'null' data IS NOT NULL and data IS NOT 'null'
NULL 0 0 1 1 0
'null' 0 1 0 1 0
JSON 1 1 1 1 1

Workaround

Stamp database with f92143d30c25, downgrade to bb38729c471a, replace two lines in revision f92143d30c25, run migrations.

Reproduction

SQL:

import sqlite3
conn = sqlite3.connect(":memory:")
cur = conn.cursor()

stmt = """
        SELECT (PLACEHOLDER IS NOT NULL),
        (PLACEHOLDER != 'null'), 
        (PLACEHOLDER IS NOT 'null'), 
        (PLACEHOLDER IS NOT NULL or PLACEHOLDER != 'null'), 
        (PLACEHOLDER IS NOT NULL and PLACEHOLDER != 'null'), 
        (PLACEHOLDER IS NOT NULL or PLACEHOLDER IS NOT 'null'), 
        (PLACEHOLDER IS NOT NULL and PLACEHOLDER IS NOT 'null')
"""
cur.execute(stmt.replace("PLACEHOLDER", "NULL"))
print(cur.fetchall())
cur.execute(stmt.replace("PLACEHOLDER", "'null'"))
print(cur.fetchall())
cur.execute(stmt.replace("PLACEHOLDER", "'JSON'"))
print(cur.fetchall())

Complete:

  1. pip install prefect==2.4.2
  2. export PREFECT_ORION_DATABASE_CONNECTION_URL=sqlite+aiosqlite:///test.db
  3. export PREFECT_ORION_DATABASE_ECHO=True
  4. Start orion, start an agent, upload a deployment, run it at least once
  5. pip install prefect==2.7.11
  6. You need to have more rows with data=NULL in flow_run_state than batch_size used here, lower the value manually.
  7. Start orion, migrations will never terminate and the server will never start. Restarting the server will leave database in a broken state since has_data has already been added, but migrations terminated prematurely.

Error

2023-02-06 13:14:13,734 INFO sqlalchemy.engine.Engine BEGIN (implicit; DBAPI should not BEGIN due to autocommit mode)
13:14:13.734 | INFO    | sqlalchemy.engine.Engine - BEGIN (implicit; DBAPI should not BEGIN due to autocommit mode)
2023-02-06 13:14:13,735 INFO sqlalchemy.engine.Engine 
            UPDATE flow_run_state
            SET has_data = (data IS NOT NULL or data != 'null')
            WHERE flow_run_state.id in (SELECT id FROM flow_run_state WHERE (has_data IS NULL) LIMIT 500);
        
13:14:13.735 | INFO    | sqlalchemy.engine.Engine - 
            UPDATE flow_run_state
            SET has_data = (data IS NOT NULL or data != 'null')
            WHERE flow_run_state.id in (SELECT id FROM flow_run_state WHERE (has_data IS NULL) LIMIT 500);
        
2023-02-06 13:14:13,735 INFO sqlalchemy.engine.Engine [generated in 0.00069s] {}

<...>

2023-02-06 13:17:51,928 INFO sqlalchemy.engine.Engine 
            UPDATE flow_run_state
            SET has_data = (data IS NOT NULL or data != 'null')
            WHERE flow_run_state.id in (SELECT id FROM flow_run_state WHERE (has_data IS NULL) LIMIT 500);
        
13:17:51.928 | INFO    | sqlalchemy.engine.Engine - 
            UPDATE flow_run_state
            SET has_data = (data IS NOT NULL or data != 'null')
            WHERE flow_run_state.id in (SELECT id FROM flow_run_state WHERE (has_data IS NULL) LIMIT 500);
        
2023-02-06 13:17:51,928 INFO sqlalchemy.engine.Engine [cached since 218.2s ago] {}
13:17:51.928 | INFO    | sqlalchemy.engine.Engine - [cached since 218.2s ago] {}

Versions

Version:             2.7.11
API version:         0.8.4
Python version:      3.10.7
Git commit:          6b27b476
Built:               Thu, Feb 2, 2023 7:22 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

Additional context

No response

@hateyouinfinity hateyouinfinity added bug Something isn't working status:triage labels Feb 6, 2023
@zanieb
Copy link
Contributor

zanieb commented Feb 6, 2023

Thank you for the very thorough issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working great writeup This is a wonderful example of our standards
Projects
None yet
Development

No branches or pull requests

2 participants