Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT with concurrent writes sometimes returns corrupt data #7237

Closed
nuno-faria opened this issue Feb 2, 2022 · 3 comments
Closed

SELECT with concurrent writes sometimes returns corrupt data #7237

nuno-faria opened this issue Feb 2, 2022 · 3 comments
Labels
bug GDK Kernel
Milestone

Comments

@nuno-faria
Copy link
Contributor

@nuno-faria nuno-faria commented Feb 2, 2022

Describe the bug
When performing a SELECT with concurrent writes, there is a rare chance that the SELECT returns corrupt data.
This was detected in the following use case:

  1. A client acquires a timestamp (current_ts) that will be used to retrieve data from a table Test;

  2. It uses current_ts to get the data with the maximum ts such that ts <= current_ts, grouping by id. For instance, in the following example, considering that current_ts = 7, it would return rows 3 and 4.

    # id val ts
    1 u1 100 0
    2 u1 60 5
    3 u1 52 6
    4 u2 100 0
    5 u2 73 10

    To achieve that, the following query is used:

     SELECT id, val, ts
     FROM test
     WHERE (id, ts) IN (
         SELECT id, max(t.ts)
         FROM test t
         WHERE t.ts <= current_ts
         GROUP BY id
     )

    Since the table is always pre-populated with ts = 0 and new inserts use existing ids, the query should always return the same number of rows.

  3. The client then selects a random item and inserts a new row with a different val and the next current_ts.

The problem is that sometimes the query in step 2 does not return all rows. In addition, some of the rows are corrupted (e.g. id being set to an empty string). This is a small example where the query returned 898 rows instead of 4000:

id val ts
'' 100.0 0
u770 100.0 0
'' 100.0 0
u771 100.0 0

However, immediately executing the same query with the same current_ts in the same connection now returns the correct data.

To Reproduce
This gist contains the code to (hopefully) reproduce this error.
The method Test.execute contains the relevant code. In case the error occurs, it prints ERROR to the terminal and logs the data retrieved, as well as the result of retrying that query, in the file error.txt. I also leave here a real example of a error.txt file:
Unfortunally, I am not able to narrow the problem down more than this.

Expected behavior
Always return the same number of rows with the correct values.

Software versions

  • MonetDB v11.43.6 Jan22 branch (most recent commit)
  • Ubuntu 20.04 LTS
  • unixODBC 2.3.6, MonetDB ODBC driver installed from source, just like the server
  • Self-installed and compiled

Additional context
Seems to never happen without concurrent inserts. Additionally, there are no error messages in the logs.

@PedroTadim
Copy link
Contributor

@PedroTadim PedroTadim commented Feb 3, 2022

I can try the script as well, Please run the server with debug flags, eg --debug=10, so after each MAL operator call, the BAT properties will be checked. This way we may narrow the issue easier.

@nuno-faria
Copy link
Contributor Author

@nuno-faria nuno-faria commented Feb 3, 2022

With the --debug=10 flag, the server reported these two lines when the error occurred:

#2022-02-03 09:02:22: DFLOWworker1021: BATassertProps: !ERROR: Assertion `strcmp(b->theap->filename, filename) == 0' failed
#2022-02-03 09:02:22: DFLOWworker1021: BATassertProps: !ERROR: Assertion `strcmp(b->theap->filename, filename) == 0' failed

monetdb-team pushed a commit that referenced this issue Feb 4, 2022
@sjoerdmullender
Copy link
Member

@sjoerdmullender sjoerdmullender commented Feb 4, 2022

It took some time to find the problem, but the reproduction script was priceless. Thanks for that.

@sjoerdmullender sjoerdmullender added bug GDK Kernel labels Feb 4, 2022
@PedroTadim PedroTadim added this to the NEXTRELEASE milestone Feb 4, 2022
@sjoerdmullender sjoerdmullender removed this from the NEXTRELEASE milestone Feb 14, 2022
@sjoerdmullender sjoerdmullender added this to the Jan2022-SP1 milestone Feb 14, 2022
@sjoerdmullender sjoerdmullender changed the title SELECT with concurrent writes rarely returns corrupt data SELECT with concurrent writes sometimes returns corrupt data Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug GDK Kernel
Projects
None yet
Development

No branches or pull requests

3 participants