Describe the bug
When performing a SELECT with concurrent writes, there is a rare chance that the SELECT returns corrupt data.
This was detected in the following use case:
A client acquires a timestamp (current_ts) that will be used to retrieve data from a table Test;
It uses current_ts to get the data with the maximum ts such that ts <= current_ts, grouping by id. For instance, in the following example, considering that current_ts = 7, it would return rows 3 and 4.
#
id
val
ts
1
u1
100
0
2
u1
60
5
3
u1
52
6
4
u2
100
0
5
u2
73
10
To achieve that, the following query is used:
SELECT id, val, ts
FROM test
WHERE (id, ts) IN (
SELECT id, max(t.ts)
FROM test t
WHEREt.ts<= current_ts
GROUP BY id
)
Since the table is always pre-populated with ts = 0 and new inserts use existing ids, the query should always return the same number of rows.
The client then selects a random item and inserts a new row with a different val and the next current_ts.
The problem is that sometimes the query in step 2 does not return all rows. In addition, some of the rows are corrupted (e.g. id being set to an empty string). This is a small example where the query returned 898 rows instead of 4000:
id
val
ts
''
100.0
0
u770
100.0
0
''
100.0
0
u771
100.0
0
However, immediately executing the same query with the same current_ts in the same connection now returns the correct data.
To Reproduce This gist contains the code to (hopefully) reproduce this error.
The method Test.execute contains the relevant code. In case the error occurs, it prints ERROR to the terminal and logs the data retrieved, as well as the result of retrying that query, in the file error.txt. I also leave here a real example of a error.txt file:
Unfortunally, I am not able to narrow the problem down more than this.
Expected behavior
Always return the same number of rows with the correct values.
I can try the script as well, Please run the server with debug flags, eg --debug=10, so after each MAL operator call, the BAT properties will be checked. This way we may narrow the issue easier.
sjoerdmullender
changed the title
SELECT with concurrent writes rarely returns corrupt data
SELECT with concurrent writes sometimes returns corrupt data
Mar 9, 2022
Describe the bug
When performing a SELECT with concurrent writes, there is a rare chance that the SELECT returns corrupt data.
This was detected in the following use case:
A client acquires a timestamp (
current_ts) that will be used to retrieve data from a tableTest;It uses
current_tsto get the data with the maximumtssuch thatts <= current_ts, grouping by id. For instance, in the following example, considering thatcurrent_ts = 7, it would return rows 3 and 4.To achieve that, the following query is used:
Since the table is always pre-populated with
ts = 0and new inserts use existing ids, the query should always return the same number of rows.The client then selects a random item and inserts a new row with a different
valand the nextcurrent_ts.The problem is that sometimes the query in step 2 does not return all rows. In addition, some of the rows are corrupted (e.g.
idbeing set to an empty string). This is a small example where the query returned 898 rows instead of 4000:However, immediately executing the same query with the same
current_tsin the same connection now returns the correct data.To Reproduce
This gist contains the code to (hopefully) reproduce this error.
The method
Test.executecontains the relevant code. In case the error occurs, it printsERRORto the terminal and logs the data retrieved, as well as the result of retrying that query, in the fileerror.txt. I also leave here a real example of aerror.txtfile:Unfortunally, I am not able to narrow the problem down more than this.
Expected behavior
Always return the same number of rows with the correct values.
Software versions
Additional context
Seems to never happen without concurrent inserts. Additionally, there are no error messages in the logs.
The text was updated successfully, but these errors were encountered: