New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Optimize postgresql storage get all fixes 1507 #1615
[WIP] Optimize postgresql storage get all fixes 1507 #1615
Conversation
Did some benchmarking first. See https://gist.github.com/peterbe/c4b6bb31e2b957d85d14932f5200f08a |
cb61a32
to
51c7bf9
Compare
if len(retrieved) == 0: | ||
return [], 0 | ||
if len(retrieved) == 0: | ||
return [], 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some further work to do here.
This early exit allows us to being able to skip the COUNT query since the SELECT query yielded 0 results anyway.
The reason for why I didn't use the exact filter for the COUNT and the SELECT was because of this test:
https://github.com/peterbe/kinto/blob/51c7bf9ec9eb1f89d7e7d3a015429f2bc525551d/kinto/core/storage/testing.py#L1217-L1218
(there might be more like this!)
In that test it does a get_all(include_deleted=True, ...)
and it expects the number of records to be different from the total count.
Isn't that strange? If you include_deleted=True
why should the total count be always filter out the deleted records??
If that test is wrong, I can rewrite the two queries so that the SELECT and COUNT uses the exact same WHERE clause. Then, if we do that we can do the COUNT first (slightly faster since it's just an integer) and if it's 0 we can skip the SELECT and just return an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables are not named correctly in the tests, instead it could be:
records_and_tombstones, records_count = self.storage.get_all(parent_id='abc',
collection_id='c',
include_deleted=True)
self.assertEqual(records_count, 0)
self.assertEqual(len(records_and_tombstones), 2)
The returned count is the number of records, excluding tombstones. Tombstones are returned when ?_since
is in querystring, they are not real records....
The parameter include_deleted
is not named correctly, it should be with_tombstones
. We identified the issue a while ago #710
I see you've made some drastic changes to this query. Most are probably fine, but separating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was first tempted to think that the count reentrancy would be acceptable, but it's used to paginate in the resource code collection_get()
. I wonder what consequences could there be, but maybe there are ways to be smarter there... Related #1170
select_query = """ | ||
SELECT {distinct} id, as_epoch(last_modified) AS last_modified, data | ||
FROM records | ||
WHERE {pagination_rules} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
µ-nit: indentation of WHERE
😊
if len(retrieved) == 0: | ||
return [], 0 | ||
if len(retrieved) == 0: | ||
return [], 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables are not named correctly in the tests, instead it could be:
records_and_tombstones, records_count = self.storage.get_all(parent_id='abc',
collection_id='c',
include_deleted=True)
self.assertEqual(records_count, 0)
self.assertEqual(len(records_and_tombstones), 2)
The returned count is the number of records, excluding tombstones. Tombstones are returned when ?_since
is in querystring, they are not real records....
The parameter include_deleted
is not named correctly, it should be with_tombstones
. We identified the issue a while ago #710
In #1616 (comment) I discovered something. This PR uses a |
To be clear; in my last commit, I removed the |
@glasserc When you talked about reentrence, does that refer to the network overhead of having to send the query twice to the Postgres server. Especially if it's latency is slightly scary. Note that the psycopg connection is kept open by the Python process. So the re-entrence doesn't have pay a "connection fee" more than once. Or, are you referring to it as a risk of getting wrong numbers between the two queries? I.e.
Also, in terms of benchmarking performance, see https://gist.github.com/peterbe/c4b6bb31e2b957d85d14932f5200f08a#gistcomment-2571339 |
Perhaps I'm trying to prove something that isn't all that important or disputed, but I updated the HTTP GET benchmark by seeing how it behaves under concurrent load. |
placeholders.update(**holders) | ||
else: | ||
safeholders['pagination_rules'] = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: superfluous since safeholders
is a default dict ;)
CHANGELOG.rst
Outdated
- Nothing changed yet. | ||
**Internal changes** | ||
|
||
- Refactor of ``kinto.core.storage.Storage.get_all`` method to issue two separate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it only applies to PostgreSQL ;)
And also, I think it would make sense to mention the performance gain!
Yes, I'm pretty sure it's this one! Ethan has worked a lot on race conditions / TOCTOU stuff ;) |
Ah! Yeah, it's a real concern. I stand by the fact that this is now going to get significantly an un-risk. What I mean by that is that since the In https://gist.github.com/peterbe/c4b6bb31e2b957d85d14932f5200f08a I was able to show that I can finish the select in 1.59ms on a 183,244 records database. And what proves that it's an index-level read (apart from looking at the EXPLAIN ANALYZE) is that that number is 1.51ms for an even larger database. (Why it's not exactly the same I don't know). Anyway, there is a real way to prevent the I haven't attempted to use this |
Now that #1616 is clarified, it would be good to add a test to this PR that captures the subtle bug that the old query had when using a wildcard
It happens to be that way sometimes, but in general this query may not be a pure index-level read, and it may need to read the records table.
Another, simpler way is to use a different query. For example, the previous query doesn't have this problem. That's why I asked you to benchmark with and without that change. Even if the risk is small, it's a new risk and I would like to know what the pros and cons are. |
Criss-cross github issue madness. #1616 (comment)
So my change here comes at a danger in that you CAN end up with But I haven't made any benchmarks with/without the technique discussed in https://gist.github.com/peterbe/546e782b5ae51afb22ab0eed9b3ddab1 |
No, I don't want you to test using I want you to test with/without two queries, because it's possible to write a single query that does not suffer from the race condition that you automatically have if you write two queries, and if the performance penalty for that isn't too bad, then I would prefer that. Yes, with this PR, certain queries are faster. (It turns out that others are slower.) But you have made a lot of different changes to the queries in order to get this result. I would like to look at each change individually rather than the whole package. |
Now I understand your point @glasserc select COUNT(*) OVER (), id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record'
ORDER BY last_modified DESC
LIMIT 10; Compared to these two: select count(*)
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record';
select id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record'
ORDER BY last_modified DESC
LIMIT 10; If I break these up and run all three 10 times and record the best possible time with my 183,240 rows database:
Difference is that it's roughly 10x slower to do the
To sanity check I compared it with the original query (the one still in Lastly, the Also, the biggest problem with the Next, anybody got a really good window function to do the same that I can use? |
By the way, I tried to write a CTE query but after a lot of trial-and-error the only thing I could come up with was one just like the original and back to square one where it takes several seconds to execute it. :( |
Great, thanks! From your notes in #1507, the One obvious difference between the queries you're benchmarking and the ones in your PR is that the ones in the PR have filters associated with them. Between this and the fact that your broken-down queries are doing a table scan, it might be good to see what happens if you run the query on a collection that makes up a smaller proportion of the database. I didn't know about windowing functions. The query I came up with is much closer to the original query:
|
I took a copy of my database and deleted a bunch of random records. Now it only has 30,000 records. Half of them I changed the
Now, let's compare WITH collection_filtered AS (
SELECT id, last_modified, data, deleted
FROM records
WHERE parent_id = '/buckets/build-hub/collections/funny'
AND collection_id = 'record'
ORDER BY last_modified DESC
),
total_filtered AS (
SELECT COUNT(*) AS count_total
FROM collection_filtered
WHERE NOT deleted
)
SELECT count_total,
a.id, as_epoch(a.last_modified) AS last_modified,
a.data
FROM collection_filtered as a,
total_filtered
LIMIT 10; Against... select count(*)
from records
WHERE parent_id = '/buckets/build-hub/collections/funny'
AND collection_id = 'record'; and select id, as_epoch(last_modified) AS last_modified,
data
from records
WHERE parent_id = '/buckets/build-hub/collections/funny'
AND collection_id = 'record'
ORDER BY last_modified DESC
LIMIT 10; Results:
So combined is still much slower. 16x slower. I ran each query 100 times which makes the mean ~= median on the Execution time. |
50% is still quite a large slice of your database. I bet it's still doing a table scan. (Some EXPLAIN ANALYZE output would verify it for sure.) How about if it's only 10% or 1% of the total records? How about one of the records from the buildhub docs, using the Kinto records API queries instead of ElasticSearch? |
This time, I try with a parent_id that only matches 10 items:
Same three queries as above but instead of Results:
So in this case, the split up select & count is 10% slower. |
All three The order is:
All three use an index scan. |
In Buildhub we have this thing were it uses import time
t0=time.time()
new_records_batches = [client.get_records(
_since=previous_run_etag,
pages=float('inf')
)]
t1=time.time()
print(
"TOOK",
t1 - t0,
"seconds to paginate fetch",
len(new_records_batches[0]),
"(in {} batches of 10,000)".format(int(len(new_records_batches[0]) / 10_000))
)
raise Exception The output becomes:
Then I literally copied the
To sanity check, I watched my /usr/local/var/log/postgres.log where I have it log every single query. I also saved each massive list of dicts to disk to be able to compare them:
However! They were not identical. If I parse them and analyzed them a bit more in detail. If I convert each list to a dict by the
So the old (
I'm not even sure where to begin debugging this! Or if now's the time to debug it at all. One culprit might be the rounding error of rounding the last_modified up to the nearest second. Note that each two pairs of End of the day, the new code works just as well as the old code but it's 13 times faster. |
OMG, this is dreadful... it's unrelated to this PR indeed, which has to assume that epoch timestamps are unique. Let's create a bug around that, we should at least have a DB constraint to avoid this situation |
We just talked about this in Vidyo -- for those following along at home:
If I understood correctly, @peterbe said he was going to step away from this PR so (assuming it's high priority) maybe I will take a swing at coming up with a query that is better than what we have without slowing down other queries. |
There are many aspects that made this issue more important recently:
|
I don't think that's a fair evaluation. When there were only 10 records to be matched, the sum of the SELECT + COUNT query was 10% slower. Also, it was a difference between 1.48ms vs. 0.15ms+1.55ms. That might just be noise. There is admittedly a tiny overhead of having to send two distinct queries on the open connection. The proposed solution works for tiny queries. Paying 1.7ms for a count and a set of records is cheap. And it scales rapidly when it goes way beyond 10 matching records. Also, in terms of priorities I think we're missing a subtle point in that if a READ query takes 15 seconds (not unrealistic at all at the moment on databases like Buildhub) that means the Postgres server is potentially distracted and resource hogging for other queries (read or write) that needs to be dealt with. Perhaps that's why we saw, in New Relic, these monster times for queries that, on paper, should be fast. My perspective on Kinto that of a single instance that I'm familiar with. That's why I'm hesitant to speak to the risk evaluation of this in totality. Hence be now slightly withdrawn involvement. |
In fact, your proposed solution can be up to 100% slower because of what I wrote in #1507 (comment).
In general, yes, I agree, slow queries are bad. Note that what you are worrying about here is about CPU usage. However, your benchmarks actually measure latency. |
So can we close this now that #1622 has landed? I think so. |
Fixes #1507
kinto.tpl
file with it.