Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimize postgresql storage get all fixes 1507 #1615

Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ This document describes changes between each past release.
9.1.0 (unreleased)
------------------

- Nothing changed yet.
**Internal changes**

- Refactor of ``kinto.core.storage.Storage.get_all`` method to issue two separate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it only applies to PostgreSQL ;)

And also, I think it would make sense to mention the performance gain!

simpler, queries to get the total count and the records. (#1507)


9.0.0 (2018-04-25)
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Contributors
* Oron Gola <oron.golar@gmail.com>
* Palash Nigam <npalash25@gmail.com>
* PeriGK <per.gkolias@gmail.com>
* Peter Bengtsson <mail@peterbe.com>
* realsumit <sumitsarinofficial@gmail.com>
* Rektide <rektide@voodoowarez.com>
* Rémy Hubscher <rhubscher@mozilla.com>
Expand All @@ -83,4 +84,3 @@ Contributors
* Vitor Falcao <vitor.falcaor@gmail.com>
* Wil Clouser <wclouser@mozilla.com>
* Yann Klis <yann.klis@gmail.com>

56 changes: 26 additions & 30 deletions kinto/core/storage/postgresql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -587,31 +587,24 @@ def get_all(self, collection_id, parent_id, filters=None, sorting=None,
modified_field=DEFAULT_MODIFIED_FIELD,
deleted_field=DEFAULT_DELETED_FIELD,
auth=None):
query = """
WITH collection_filtered AS (
SELECT id, last_modified, data, deleted
FROM records
WHERE {parent_id_filter}
AND collection_id = :collection_id
{conditions_deleted}
{conditions_filter}
),
total_filtered AS (
SELECT COUNT(id) AS count_total
FROM collection_filtered
WHERE NOT deleted
),
paginated_records AS (
SELECT DISTINCT id
FROM collection_filtered
{pagination_rules}
)
SELECT count_total,
a.id, as_epoch(a.last_modified) AS last_modified, a.data
FROM paginated_records AS p JOIN collection_filtered AS a ON (a.id = p.id),
total_filtered
{sorting}
LIMIT :pagination_limit;
count_query = """
SELECT COUNT(id) AS count_total
FROM records
WHERE {parent_id_filter}
AND collection_id = :collection_id
AND NOT deleted
{conditions_filter};
"""
select_query = """
SELECT id, as_epoch(last_modified) AS last_modified, data
FROM records
WHERE {pagination_rules}
{parent_id_filter}
AND collection_id = :collection_id
{conditions_deleted}
{conditions_filter}
{sorting}
LIMIT :pagination_limit;
"""

# Unsafe strings escaped by PostgreSQL
Expand Down Expand Up @@ -647,21 +640,24 @@ def get_all(self, collection_id, parent_id, filters=None, sorting=None,
if pagination_rules:
sql, holders = self._format_pagination(pagination_rules, id_field,
modified_field)
safeholders['pagination_rules'] = 'WHERE {}'.format(sql)
safeholders['pagination_rules'] = '{} AND'.format(sql)
placeholders.update(**holders)
else:
safeholders['pagination_rules'] = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: superfluous since safeholders is a default dict ;)


# Limit the number of results (pagination).
limit = min(self._max_fetch_size, limit) if limit else self._max_fetch_size
placeholders['pagination_limit'] = limit

with self.client.connect(readonly=True) as conn:
result = conn.execute(query.format_map(safeholders), placeholders)
result = conn.execute(select_query.format_map(safeholders), placeholders)
retrieved = result.fetchmany(self._max_fetch_size)

if len(retrieved) == 0:
return [], 0
if len(retrieved) == 0:
return [], 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some further work to do here.
This early exit allows us to being able to skip the COUNT query since the SELECT query yielded 0 results anyway.

The reason for why I didn't use the exact filter for the COUNT and the SELECT was because of this test:
https://github.com/peterbe/kinto/blob/51c7bf9ec9eb1f89d7e7d3a015429f2bc525551d/kinto/core/storage/testing.py#L1217-L1218
(there might be more like this!)
In that test it does a get_all(include_deleted=True, ...) and it expects the number of records to be different from the total count.
Isn't that strange? If you include_deleted=True why should the total count be always filter out the deleted records??

If that test is wrong, I can rewrite the two queries so that the SELECT and COUNT uses the exact same WHERE clause. Then, if we do that we can do the COUNT first (slightly faster since it's just an integer) and if it's 0 we can skip the SELECT and just return an empty list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables are not named correctly in the tests, instead it could be:

records_and_tombstones, records_count = self.storage.get_all(parent_id='abc',
                                                             collection_id='c',
                                                             include_deleted=True)
self.assertEqual(records_count, 0)
self.assertEqual(len(records_and_tombstones), 2)

The returned count is the number of records, excluding tombstones. Tombstones are returned when ?_since is in querystring, they are not real records....

The parameter include_deleted is not named correctly, it should be with_tombstones. We identified the issue a while ago #710


count_total = retrieved[0]['count_total']
result_count = conn.execute(count_query.format_map(safeholders), placeholders)
count_total, = result_count.fetchone()

records = []
for result in retrieved:
Expand Down
12 changes: 6 additions & 6 deletions kinto/core/storage/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1211,13 +1211,13 @@ def test_delete_all_can_delete_by_parent_id_with_tombstones(self):
self.assertEqual(count, 1)
self.assertEqual(len(records), 1)

records, count = self.storage.get_all(parent_id='abc',
collection_id='c',
include_deleted=True)
records_and_tombstones, count = self.storage.get_all(parent_id='abc',
collection_id='c',
include_deleted=True)
self.assertEqual(count, 0)
self.assertEqual(len(records), 2)
self.assertTrue(records[0]['deleted'])
self.assertTrue(records[1]['deleted'])
self.assertEqual(len(records_and_tombstones), 2)
self.assertTrue(records_and_tombstones[0]['deleted'])
self.assertTrue(records_and_tombstones[1]['deleted'])

def test_delete_all_can_delete_partially(self):
self.create_record({'foo': 'po'})
Expand Down