New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

devkat: DB upsert performance improvements. #2464

Merged
merged 8 commits into from Jan 27, 2018

Conversation

Projects
None yet
5 participants
@sebastienvercammen
Member

sebastienvercammen commented Jan 26, 2018

Description

  • Changes our bulk upsert code to use INSERT ... ON DUPLICATE KEY UPDATE instead of relying on peewee, which only supports REPLACE INTO.
  • Also leverages the cursor's .executemany, allowing the DB to properly prepare for a batched query.

Long-term, we can look at increasing the batch step (currently 250) and analyze the performance with executemany.

Motivation and Context

REPLACE INTO first deletes the old entry before inserting the new one. This causes several issues:

  • Unnecessary overhead.
  • The auto-increment ID for the row doesn't stay the same, it gets a new one when inserted. #2440 is affected by this.
  • Table indexes have to be updated each time.
  • Strongly affects DB lock wait times and eventually causes deadlocks once it reaches its limit.

How Has This Been Tested?

Tested by devkat's grapefruit patrons as well as a 2-week testing period by our NAG patrons, per our usual patron-only release/testing schedule.

Feedback:

stock RM:
Upserted to ScannedLocation, 7957 records (upsert queue remaining: 0) in 1.981721 seconds.

new:
Upserted to ScannedLocation, 7957 records (upsert queue remaining: 0) in 0.909359 seconds.

[9:04 PM] -censored-: Jesus H Crust that upsert commit really helped. Changed from 16 db threads and still getting db-queues and eventually locks to 1 db-thread and smooth sailing.

Types of changes

  • Enhancement

Checklist:

  • My code follows the code style of this project.
@captbunzo

Tested on about 100 hives for several weeks. No problems found and performance improvement was observed. Thanks for the hard work on this!

query_string = ('INSERT INTO {table} ({fields}) VALUES'
+ ' ({placeholders}) ON DUPLICATE KEY UPDATE'
+ ' {assignments}')

This comment has been minimized.

@friscoMad

friscoMad Jan 26, 2018

Contributor

Maybe we can benchmark this part as maybe it is not needed but all that meta and peewee classes sounds more or less expensive calls to get always the same result per class so maybe we can cache this string and as we do a lot of inserts it can help.
Also this part maybe fits better in is own function for clarity

This comment has been minimized.

@sebastienvercammen

sebastienvercammen Jan 26, 2018

Member

@friscoMad I agree that we could cache some things per model. I'm not sure how expensive it is exactly, I primarily benchmarked the DB changes. Definitely something we can look into while reviewing.

I expect the calls won't be the most expensive part, since most is done on a handful of small lists containing the field names. Also keep in mind that fields are determined by the InsertQuery since it uses its models to know which are required.

But I'm not sure that separating it into its own function is the best idea: it's purely to build our query, and it would be irrelevant in any scope outside of the upsert.

return
# We used to support SQLite and it has a default max 999 parameters,
# so we need to limit how many rows we insert for it.
step = 250

This comment has been minimized.

@friscoMad

friscoMad Jan 26, 2018

Contributor

Not sure but maybe we can remove the steps also, I did a small research and it seems that mysql has a 10K parameter limit, we are using 2x each value, but even then it is 5K params and I am not sure what queries can insert so many elements (steps maybe)

This comment has been minimized.

@sebastienvercammen

sebastienvercammen Jan 26, 2018

Member

@friscoMad Per the provider docs:

Parameter limits:

Oracle: 64000
MySQL: 65535
PostgreSQL: 34464
Sqlite: 999

We can definitely increase it, that's why I added it as a TODO in the description of the PR.

@friscoMad

This comment has been minimized.

Contributor

friscoMad commented Jan 27, 2018

Just as information query creation step does take between 0.0002 and 0.0004 seconds, not big deal.

@jaake77

Works. Zoom zoom. Thanks for being you.

sebastienvercammen added some commits Dec 27, 2017

Support peewee defaults.
Fix reference to meta defaults.

Fix field defaults.

Fix row reference.

Fix double escaping.
Use class for fields instead of row.
Reference field values.

Use proper references for peewee.

Make sure all columns are added.

Fix KeyError.

Fix defaults keys.

Fix order of values.

Sort rows.
Translate input fields.
Removed unnecessary escaping.

Added extra comments.
Use InsertQuery to model the query.
Pythonic first item of iterator.

Convert callable default to static.
@friscoMad

Tested again really fast and working

@sebastienvercammen sebastienvercammen merged commit 5225b59 into RocketMap:develop Jan 27, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@sebastienvercammen sebastienvercammen deleted the sebastienvercammen:devkat_upsert branch Jan 27, 2018

@sebastienvercammen sebastienvercammen referenced this pull request Jan 27, 2018

Merged

Potential EX gym detection #2440

6 of 7 tasks complete

tomballgithub added a commit to tomballgithub/RocketMap that referenced this pull request Feb 9, 2018

devkat: DB upsert performance improvements. (RocketMap#2464)
* Test base class to query conversion.

Fix variable name.

* Support peewee defaults.

Fix reference to meta defaults.

Fix field defaults.

Fix row reference.

Fix double escaping.

* Use class for fields instead of row.

Reference field values.

Use proper references for peewee.

Make sure all columns are added.

Fix KeyError.

Fix defaults keys.

Fix order of values.

Sort rows.

* Support peewee column name changes.

* Translate input fields.

Removed unnecessary escaping.

Added extra comments.

* Use InsertQuery to model the query.

Pythonic first item of iterator.

Convert callable default to static.

* Only call callables when needed.

* Double step size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment