Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database vacuuming either not running or unoptimized. #1128

Closed
smathermather opened this issue Jan 26, 2022 · 11 comments · Fixed by #1256
Closed

Database vacuuming either not running or unoptimized. #1128

smathermather opened this issue Jan 26, 2022 · 11 comments · Fixed by #1256
Labels
good first issue Issues for first time contributors possible software fault

Comments

@smathermather
Copy link
Contributor


How did you install WebODM (docker, installer, etc.)?

Seriously dockerized here.

What's your browser and operating system? (Copy/paste the output of https://www.whatismybrowser.com/)

Not relevant.

What is the problem?

Autovacuum is either not set or untuned in WebODM

How can we reproduce this? (What steps trigger the problem? What parameters are you using for processing? Include screenshots. If you are having issues processing a dataset, you must include a copy of your dataset uploaded on Dropbox, Google Drive or https://dronedb.app)

See this: https://community.opendronemap.org/t/what-is-eating-up-my-hard-disk-part-dieux-pssst-spoiler-its-the-database-not-being-vacuumed/10330

@smathermather smathermather added good first issue Issues for first time contributors possible software fault labels Jan 26, 2022
@smathermather
Copy link
Contributor Author

I'm guessing we touch autovacuum_vacuum_cost_limit and or autovacuum_vacuum_cost_delay in https://github.com/OpenDroneMap/WebODM/blob/master/db/Dockerfile maybe around line 58.

@pierotofy
Copy link
Member

I didn't even know this was a thing (why doesn't postgres do this automatically?), so I'll go with whatever suggestion.

@smathermather
Copy link
Contributor Author

smathermather commented Jan 26, 2022

why doesn't postgres do this automatically

It's that old addage: configuration over convention. Wait... .

https://github.com/OpenDroneMap/WebODM/blob/master/db/Dockerfile maybe around line 58.

If this is the right place, I'll take a stab at a pull request.

@Saijin-Naib
Copy link
Contributor

I'm going to defer to you, Stephen. I don't understand PGSQL enough to do anything other than guess. My familiarity is just with GeoPackage/SQLite on smaller deployments than we'd see ODM scale to.

@OKaluza
Copy link
Contributor

OKaluza commented Oct 21, 2022

Hi, I've just spent a lot of time looking in to this issue and what causes it, it's hard to pin down, I've dug into the internals of postgresql a lot more than I wanted to and I have heaps of debugging info and further details I can share, but here are my conclusions so far:

  • The autovacuum is running, and the settings should be ok - may be able to tweak them to get around this but I think it's better to fix the underlying cause
  • The bloat is caused in a pg_toast_#### table related to app_tasks - this stores modified tuples from old records when a record is updated, the vacuum should clean them up, but clearly this isn't always happening:
SELECT                                           
       relname AS "table_name",
       pg_size_pretty(pg_table_size(C.oid)) AS "table_size"
FROM                            
       pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND relkind IN ('r')
ORDER BY pg_table_size(C.oid)
DESC LIMIT 1;

 table_name | table_size 
------------+------------
 app_task   | 34 GB
(1 row)

SELECT relname, pg_total_relation_size(oid)
FROM pg_class
ORDER BY 2 DESC;

                             relname                             | pg_total_relation_size 
-----------------------------------------------------------------+------------------------
 app_task                                                        |            36399669248
 pg_toast_18121                                                  |            36398899200
 pg_toast_18121_index                                            |              515096576
 django_session                                                  |               45457408

SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 2;

           relation            |  size  
-------------------------------+--------
 pg_toast.pg_toast_18121       | 33 GB
 pg_toast.pg_toast_18121_index | 531 MB
(2 rows)
  • I noticed the vacuum job often gets stuck (trying to get a lock?) for long periods, probably this happens later on when the issue starts to snowball, I had to kill this job to get things to continue but it still wasn't enough to fix things:
SELECT pid, datname, usename, state, backend_xmin, query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;

  pid   |  datname   | usename  | state  | backend_xmin |                          query                           
--------+------------+----------+--------+--------------+----------------------------------------------------------
 219389 | webodm_dev | postgres | active |     10020481 | autovacuum: VACUUM pg_toast.pg_toast_18121
  • Only way to get things back to sane was stop the apps and manually run a vacuum full, but this would only work if there are no jobs running.
  • The above query that shows long_running / active tasks also showed the UPDATE to app_tasks issued from the webapp would occasionally take quite a while to execute.
  • When a job is running, updates are very frequent, this should be fine but if you query the database for the biggest space hogging column, it is console_output in app_tasks, this is the biggest data consumer and this column is updated every cycle!
                 table_name                  |     column_name      | pg_size_pretty 
---------------------------------------------+----------------------+----------------
 app_task                                    | console_output       | 276 MB
 django_session                              | session_data         | 16 MB
 app_imageupload                             | image                | 16 MB
 django_session                              | session_key          | 4289 kB
 spatial_ref_sys                             | srtext               | 3352 kB
 app_imageupload                             | task_id              | 3152 kB
 django_session                              | expire_date          | 1040 kB
 app_imageupload                             | id                   | 788 kB
 spatial_ref_sys                             | proj4text            | 614 kB
 app_task                                    | available_assets     | 59 kB
  • As a long running job progresses, the console_output only gets bigger, every time it's updated the old version has to be moved to the toast table, the vacuum can't keep up and eventually grinding everything to a halt when the database updates start fighting with the auto vacuum for locks, and we eventually either finish the job or run out of disk space - along the way slowing down all other queries to the db too.

Possible actions if I'm coming to the right conclusions:

  • Update Postgres and see if it improves things, PostgreSQL 9.6+ will probably help: https://stackoverflow.com/a/40457585
  • Try and tweak the auto vacuum settings to keep up with rapidly updating the console output
  • Change the way the console output is updated, ie: reducing the frequency, checking if it has changed before committing to db, or even storing it in another related table (1 row per line perhaps) or just keeping it as a separate file instead?

@smathermather
Copy link
Contributor Author

smathermather commented Oct 21, 2022

If I understand your analysis, do we need to be performing a transaction for console output as it is produced? This explains what I see in locked logs on long running processes all the time. I see the logs continue in NodeODM but no updates in WebODM. I don't know why we would want to wrap those updates in a transaction in WebODM until the log is complete (either a processing failure or success).

@OKaluza
Copy link
Contributor

OKaluza commented Oct 21, 2022

Each console /status update is still a single transaction, they just take a little longer than they should sometimes when the log gets big and there are many happening at once. They are quite frequent and involve changing a very large text field. The autovacuum on the other hand is a single long transaction that seems to get stuck/frozen and need killing in some situations.

I could be wrong, but it seems like it's actually just a pattern that doesn't work well with the default autovacuum settings of postgresql, that is: updating a large field text field frequently in this way, especially when there are multiple jobs running with very long log outputs.
I'm going to try both tweaking some of the autovacuum settings to see if it works better, and hack it to just write console output to a file and skip the db to see the if the issue keeps occurring to narrow down if this is actually the problem.
Here it gets the console output: https://github.com/OpenDroneMap/WebODM/blob/master/app/models/task.py#L760 and here it saves the task https://github.com/OpenDroneMap/WebODM/blob/master/app/models/task.py#L826
I'm not sure if there is any logic in Django models to prevent writing on .save() if nothing is changed.

@pierotofy
Copy link
Member

I think this page explains most of the issue: https://www.postgresql.org/docs/9.0/routine-vacuuming.html

@OKaluza
Copy link
Contributor

OKaluza commented Oct 21, 2022

There may be aggressive enough tweaks to autovacuum, as long as they're enough to prevent it getting into this territory, which is the state I keep finding our db:

Tip: Plain VACUUM may not be satisfactory when a table contains large numbers of dead row versions as a result of massive update or delete activity. If you have such a table and you need to reclaim the excess disk space it occupies, you will need to use VACUUM FULL, or alternatively CLUSTER or one of the table-rewriting variants of ALTER TABLE. These commands rewrite an entire new copy of the table and build new indexes for it. All these options require exclusive lock. Note that they also temporarily use extra disk space approximately equal to the size of the table, since the old copies of the table and indexes can't be released until the new ones are complete.

I've still got a known problem job I can run so will try some autovacuum settings changes and let it run overnight.

@OKaluza
Copy link
Contributor

OKaluza commented Oct 22, 2022

I've tried some very aggressive autovacuum settings from this post as a test https://dba.stackexchange.com/a/21649
From my init container:

sed -i "s/#autovacuum_max_workers = 3/autovacuum_max_workers = 6/" postgresql.conf
sed -i "s/#autovacuum_naptime = 1min/autovacuum_naptime = 15s/" postgresql.conf
sed -i "s/#autovacuum_vacuum_threshold = 50/autovacuum_vacuum_threshold = 25/" postgresql.conf
sed -i "s/#autovacuum_vacuum_scale_factor = 0.2/autovacuum_vacuum_scale_factor = 0.1/" postgresql.conf
sed -i "s/#autovacuum_analyze_threshold = 50/autovacuum_analyze_threshold = 10/" postgresql.conf
sed -i "s/#autovacuum_analyze_scale_factor = 0.1/autovacuum_analyze_scale_factor = 0.05/" postgresql.conf
sed -i "s/#autovacuum_analyze_threshold = 50/autovacuum_analyze_threshold = 10/" postgresql.conf
sed -i "s/#autovacuum_vacuum_cost_delay = 20ms/autovacuum_vacuum_cost_delay = 10ms/" postgresql.conf
sed -i "s/#autovacuum_vacuum_cost_limit = -1/autovacuum_vacuum_cost_limit = 1000/" postgresql.conf

This seems to be working to keep the database size trimmed for now with a single job running, will see how it goes after running longer and adding some more jobs. For our use case we'll often have many users and jobs running on the same WebODM instance and this could easily raise it's head again if we don't fix the underlying cause.
The log file for just this job is now over 60MB, it's updating this in the db every few seconds, it loads the up server unnecessarily, if we have 10+ jobs running it's going to impact performance and the autovac might not keep up again.
I've written a patch I'll apply to our fork to only get the log for a running job when the /output() api is called (ie: when the console view is enabled, and bypassing the db), and just update the field in the database when status changes. I'll do some more testing on this but initial tests showed it to work well. I'll make a pull request for the autovac settings (and the console output hack if there's any interest).

@pierotofy
Copy link
Member

pierotofy commented Oct 22, 2022

A PR for the autovac settings would be fantastic. Thanks for sharing your findings!

OKaluza added a commit to AuScalableDroneCloud/WebODM that referenced this issue Oct 24, 2022
Taken from here https://dba.stackexchange.com/a/21649 to fix OpenDroneMap#1128
This could be overkill, but has fixed the issue in my testing, could
impact performance for concurrent users of the database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issues for first time contributors possible software fault
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants