|
| 1 | +Originally from: [tweet](https://twitter.com/samokhvalov/status/1711237471743480060), [LinkedIn post](...). |
| 2 | + |
| 3 | +--- |
| 4 | + |
| 5 | +# How to benchmark |
| 6 | + |
| 7 | +> I post a new PostgreSQL "howto" article every day. Join me in this |
| 8 | +> journey – [subscribe](https://twitter.com/samokhvalov/), provide feedback, share! |
| 9 | +
|
| 10 | +Benchmarking is a huge topic. Here, we cover minimal steps that are required for a Postgres benchmark to be informative, |
| 11 | +useful, and correct. |
| 12 | + |
| 13 | +In this article, we assume that the following principles are followed: |
| 14 | + |
| 15 | +1. **NOT A SHARED ENV:** The whole machine is under our solely use (nobody else is using it), we aim to study the |
| 16 | + behavior |
| 17 | + of Postgres as a whole, with all its components (vs. microbenchmarks such as studying a particular query via using |
| 18 | + `EXPLAIN` or focusing on underlying components such as disk and filesystem performance). |
| 19 | + |
| 20 | +2. **HIGH QUALITY:** We aim to be honest (no "benchmarketing" goals), transparent (all details are shared), precise, and |
| 21 | + fix all mistakes when/if they happen. When it makes sense, each benchmark run should be long enough to take into |
| 22 | + account factors like colder state of caches or various fluctuations. There should be also multiple runs of each type |
| 23 | + of benchmark (usually, at least 4-5), to ensure that runs are reproducible. It makes sense to learn from existing |
| 24 | + experience: excellent [Brendan Gregg's book "System |
| 25 | + Performance"](https://brendangregg.com/blog/2020-07-15/systems-performance-2nd-edition.html) has a chapter about |
| 26 | + benchmarks; it may also be useful to learn from other fields (physics, etc.), to understand the principles and |
| 27 | + methodologies of [successful experiments](https://en.wikipedia.org/wiki/Experiment). |
| 28 | + |
| 29 | +3. **READY TO LEARN:** We have enough expertise to understand the bottlenecks and limitations or ready to use other |
| 30 | + people's help, re-making benchmarks if needed. |
| 31 | + |
| 32 | +## Benchmark structure |
| 33 | + |
| 34 | +Benchmark is a kind of database experiment, where, in general case, we use multiple sessions to DBMS and study the |
| 35 | +behavior of the system as a whole, and it's all or particular components (e.g., buffer pool, checkpointer, replication). |
| 36 | + |
| 37 | +Each benchmark run should have a well-defined structure. In general, it contains two big parts: |
| 38 | + |
| 39 | +1. **INPUT:** everything we have or define before conducting the database – where we run the benchmark, how the system |
| 40 | + was |
| 41 | + configured, what DB and workload we use, what change we aim to study (to compare the behavior before and after the |
| 42 | + change). |
| 43 | +2. **OUTPUT:** various observability data such as logs, errors observed, statistics, etc. |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +Each part should be well-documented so anyone can reproduce the experiment, understand the main metrics (latency, |
| 48 | +throughput, etc.), understand bottlenecks, conduct additional, modified experiments if needed. |
| 49 | + |
| 50 | +The description of all aspects of database benchmarking could take a whole book – here I provide only basic |
| 51 | +recommendations that can help you avoid mistakes and improve the general quality of your benchmarks. |
| 52 | + |
| 53 | +Of course, some of the things can be omitted, if needed. But in general case, it is recommended to automate |
| 54 | +documentation and artifact collection for all experiments, so it would be easy to study the details later. You can |
| 55 | +find [here](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues) |
| 56 | +some good examples of benchmarks performed for specific purposes (e.g., to study pathological subtransaction behavior or |
| 57 | +to measure the benefits of enabling `wal_compression`). |
| 58 | + |
| 59 | +## INPUT: Environment |
| 60 | + |
| 61 | +- Make sure you're using machines of a proper size, don't run on laptop (unless absolutely necessary). AWS Spot or GCP |
| 62 | + Preemptible instances, used during short periods of time, are extremely affordable and super helpful for |
| 63 | + experimenting. For example, spot instances for VMs with 128 Intel or AMD vCPUs and 256-1024 GiB of RAM have hourly |
| 64 | + price as low as $5-10 ([good comparison tool](https://instances.vantage.sh)), billed by second – this enables very |
| 65 | + cost-efficient experiments on large machines. |
| 66 | +- When you get VMs, quickly check them with microbenchmark tools such as fio, sysbench to ensure that CPU, RAM, disk, |
| 67 | + network work as advertised. If in doubts, compare to other same-type VMs and choose. |
| 68 | +- Document VM tech spec, disk, and filesystem (these two are extremely important for databases!), OS choices, and |
| 69 | + non-default settings. |
| 70 | +- Document Postgres version used and additional extensions, if any (some of them can have "observer effect" even if they |
| 71 | + are just installed). |
| 72 | +- Document non-default Postgres settings used. |
| 73 | + |
| 74 | +## INPUT: Database |
| 75 | + |
| 76 | +- Document what schema and data you use. Ideally, in a fully reproducible form (SQL / dump). |
| 77 | + |
| 78 | +## INPUT: Workload |
| 79 | + |
| 80 | +- Document all the aspects of the workload you've used – ideally, in a fully reproducible form (SQL, pgbench, sysbench, |
| 81 | + etc. details). |
| 82 | +- Understand the type of your benchmark, the kind of load testing you're aiming to have: is it edge-case load testing |
| 83 | + (stress testing) when you aim to go "full speed" or regular load testing, in which you try to simulate real-life |
| 84 | + situation when, for example, CPU usage is normally far below 100%. Note that by default, pgbench tends to give you |
| 85 | + "stress testing" (not limiting the number of TPS – to limit it, use option `-R`). |
| 86 | + |
| 87 | +## INPUT: Delta |
| 88 | + |
| 89 | +There may be various types of "deltas" (the subject of our study that define the difference between runs). Here are just |
| 90 | +some examples: |
| 91 | + |
| 92 | +- different Postgres minor or major versions |
| 93 | +- different OS versions |
| 94 | +- different settings such as `work_mem` |
| 95 | +- different hardware |
| 96 | +- varying scale: different number of clients working with database or different table sizes |
| 97 | +- different filesystems |
| 98 | + |
| 99 | +It is not recommended to consider schema changes of changes in SQL queries as "delta" because: |
| 100 | + |
| 101 | +- such workload changes usually happen at a very high pace |
| 102 | +- full-fledged benchmarking is very expensive |
| 103 | +- it is possible to study schema and query changes in shared environments, focusing on IO metrics (BUFFERS!), achieving |
| 104 | + high level of time and cost efficiency (see [@Database_Lab](https://twitter.com/Database_Lab)) |
| 105 | + |
| 106 | +## OUTPUT: collect artifacts |
| 107 | + |
| 108 | +It is worth collecting a lot of various artifacts and making sure they will not be lost (e.g., upload them to an object |
| 109 | +storage). |
| 110 | + |
| 111 | +- Before each run, reset all statistics, using `pg_stat_reset()`, `pg_stat_reset_shared(..)`, other standard |
| 112 | + `pg_stat_reset_***()` functions ([docs](https://postgresql.org/docs/current/monitoring-stats.html)), |
| 113 | + `pg_stat_statements_reset()`, `pg_stat_kcache_reset()`, and so on. |
| 114 | +- After each run, dump all `pg_stat_***` views in CSV format. |
| 115 | +- Collect all Postgres logs and any other related logs (e.g., pgBouncer's, Patroni's, syslog). |
| 116 | +- While Postgres, pgBouncer or any other configs are "input", it makes sense to create a snapshot of all actual observed |
| 117 | + configuration values (e.g., `select * from pg_settings;`) and consider this data as artifacts as well. |
| 118 | +- Collect the query analysis information: snapshot of `pg_stat_statements`, `pg_stat_kcache`, `pg_wait_sampling` |
| 119 | + / [pgsentinel](https://github.com/pgsentinel/pgsentinel), etc. |
| 120 | +- Extract all information about errors from (a) logs, (b) `pg_stat_database` (`xact_rollback`) and similar places via |
| 121 | + SQL, |
| 122 | + and consider this a separate, important type of artifact for analysis. Consider using a small extension called |
| 123 | + [logerrors](https://github.com/munakoiso/logerrors) that will register all error codes and expose them via SQL. |
| 124 | +- If monitoring is used, collect charts from there. For experiments particularly, it may be convenient to use |
| 125 | + [Netdata](https://netdata.cloud) since it's really easy to install on a fresh machine, and it has dashboard |
| 126 | + export/import functions (unfortunately, they are client-side, hence manual actions are always needed; but, I |
| 127 | + personally find them very convenient when conducting DB experiments). |
| 128 | + |
| 129 | +## Analysis |
| 130 | + |
| 131 | +Some tips (far from being complete): |
| 132 | + |
| 133 | +1. Always check errors. It's not uncommon to have a benchmark run, jump into some conclusion, and only later realize |
| 134 | + that the error count was too high, making the run not useful. |
| 135 | +2. Understand where the bottleneck is. Very often, we are saturated, say, on disk IO, and think we observe behavior of |
| 136 | + our database system, but we actually observe the behavior of, say, cloud disk throttling or filesystem limitations |
| 137 | + instead. In such cases we need to think how to tune our input to avoid such bottlenecks, to perform useful |
| 138 | + experiments. |
| 139 | +3. In some cases, it is, vice versa, very desired to reach some kind saturation – for example, if we study the speed of |
| 140 | + `pg_dump` or `pg_restore`, we may want to observe our disk system saturated, and we tune the input (e.g. how exactly |
| 141 | + we `pg_dump` – how many parallel workers we use, is compression involved, is network involved, etc.) so the desired |
| 142 | + saturation is indeed reached, and we can demonstrate it. |
| 143 | +4. Understand the main metrics you're going to compare between runs – latencies, throughput numbers, query analysis |
| 144 | + metrics (those from `pg_stat_statements`, wait event analysis), and so on. |
| 145 | +5. Develop a good format of summary and follow this format. It can include short description of various input parts, |
| 146 | + including workload delta, and main comparison metrics. Store all the summaries for all runs in this well-structured |
| 147 | + form. |
0 commit comments