Skip to content

Commit f00f66e

Browse files
committed
Day 13
1 parent 3fecb7d commit f00f66e

File tree

2 files changed

+147
-0
lines changed

2 files changed

+147
-0
lines changed

0013_how_to_benchmark.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
Originally from: [tweet](https://twitter.com/samokhvalov/status/1711237471743480060), [LinkedIn post](...).
2+
3+
---
4+
5+
# How to benchmark
6+
7+
> I post a new PostgreSQL "howto" article every day. Join me in this
8+
> journey – [subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
Benchmarking is a huge topic. Here, we cover minimal steps that are required for a Postgres benchmark to be informative,
11+
useful, and correct.
12+
13+
In this article, we assume that the following principles are followed:
14+
15+
1. **NOT A SHARED ENV:** The whole machine is under our solely use (nobody else is using it), we aim to study the
16+
behavior
17+
of Postgres as a whole, with all its components (vs. microbenchmarks such as studying a particular query via using
18+
`EXPLAIN` or focusing on underlying components such as disk and filesystem performance).
19+
20+
2. **HIGH QUALITY:** We aim to be honest (no "benchmarketing" goals), transparent (all details are shared), precise, and
21+
fix all mistakes when/if they happen. When it makes sense, each benchmark run should be long enough to take into
22+
account factors like colder state of caches or various fluctuations. There should be also multiple runs of each type
23+
of benchmark (usually, at least 4-5), to ensure that runs are reproducible. It makes sense to learn from existing
24+
experience: excellent [Brendan Gregg's book "System
25+
Performance"](https://brendangregg.com/blog/2020-07-15/systems-performance-2nd-edition.html) has a chapter about
26+
benchmarks; it may also be useful to learn from other fields (physics, etc.), to understand the principles and
27+
methodologies of [successful experiments](https://en.wikipedia.org/wiki/Experiment).
28+
29+
3. **READY TO LEARN:** We have enough expertise to understand the bottlenecks and limitations or ready to use other
30+
people's help, re-making benchmarks if needed.
31+
32+
## Benchmark structure
33+
34+
Benchmark is a kind of database experiment, where, in general case, we use multiple sessions to DBMS and study the
35+
behavior of the system as a whole, and it's all or particular components (e.g., buffer pool, checkpointer, replication).
36+
37+
Each benchmark run should have a well-defined structure. In general, it contains two big parts:
38+
39+
1. **INPUT:** everything we have or define before conducting the database – where we run the benchmark, how the system
40+
was
41+
configured, what DB and workload we use, what change we aim to study (to compare the behavior before and after the
42+
change).
43+
2. **OUTPUT:** various observability data such as logs, errors observed, statistics, etc.
44+
45+
![Database Benchmark](files/0013_db_benchmark.png)
46+
47+
Each part should be well-documented so anyone can reproduce the experiment, understand the main metrics (latency,
48+
throughput, etc.), understand bottlenecks, conduct additional, modified experiments if needed.
49+
50+
The description of all aspects of database benchmarking could take a whole book – here I provide only basic
51+
recommendations that can help you avoid mistakes and improve the general quality of your benchmarks.
52+
53+
Of course, some of the things can be omitted, if needed. But in general case, it is recommended to automate
54+
documentation and artifact collection for all experiments, so it would be easy to study the details later. You can
55+
find [here](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues)
56+
some good examples of benchmarks performed for specific purposes (e.g., to study pathological subtransaction behavior or
57+
to measure the benefits of enabling `wal_compression`).
58+
59+
## INPUT: Environment
60+
61+
- Make sure you're using machines of a proper size, don't run on laptop (unless absolutely necessary). AWS Spot or GCP
62+
Preemptible instances, used during short periods of time, are extremely affordable and super helpful for
63+
experimenting. For example, spot instances for VMs with 128 Intel or AMD vCPUs and 256-1024 GiB of RAM have hourly
64+
price as low as $5-10 ([good comparison tool](https://instances.vantage.sh)), billed by second – this enables very
65+
cost-efficient experiments on large machines.
66+
- When you get VMs, quickly check them with microbenchmark tools such as fio, sysbench to ensure that CPU, RAM, disk,
67+
network work as advertised. If in doubts, compare to other same-type VMs and choose.
68+
- Document VM tech spec, disk, and filesystem (these two are extremely important for databases!), OS choices, and
69+
non-default settings.
70+
- Document Postgres version used and additional extensions, if any (some of them can have "observer effect" even if they
71+
are just installed).
72+
- Document non-default Postgres settings used.
73+
74+
## INPUT: Database
75+
76+
- Document what schema and data you use. Ideally, in a fully reproducible form (SQL / dump).
77+
78+
## INPUT: Workload
79+
80+
- Document all the aspects of the workload you've used – ideally, in a fully reproducible form (SQL, pgbench, sysbench,
81+
etc. details).
82+
- Understand the type of your benchmark, the kind of load testing you're aiming to have: is it edge-case load testing
83+
(stress testing) when you aim to go "full speed" or regular load testing, in which you try to simulate real-life
84+
situation when, for example, CPU usage is normally far below 100%. Note that by default, pgbench tends to give you
85+
"stress testing" (not limiting the number of TPS – to limit it, use option `-R`).
86+
87+
## INPUT: Delta
88+
89+
There may be various types of "deltas" (the subject of our study that define the difference between runs). Here are just
90+
some examples:
91+
92+
- different Postgres minor or major versions
93+
- different OS versions
94+
- different settings such as `work_mem`
95+
- different hardware
96+
- varying scale: different number of clients working with database or different table sizes
97+
- different filesystems
98+
99+
It is not recommended to consider schema changes of changes in SQL queries as "delta" because:
100+
101+
- such workload changes usually happen at a very high pace
102+
- full-fledged benchmarking is very expensive
103+
- it is possible to study schema and query changes in shared environments, focusing on IO metrics (BUFFERS!), achieving
104+
high level of time and cost efficiency (see [@Database_Lab](https://twitter.com/Database_Lab))
105+
106+
## OUTPUT: collect artifacts
107+
108+
It is worth collecting a lot of various artifacts and making sure they will not be lost (e.g., upload them to an object
109+
storage).
110+
111+
- Before each run, reset all statistics, using `pg_stat_reset()`, `pg_stat_reset_shared(..)`, other standard
112+
`pg_stat_reset_***()` functions ([docs](https://postgresql.org/docs/current/monitoring-stats.html)),
113+
`pg_stat_statements_reset()`, `pg_stat_kcache_reset()`, and so on.
114+
- After each run, dump all `pg_stat_***` views in CSV format.
115+
- Collect all Postgres logs and any other related logs (e.g., pgBouncer's, Patroni's, syslog).
116+
- While Postgres, pgBouncer or any other configs are "input", it makes sense to create a snapshot of all actual observed
117+
configuration values (e.g., `select * from pg_settings;`) and consider this data as artifacts as well.
118+
- Collect the query analysis information: snapshot of `pg_stat_statements`, `pg_stat_kcache`, `pg_wait_sampling`
119+
/ [pgsentinel](https://github.com/pgsentinel/pgsentinel), etc.
120+
- Extract all information about errors from (a) logs, (b) `pg_stat_database` (`xact_rollback`) and similar places via
121+
SQL,
122+
and consider this a separate, important type of artifact for analysis. Consider using a small extension called
123+
[logerrors](https://github.com/munakoiso/logerrors) that will register all error codes and expose them via SQL.
124+
- If monitoring is used, collect charts from there. For experiments particularly, it may be convenient to use
125+
[Netdata](https://netdata.cloud) since it's really easy to install on a fresh machine, and it has dashboard
126+
export/import functions (unfortunately, they are client-side, hence manual actions are always needed; but, I
127+
personally find them very convenient when conducting DB experiments).
128+
129+
## Analysis
130+
131+
Some tips (far from being complete):
132+
133+
1. Always check errors. It's not uncommon to have a benchmark run, jump into some conclusion, and only later realize
134+
that the error count was too high, making the run not useful.
135+
2. Understand where the bottleneck is. Very often, we are saturated, say, on disk IO, and think we observe behavior of
136+
our database system, but we actually observe the behavior of, say, cloud disk throttling or filesystem limitations
137+
instead. In such cases we need to think how to tune our input to avoid such bottlenecks, to perform useful
138+
experiments.
139+
3. In some cases, it is, vice versa, very desired to reach some kind saturation – for example, if we study the speed of
140+
`pg_dump` or `pg_restore`, we may want to observe our disk system saturated, and we tune the input (e.g. how exactly
141+
we `pg_dump` – how many parallel workers we use, is compression involved, is network involved, etc.) so the desired
142+
saturation is indeed reached, and we can demonstrate it.
143+
4. Understand the main metrics you're going to compare between runs – latencies, throughput numbers, query analysis
144+
metrics (those from `pg_stat_statements`, wait event analysis), and so on.
145+
5. Develop a good format of summary and follow this format. It can include short description of various input parts,
146+
including workload delta, and main comparison metrics. Store all the summaries for all runs in this well-structured
147+
form.

files/0013_db_benchmark.png

344 KB
Loading

0 commit comments

Comments
 (0)