Disclaimer: This client was designed with the sole purpose of comparing PostgreSQL's performance with PotionDB when executing a subset of TPC-H's queries. As such, it does not implement the complete set of TPC-H queries, and it may be difficult to use this client for purposes other than benchmarking with TPC-H. If you need a general purpose client for Postgres, please consider other alternatives.
A Go client for PostgreSQL that implements a part of TPC-H's benchmark. It is assumed the reader is familiar with TPC-H's specification. Some of the code (namely table representations) is shared with PotionDB's TPC-H Client.
This client relies on the Bun library for communcation with PostgreSQL. The Pg_ivm library is used on both client and server side in order to support incrementally maintained materialized views.
- Features
- TPC-H Dataset generation
- PostgreSQL setup
- PostgreSQL TPC-H Go Client setup
- Executing the client
- Client configuration
- Full support of TPC-H's dataset: all tables have internal representations as Go's structs;
- Full support of dataset updates, as defined in the TPC-H's specification;
- Support with and without materialized views for the following TPC-H queries: 3, 5, 11, 14, 15 and 18.
- Support for loading TPC-H's dataset and views into PostgreSQL;
- Customizable client: batching, raw/view queries, % of updates, etc.
Note: If you have already used PotionDB's TPC-H Client, you can re-use the TPC-H dataset generated for that client, and skip to the next section.
In order to benchmark PostgreSQL with TPC-H, it is first necessary to generate a dataset. This can be summarized into the following steps:
- Download TPC-H tools and generate the dataset;
- (Optionally) Modify the dataset to give locality to the data (For the motivation and explanation on this, please check here).
Note: For VLDB reviewers, step 2 is obligatory in order to reproduce the results reported in the paper.
First, download the TPC-H tools here. This will be necessary to generate the TPC-H dataset.
Afterwards, you will need to build the tool. Refer to the README file included with the tool for instructions, as well as the TPC-H specification.
For the parameters of the makefile, use the following:
DATABASE=ORACLE
WORKLOAD=TPCH
To generate the dataset and the set of updates to be used, run, respectively:
dbgen -s 1
dbgen -s 1 -u 1000
Note: If testing PostgreSQL with modest hardware, you may want to test with a lower -s
(for example, 0.1).
While the tool works for values under 1, said values are not officially supported by TPC-H.
Finally, organize the dataset and updates using the following folder structure:
tpch_data\
tables\$sSF\
upds\$sSF\
headers\
Where $s
corresponds to the value used for -s
when generating the dataset.
For example, for -s 1
:
tpch_data\
tables\1SF\
upds\1SF\
headers\
Put all generated tables and updates, respectively, under tables\$sSF
and upds\$sSF
.
Fill the headers
folder with the contents of the tpch_headers folder in this repository.
Use the TPC-H Locality tool that we have made, available here.
After the new dataset and updates are generated by the TPC-H Locality tool, replace the contents of the folders tables
and upds
with the new data.
The new data will be inside a folder named _mod
that is inside both tables
and upds
.
This section explains how to setup PostgreSQL with pg_ivm extension for running TPC-H benchmarks with this client. For other purposes, you can either refer to the official documentation for an installer, build from source or use the official PostgreSQL docker image.
This tutorial assumes a linux-based operating system. It also assumes that you have sudo access to the system. We cover two options: setting up PostgreSQL in a RAMDisk, and a standard instalation of PostgreSQL. Both options will include the installation of pg_ivm extension and setting up key settings for PostgreSQL.
VLDB reviewers must use the RAMDisk installation in order to be able to replicate the results published in our paper.
First, prepare a folder to hold all the required files.
We'll refer to such folder as the baseFolder
.
Then, download PostgreSQL v16.2's source code here and pg_ivm 1.8 here. Extract them into folders named, respectively, Postgresql/postgresql-16.2
and pg_ivm-1.8
. Afterwards, download the scripts in the scriptsPostgreSQL folder of this repository.
Run the following command to set up a RAMDisk and install Postgres on it:
scriptsPostgreSQL/postgres_install_ramdisk.sh $base_folder $ramdisk_folder $ramdisk_size $username
Where:
$base_folder
is the folder above thePostgresql
folder;$ramdisk_folder
is the desired mount point for the RAMDisk;$ramdisk_size
is the desired size for the RAMDisk, in the format of number+suffix. For example, 64G stands for 64GB;$username
stands for the name of the current user of the system. PostgreSQL will use this username to ensure it can write to the RAMDisk.
Example:
scriptsPostgreSQL/postgres_install_ramdisk.sh . /media/ramdisk 64G myuser
After running the command above, PostgreSQL will be running and listening to port 5432.
Note for VLDB reviewers: use 64G
as $ramdisk_size
in order to more accurately replicate our test environment.
Proceed as in Option 1, up until before running the script.
Then, run the following script to install PostgreSQL and pg_ivm:
scriptsPostgreSQL/postgres_install.sh $base_folder $destination_folder $username
Where:
$base_folder
is the folder above thePostgresql
folder;$destination_folder
is the desired instalation location for PostgreSQL (e.g., /usr/local). A new folder will be creased insidedestination_folder
;$username
stands for the name of the current user of the system. PostgreSQL will use this username to ensure it can write in$destination_folder
.
Example:
scriptsPostgreSQL/postgres_install.sh . /usr/local myuser
After running the command above, PostgreSQL will be running and listening to port 5432.
To setup the initial data for the TPC-H benchmark, run the following command:
scriptsPostgreSQL/setupDB.sh $psql_location $script_folder $table_folder
Where:
$psql_location
is the location of the psql executable inside your Postgres instalation. By default, this executable is inramdisk_folder
/pgsql/bin;$script_folder
is the location for thescriptsPostgreSQL
folder. The folder should be included in the path;$table_folder
is the location of the TPC-H tables. By default, this should point totpch_data
/tables/1SF. Adjust the SF to the actual value used for SF during the TPC-H dataset generation.
The command above can also be used to reset PostgreSQL back to its initial state for TPC-H, as it drops any existing TPC-H tables, views and indexes before loading the base data.
Note for VLDB reviewers: use 1SF for SF and do not change any setting in postgres_install_ramdisk.sh
, in order to more accurately replicate our test environment.
PostgreSQL can be stopped with the following command:
/media/ramdisk/pgsql/bin/pg_ctl stop -D /media/ramdisk/pgsql/data
To erase the RAMDisk, run the command above followed by:
sudo umount /media/ramdisk
If you want to change some parameters of PostgreSQL, you can either use the psql
tool included with PostgreSQL (at /media/ramdisk/pgsql/bin/psql
) or edit the postgres_install_ramdisk.sh
script.
Start by installing Docker: https://docs.docker.com/engine/install/
To obtain the pre-compiled Docker image of PostgreSQL TPC-H Go Client, do:
docker pull andrerj/postgresclient
First, prepare a folder to hold all the required files.
We'll refer to such folder as the baseFolder
.
Afterwards, go inside the baseFolder
and clone the following repositories:
git clone https://github.com/AndreRijo/potionDB.git potionDB
git clone https://github.com/AndreRijo/tpch-data-processor.git tpch_data_processor
git clone https://github.com/AndreRijo/postgres-tpch-go-lib.git postgresTpchGoLib
git clone https://github.com/AndreRijo/postgres-tpch-client postgresTPCHClient
Make sure you have Docker installed in your system. If not, please check here for instructions on installing Docker.
Afterwards, from the baseFolder
, build the Docker image:
docker build -f postgresTPCHClient/Dockerfile . -t postgresclient
In the Executing the client with Docker section, for all commands replace andrerj/postgresclient
with postgresclient
.
Proceed as in Option 2 up to (exclusive) the instalation of Docker.
This client should work with any version of Go with support for modules and generics. However, we recommend using Go 1.20.4. or newer.
To check if you have Go installed and what is its version, run the following command:
go version
For installing or upgrading your Go instalation, please check here.
Afterwards, inside baseFolder/postgresTPCHClient/src
, run:
go run ./main/main.go --data_folder=$path_to_data --config=$path_to_config --test_name=$path_to-results
Where:
$path_to_data
is the path to the folder tpch_data$path_to_config
is the path to the folder with the configuration file intended to be used;path_to_results
is the path to the folder where you want to store the results of the benchmarking.
In the three parameters above, both relative and full paths are acceptable.
For details on how to configure the client, please check Client configuration section.
(For executing the client without Docker, please check the previous subsection)
Start the client with the following command:
docker run -it --name postgrestpch1 -v "$your_path_to_data/tpch_data/:/go/data" -v "$your_path_to_configs/:/go/configs/extern/" -v "your_path_to_results/:/go/results" -e CONFIG="/go/configs/extern/$YOUR_CONFIG_FOLDER/" andrerj/postgresclient
Where:
$your_path_to_data
is the full path to the folder tpch_data created earlier;your_path_to_configs
is the path to the folder with your own/modified configuration files. If using one of the default ones, this-v
can be removed and you should setCONFIG=/go/configs/path_to_the_choosen_config/
;your_config_folder
the relative (using as base $your_path_to_configs) path to the folder with the intended configuration file. For example, if your config.cfg is at folder/home/yourname/myconfigs/mytpch
,$your_path_to_data
would behome/yourname/myconfigs
andyour_config_folder
would bemytpch
;your_path_to_results
the full path to the folder where you want to store the statistics of executing the TPC-H benchmark.
Note: Depending on your operating system and Docker installation, you may need to set up your Docker to allow sharing of folders.
An example of usage with a custom configuration file:
docker run -it --name postgrestpch1 -v "home/myname/tpch_data:/go/data/" -v "$home/myname/myPostgresTpchClientConfigs/:/go/configs/extern/" -v "home/myname/postgresTpch_results/:/go/results/" -e CONFIG="/go/configs/extern/myconfig/" andrerj/postgresclient
Or with a pre-made configuration file:
docker run -it --name postgrestpch1 -v "home/myname/tpch_data:/go/data/" -v "home/myname/postgresTpch_results/:/go/results/" -e CONFIG="/go/configs/default/1SF/ALL/" andrerj/tpchclient
You can use the command above if you're testing PostgreSQL in your local machine. If PostgreSQL is running on a different machine from the client, you can adapt configs/cluster/1SF/ALL/config.cfg.
The client will start by setting up the views in PostgreSQL. After the views are ready, a pre-defined number of clients will start querying the server. Please check Client configuration on how to configure the client's options.
A full list of all options for configuration files is available on pgConfigs.go, method loadConfigsFile(). It is also possible to set most of those settings by using Docker's environment variables: check pgConfigs.go, method loadFlags() and start.sh
Always use one of the existing configuration files as a base: the configurations in configs/cluster and configs/default are good starting points.
You will likely want to change the settings named ip
, queryClients
, queryWait
, queryDuration
, nReadsTxn
and updRate
.
Some of the most relevant, available settings:
ip
: ip:port of PostgreSQL server;user
: user used to access PostgreSQL. Can be either the user specified in PostgreSQL installation or another user added to PostgreSQL;scale
: corresponds to the value used for-s
when the dataset was generated withdbgen
;doDataLoad (true/false)
: defines if the client should do the initial loading of the database or not. It is preferable for PostgreSQL to load the data itself as loading through the client is very slow;doViewload
anddoIndexload (true/false on both)
: defines if the client should, respectively, load the views or indexes to PostgreSQL. Note that queries are very slow without the views. When doViewload is true, the value of doIndexload is irrelevant (as then the indexes no longer help);doQueries
anddoUpdates (true/false on both)
: defines if the client should do queries and/or updates. It is recommended to leave these both astrue
and control the update-query rate withupdRate
;updRate (0-1)
: defines the percentage of operations that should be updates. Setting to 0 or 1 corresponds to only doing, respectively, queries or updates;startUpdFile
andfinishUpdFile
: if running multiple instances of the client, you will need to adjust this to ensure each instance has a non-overlaping subset of update files;queryClients
: number of connections to use to execute queries and/or updates. You will want to vary this in order to grasp how PostgreSQL scales;queryWait (ms)
: time for the clients to wait before starting to execute the TPC-H benchmark. The client will subtract from the waiting time any time it took preparing itself and/or doing the initial loading of the database. This is useful for coordinating multiple instances of the client. This time will, unavoidably, depend on your system andscale
value: try first running a single client instance only to measure the time it takes for the client and PostgreSQL to be ready. Remember that clients read TPC-H data in order to do updates;queryDuration (ms)
: the time for which TPC-H's benchmark is run for, afterqueryWait
ends;id
: the name of the client. When running multiple client instances, it is important to set a differentid
for each client, in order to avoid conflicts when statistics files are produced in a shared folder environment;statsLocation
: the location to save the statistics files produced by the client.statisticsInterval (ms)
: time between which statistics are recorded. For most use cases it is fine to leave this unchanged (5000ms).tpchAddRate (0-1)
: when executing updates, defines the percentage of updates that are new orders, compared to deletion of orders. TPC-H's specification suggests to set this to 0.5, however in a real-world scenario for an e-commerce application this is heavily skewed towards 1 (mostly new with very rare deletions).nReadsTxn
: number of operations (either read or updates) to execute per transaction. A higher value may lead to a higher throughput, at the sacrifice of possibly higher latency per transaction.