Analyzing the Impact of Cardinality Estimates on Query Optimization

1. Experimental Setup

Experimental was conducted in the following environment:

Machine: 2x Intel(R) Xeon(R) CPU E5-2660 v4 (28 CPU cores and 256GB memory, and HDD storage ).
OS: Ubuntu 22.04 LTS, Python 3.10.6.
Docker v20.10.12 .
RStudio 2022.12.0 .

2. Dataset and Queries

2.1 Internet Movie Data Base (IMDB)

We used the Internet Movie Data Base (IMDB) dataset. The original data is publicly available in txt files (ftp://ftp.fu-berlin.de/pub/misc/movies/database/), and we utilized the open-source imdbpy package to convert the txt files to CSV files. The 3.6GB snapshot is from May 2013, and the dataset contains 21 CSV files, i.e., 21 relations in total.

2.2 Join Order Benchmark (JOB)

Our experiments used Join Order Benchmark (JOB) queries. JOB includes 113 queries in total, comprising 33 query families with equijoins. The queries within each family differ only in selection predicates. The join sizes range from 2 to 17, join predicates from 4 to 28, and tables from 2 to 17.

3. PostgreSQL

We employed a modified version of PostgreSQL 14.2(https://github.com/waltercai/pqo-opensource) that allows injecting estimates for subqueries during query runtime [2]. To optimize PostgreSQL performance, we made the following changes:

shared_buffers = 128GB
work_mem = 128GB
max_wal_size = 128GB
effective_cache_size = 128GB
geqo_threshold = 18
from_collapse_limit = 1 (default 8, to disable subquery reordering)
join_collapse_limit = 1	(default 8, to disable join reordering)
max_parallel_workers_per_gather = 0-5 (default 2)
max_parallel_workers = 0-5 (default 8)

The last two variables were only adjusted during thread experiments. For other experiments, we kept the default values. The modified version of the postgresql.conf file can be found in scripts folder.

4. Scripts

We provide python scripts to replicate -

$Simpli^2$ join ordering algorithm.
Scripts for generating query from join order.
QuickPick algorithm.
Cost calculation using Leis et al. in-memory cost function.
Runtime collection for different settings. Includes -
a) Indexed/ non-indexed.
b) Join operator selection.
c) Estimate selection.
d) Number of threads.

usages of this scripts is below

4.1. $Simpli^2$ join order and query generation

non-indexed

python3 Simpli2/Simpli2.py

indexed

python3 Simpli2/Simpli2_idx.py

Above two scripts will help to generate join orders for $Simpli^2$ in both non-indexed and indexed settings. Below script will take those join order as input and generate sql queries.

python3 sql_generator.py

4.2. Cost and Runtime collection

Below script will collect cost and runtime.

python3 pg_cost_calc.py

This script has configurable parameters which are described below -

_fromLimit = from collapse limit (1/8)
_joinLimit = join collapse limit (1/8)
_nlFlag = nested loop join (on/off)
_hjFlag = hash join (on/off)
_mjFlag = marge join (on/off)
_timeOut = query timeout (in seconds)
_estFlag = estimates availability (True/False)
_estFolder = location of estimates (folder location)
_db = database name
_inFolder = query location
_index = index availability (True/ False)
_threads = number of threads (0 - 4)

Functions : 
# cost
def runQueriesCost(_fromLimit, _joinLimit, _nlFlag, _hjFlag, _mjFlag, _timeOut, _estFlag, _estFolder, _db, _inFolder, _index, _threads)

# runtime
def runQueriesTime(_fromLimit, _joinLimit, _nlFlag, _hjFlag, _mjFlag, _timeOut, _estFlag, _estFolder, _db, _inFolder, _index, _threads)

# runtime and cost
def runQueriesCostTime(_fromLimit, _joinLimit, _nlFlag, _hjFlag, _mjFlag, _timeOut, _estFlag, _estFolder, _db, _inFolder, _index, _threads)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Estimates		Estimates
Queries		Queries
plans		plans
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimates

Estimates

Queries

Queries

plans

plans

scripts

scripts

README.md

README.md

Repository files navigation

Analyzing the Impact of Cardinality Estimates on Query Optimization

1. Experimental Setup

2. Dataset and Queries

2.1 Internet Movie Data Base (IMDB)

2.2 Join Order Benchmark (JOB)

3. PostgreSQL

4. Scripts

4.1. $Simpli^2$ join order and query generation

4.2. Cost and Runtime collection

About

Releases

Packages

Languages

Asoke26/Estimates-and-No-Estimates

Folders and files

Latest commit

History

Repository files navigation

Analyzing the Impact of Cardinality Estimates on Query Optimization

1. Experimental Setup

2. Dataset and Queries

2.1 Internet Movie Data Base (IMDB)

2.2 Join Order Benchmark (JOB)

3. PostgreSQL

4. Scripts

4.1. $Simpli^2$ join order and query generation

4.2. Cost and Runtime collection

About

Resources

Stars

Watchers

Forks

Languages