IMDB / JOB Workload

This repository contains a Vagrant machine that automatically pulls down and imports the IMDB dataset from that paper How Good are Query Optimizers, Really?. Note that, upon provisioning, the VM will download 1.2GB+ of data.

It will create a VM running Arch Linux, upgrade it, install the latest version of Postgres, configure itself to use 16GB of RAM (12GB for the Postgres shared_buffers) and 4 CPU cores, create a 100GB disk image to hold the data, and, finally, download and load an archive. It could break at any moment.

Note: if you would just like to download a Postgres pg_dump of the IMDB dataset, you can get it here:

To use, first install the persistent storage Vagrant plugin:

vagrant plugin install vagrant-persistent-storage

Next, modify vagrant/Vagrantfile to set a path to where you would like the VDI containing the database to go.

config.persistent_storage.enabled = true
config.persistent_storage.location = "/PATH/TO/STORAGE/LOCATION.vdi"
config.persistent_storage.size = 100000
config.persistent_storage.mountname = 'pg'
config.persistent_storage.filesystem = 'ext4'
config.persistent_storage.mountpoint = '/media/data'
config.persistent_storage.volgroupname = 'myvolgroup'

Then, start up the VM:

cd vagrant
vagrant up
cd ..

You can ignore the last few warnings (about /home/vagrant). Note that this VM will have an open Postgres server, with a single user, imdb, with no password. You don't want to leave it running on a network you don't trust (or without your own firewall).

To connect to the database from your host machine:

psql -U imdb -h localhost

To run one of the JOB queries:

psql -U imdb -h localhost < job/1a.sql


If you use the JOB dataset, please cite the original authors (no affiliation):

If you use this VM or our prepared dataset, please cite our paper as well:

If you use the CEB datasets, please cite the Flow Loss paper:

Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. 2021. Flow-loss: learning cardinality estimates that matter. Proc. VLDB Endow. 14, 11 (July 2021), 2019–2032.

If you use the JOB extended queries, please cite the Neo paper:

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: a learned query optimizer. Proc. VLDB Endow. 12, 11 (July 2019), 1705–1718.

