SIGMOD Artifacts for "Uncertainty Annotated Databases - A Lightweight and General Approach for Approximating Certain Answers"
The GProM system is written in C
. For the experiments we did use PostgreSQL as a backend for storage. TODO
GProM is available at https://github.com/IITDBGroup/gprom. For the experiments, please use the TODO
branch. GProM acts as a client for a relational database. UA-DB creation and querying is available through an extension of SQL.
- Repository: https://github.com/IITDBGroup/gprom
- Programming Language: C, Python
- Additional Programming Language info: we are requiring Python3.6.9
- Compiler Info: gcc-11.0.0
- Required libraries/packages:
- gnuplot (5.2)
- pg8000
- GPRoM dependencies
- python (3.6.9)
We used several open real world datasets in the experiments. See below for links. We cleaned these datasets by replacing null values with constants using missing value imputation as implemented in Mimir. In the output every row with at least one replaced value was marked as uncertain. The UA-DBs generated in this ways were used in the experiments. In addition we use two synthetic datasets. One if PDBench (TODO link), a probabilistic version of the TPC-H datasets and ...
- Real world datasets:
- Data generators:
PDBench
is a probabilistic version of the TPC-H data generator, we used a fork (fixing some compilation bugs) hosted here: https://github.com/IITDBGroup/pdbench- For real world dataset with access control annotation, we generate the data in the python testing script.
All runtime experiments were executed on a server with the following specs:
Element | Description |
---|---|
CPU | 2 x AMD Opteron(tm) Processor 4238, 3.3Ghz |
Caches (per CPU) | L1 (288KiB), L2 (6 MiB), L3 (6MiB) |
Memory | 128GB (DDR3 1333MHz) |
RAID Controller | LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05), 512MB cache |
RAID Config | 4 x 1TB, configured as RAID 5 |
Disks | 4 x 1TB 7.2K RPM Near-Line SAS 6Gbps (DELL CONSTELLATION ES.3) |
A docker hub is prepared with all requirements installed,
To install docker, please refer to the docker website:
(A account is required to download/use docker)
https://www.docker.com/get-started
After docker is installed, you can pull the docker image with:
docker pull iitdbgroup/uadbreproduce
The image entry point is set to /bin/bash, so to load the docker image in command line:
docker run -ti iitdbgroup/uadbreproduce
(Some systems may need sudo for docker operations)
The main test script is under the default entry folder. All test results is saved to /results folder. To get the results in local system, mount a local system folder to the docker container result folder when load the image:
docker run -ti -v /your/local/directory:/reproducibility/results iitdbgroup/uadbreproduce
Please follow these instructions to install the system and datasets for reproducibility.
GProM requires TODO. For example, on Ubuntu you can install the prerequisites with:
sudo apt-get install
Please clone the git repository ...
git clone git@github.com:IITDBGroup/gprom ...
As mentioned before, Cape is written in Python. We recommend creating a python3 virtual environment. There are several ways to do that. Here we illustrate one. First enter the directory in which you cloned the capexplain git repository.
- TODO
Our python script will make and generate PDbench data when needed by the experiments. The data will be saved into a folder with name s[sval]x[xval] where sval is the scale factor and xval is the uncertainy factor.
By default the python testing script will connect to database 'postgres' with port 5432. Connection can be configured in the /config/config.ini file.
We provide a unified python script file (gen.py) to run all experiments. To run all default experiments:
python3 gen.py
The schript will record which tests are done so when reruning the script it will not repeat finshed tests. To start from beginning of the test, use command -r:
python3 gen.py -r
The script also allowing executing only one experiment for specifying a single experiment/step using -s:
-s command values:
0 - prepair sqlite database data.
1 - pdbench data generation on all test cases.
2 - perform pdbench test varying amount of uncertainty.
3 - perform pdbench test varying data size.
4 - perform pdbench result size test.
5 - perform incompetness projection test.
6 - perform ultility test.
7 - perform realQuery test.
For example, to run only pdbench test by varying result size and regenerate data using pdbench:
python3 gen.py -r -s 4
All experiment results will be save to /result folder.
Different parameter settings need to modify the 's' list and 'x' list in the gen.py file where 's' is the list of all scale factors will be tested and 'x' is the list of all uncertainty factors will be tested.
You can use the gprom system included in the container to run queries with UA-DB semantics over the provided datasets.
To run a query with uncertain semantics, the whole query should be wrapped in:
TUPLE UNCERTAIN (
...
);
Unless instructed otherwise, GProM expects inputs to be UA-DBs. However, GProM can also interpret different types of uncertain data models and translate them into UA-DBs as part of queries. To inform GProM that an input table should be interpreted as a certain type of uncertain relation, you specify the type after the table access in the FROM
clause. Currently, we support tuples-independent probabilistic databases (TIPS) and x-relations.
A TIP stores for each row it's marginal probability. The existence of tuples in the database are assumed to be mutually independent of each other. That is, the set of possible worlds represented by a TIP database are all subsets of the TIP database. To use a TIP table in GProM, the table should have an additional attribute storing the tuple probabilities and the table access in the FROM
clause should be followed by IS TIP (prob)
where prob
is the name of the attribute storing tuple probabilities. For instance, consider this table R
(attribute p
stores probabilities):
| name | age | p |
|-------|-----|----|
| Peter | 34 | 0.9 |
| Alice | 19 | 0.6 |
| Bob | 23 | 1.0 |
An example query over this table
TUPLE UNCERTAIN (
SELECT * FROM R IS TIP(p);
);
An X-table consists of x-tuples which are sets of tuples called alternatives, each associated with a probability. X-tables are a specific type of block-independent probabilistic databases. The alternatives of an x-tuple are disjoint events while alternatives from different x-tuples are independent events. GProM expects an X-tables to have two additional attributes: one that stores probabilities for alternatives and one that stores a unique identifier for each x-tuple. For instance, consider the following table where we are uncertain about Peter's age, and Alice may or may not be in the table.
| name | age | x-id | p |
|-------|-----|------|----|
| Peter | 34 | 1 | 0.4 |
| Peter | 35 | 1 | 0.3 |
| Peter | 36 | 1 | 0.3 |
| Alice | 19 | 2 | 0.6 |
| Bob | 23 | 3 | 1.0 |
To use a X-table in GProM you have to specify the names of the attributes storing probabilities and x-tuple identifiers. For instance, for the table above:
TUPLE UNCERTAIN (
SELECT * FROM R IS XTABLE(x-id,p);
);
To run queries over the provided datasets use start gprom like this: (You need to unzip the database if it is not done by the testing script)
gprom -backend sqlite -db /reproducibility/dbs/incomp.db
buffalo
TUPLE UNCERTAIN (SELECT "lat" FROM buffalo IS UADB);