diff --git a/README.md b/README.md index 68fce73fe..211f252a0 100644 --- a/README.md +++ b/README.md @@ -46,46 +46,7 @@ A brief introduction to the tool and its use cases can be found [here](https://m ## Console -Usage examples: -1) Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used. - -```sh -python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True -``` - -```text -[Course Classroom] -> Professor -[Classroom Semester] -> Professor -[Classroom Semester] -> Course -[Professor] -> Course -[Professor Semester] -> Classroom -[Course Semester] -> Classroom -[Course Semester] -> Professor -``` - -2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used. - -```sh -python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 -``` - -```text -[Id] -> ProductName -[Id] -> Price -[ProductName] -> Price -``` - -3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. - -```sh -python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 -``` - -```text -True -``` - -For more information consult documentation and help files. +For information about the console interface check the [repository](https://github.com/Desbordante/desbordante-cli). ## Python bindings @@ -250,17 +211,6 @@ $ pip install desbordante However, as Desbordante core uses C++, additional requirements on the machine are imposed. Therefore this installation option may not work for everyone. Currently, only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported. If the above does not work for you consider building from sources. -## CLI installation - -**NOTE**: Only Python 3.11+ is supported for CLI - -Сlone the repository, change the current directory to the project directory and run the following commands: - -```sh -pip install -r cli/requirements.txt -python3 cli/cli.py --help -``` - ## Build instructions ### Ubuntu diff --git a/README_CONSOLE.md b/README_CONSOLE.md new file mode 100644 index 000000000..979c4c444 --- /dev/null +++ b/README_CONSOLE.md @@ -0,0 +1,106 @@ +

+ +

+ +--- + +# Desbordante: high-performance data profiler (console interface) + +## What is it? + +[**Desbordante**](https://github.com/Desbordante/desbordante-core) is a high-performance data profiler oriented towards exploratory data analysis. This is the repository for the Desbordante console interface, which is published as a separate [package](https://pypi.org/project/desbordante-cli/). This package depends on the [desbordante package](https://pypi.org/project/desbordante/), which contains the C++ code for pattern discovery and validation. As the result, depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternative tools. + +## Table of Contents + +- [Main Features](#main-features) +- [Usage Examples](#usage-examples) +- [Installation](#installation) +- [Contacts and Q&A](#contacts-and-qa) + +# Main Features + +[**Desbordante**](https://github.com/Desbordante/desbordante-core) is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. + +The **Discovery** task is designed to identify all instances of a specified pattern *type* of a given dataset. + +The **Validation** task is different: it is designed to check whether a specified pattern *instance* is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values). + +The currently supported data patterns are: +* Functional dependency variants: + - Exact functional dependencies (discovery and validation) + - Approximate functional dependencies, with g1 metric (discovery and validation) + - Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery) +* Graph functional dependencies (validation) +* Conditional functional dependencies (discovery) +* Inclusion dependencies (discovery) +* Order dependencies: + - set-based axiomatization (discovery) + - list-based axiomatization (discovery) +* Metric functional dependencies (validation) +* Fuzzy algebraic constraints (discovery) +* Unique column combinations: + - Exact unique column combination (discovery and validation) + - Approximate unique column combination, with g1 metric (discovery and validation) +* Association rules (discovery) + +For more information about the supported patterns check the main [repo](https://github.com/Desbordante/desbordante-core). + +## Usage examples + +Usage examples: +1) Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used. + +```sh +python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True +``` + +```text +[Course Classroom] -> Professor +[Classroom Semester] -> Professor +[Classroom Semester] -> Course +[Professor] -> Course +[Professor Semester] -> Classroom +[Course Semester] -> Classroom +[Course Semester] -> Professor +``` + +2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used. + +```sh +python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 +``` + +```text +[Id] -> ProductName +[Id] -> Price +[ProductName] -> Price +``` + +3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. + +```sh +python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 +``` + +```text +True +``` + +For more information check the --help option: + +```sh +desbordante --help +``` + +## Installation + +The source code is currently hosted on GitHub at https://github.com/Desbordante/desbordante-console. In order for this to run, first you have to have install the latest version of the main Desbordante [package](https://pypi.org/project/desbordante/). + +**NOTE**: Only Python 3.11+ is supported for CLI + +Run the following commands: + +```sh +pip install -r cli/requirements.txt +python3 cli/cli.py --help +```