Desbordante supports the **discovery** and **verification** of both exact INDs and approximate INDs:
  1. Exact INDs: All values in the LHS set must match a value in the RHS set.
  2. Approximate INDs: Allows for controlled violations quantified by the error metric.

For discovery tasks, users can specify an error threshold, and Desbordante will return all AINDs with an error value equal to or less than the specified threshold.

The error metric used for AINDs is an adaptation of `g3`, originally designed for approximate functional dependencies (FDs).

For more information, consider:

```
"Unary and n-ary inclusion dependency discovery in relational databases",
        Fabien De Marchi, Stéphane Lopes, and Jean-Marc Petit.
```

# What is AIND?

In Desbordante we consider an approximate inclusion dependency (AIND)
as any inclusion dependency (IND) that utilizes an error metric to measure
violations.

This metric calculates the proportion of distinct values in the
dependent set (LHS) that must be removed to satisfy the dependency on the
referenced set (RHS) completely.

The metric lies within the `[0, 1]` range:
- A value of `0` means the IND holds exactly (no violations exist).
- A value closer to `1` indicates a significant proportion of LHS values violate the dependency.

# Demonstration of how to discover AINDs

# Install dependencies

In [None]:
!pip install desbordante==2.3.2
!pip install pandas
!pip install tabulate

Collecting desbordante
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


# Import desbordante and pandas

In [None]:
import desbordante
import pandas as pd
from tabulate import tabulate

# Get sample datasets

In [None]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/employee.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/project_assignments.csv

# Examine the data

The datasets under consideration for this example are `employee` and `project_assignments`.

In [None]:
pd.read_csv("employee.csv", header=[0])

Unnamed: 0,id,name,department,location
0,101,Alice Cooper,Marketing,New York
1,102,Bob Johnson,Engineering,San Francisco
2,103,Charlie Brown,HR,Chicago
3,104,Dana White,Sales,Los Angeles
4,105,Eva Black,Marketing,Boston
5,106,Frank Green,Engineering,Austin


In [None]:
pd.read_csv("project_assignments.csv", header=[0])

Unnamed: 0,id,employee_name,title,deadline
0,P001,Alice Cooper,Website Redesign,2024-12-01
1,P002,Bob Johnson,App Development,2024-12-15
2,P003,Charley Brown,HR Policy Update,2024-12-20
3,P006,Frank Green,Infrastructure Upgrade,2025-02-05


Let's find all AINDs with an error threshold less than `0.3`.

In [None]:
algo = desbordante.ind.algorithms.Mind()

TABLES = [
    ("employee.csv", ',', True),
    ("project_assignments.csv", ",", True),
]

algo.load_data(tables=TABLES)
algo.execute(error=0.3)

for ind in algo.get_inds():
    print("IND:", ind)

IND: (project_assignments.csv, [employee_name]) -> (employee.csv, [name]) with error threshold = 0.25


We found only a single AIND, this dependency contains typos in the `employee name` column of the `project_assignment.csv`.

For automatically detecting violating clusters, you can create a pipeline using the AIND verifier in combination with a mining algorithm.