# What are IND and AIND?

In Desbordante we consider an approximate inclusion dependency (AIND)
as any inclusion dependency (IND) that utilizes an error metric to measure
violations.

This metric calculates the proportion of distinct values in the
dependent set (LHS) that must be removed to satisfy the dependency on the
referenced set (RHS) completely.

The metric lies within the `[0, 1]` range:
- A value of `0` means the IND holds exactly (no violations exist).
- A value closer to `1` indicates a significant proportion of LHS values violate the dependency.

## What you can do with it?

Desbordante supports the **discovery** and **verification** of both exact INDs and approximate INDs:
  1. Exact INDs: All values in the LHS set must match a value in the RHS set.
  2. Approximate INDs: Allows for controlled violations quantified by the error metric.

For `discovery` tasks, users can specify an error threshold, and Desbordante will return all AINDs with an error value equal to or less than the specified threshold.

For `verification` tasks, users can specify an AIND, and Desbordante will calculate the error value, identifying clusters of violating values.

The error metric used for AINDs is an adaptation of `g3`, originally designed for approximate functional dependencies (FDs).

For more information, consider:

```
"Unary and n-ary inclusion dependency discovery in relational databases",
        Fabien De Marchi, Stéphane Lopes, and Jean-Marc Petit.
```

# Demonstration

We will show how you can discover and verify both exact and approximate inlusion dependencies.

# Install python dependencies

In [1]:
!pip install desbordante==2.3.2
!pip install pandas
!pip install tabulate

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


# Import python modules

In [17]:
import desbordante
import pandas as pd
from tabulate import tabulate
import textwrap

# Get sample datasets

In [None]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/course.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/department.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/instructor.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/student.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ind_datasets/teaches.csv

## Exact IND's discovery

Consider the following data tables.

In [8]:
pd.read_csv("course.csv")

Unnamed: 0,Course ID,Title,Department name
0,IT-1,Computer Science,Institute of Information Technology
1,MM-3,Algebra,Mathematics and Mechanics Faculty
2,H-1,History,Institute of History
3,FL-2,English,Faculty of Foreign Languages
4,IT-2,Programming,Institute of Information Technology
5,S-5,Philosophy,Faculty of Sociology
6,P-2,Physics,Faculty of Physics
7,C-8,Chemistry,Institute of Chemistry


In [9]:
pd.read_csv("department.csv")

Unnamed: 0,Department name,Building
0,Institute of Information Technology,5 Academic av.
1,Mathematics and Mechanics Faculty,3 Academic av.
2,Institute of History,29A University st.
3,Faculty of Foreign Languages,10 Science sq.
4,Faculty of Sociology,29C University st.
5,Faculty of Physics,10 Academic av.
6,Institute of Chemistry,11 Academic av.
7,Graduate School of Managemment,49 Science sq.


In [10]:
pd.read_csv("instructor.csv")

Unnamed: 0,ID,Name,Department name,Salary
0,in1089,Prof. Jones,Mathematics and Mechanics Faculty,$12000
1,in6723,Dr. Powers,Faculty of Sociology,$8000
2,in5555,Larry Thompson,Graduate School of Managemment,$5000
3,in8930,Prof. Burgess,Faculty of Sociology,$11500
4,in4520,David Stewart,Institute of Chemistry,$5200
5,in6577,Dr. Holloway,Mathematics and Mechanics Faculty,$9000
6,in9910,Dr. Rose,Institute of History,$8500


In [11]:
pd.read_csv("student.csv")

Unnamed: 0,ID,Name,Department name
0,st104726,Darlene Johnson,Institute of Chemistry
1,st967925,Alice Green,Mathematics and Mechanics Faculty
2,st760375,Olga Jones,Graduate School of Managemment
3,st779090,Felix Brown,Faculty of Sociology
4,st299471,Angela Ramirez,Faculty of Sociology
5,st887788,Debbie Lewis,Graduate School of Managemment
6,st679973,Evelyn Obrien,Mathematics and Mechanics Faculty
7,st897856,Melissa Smith,Institute of Information Technology


In [12]:
pd.read_csv("teaches.csv")

Unnamed: 0,Instructor ID,Course ID,Year,Semester
0,in1089,MM-3,2,Fall
1,in6723,S-5,1,Spring
2,in8930,S-5,3,Fall
3,in4520,C-8,2,Fall
4,in6577,MM-3,1,Fall


Let's discover exact IND's of the given tables together.

`->` means "is included in"

In [15]:
TABLES = [(f'{table_name}.csv', ',', True) for table_name in ['course', 'department', 'instructor', 'student', 'teaches']]

algo = desbordante.ind.algorithms.Default()
algo.load_data(tables=TABLES)
algo.execute()
inds = algo.get_inds()

In [16]:
for ind in inds:
    print(ind)

(course.csv, [Department name]) -> (department.csv, [Department name])
(instructor.csv, [Department name]) -> (department.csv, [Department name])
(student.csv, [Department name]) -> (department.csv, [Department name])
(teaches.csv, [Instructor ID]) -> (instructor.csv, [ID])
(teaches.csv, [Course ID]) -> (course.csv, [Course ID])
