First, take a look at [Functional Dependencies Mining](/examples/notebooks/Functional_Dependencies_Mining.ipynb), if you haven't already.

# What is an Approximate Functional Dependency?

In Desbordante we consider an approximate functional dependency ($AFD$)
any kind of functional dependency ($FD$) that employs an error metric and is not named (e.g. *soft functional dependencies*).

This metric is used to calculate the extent of violation for a given exact $FD$ and lies within `[0, 1]` range (the lower, the less violations
are found in data).

For the discovery task a user can specify the threshold and Desbordante
will find all $AFDs$, which have their error equal or less than the threshold, according to the selected metric.

# What we have to offer

Currently, Desbordante supports several AFD discovery algorithms:

- `Tane` with the following metrics: `g1`, `pdep`, `tau`, `mu+`, `rho`
- `Pyro` (faster than `Tane`) with `g1` metric.

You can utilize these metrics in the following ways:

- Discovery: `g1`, `pdep`, `tau`, `mu+`, `rho`
- Validation: `g1`.

For more information consider:
1. [Measuring Approximate Functional Dependencies: A Comparative Study by M. Parciak et al.](https://www.researchgate.net/publication/377895992_Measuring_Approximate_Functional_Dependencies_a_Comparative_Study)
2. [Efficient Discovery of Approximate Dependencies by S. Kruse and F. Naumann.](https://dl.acm.org/doi/10.14778/3192965.3192968)
3. [TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies by Y. Huhtala et al.](https://dl.acm.org/doi/10.14778/3192965.3192968)

## Demonstration

Now, we are going to demonstrate how to verify $AFDs$ and $FDs$.

First, install Python dependencies, import the modules and load the dataset.

In [1]:
!pip install desbordante==2.3.2
!pip install pandas

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


In [2]:
import desbordante as db
import pandas as pd

In [None]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/duplicates_short.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/DnD.csv

### Verify functional dependencies

Let's look at the `duplicates_short.csv` table.

In [None]:
data = pd.read_csv("duplicates_short.csv")
data

Now we verify whether `[id]` $\rightarrow$ `[name]` FD holds.

In [None]:
def print_clusters(verifier, data, lhs, rhs):
    print(f"Number of clusters violating FD: {verifier.get_num_error_clusters()}")
    for i, highlight in enumerate(verifier.get_highlights(), start=1):
        print(f"#{i} cluster:")
        for el in highlight.cluster:
            print(f"\t{el}: {data[data.columns[lhs]][el]} -> {data[data.columns[rhs]][el]}")

        print(f"Most frequent rhs value proportion: {highlight.most_frequent_rhs_value_proportion}")
        print(f"Num distinct rhs values: {highlight.num_distinct_rhs_values}")


def print_results_for_fd(verifier, data, lhs, rhs):
    if verifier.fd_holds():
        print("FD holds")
    else:
        print("FD does not hold")
        print_clusters(verifier, data, lhs, rhs)

In [None]:
algo = db.afd_verification.algorithms.Default()
algo.load_data(table=("duplicates_short.csv", ",", True))
algo.execute(lhs_indices=[0], rhs_indices=[2])
print_results_for_fd(algo, data, 0, 2)

Now check whether `[name]` $\rightarrow$ `[credit_score]` FD holds.

In [None]:
algo.execute(lhs_indices=[1], rhs_indices=[2])
print_results_for_fd(algo, data, 1, 2)

We learned that in this case the specified FD does not hold and there are two clusters of rows that contain values that prevent our FD from holding.

First, what is a **cluster**`? A **cluster** (with respect to a fixed FD) is a collection of rows that share the same left-hand side part but differ on the right-hand side one.

Now, let us take a closer look at them.

In the first cluster, three values are `0` and a single one is `NaN`.
This suggests that this single entry with the `NaN` value is a result of a mistake by someone who is not familiar with the table population policy. Therefore, it should probably be changed to `0`.

Now let's take a look at the second cluster.
There are two entries: `27` and `28`.
In this case, it is probably a typo, since buttons `7` and `8` are located close to each other on the keyboard.

Having analyzed these clusters, we can conclude that our FD does not hold due to typos in the data.

Therefore, by eliminating them, we can get this FD to hold (and make our dataset error-free).

### AFD's verification example

Now let's look at the `DnD.csv`.

In [7]:
data = pd.read_csv("DnD.csv", header=[0])
data

Unnamed: 0,Creature,Strength,HaveMagic
0,Ogre,9,False
1,Ogre,6,False
2,Elf,6,True
3,Elf,6,True
4,Elf,1,True
5,Dwarf,9,False
6,Dwarf,6,False


In [10]:
def print_clusters(verifier, data, lhs, rhs):
    print(f"Number of clusters violating FD: {verifier.get_num_error_clusters()}")
    for i, highlight in enumerate(verifier.get_highlights(), start=1):
        print(f"#{i} cluster: ")
        for el in highlight.cluster:
            print(f"\t{el}: {data[data.columns[lhs]][el]} -> {data[data.columns[rhs]][el]}")

        print(f"Most frequent rhs value proportion: {highlight.most_frequent_rhs_value_proportion}")
        print(f"Num distinct rhs values: {highlight.num_distinct_rhs_values}\n")

def print_results_for_fd(verifier, data, lhs, rhs):
    if verifier.fd_holds():
        print("FD holds")
    else:
        print("FD does not hold")
        print_clusters(verifier, data, lhs, rhs)

algo = db.afd_verification.algorithms.Default()
algo.load_data(table=data)
algo.execute(lhs_indices=[0], rhs_indices=[1])

In [12]:
def print_results_for_afd(verifier, error):
    if verifier.get_error() < error:
        print("AFD with this error threshold holds")
    else:
        print("AFD with this error threshold does not hold")
        print(f"But the same AFD with error threshold = {verifier.get_error()} holds")

Checking whether `[Creature]` $\rightarrow$ `[Strength]` AFD holds (error threshold = 0.5).

In [13]:
print_results_for_afd(algo, 0.5)

AFD with this error threshold holds


Checking whether `[Creature]` $\rightarrow$ `[Strength]` AFD holds (error threshold = 0.1)


In [14]:
print_results_for_afd(algo, 0.1)

AFD with this error threshold does not hold
But the same AFD with error threshold = 0.19047619047619047 holds


Similarly to the FD verification primitive, the AFD one can provide a user with clusters.

In [15]:
print_clusters(algo, data, 0, 1)

Number of clusters violating FD: 3
#1 cluster: 
	2: Elf -> 6
	3: Elf -> 6
	4: Elf -> 1
Most frequent rhs value proportion: 0.6666666666666666
Num distinct rhs values: 2

#2 cluster: 
	0: Ogre -> 9
	1: Ogre -> 6
Most frequent rhs value proportion: 0.5
Num distinct rhs values: 2

#3 cluster: 
	5: Dwarf -> 9
	6: Dwarf -> 6
Most frequent rhs value proportion: 0.5
Num distinct rhs values: 2

