# What is SFD?

A soft functional dependency (SFD) is yet another type of relaxed functional dependency (FD) a kind of FD that tolerates some degree of errors in data. It has the form of X->Y, where X and Y are single attributes of a table. SFDs were first introduced in

```
"CORDS: automatic discovery of correlations and soft functional dependencies",
        by Ihab Ilyas et al.
```

They are also known as the approximate FDs (AFD) with the **$\rho$** metric in

```
"Measuring Approximate Functional Dependencies: a Comparative Study",
        by Marcel Parciak et al.
```

Using CORDS algorithm to discover SFDs is pretty similar to using plain FD discovery algorithms, which is discussed in `Functional Dependencies Mining` example.

Therefore, in this example we will try to construct and describe a dataset on which discovery of SFDs is meaningful.

# Install python dependencies

In [None]:
!pip install desbordante==2.3.2
!pip install pandas

Collecting desbordante
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


# Import python modules

In [None]:
import desbordante as db
import pandas as pd
import numpy as np

# Explore SFD

First, let's define some useful function we will be using.

In [None]:
def mine_sfd():
    algo = db.sfd.algorithms.Default()
    algo.load_data(table=df)
    algo.execute(min_sfd_strength=MIN_SFD_STRENGTH_MEASURE,
                 delta=DELTA,
                 max_false_positive_probability=MAX_FALSE_POSITIVE_PROBABILITY,
                 only_sfd=ONLY_SFD,
                 min_cardinality=MIN_CARDINALITY,
                 max_amount_of_categories=MAX_AMOUNT_OF_CATEGORIES,
                 min_skew_threshold=MIN_SKEW_THRESHOLD,
                 min_structural_zeroes_amount=MIN_STRUCTURAL_ZEROES_AMOUNT,
                 max_different_values_proportion=MAX_DIFF_VALS_PROPORTION)
    fds = algo.get_fds()
    cors = algo.get_correlations()
    if len(fds):
        print("Soft functional dependencies:")
        for fd in fds:
            print(fd)
    else:
        print("No sfd")
    if len(cors):
        print("Correlations:")
        for cor in cors:
            print(cor)
    else:
        print("No correlations")

First of all let us construct a synthetic dataset consisting of 10000 rows and 5 columns

In [None]:
n_rows = 10000
np.random.seed(65)  # Set the random seed for reproducibility

# Generate attributes
A = np.random.randint(1, 100, n_rows)  # Initial column
B = 2 * A + np.random.choice([0, 1], n_rows, p=[0.95, 0.05])
C = A + B + np.random.choice([0, 27], n_rows, p=[0.85, 0.15])
D = 3 * C + np.random.choice([0, 5, 17, 34], n_rows,
                              p=[0.6, 0.1, 0.1, 0.2])
E = np.random.randint(1, 3, n_rows)  # Independent column
# Create a DataFrame
df = pd.DataFrame({
    'A': A,
    'B': B,
    'C': C,
    'D': D,
    'E': E
})

df

Unnamed: 0,A,B,C,D,E
0,47,94,141,423,2
1,64,128,192,576,2
2,41,82,123,369,1
3,86,172,285,860,2
4,72,144,216,648,1
...,...,...,...,...,...
9995,22,44,66,215,2
9996,32,64,96,322,1
9997,34,68,129,387,1
9998,88,176,264,792,2


Our dataset contains 5 columns: `A`, `B`, `C`, `D`, `E`.

- `A` is a column of random integers in the range `[1,100).
- `B` is generated as `2*A` with 5% chance of deviation by 1.
- `C` is generated as `A+B` with 15% chance of deviation by 27.
- `D` is generated as `3*C` with 10% chance of deviation by 5, 10% chance of deviation by 17, 20% chance of deviation by 34.
- `E` is a column of random integers in the range `[1,2].

Here are the first 10 rows of our dataset:

In [None]:
df[:10]

Unnamed: 0,A,B,C,D,E
0,47,94,141,423,2
1,64,128,192,576,2
2,41,82,123,369,1
3,86,172,285,860,2
4,72,144,216,648,1
5,57,115,199,602,2
6,94,188,282,846,2
7,7,14,21,63,2
8,2,4,6,18,2
9,98,196,294,882,2


As you can see our dataset is constructed in such a way that exact FDs almost hold. Meaning there are some tuples which violate them.

We expect the following SFDs to hold:
 - [A] $\rightarrow$ B
 - [A] $\rightarrow$ C
 - [A] $\rightarrow$ D
 - [B] $\rightarrow$ C
 - [B] $\rightarrow$ D
 - [C] $\rightarrow$ D

The core parameters of the CORDS algorithm are
- `only_sfd`: a boolean flag indicating whether we want to mine correlations besides SFDs or not.
- `min_cardinality`: `(1 - min_cardinality)*n_rows` denotes the minimum amount of distinct values in a column to be considered a soft key.
- `min_sfd_strength`: `(1 - min_sfd_strength)` denotes the minimum strength threshold of SFD in order to be included into the result.
- `max_false_positive_probability`: `(1 - max_false_positive_probability)` denotes maximum acceptable probability of a false-positive correlation test result.
- `max_amount_of_categories`: denotes the maximum amount of allowed categories for the chi-squared test.

In [None]:
ONLY_SFD = False
MIN_CARDINALITY = 0.1
MAX_DIFF_VALS_PROPORTION = 0.99
MIN_SFD_STRENGTH_MEASURE = 0.1
MIN_SKEW_THRESHOLD = 0.5
MIN_STRUCTURAL_ZEROES_AMOUNT = 3e-01
MAX_FALSE_POSITIVE_PROBABILITY = 1e-06
DELTA = 0.11
MAX_AMOUNT_OF_CATEGORIES = 100

There are other parameters besides listed above. For more detailed descriptions of them we recommend you to look into the original paper and `/src/core/config/descriptions.h`.

Due to the random nature of the algorithm we might have to run it a few times to get the full picture:

In [None]:
for i in range(1, 11):
    print('-' * 15)
    print("Iteration:", i)
    mine_sfd()
print('-' * 15, '\n')

---------------
Iteration: 1
Soft functional dependencies:
[B] -> A
[D] -> B
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
---------------
Iteration: 2
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 3
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 4
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 5
Soft functional dependencies:
[B] -> A
[D] -> B
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
---------------
Iteration: 6
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 7
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 8
Soft functional dependencies:
[B] -> A
[D] -> C
Correlations:
C ~ A
D ~ A
C ~ B
D ~ B
---------------
Iteration: 9
Soft functional dependencies:
[B] -> A
[D] ->

 As you can see, on some iterations our expected soft functional dependencies are mined as correlations. But if we relax our SFD measure threshold to 0.7 (set `min_sfd_strength` to `0.3`) we will see the expected output.

In [None]:
MIN_SFD_STRENGTH_MEASURE = 0.3

for i in range(1, 11):
    print('-' * 15)
    print("Iteration:", i)
    mine_sfd()
print('-' * 15)

---------------
Iteration: 1
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 2
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 3
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 4
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 5
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 6
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 7
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D] -> B
[D] -> C
No correlations
---------------
Iteration: 8
Soft functional dependencies:
[B] -> A
[C] -> A
[D] -> A
[C] -> B
[D

Also you may notice that left hand sides (LHS) and right hand sides (RHS) of our SFDs are reversed. This happens because the algorithm considers column with higher cardinality to be the LHS.