# Functional Dependencies

Functional dependencies (FDs) play a crucial role in database design and data analysis. They help identify relationships between attributes, enabling:

- **Database Normalization** - Eliminating redundancy and update anomalies by decomposing tables into well-structured relations.
- **Schema Optimization** - Improving query performance through efficient table design.
- **Data Quality Analysis** - Detecting inconsistencies and constraints in datasets.

Let $r$ be a relation, and let $X$ and $Y$ be arbitrary subsets of the attribute set of $r$.

We say that $Y$ is functionally dependent on $X$, denoted as $X \rightarrow Y$, if and only if every value of $X$ in $r$ is associated with exactly one value of $Y$ in $r$.

$$X \rightarrow Y \iff \forall t1,t2 \in r, \; t1[X]=t2[X] \rightarrow t1[Y]=t2[Y]$$

- t1,t2: tuples in relation $r$;
- t[X]: the projection of tuple $t$ on attribute set $X$.

Let us show how Desbordante helps you with discovering Functional Dependencies in dataset.

# Install python libraries

In [10]:
!pip install desbordante==2.3.2
!pip install pandas



# Import desbordante and pandas

In [11]:
import desbordante as db
import pandas as pd

# Get sample datasets

In [12]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/university_fd.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/duplicates_short.csv

Load the data

In [13]:
pd.read_csv("university_fd.csv")

Unnamed: 0,Course,Classroom,Professor,Semester
0,Math,512,Dr. Smith,Fall
1,Physics,406,Dr. Green,Fall
2,English,208,Prof. Turner,Fall
3,History,209,Prof. Davis,Fall
4,Math,512,Dr. Smith,Spring
5,Physics,503,Dr. Gray,Spring
6,English,116,Prof. Turner,Spring
7,Biology,209,Prof. Light,Spring


# Dsicover functional dependencies

Using Desbordante it's trivial to discover all `functional dependecies` in the dataset.

In [14]:
algo = db.fd.algorithms.Default()
algo.load_data(table=("university_fd.csv", ',', True))
algo.execute()
print('FDs:')
for fd in algo.get_fds():
    print(fd)

FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor


## Verify exact Functional Dependencies

First, let's look at the `duplicates_short.csv` table.

In [20]:
data = pd.read_csv("duplicates_short.csv")
data

Unnamed: 0,id,name,credit_score,city,email,phone,country
0,26,Björn Smith,25.0,Pilington,Björn.Smith650@virtex.rum,25,RI
1,11859,Mary Doe,0.0,Lumdum,Mary.Doe-5926@ferser.edu,0,EU
2,1,Mary Doe,0.0,Lumdum,Mary.Doe0@muli.ry,4,EU
3,56,Emily Honjo,55.0,Kustruma,Emily.Honjo3080@ferser.edu,55,GZ
4,30,Björn Tarski,29.0,Lumdum,Björn.Tarski870@ferser.edu,29,PR
5,17788,Mary Doe,0.0,Kustruma,Mary.Doe35099692@virtex.rum,0,EU
6,5930,Mary Doe,,Lumdum,Mary.Doe-5926@ferser.edu,0,EU
7,58,Lisa Smith,57.0,Syndye,Lisa.Smith3306@virtex.rum,57,CM
8,29,Björn Shiramine,28.0,Syndye,Björn.Shiramine812@virtex.rum,28,EU
9,28,Björn Wolf,27.0,,Björn.Wolf756@virtex.rum,27,AI


Now we verify whether `[id]` $\rightarrow$ `[name]` FD holds.

In [36]:
def print_clusters(verifier, data, lhs, rhs):
    print(f"Number of clusters violating FD: {verifier.get_num_error_clusters()}")
    for i, highlight in enumerate(verifier.get_highlights(), start=1):
        print(f"#{i} cluster:")
        for el in highlight.cluster:
            print(f"\t{el}: {data[data.columns[lhs]][el]} -> {data[data.columns[rhs]][el]}")

        print(f"Most frequent rhs value proportion: {highlight.most_frequent_rhs_value_proportion}")
        print(f"Num distinct rhs values: {highlight.num_distinct_rhs_values}")


def print_results_for_fd(verifier, data, lhs, rhs):
    if verifier.fd_holds():
        print("FD holds")
    else:
        print("FD does not hold")
        print_clusters(verifier, data, lhs, rhs)

In [37]:
algo = db.afd_verification.algorithms.Default()
algo.load_data(table=("duplicates_short.csv", ",", True))
algo.execute(lhs_indices=[0], rhs_indices=[2])
print_results_for_fd(algo, data, 0, 2)

FD holds


Now verify whether `[name]` $\rightarrow$ `[credit_score]` FD holds.

In [38]:
algo.execute(lhs_indices=[1], rhs_indices=[2])
print_results_for_fd(algo, data, 1, 2)

FD does not hold
Number of clusters violating FD: 2
#1 cluster:
	1: Mary Doe -> 0.0
	2: Mary Doe -> 0.0
	5: Mary Doe -> 0.0
	6: Mary Doe -> nan
Most frequent rhs value proportion: 0.75
Num distinct rhs values: 2
#2 cluster:
	9: Björn Wolf -> 27.0
	11: Björn Wolf -> 28.0
	14: Björn Wolf -> 27.0
Most frequent rhs value proportion: 0.6666666666666666
Num distinct rhs values: 2


We learned that in this case the specified FD does not hold and there are two clusters of rows that contain values that prevent our FD from holding:

A cluster (with respect to a fixed FD) is a collection of rows that share the same left-hand side part but differ on the right-hand side one.


Take a closer look at them.

In the first cluster, three values are `0` and a single one is `NaN`.
This suggests that this single entry with the `NaN` value is a result of a mistake by someone who is not familiar with the table population policy. Therefore, it should probably be changed to `0`.

Now let's take a look at the second cluster.
There are two entries: `27` and `28`.
In this case, it is probably a typo, since buttons `7` and `8` are located close to each other on the keyboard.

Having analyzed these clusters, we can conclude that our FD does not hold due to typos in the data.

Therefore, by eliminating them, we can get this FD to hold (and make our dataset error-free).