# Approximate Unique Column Combinations

This example illustrates the usage of `Approximate` Unique Column Combinations (AUCC).

- An exact UCC declares that some columns uniquely identify every tuple in a table.
- An approximate UCC declares that some columns uniquely identify every tuple in a table, **but allows a certain degree of violation**.

For more information consult:

```
A Hybrid Approach for Efficient Unique Column Combination Discovery,
        by T. Papenbrock and F. Naumann.
Efficient Discovery of Approximate Dependencies,
        by S. Kruse and F. Naumann.
````

# Install dependencies

In [None]:
!pip install desbordante==2.3.2
!pip install pandas



## Import python libraries

In [None]:
import desbordante as db
import pandas as pd

## Approximate UCC

Let's discover approximate UCC.

In [15]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ucc_datasets/aucc.csv
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/main/examples/datasets/ucc_datasets/aucc_correct.csv

The following table contains records about employees.

In [16]:
pd.read_csv("aucc.csv")

Unnamed: 0,Name,Grade,Salary,Work_experience
0,Mark,7,1150,12
1,Joyce,2,1100,5
2,Harry,3,1000,7
3,Grace,4,900,12
4,Harry,4,1000,5
5,Samuel,1,900,9
6,Nancy,2,1000,3


Again, we need to select a column that will serve as a unique key (ID).
This time though, the AUCC mining algorithm with error threshold will be used.

The smaller threshold gets, the less violations (repeated values) are allowed in column combinations.

Let's run AUCC mining algorithm with threshold equal to `0`.
Setting threshold to `0` means mining exact UCCs (without violations).

In [17]:
algo = db.ucc.algorithms.PyroUCC()
algo.load_data(table=("aucc.csv", ',', True))
algo.execute(error=0)
uccs = algo.get_uccs()

In [18]:
for ucc in uccs:
    print(ucc.to_long_string())

[Name Grade]
[Name Work_experience]
[Grade Work_experience]
[Grade Salary]
[Salary Work_experience]


And again, there are no unary UCCs, so there is no single column that can define a key.

Let's run algorithm with bigger threshold (= `0.1`).

In [19]:
algo = db.ucc.algorithms.PyroUCC()
algo.load_data(table=("aucc.csv", ',', True))
algo.execute(error=0.1)
auccs = algo.get_uccs()

In [20]:
for aucc in auccs:
    print(aucc.to_long_string())

[Name]
[Grade]
[Work_experience]


Now, almost all columns are considered to be unique, but that is not what we wanted. Let's test a smaller threshold (= `0.05`).

In [21]:
algo = db.ucc.algorithms.PyroUCC()
algo.load_data(table=("aucc.csv", ',', True))
algo.execute(error=0.05)
auccs = algo.get_uccs()

In [22]:
for aucc in auccs:
    print(aucc.to_long_string())

[Name]
[Grade Salary]
[Grade Work_experience]
[Salary Work_experience]


Out of single-column UCCs, `Name` requires the smallest threshold to be "unique".

It means that `Name` has less violations than other columns.

Let's look at the table again, paying a special attention to the `Name` column.

In [43]:
df = pd.read_csv("aucc.csv")
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[2,0]='color:red;font-weight:bold'
  df1.iloc[4,0]='color:red;font-weight:bold'
  return df1

df.style.apply(color_cells, axis=None)

Unnamed: 0,Name,Grade,Salary,Work_experience
0,Mark,7,1150,12
1,Joyce,2,1100,5
2,Harry,3,1000,7
3,Grace,4,900,12
4,Harry,4,1000,5
5,Samuel,1,900,9
6,Nancy,2,1000,3


There are two `Harrys`. They have different work experience,therefore they are two different employees. This is most likely an error/oversight in data.

If we represented their records using unique names, the `Name` AUCC would hold with threshold = `0`, and `Name` could be used as a key.

In [46]:
df = pd.read_csv("aucc_correct.csv")
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[2,0]='color:green;font-weight:bold'
  df1.iloc[4,0]='color:green;font-weight:bold'
  return df1

df.style.apply(color_cells, axis=None)

Unnamed: 0,Name,Grade,Salary,Work_experience
0,Mark,7,1150,12
1,Joyce,2,1100,5
2,Harry_1,3,1000,7
3,Grace,4,900,12
4,Harry_2,4,1000,5
5,Samuel,1,900,9
6,Nancy,2,1000,3


Let's run algorithm once more with threshold = `0`

In [48]:
algo = db.ucc.algorithms.PyroUCC()
algo.load_data(table=("aucc_correct.csv", ',', True))
algo.execute(error=0)
uccs = algo.get_uccs()

In [49]:
for ucc in uccs:
    print(ucc.to_long_string())

[Name]
[Grade Salary]
[Grade Work_experience]
[Salary Work_experience]


Now we can use `Name` as a key.