**Matching dependencies** are able to capture a huge number of hidden patterns in your data. Let's try them with [Desbordante](https://github.com/Desbordante/desbordante-core)!

# Install necessary dependencies

Firstly, let's download and import necessary libraries:

In [None]:
!pip install desbordante==2.3.2

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


Desbordante library will be used for discovery of matching dependencies and Pandas library will be used for visualising the data:

In [None]:
import desbordante
import pandas as pd

Let's download example data:

In [None]:
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/animals_beverages.csv
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/carrier_merger.csv

--2025-03-20 16:56:34--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/animals_beverages.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115 [text/plain]
Saving to: ‘animals_beverages.csv’


2025-03-20 16:56:34 (1.84 MB/s) - ‘animals_beverages.csv’ saved [115/115]

--2025-03-20 16:56:34--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/carrier_merger.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 412 [text/plain]
Saving to

# Preliminary: similarity measures

If you are already familiar with similarity measures, you can skip this section.

---

To achieve better understanding of matching dependencies let's talk about similarity measures. It's a way to describe difference between two values. More formally, a similarity measure is a function that takes two values and returns a value between 0.0 and 1.0. Intuitively, similarity of 1.0 means that values are equal, and the smaller the similarity gets, the more different are the values. Let's look at some examples of similarity measures.

**Levenshtein similarity measure**

[Levenstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between two strings is the number of characters that need to be substituted, deleted or inserted in order to turn the first string into the second. Levenshtein similarity measure is just Levenshtein distance being normalized by a simple formula:

$sim_{id}(a,b)=1.0 - \frac{dist(a,b)}{max(|a|,|b|)}$.

The similarity obtained by this formula is always between 0.0 and 1.0.

Let's use Levenshtein python library for calculating Levenshtein distance:

In [None]:
!pip install Levenshtein

Collecting Levenshtein
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein)
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.27.1 rapidfuzz-3.12.2


Here we use the above formula for calculating Levenshtein similarity:

In [None]:
import Levenshtein as lev

def LevenshteinSimilarity(str1, str2):
    max_dist = max(len(str1), len(str2))
    if max_dist != 0:
        dist = lev.distance(str1, str2)
        return (max_dist - dist) / max_dist
    return 1.0

print(LevenshteinSimilarity("hello", "hello"))
print(LevenshteinSimilarity("hello", "hallo"))
print(LevenshteinSimilarity("hello", "world"))

1.0
0.8
0.2


**Equality similarity measure**

Equality similarity measure is simple: it returns 1.0 if values are equal and 0.0 otherwise. Let's implement this similarity measure in a function:

In [None]:
def EqualitySimilarity(str1, str2):
  return 1.0 if str1 == str2 else 0.0

print(EqualitySimilarity("hello", "hello"))
print(EqualitySimilarity("hello", "hallo"))
print(EqualitySimilarity("hello", "world"))

1.0
0.0
0.0


**Jaccard similarity measure**

[Jaccard similarity measure](https://en.wikipedia.org/wiki/Jaccard_index) is defined as similarity between two sets:

$J(A,B)=\frac{|A \cap B|}{|A \cup B|}=\frac{|A \cap B|}{|A|+|B|-|A \cap B|}$.

Let's implement a function for calculating this similarity measure for string values. The function treats strings as sets of symbols:

In [None]:
def SymbolJaccardSimilarity(str1, str2):
    symbols1 = set(str1)
    symbols2 = set(str2)
    intersection_size = len(symbols1 & symbols2)
    union_size = len(symbols1) + len(symbols2) - intersection_size
    return intersection_size / union_size

print(SymbolJaccardSimilarity("hello", "hello"))
print(SymbolJaccardSimilarity("hello", "hallo"))
print(SymbolJaccardSimilarity("hello", "world"))

1.0
0.6
0.2857142857142857


Another way to implement Jaccard similarity measure for strings is to treat them as sets of words:

In [None]:
def WordJaccardSimilarity(str1, str2):
    symbols1 = set(str1.split())
    symbols2 = set(str2.split())
    intersection_size = len(symbols1 & symbols2)
    union_size = len(symbols1) + len(symbols2) - intersection_size
    return intersection_size / union_size

print(WordJaccardSimilarity("hello world", "hello world"))
print(WordJaccardSimilarity("hello hello", "hallo hello"))
print(WordJaccardSimilarity("hello", "world"))

1.0
0.5
0.0


Let's try these similarity measures:

In [None]:
str1, str2 = input("Enter two strings: ").split()

print(f"Similarities between '{str1}' and '{str2}':")
print("Levenshtein similarity:", LevenshteinSimilarity(str1, str2))
print("Equality similarity:", EqualitySimilarity(str1, str2))
print("Symbol Jaccard similarity:", SymbolJaccardSimilarity(str1, str2))
print("Word Jaccard similarity:", WordJaccardSimilarity(str1, str2))

Enter two strings: hello hallo
Similarities between 'hello' and 'hallo':
Levenshtein similarity: 0.8
Equality similarity: 0.0
Symbol Jaccard similarity: 0.6
Word Jaccard similarity: 0.0


There is a vast amount of different similarity measures. The presented similarity measures are probably most commonly used. Now, we are prepared for matching dependency discovery!

# First example

Let's look at the first dataset:

In [None]:
first_dataset=pd.read_csv('animals_beverages.csv')
first_dataset

Unnamed: 0,name,zoo,animal,diet
0,Simba,berlin,lion,meat
1,Clarence,london,lion,mead
2,Baloo,berlin,bear,fish
3,Pooh,london,beer,fish


It's a small example dataset that contains information about animals.

Now, let's find matching dependencies using Desbordante!

Initially, we define columns the values of which are going to be compared and the similarity measure according to which similarity of values is going to be determined. The HyMD algorithm then finds the set of decision boundaries of all MDs that are enough to infer MDs that satisfy some requirements (interestingness criteria) and hold on the data.

In this example, we are going to compare values of every column to itself using normalized Levenshtein distance:

In [None]:
Levenshtein = desbordante.md.column_matches.Levenshtein

algo = desbordante.md.algorithms.HyMD()
algo.load_data(left_table=first_dataset)
column_matches = [Levenshtein(i, i) for i in range(len(first_dataset.columns))]
algo.execute(column_matches=column_matches)
mds = algo.get_mds()
print('Found MDs:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs:
1 [ levenshtein(diet, diet)>=0.75 ] -> levenshtein(animal, animal)>=0.75
2 [ levenshtein(animal, animal)>=0.75 ] -> levenshtein(diet, diet)>=0.75


The HyMD algorithm found two matching dependencies (MDs)! These MDs can also be displayed in short form, showing only non-zero decision boundaries:

In [None]:
print(*map(lambda md: md.to_short_string(), mds), sep='\n')

[,,,0.75]->2@0.75
[,,0.75,]->3@0.75


The first MD, "[ levenshtein(diet, diet)>=0.75 ] -> levenshtein(animal, animal)>=0.75", means the following.

For any two tuples of the table if the Levenshtein similarity between their values on column "diet" is greater than 0.75, then the Levenshtein similarity between their values on column "animal" is greater than 0.75.

As we can see, there are two tuple pairs that satisfy left-hand side (LHS) of the MD:

In [None]:
print(LevenshteinSimilarity("meat","mead"))
print(LevenshteinSimilarity("fish","fish"))

0.75
1.0


In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,3]='color:green;font-weight:bold'
  df1.iloc[1,3]='color:green;font-weight:bold'
  df1.iloc[2,3]='color:blue;font-weight:bold'
  df1.iloc[3,3]='color:blue;font-weight:bold'
  return df1

first_dataset.style.apply(color_cells,axis=None)

Unnamed: 0,name,zoo,animal,diet
0,Simba,berlin,lion,meat
1,Clarence,london,lion,mead
2,Baloo,berlin,bear,fish
3,Pooh,london,beer,fish


Let's look at their values on column "animal":

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  for j in range(2,4):
    df1.iloc[0,j]='color:green;font-weight:bold'
    df1.iloc[1,j]='color:green;font-weight:bold'
    df1.iloc[2,j]='color:blue;font-weight:bold'
    df1.iloc[3,j]='color:blue;font-weight:bold'
  return df1

first_dataset.style.apply(color_cells,axis=None)

Unnamed: 0,name,zoo,animal,diet
0,Simba,berlin,lion,meat
1,Clarence,london,lion,mead
2,Baloo,berlin,bear,fish
3,Pooh,london,beer,fish


It can easily be seen that the similarity between these values is greater than 0.75:

In [None]:
print(LevenshteinSimilarity("lion","lion"))
print(LevenshteinSimilarity("bear","beer"))

1.0
0.75


Thus, the first MD holds in the table. The second MD is inverse to the first one and holds in the table for similar reasons.

# Column matches

Matching dependencies are defined on **column matches** instead of single columns. A column match has the following form: *measure(column1, column2)*. It means that a similarity measure *measure* is used for comparing values from *column1* to values from *column2*. For example, *levenshtein(diet, diet)* means that Levenshtein similarity measure is used for comparing values from column *diet*. In the above example all of the columns in column matches were the same, but it is possible to define a column match on different columns.

```python
Levenshtein = desbordante.md.column_matches.Levenshtein
column_matches = [Levenshtein(i, i) for i in range(len(first_dataset.columns))
```

These two lines from the above example were used to create column matches for discovery of MDs on them. Creation of column matches with Levenshtein similarity measure is directly supported in Desbordante. As you can see, we have created column matches from same columns. Let's move to the second example to see more complex column matches and similarity measures!

# Second example

Let's have a look at the second dataset:

In [None]:
second_dataset=pd.read_csv('carrier_merger.csv')
second_dataset

Unnamed: 0,id,Source,From,To,Distance (km)
0,1,ac1,Saint-Petersburg,Helsinki,315
1,2,ac2,St-Petersburg,Helsinki,301
2,3,ac2,Moscow,St-Petersburg,650
3,4,ac2,Moscow,St-Petersburg,638
4,5,ac1,Moscow,Saint-Petersburg,670
5,6,ac1,Moscow,Yekaterinburg,1417
6,7,ac2,Trondheim,Copenhagen,877
7,8,ac1,Copenhagen,Trondheim,877
8,9,ac2,Dobfany,Helsinki,1396
9,10,ac2,St-Petersburg,Kostroma,659


The dataset is obtained as a result of merger of data from two aircraft carriers (ac1 and ac2).

Firstly, let's define column matches for MD discovery:

In [None]:
Equality = desbordante.md.column_matches.Equality
Custom = desbordante.md.column_matches.Custom
Jaccard = desbordante.md.column_matches.Jaccard

algo = desbordante.md.algorithms.HyMD()
algo.load_data(left_table=second_dataset)

max_distance = max(second_dataset['Distance (km)'])

column_matches = [
        Equality('id', 'id'),
        Equality('Source', 'Source'),
        Custom(SymbolJaccardSimilarity, 'From', 'From', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(SymbolJaccardSimilarity, 'To', 'To', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(SymbolJaccardSimilarity, 'To', 'From', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(SymbolJaccardSimilarity, 'From', 'To', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(lambda d1, d2: 1 - abs(int(d1) - int(d2)) / max_distance, 'Distance (km)',
               'Distance (km)', symmetrical=True, equality_is_max=True,
               measure_name='normalized_distance')]

As we can see,
1.   IDs and sources are considered similar if they are equal
2.   Departure ("From" column) and arrival ("To" column) city names are going to be compared to themselves ("From" to "From", "To" to "To") and to each other ("To" to "From", "From" to "To") using the Jaccard similarity measure
3.   Distances are going to be compared to each other using normalized difference: 1 - |dist1 - dist2| / max_distance, where max_distance is the maximum value in the column

Now, let's run the HyMD algorithm:

In [None]:
algo.execute(column_matches=column_matches)
mds = algo.get_mds()
print('Found MDs:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs:
1 [ jaccard(To, To)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
2 [ jaccard(From, From)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
3 [ jaccard(From, From)>=0.769231 | jaccard(To, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.977417
4 [ jaccard(From, From)>=0.769231 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.99012
5 [ jaccard(From, From)>=1 | normalized_distance(Distance (km), Distance (km))>=0.99012 ] -> equality(Source, Source)>=1
6 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> equality(Source, Source)>=1
7 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531
8 [ equality(Source, Source)>=1 | jaccard(From, From)>=0.769231 | jaccard(To, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531


# Custom similarity measures

In this example we used column matches based on our own similarity measures:

```python
Custom(SymbolJaccardSimilarity, 'From', 'From', symmetrical=True, equality_is_max=True,
               measure_name='jaccard')
```
As it can be seen, to create a column match we need to pass the name of the function that calculates similarity measure, names of columns, which values we want to compare using this measure and some properties of the measure: is the function symmetrical and does it return maximum number for equal values.

For example, for those similarity measures that are supported in Desbordante we have two options to pass them:

In [None]:
Equality = desbordante.md.column_matches.Equality
Custom = desbordante.md.column_matches.Custom
Jaccard = desbordante.md.column_matches.Jaccard

algo = desbordante.md.algorithms.HyMD()
algo.load_data(left_table=second_dataset)

max_distance = max(second_dataset['Distance (km)'])


column_matches1 = [
        Equality('id', 'id'),
        Equality('Source', 'Source'),
        Jaccard('From', 'From'),
        Jaccard('To', 'To'),
        Jaccard('To', 'From'),
        Jaccard('From', 'To'),
        Custom(lambda d1, d2: 1 - abs(int(d1) - int(d2)) / max_distance, 'Distance (km)',
               'Distance (km)', symmetrical=True, equality_is_max=True,
               measure_name='normalized_distance')]

column_matches2 = [
        Custom(EqualitySimilarity, 'id', 'id', symmetrical=True, equality_is_max=True,
               measure_name='equality'),
        Custom(EqualitySimilarity, 'Source', 'Source', symmetrical=True, equality_is_max=True,
               measure_name='equality'),
        Custom(WordJaccardSimilarity, 'From', 'From', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(WordJaccardSimilarity, 'To', 'To', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(WordJaccardSimilarity, 'To', 'From', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(WordJaccardSimilarity, 'From', 'To', symmetrical=True, equality_is_max=True,
               measure_name='jaccard'),
        Custom(lambda d1, d2: 1 - abs(int(d1) - int(d2)) / max_distance, 'Distance (km)',
               'Distance (km)', symmetrical=True, equality_is_max=True,
               measure_name='normalized_distance')]

As we can see, the outputs are identical:

In [None]:
algo.execute(column_matches=column_matches1)
mds = algo.get_mds()
print('Found MDs using supported measures:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')
print()
algo.execute(column_matches=column_matches2)
mds = algo.get_mds()
print('Found MDs using custom measures:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs using supported measures:
1 [ jaccard(To, To)>=1 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
2 [ jaccard(From, From)>=1 | normalized_distance(Distance (km), Distance (km))>=0.99012 ] -> equality(Source, Source)>=1
3 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> equality(Source, Source)>=1
4 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531

Found MDs using custom measures:
1 [ jaccard(To, To)>=1 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
2 [ jaccard(From, From)>=1 | normalized_distance(Distance (km), Distance (km))>=0.99012 ] -> equality(Source, Source)>=1
3 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> equality(Source, Source)>=1
4 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531


# Support of MDs

Now, let's look at the results of the second example more thoroughly:

In [None]:
algo.execute(column_matches=column_matches)
mds = algo.get_mds()
print('Found MDs:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs:
1 [ jaccard(To, To)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
2 [ jaccard(From, From)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
3 [ jaccard(From, From)>=0.769231 | jaccard(To, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.977417
4 [ jaccard(From, From)>=0.769231 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.99012
5 [ jaccard(From, From)>=1 | normalized_distance(Distance (km), Distance (km))>=0.99012 ] -> equality(Source, Source)>=1
6 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> equality(Source, Source)>=1
7 [ jaccard(From, From)>=1 | jaccard(To, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531
8 [ equality(Source, Source)>=1 | jaccard(From, From)>=0.769231 | jaccard(To, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531


Let's also look again at the dataset:

In [None]:
second_dataset

Unnamed: 0,id,Source,From,To,Distance (km)
0,1,ac1,Saint-Petersburg,Helsinki,315
1,2,ac2,St-Petersburg,Helsinki,301
2,3,ac2,Moscow,St-Petersburg,650
3,4,ac2,Moscow,St-Petersburg,638
4,5,ac1,Moscow,Saint-Petersburg,670
5,6,ac1,Moscow,Yekaterinburg,1417
6,7,ac2,Trondheim,Copenhagen,877
7,8,ac1,Copenhagen,Trondheim,877
8,9,ac2,Dobfany,Helsinki,1396
9,10,ac2,St-Petersburg,Kostroma,659


It is clear to see that ID determines every other attribute, so, for example, the following MD holds in the table: "[ equality(id, id)>=1 ] -> equality(Source, Source)>=1". However, there are no dependencies in the above results that indicate that.

In the same manner, one would expect names of departure and arrival cities being similar to indicate distances also being similar. There is indeed a dependency like that, which is dependency 3. That dependency matches the "To" and "From" values to themselves. However, it also makes sense for there to be a dependency that matches a "To" value to a "From" value or the other way around. And yet, none of these dependencies are presented in the answer.

This is because they do not satisfy an interestingness criterion: their support is too low. "Support" in this case means the number of record pairs with similar values, i.e. pairs that satisfy LHS. By default, when there is only one source table, the minimum support is set to one greater than its number of records. As their support is lower than that, these dependencies are pruned.

For example, the support of MD "[ equality(id, id)>=1 ] -> equality(Source, Source)>=1" equals to the number of records (12), thus the support of this MD is less than the default minimum support (13) and the dependency is considered to be not interesting.

Let's decrease the minimum support from 13 to 6:

In [None]:
algo.execute(column_matches=column_matches, min_support=6)
mds = algo.get_mds()
print('Found MDs:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs:
1 [ equality(id, id)>=1 ] -> equality(Source, Source)>=1
2 [ equality(id, id)>=1 ] -> jaccard(From, From)>=1
3 [ equality(id, id)>=1 ] -> jaccard(To, To)>=1
4 [ equality(id, id)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=1
5 [ jaccard(To, From)>=0.769231 | jaccard(From, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.985886
6 [ jaccard(To, From)>=1 | jaccard(From, To)>=1 ] -> normalized_distance(Distance (km), Distance (km))>=0.991531
7 [ jaccard(To, To)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
8 [ jaccard(From, From)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=0.991531 ] -> equality(Source, Source)>=1
9 [ jaccard(From, From)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=1 ] -> equality(id, id)>=1
10 [ jaccard(From, From)>=0.769231 | normalized_distance(Distance (km), Distance (km))>=1 ] -> jaccard(To, To)>=1
11 [ jaccard(From, From)>=0.

Now these dependencies are present, they are the first five of the ones displayed.
However, there also several dependencies that "do not make sense", like "the departure city and closeness in distance determines the arrival city". These only hold because the dataset being inspected does not happen to contain a counterexample.

We can also increase the minimum support requirement. This can help us find the dependencies that are more reliable, with more examples supporting them:

In [None]:
algo.execute(column_matches=column_matches, min_support=round(len(second_dataset) * 1.5))
mds = algo.get_mds()
print('Found MDs:')
print(*(f'{i + 1} {md}' for i, md in enumerate(mds)), sep='\n')

Found MDs:
1 [ jaccard(From, From)>=0.769231 | jaccard(To, To)>=0.769231 ] -> normalized_distance(Distance (km), Distance (km))>=0.977417


# Conclusion

If you are reading this, then you have learnt about matching dependencies. Congratulations!

We have explored data and found that flights with similar departure and arrival cities have similar distances. We have also learnt about different similarity measures and how to use them for discovery of matching dependencies. Now, for each type of data in a column you can choose the similarity measure that suits your needs the most.


If you wish to find these patterns in your data, now you know how to do it 🙂
Also, you can learn more about other pattern types presented in [Desbordante](https://github.com/Desbordante/desbordante-core).