**Differential dependencies** seem to be complicated, but in fact, they are easy to understand. Let's try them with [Desbordante](https://github.com/Desbordante/desbordante-core)!

# Install necessary dependencies

Firstly, let's download and import necessary libraries:

In [None]:
!pip install desbordante==2.3.2

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/4.0 MB[0m [31m16.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m2.5/4.0 MB[0m [31m33.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m3.7/4.0 MB[0m [31m34.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.0/4.0 MB[0m [31m31.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25

Desbordante library will be used for discovery of differential dependencies and Pandas library will be used for visualising the data:

In [None]:
import desbordante
import pandas as pd

Let's download example data:

In [None]:
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/flights_dd.csv
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/flights_dd_dif_table.csv

--2025-03-20 16:55:07--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/flights_dd.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 758 [text/plain]
Saving to: ‘flights_dd.csv’


2025-03-20 16:55:07 (14.0 MB/s) - ‘flights_dd.csv’ saved [758/758]

--2025-03-20 16:55:07--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/flights_dd_dif_table.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131 [text/plain]
Saving to: ‘flights_dd_d

# Explore data

Let's have a look at the dataset:

In [None]:
dataset = pd.read_csv('flights_dd.csv')
dataset

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


We see some information about flights. Let's look at the second table:

In [None]:
dif_table = pd.read_csv('flights_dd_dif_table.csv')
dif_table

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,-----,-----,[0;0],[0;0],[0;50],[0;15]
1,-----,-----,[0;3],[0;3],------,------


It specifies some ranges on columns. This table is supplementary, its meaning will be explained later.

# Find differential dependencies

Now, let's find differential dependencies using Desbordante:

In [None]:
algo = desbordante.dd.algorithms.Split()
algo.load_data(table=dataset)
algo.execute(difference_table=dif_table)
dds = algo.get_dds()
for dd in dds:
  print(dd)

Departure [0, 0] ; Arrival [0, 0] -> Distance [0, 50]
Distance [0, 50] -> Duration [0, 15]
Departure [0, 3] ; Arrival [0, 3] -> Duration [0, 15]


The SPLIT algorithm found three differential dependencies (DDs)!

# First DD explanation

The first DD, "Departure [0, 0] ; Arrival [0, 0] -> Distance [0, 50]", means the following.

For any two tuples of the table if

a) the distance between them on the column "Departure" is between 0 and 0 (i.e. they are equal), and

b) the distance between them on the column "Arrival" is between 0 and 0 (i.e. they are equal),

then the distance between them on the column "Distance" is between 0 and 50.

The only tuple pair that satisfies both of the constraints on the left-hand side (LHS) is (0,2):

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,2]='color:green;font-weight:bold'
  df1.iloc[2,2]='color:green;font-weight:bold'
  df1.iloc[0,3]='color:green;font-weight:bold'
  df1.iloc[2,3]='color:green;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


As we can see, this is the only tuple pair, where both Departure and Arrival airports are the same. Now let's consider the values of this tuple pair on the column "Distance":

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,2]='color:green;font-weight:bold'
  df1.iloc[2,2]='color:green;font-weight:bold'
  df1.iloc[0,3]='color:green;font-weight:bold'
  df1.iloc[2,3]='color:green;font-weight:bold'
  df1.iloc[0,4]='color:red;font-weight:bold'
  df1.iloc[2,4]='color:red;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


We can notice that the distance is between 0 and 50. Therefore, the DD
"Departure [0, 0] ; Arrival [0, 0] -> Distance [0, 50]" holds in the table.

# Second DD explanation

Now let's move to the second DD: "Distance [0, 50] -> Duration [0, 15]". This DD means the following: for any pair of tuples if the distance between them on the column "Distance" is between 0 and 50, then the distance on the column "Duration" is between 0 and 15. In other words, if two flights have similar distances, then they last for a similar time.

As can be seen from the table, almost all flights have similar distances which differ by less than 50 kilometers. Here we show all suitable records for the first record (i.e. those flights which distance is similar to the distance of the first flight):

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  for i in range(10):
    if i!=5 and i!=6:
      df1.iloc[i,4]='color:green;font-weight:bold'
      df1.iloc[i,5]='color:red;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


Next, for all flights from 0 to 9 their durations are between 58 and 73 minutes, so the difference is less or equal to 15 minutes. Therefore, the second DD also holds in the table. If you don't believe, you can check each tuple pair here:

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,4]='color:green;font-weight:bold'
  df1.iloc[1,4]='color:green;font-weight:bold'
  df1.iloc[0,5]='color:red;font-weight:bold'
  df1.iloc[1,5]='color:red;font-weight:bold'
  return df1

x,y=map(int,input("Enter a pair of tuples: ").split())
while True:
  if x==y:
    print("Tuples should have different numbers")
  elif x>=dataset.shape[0] or y>=dataset.shape[0]:
    print("Number of a tuple should be less than the number of rows (",dataset.shape[0],")")
  elif x<0 or y<0:
    print("Tuples shouldn't have negative numbers")
  else:
    break
  x,y=map(int,input("Enter a pair of tuples: ").split())

left_dist=abs(dataset["Distance"][x]-dataset["Distance"][y])
print(f"Difference between tuples {x} and {y} on column 'Distance':",left_dist)
if left_dist<=50:
  print("This tuple pair satisfies LHS")
else:
  print("This tuple pair doesn't satisfy LHS")

right_dist=abs(dataset["Duration"][x]-dataset["Duration"][y])
print(f"Difference between tuples {x} and {y} on column 'Duration':",right_dist)
if right_dist<=15:
  print("This tuple pair satisfies RHS")
else:
  print("This tuple pair doesn't satisfy RHS")
print()

if left_dist<=50 and right_dist>15:
  print("This tuple pair doesn't satisfy DD, thus the DD doesn't hold in the table :(")
else:
  if left_dist>50 or right_dist>15:
    print("Notice: this is NOT a contradiction with the DD!")
    print("A tuple pair doesn't satisfy DD if and only if")
    print("a) it satisfies LHS")
    print("b) it doesn't satisfy RHS")
    print()
  print("Therefore, this tuple pair satisfies DD")

tuple_pair=dataset.iloc[[x,y],:]
tuple_pair.style.apply(color_cells,axis=None)

Enter a pair of tuples: 3 4
Difference between tuples 3 and 4 on column 'Distance': 26
This tuple pair satisfies LHS
Difference between tuples 3 and 4 on column 'Duration': 4
This tuple pair satisfies RHS

Therefore, this tuple pair satisfies DD


Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62


# Third DD explanation

Now consider the third DD: "Departure [0, 3] ; Arrival [0, 3] -> Duration [0, 15]". It means that for any two
tuples from the table if

a) the distance between them on the column "Departure" is between 0 and 3, and

b) on the column "Arrival" the distance is between 0 and 3,

then the distance on the column "Duration" is between 0 and 15.

The distance between two strings is their [edit distance](https://en.wikipedia.org/wiki/Edit_distance) (the number of characters that need to be substituted,
deleted or inserted in order to turn the first string into the second).

The distance constraint "Departure [0, 3]" means that we consider only those tuple pairs whose values on the column "Departure" are close enough. In this case we aim to consider the airports located in the same cities. For example, tuple pairs (0,1) and (3,4) are satisfying this constraint.

Tuples 0 and 1 have the same departure airport:

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,2]='color:green;font-weight:bold'
  df1.iloc[1,2]='color:green;font-weight:bold'
  return df1

tuple_pair=dataset.iloc[[0,1],:]
tuple_pair.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63


Tuples 3 and 4 have the same city but different airport codes:

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[0,2]='color:green;font-weight:bold'
  df1.iloc[1,2]='color:green;font-weight:bold'
  return df1

tuple_pair=dataset.iloc[[3,4],:]
tuple_pair.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62


For the distance constraint "Arrival [0, 3]" the situation is similar.

Here are the tuple pairs that satisfy both of the constraints on the left-hand side:

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  for i in range(10):
    if i in [0,1,2,6]:
      df1.iloc[i,2]='color:green;font-weight:bold'
      df1.iloc[i,3]='color:green;font-weight:bold'
    elif i in [3,4,5]:
      df1.iloc[i,2]='color:orange;font-weight:bold'
      df1.iloc[i,3]='color:orange;font-weight:bold'
    elif i in [7,8]:
      df1.iloc[i,2]='color:blue;font-weight:bold'
      df1.iloc[i,3]='color:blue;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


Now let's consider the values of these tuple pairs on the column "Duration":

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  for i in range(10):
    if i in [0,1,2,6]:
      df1.iloc[i,2]='color:green;font-weight:bold'
      df1.iloc[i,3]='color:green;font-weight:bold'
      df1.iloc[i,5]='color:green;font-weight:bold'
    elif i in [3,4,5]:
      df1.iloc[i,2]='color:orange;font-weight:bold'
      df1.iloc[i,3]='color:orange;font-weight:bold'
      df1.iloc[i,5]='color:orange;font-weight:bold'
    elif i in [7,8]:
      df1.iloc[i,2]='color:blue;font-weight:bold'
      df1.iloc[i,3]='color:blue;font-weight:bold'
      df1.iloc[i,5]='color:blue;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,SU 35,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,64
1,FV 6015,2024-03-06,Saint Petersburg (LED),Moscow (VKO),624,63
2,FV 6027,2024-03-06,Saint Petersburg (LED),Moscow (SVO),598,66
3,FV 6024,2024-03-03,Moscow (VKO),Saint Petersburg (LED),624,58
4,SU 6,2024-03-06,Moscow (SVO),Saint Petersburg (LED),598,62
5,S7 1009,2024-03-01,Moscow (DME),Saint Petersburg (LED),664,66
6,S7 1010,2024-03-02,Saint Petersburg (LED),Moscow (DME),664,70
7,B2 978,2024-03-07,Moscow (SVO),Minsk (MSQ),641,58
8,DP 967,2024-03-07,Moscow (VKO),Minsk (MSQ),622,73
9,B2 981,2024-03-08,Minsk (MSQ),Moscow (VKO),622,61


It can easily be seen that for every highlighted tuple pair their duration differs by up to 15 minutes. Therefore,
the DD "Departure [0, 3] ; Arrival [0, 3] -> Duration [0, 15]" holds in the table.

# Difference table

The most important parameter of the SPLIT algorithm for DD discovery is the difference table. Here is the difference table that was used in this example:

In [None]:
dif_table

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,-----,-----,[0;0],[0;0],[0;50],[0;15]
1,-----,-----,[0;3],[0;3],------,------


The difference table defines the search space for DDs. That means, the algorithm searches only for DDs constructed from distance constraints stated in the difference table. Therefore, as you can see from the discovered DDs, all of the distance constraints that were used there are stated in the difference table.

In [None]:
for dd in dds:
  print(dd)

Departure [0, 0] ; Arrival [0, 0] -> Distance [0, 50]
Distance [0, 50] -> Duration [0, 15]
Departure [0, 3] ; Arrival [0, 3] -> Duration [0, 15]


The number of constraints for each column can be different. The difference table can be accepted by the algorithm only in the format stated above. Please note that different difference tables fed into the algorithm result in different sections of the search space being explored and, thus, yield different results.

For example, let's change our difference table:

In [None]:
dif_table.iloc[1,3]="-----"
dif_table

Unnamed: 0,Flight Number,Date,Departure,Arrival,Distance,Duration
0,-----,-----,[0;0],[0;0],[0;50],[0;15]
1,-----,-----,[0;3],-----,------,------


We have deleted a constraint from the column "Arrival". Let's execute the algorithm again with a new difference table:

In [None]:
algo.execute(difference_table=dif_table)
dds = algo.get_dds()
for dd in dds:
  print(dd)

Departure [0, 0] ; Arrival [0, 0] -> Distance [0, 50]
Distance [0, 50] -> Duration [0, 15]
Departure [0, 3] ; Arrival [0, 0] -> Duration [0, 15]


Note that the distance constraint in the third DD has been changed from "Arrival [0, 3]" to "Arrival [0, 0]".
That has happened because the constraint "Arrival [0, 3]" is no more in the search space.

# Conclusion

If you are reading this, then you have learnt about differential dependencies. Not so difficult, after all, right?

We have explored data and found interesting patterns there:


1.   Flights between same airports have similar distances
2.   Flights with similar distances last for similar time
3.   Flights between same cities also last for similar time, not depending on different airports in one city

If you wish to find these patterns in your data, now you know how to do it 🙂
Also, you can learn more about other pattern types presented in [Desbordante](https://github.com/Desbordante/desbordante-core).
