# **Algebraic Constraints (AC)**
This example is dedicated to Fuzzy Algebraic Constraints (AC). The definition and algorithm are based on article "B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data" by Paul G. Brown & Peter J. Haas presented at VLDB in 2003.

First of all, let's figure out what AC is. However, to avoid going too deep, we will give you a simple definition without formalization. AC represents the results of applying binary operations between two table columns, with values grouped into meaningful intervals.
Let's illustrate this with an example.


# Installing and import python libraries
First of all we need to install python libraries.

In [None]:
!pip install desbordante==2.3.2
!pip install pandas
!pip install tabulate

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


Then import all required libraries.

In [None]:
import desbordante
import pandas
import operator
from tabulate import tabulate

# Fixing the constants

For default parameters we will use those values: binary operation is "-", weight - 0.1, fuzziness - 0.2, p_fuzz - 0.85, bumps_limit - 0, iterations_limit - 4, AC_seed - 11.

We will tell you what each of the parameters means later, but for now we just set them like this.

In [None]:
HEADER = 0
SEPARATOR = ','
P_FUZZ = 0.85
FUZZINESS = 0.2
BUMPS_LIMIT = 0
WEIGHT = 0.1
BIN_OPERATION = '-'
AC_SEED = 11
ITERATIONS_LIMIT = 4
OPERATIONS = {
    '+': (operator.add, 'Sum'),
    '-': (operator.sub, 'Difference'),
    '*': (operator.mul, 'Product'),
    '/': (operator.truediv, 'Ratio'),
}

# Getting sample datasets

In [None]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/86085bbe5fb310795a8da70c63978f07b5d96a3b/examples/datasets/player_stats.csv

Load the first data file.

In [None]:
table_for_example = pandas.read_csv('player_stats.csv',
                                    sep=SEPARATOR, header=HEADER)

Function for printing table.

In [None]:
def print_table(table_, headers_ = None, title_ = None):
    if title_ is not None:
        print(title_)

    print(tabulate(table_, headers=headers_, tablefmt='pipe', showindex=False, stralign='center'))
    print()

In [None]:
print_table(table_for_example, headers_=['id', 'Strength', 'Agility'])

|   id |   Strength |   Agility |
|-----:|-----------:|----------:|
|    0 |          3 |         1 |
|    1 |          4 |         1 |
|    2 |          1 |         3 |
|    3 |          2 |         2 |
|    4 |          1 |         4 |
|    5 |         10 |        12 |
|    6 |         14 |        10 |
|    7 |          1 |        23 |
|    8 |          6 |        16 |



# Discovering algebraic constraints

Let's run the algorithm with applying binary operation "+" to the Strength and Agility columns and observe the results.

In [None]:
operation, operation_name = OPERATIONS['+']

algo = desbordante.ac.algorithms.Default()

df_without_id = table_for_example[['Strength', 'Agility']]

algo.load_data(table=df_without_id)

algo.execute(p_fuzz=P_FUZZ, fuzziness=FUZZINESS, bumps_limit=BUMPS_LIMIT, weight=WEIGHT,
        bin_operation='+', ac_seed=AC_SEED, iterations_limit=ITERATIONS_LIMIT)

Printing the results.

In [None]:
ac_ranges = algo.get_ac_ranges()
for ac_range in ac_ranges:
    l_col = df_without_id.columns[ac_range.column_indices[0]]
    r_col = df_without_id.columns[ac_range.column_indices[1]]
    print(f'Discovered ranges for ({l_col} + {r_col}) are:')
    print(ac_range.ranges)

Discovered ranges for (Strength + Agility) are:
[(4.0, 5.0), (22.0, 24.0)]


As shown, the sum of Strength and Agility falls within either the (4, 5) or (22, 24) ranges.
This pattern may emerge because player characters with similar combined attribute values likely belong to the same tier.

# Discovering exceptions

But algorithm also find data that falling out this rule.
Let's see how to get it.

In [None]:
ac_exceptions = algo.get_ac_exceptions()
print()
print('Rows in which the result of the chosen operation (+) is outside of discovered ranges:')
for ac_exception in ac_exceptions:
    id_range, column1, column2 = table_.iloc[ac_exception.row_index]
    print(f'id: {id_range}')
    print(f'Agility: {column2}')
    print(f'Strength: {column1}')
    print(f'{operation_name}: {operation(column1, column2)}')

if len(ac_exceptions) == 0:
    print('None')


Rows in which the result of the chosen operation (+) is outside of discovered ranges:
None


How you can see in this case no data is outside of discovered ranges.

# Summing everything up

For further convenience, let's collect all this code into one function.

In [None]:
def run_ac_mining(table_, columns_, p_fuzz_=P_FUZZ, fuzziness_=FUZZINESS,
         bumps_limit_=BUMPS_LIMIT, weight_=WEIGHT, bin_operation_=BIN_OPERATION,
         ac_seed_=AC_SEED, iterations_limit_=ITERATIONS_LIMIT):
    operation, operation_name = OPERATIONS[bin_operation_]

    algo = desbordante.ac.algorithms.Default()

    df_without_id = table_[columns_]

    algo.load_data(table=df_without_id)

    algo.execute(p_fuzz=p_fuzz_, fuzziness=fuzziness_, bumps_limit=bumps_limit_, weight=weight_,
            bin_operation=bin_operation_, ac_seed=ac_seed_, iterations_limit=iterations_limit_)
    print()
    ac_ranges = algo.get_ac_ranges()
    for ac_range in ac_ranges:
        l_col = df_without_id.columns[ac_range.column_indices[0]]
        r_col = df_without_id.columns[ac_range.column_indices[1]]
        print('Discovered ranges ' +
              f'for ({l_col} {bin_operation_} {r_col}) are:')
        print(ac_range.ranges)

    ac_exceptions = algo.get_ac_exceptions()
    print()
    print(f'Rows in which the result of the chosen operation ({bin_operation_}) is ' +
          'outside of discovered ranges:')
    for ac_exception in ac_exceptions:
        id_range, column1, column2 = table_.iloc[ac_exception.row_index]
        print(f'id: {id_range}')
        print(f'{columns_[1]}: {column2}')
        print(f'{columns_[0]}: {column1}')
        print(f'{operation_name}: {operation(column1, column2)}\n')

    if len(ac_exceptions) == 0:
        print('None')

# Algorithm parameters

To run the algorithm, you must configure the parameters below:
For binary arithmetic operations, you can use four options:
```
 "+"
 "-"
 "*"
 "/"
```

Furthermore, AC mining algorithm provides five parameters
for setting up execution:


*   **weight**

    Weight accepts values in the range (0, 1]. Values closer to 1 force the algorithm to produce fewer larger intervals (up to a single interval covering all values). Values closer to 0 force the algorithm to produce smaller intervals.
*   **fuzziness and p_fuzz**

    Fuzziness belongs to (0,1) range while p_fuzz belongs to [0,1] range. These parameters control precision and the number of considered rows. Fuzziness values closer to 0 and p_fuzz values closer to 1 force the algorithm to include more rows (higher accuracy). Fuzziness values closer to 1 and p_fuzz values closer to 0 force the algorithm to include fewer rows (higher chance of skipping rows, lower precision, but faster execution).
*   **bumps_limit**
    
    Bumps_limit accepts only natural numbers from the range [1, inf) and limits the number of intervals for all column pairs. To set bumps_limit to inf you should use the 0 value.
*   **iterations_limit**
    
    Iterations_limit accepts only natural numbers. Lower values (close to 1) reduce accuracy due to algorithm performing fewer iterations.
*   **AC_seed**
    
    AC_seed accepts only natural numbers. B-HUNT is a randomized algorithm that accepts the seed parameter (AC_seed). Fixing this parameter ensures reproducible results, which are necessary for verifying results during testing of the algorithm. Furthermore, we need to fix it in this example for demonstration purposes; otherwise, we may obtain a different number of intervals with different boundaries that will not correspond to the text we wrote for our output.

# Example #1

Let's proceed to a visual example. We will use this dataset:

In [None]:
!wget -q https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/cargo_march.csv
TABLE = 'cargo_march.csv'

We remind you that for default parameters we will use those values:
binary operation is "-", weight - 0.1, fuzziness - 0.2, p_fuzz - 0.85, bumps_limit - 0,
iterations_limit - 4, AC_seed - 11.

Let's see the result of the algorithm with these parameters.


In [None]:
table_for_example = pandas.read_csv(TABLE, sep=SEPARATOR, header=HEADER)
run_ac_mining(table_for_example, ['Delivery date', 'Dispatch date'])


Discovered ranges for (Delivery date - Dispatch date) are:
[(2.0, 7.0), (15.0, 22.0)]

Rows in which the result of the chosen operation (-) is outside of discovered ranges:
id: 7
Dispatch date: 1
Delivery date: 30
Difference: 29

id: 26
Dispatch date: 7
Delivery date: 18
Difference: 11

id: 30
Dispatch date: 11
Delivery date: 22
Difference: 11



# Explanation of discovered data in example #1

You can see that the algorithm creates two intervals for the binary operation "-": (2-7) and
(15-22). This means that the difference between the dispatch date and delivery date always
falls within these intervals, except for three rows where the difference lies outside the
discovered ranges. From this, we can infer that:

**Packages for some addresses are typically delivered within 7 days.**

**Packages for some addresses take up to 22 days.**


Why these two intervals? To answer this question, more context is needed; that is, we should
look into the underlying data. We can imagine several reasons for this result, such as: 1)
nearby addresses versus far addresses; 2) air shipping versus regular shipping.

There are three parcels that fall outside of these delivery intervals. Why? This is a point
for further investigation, which requires additional context. There are many possible reasons
for this: 1) on these dates there was a workers' strike in some regions; or 2) an incorrect
address was specified, which increased the delivery time; or 3) it is just a typo in the table.


# Example #2

Now we reduce the value of the parameter weight to 0.05.


In [None]:
table_for_example = pandas.read_csv(TABLE, sep=SEPARATOR, header=HEADER)
run_ac_mining(table_for_example, ['Delivery date', 'Dispatch date'], weight_=0.05)


Discovered ranges for (Delivery date - Dispatch date) are:
[(2.0, 7.0), (11.0, 11.0), (15.0, 22.0), (29.0, 29.0)]

Rows in which the result of the chosen operation (-) is outside of discovered ranges:
None


# Explanation of discovered data in example #2

You can see that the number of intervals increases, and there is no longer any data outside of
the discovered ranges.
However, with this number of intervals, it is difficult to draw immediate conclusions about the
delivery date. In this case, a detailed analysis of other attributes might enable more meaningful
predictions for delivery times. For example, it may be a good idea to partition data by the region
attribute (or by month/quarter) and consider each partition individually.

Another option is to try to find a parameter combination that will result in a smaller number of
intervals. Next, remember that the algorithm is randomized (unless you run it with the exact
settings) — it can skip some rows, so you can also try to alter the seed.

Finally, cleaning up the data by removing duplicate and incomplete rows might also help.
Thus, the quantity and quality of the intervals are the user's responsibility. It may take several
attempts to achieve something interesting. Experiment!