# Non-Derivable Itemsets

Your goal here is to write a program to compute whether an itemset is derivable or not. The program should take as input the following two files:

FILE1: A list of itemsets with their support values (one per line). See the file: itemsets.txt (the format is "itemset - support"; one per line)

FILE2: A list of itemsets (one per line), whose support bounds have to be derived. See the file: ndi.txt

Your program should output for each itemset in FILE2 the following info:

itemset: [l,u] derivable/non-derivable

where l and u are the lower and upper-bounds on the support.

In [1]:
def nditem(file1='data/itemsets.txt', file2='data/ndi.txt'):
    """
    Compute whether an itemset is derivable or not given data from file1 and file2
    * Args:
        - file1: A list of itemsets with their support values of the format "itemset - support"; one per line
        - file2: A list of itemsets (one per line), whose support bounds have to be derived
    * Return:
        itemset: [l,u] derivable/non-derivable for each itemset in file2 where l and u are the lower and upper-bounds on the support.
    """
    import numpy as np
    from itertools import combinations
    from collections import defaultdict

    def IE(Y, X):
        """
        Calculate IE(Y) given an itemset X and Y is a subset of X
        """

        Y = np.array(Y)
        X = np.array(X)
        assert np.setdiff1d(Y, X).shape[0] == 0

        Z = np.setdiff1d(X, Y)
        z_len = Z.shape[0]
        ie = 0

        for i in range(z_len):
            for subset in combinations(Z, i):
                W = tuple(np.union1d(Y, np.array(subset)))
                if (z_len - i + 1) % 2 == 0:
                    ie += supports[W]
                else:
                    ie -= supports[W]

        return ie


    def support_bounds(itemset):
        """
        Calculate lower and upper bound for support
        """

        upper_bound = np.inf
        lower_bound = 0

        n = len(itemset)

        for i in range(n + 1):
            if (n - i) % 2 == 1:
                for subset in combinations(itemset, i):
                    ie = IE(subset, itemset)
                    if ie < upper_bound:
                        upper_bound = ie
            else:
                for subset in combinations(itemset, i):
                    ie = IE(subset, itemset)
                    if ie > lower_bound:
                        lower_bound = ie

        return lower_bound, upper_bound
    
    
    # Read in file1
    with open(file1, 'r') as f1:
        lines = f1.read().splitlines()
        tid_num = int(lines[0].split(' - ')[-1])  # Number of tids - support of empty set
        lines = lines[1:]


    # Save the data in file1
    supports = dict()
    supports[()] = tid_num

    for line in lines:
        itemset_text, support_text = line.split('-')
        support = int(support_text.strip())
        itemset = tuple(np.array(itemset_text.split()).astype(int))
        supports[itemset] = support


    # Read in file2
    with open(file2, 'r') as f2:
        lines = f2.read().splitlines()

    itemsets = [np.array(line.split()).astype(int) for line in lines]
    
    # Print out the result
    for itemset in itemsets:
        lb, ub = support_bounds(itemset)
        print(f"{' '.join(itemset.astype(str))}: [{lb}, {ub}] {'derivable' if lb == ub else 'non-derivable'}")

In [2]:
nditem()

29 34 40 52 62: [2888, 2888] derivable
7 29: [3061, 3076] non-derivable
29 48 58: [2997, 2997] derivable
7 29 36 40 52 58 60: [2890, 2890] derivable
5 40 52 60: [2893, 2893] derivable
7 36 40 58: [2952, 2952] derivable
36 40 52 58 60 66: [2888, 2888] derivable
