<a href="https://colab.research.google.com/github/BlackCurrantDS/DBSE_Project/blob/main/Part1_a)Generate_Rules_Apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook takes the input data and runs apriori algorithm with min support, min confidence as arguments and generated the frequent item sets files and generated association rules.

How to run:


Step1 :

Pass the folde name where you want to create output files
Pass the path to input files for eg.
- /content/breast_train_transactions.txt

Cell 1

Step 2:

Give Change minsup and min confidence in cell 2

Output-

This generates file with frequent item sets and rules with suffix 0,1,2 etc which is input to part 2.

e.g.

/content/freqitemsets.tmp.0
/content/freqitemsets.tmp.1
/content/miner.tmp.rules.0
/content/miner.tmp.rules.1

consolidated frequent item set file-
/content/miner.tmp.itemsets


In [46]:
Filepath = "/content/drive/MyDrive/Adult_DataSet1202/Apriori1202/"
inputfilepath = "/content/drive/MyDrive/Adult_DataSet1202/adult_train_transactions.csv"

In [47]:
minsup, minconf, itemset_max_size = .01,.8,-1

In [48]:
class RuleMiner(object):    
    '''
    This class is used to generate_itemsets_and_rules and store a Naive Belief System 
    by using the most confident association rules
    '''

    def __init__(self, filter_name, train_data_set):
        
        self.nthreads = 4
        self.files_info = ARMFiles(Filepath)
        
        self.filter_name = filter_name
        self.data_set = train_data_set #taking input the dataset


    '''
    Generate association rules and select K patterns with highest confidence.
    '''    
    def generate_itemsets_and_rules(self, arm_params):
        self.generate_frequent_itemsets(arm_params)
        self.generate_association_rules(arm_params)
        #self.extract_features_4_all_rules()

    '''
    Generate frequent itemsets from data-set
    '''
    def generate_frequent_itemsets(self, arm_params):
        
        print ('generating frequent item-sets...')
        apriori = Apriori(self.data_set)
        apriori.generate_frequent_itemsets_vw(arm_params.min_sup * self.data_set.size(), 
                                              self.nthreads, 
                                              arm_params.itemset_max_size, 
                                              self.files_info.itemset_tmp_file)
        
    '''
    Generate association rules from data-set. 
    This method must be called after generate_frequent_itemsets(...) is called
    '''
    def generate_association_rules(self, arm_params):
        freq_itemsets_dict = self.load_frequent_itemsets_as_dict()
        
        print ('generating rules ....')
        itemset_formatter = getattr(ItemsetFormatter, self.filter_name)
        rule_formatter = getattr(RuleFormatter, self.filter_name)
        rule_generator = Generator(freq_itemsets_dict, 
                                   arm_params.min_conf, 
                                   itemset_formatter, 
                                   rule_formatter, 
                                   self.nthreads)
        rule_generator.execute(self.files_info.rules_tmp_file)


    '''
    Load generated frequent itemsets from file. 
    This method must be called after generate_frequent_itemsets is called
    '''
    def load_frequent_itemsets_as_dict(self):
        freq_itemset_dict = ItemsetDictionary(0)
        freq_itemset_dict.load_from_file(self.files_info.itemset_tmp_file)
        return freq_itemset_dict

Helper classes

In [49]:
def string_2_itemset(key):
    if key == '':
        return []
    else: 
        return key.split(',')

def itemset_2_string(itemset):
    return ",".join(itemset)

In [50]:
import json

class HashItem:
    
    def __init__(self, item):
        self.last_item = item 
        self.tids = []
    
    def add_tid(self, tid):
        self.tids.append(tid)
        
    def add_tids(self, tids):
        self.tids.extend(tids)
    
    def size(self):
        return len(self.tids)
    
    def serialize(self):
        return json.dumps((self.last_item, self.tids))
    
    def deserialize(self, json_string):
        result = json.loads(json_string)
        self.last_item = result[0]
        self.tids = result[1]
        

In [51]:
class HashItemCollection:
    def __init__(self):
        self.train_data = []
    
    def __iter__(self):
        return iter(self.train_data)
    
    def get_item(self, index):
        return self.train_data[index]
    
    def get_items_from(self, index):
        return self.train_data[index : ]
    
    def size(self):
        return len(self.train_data)
    
    def is_contain(self, item):
        for current_item in self.train_data:
            if current_item.last_item == item : 
                return True
        return False
        
    def sort(self):
        self.train_data.sort(key=lambda x: x.last_item, reverse=False)
    
    def add_item(self, hash_item):
        self.train_data.append(hash_item)
        
    def find_item(self, item):
        left = 0
        right = len(self.train_data) - 1
        while (left <= right):
            pivot = int((left + right)/2)
            if self.train_data[pivot].last_item == item: 
                return pivot
            if self.train_data[pivot].last_item < item:
                left = pivot + 1
            else:
                right = pivot - 1 
        return -1
        
    def add_tid(self, item, tid):
        index = self.find_item(item)
        if index == -1:
            hash_item = HashItem(item)
            hash_item.add_tid(tid)
            
            index = len(self.train_data) - 1
            self.train_data.append(hash_item)
            
            while index >= 0:
                if self.train_data[index].last_item > item:
                    self.train_data[index + 1] = self.train_data[index]
                    index -= 1
                else:
                    break
            self.train_data[index + 1] = hash_item        
        else:
            self.train_data[index].add_tid(tid)
    
    def serialize(self):
        temp = []
        for item in self.train_data:
            temp.append(item.serialize())
        return json.dumps(temp)

    def deserialize(self, json_string):
        self.train_data = []
        
        temp = json.loads(json_string)
        for item_string in temp:
            item = HashItem(None)
            item.deserialize(item_string)
            self.train_data.append(item)
        

In [52]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class ItemsetDictionary(object):
    

    def __init__(self, ntransactions = 0):
        self.itemsets = {}
        self.ntransactions = ntransactions
            
    def size(self):
        return len(self.itemsets)
    
    def exists(self, itemset_key):
        return itemset_key in self.itemsets
    
    def add_itemset(self, itemset_key, amount):
        self.itemsets[itemset_key] = amount
        
    def clear(self):
        self.itemsets.clear()
    
    def convert_2_indexes(self):
        k = 0
        dict_items_indexes = {}
        for item_name, _ in self.itemsets.items():
            dict_items_indexes[item_name] = k
            k += 1
        return dict_items_indexes
            
    def get_names(self):
        return self.itemsets.keys()
        
    def get_frequency(self, itemset_key):
        if self.exists(itemset_key):
            return self.itemsets[itemset_key]
        return 0
        
    def getConfidence(self, rule):
        left = self.get_frequency(rule.lhs_string())
        both = self.get_frequency(rule.rule_itemset_2_string())
        if left == 0: return 0
        return both/left
    
    def get_frequency_combo(self, rule):
        left = self.get_frequency(rule.lhs_string())
        right =self.get_frequency(rule.rhs_string())
        both = self.get_frequency(rule.rule_itemset_2_string())
        
        return left, right, both
    
    def get_support(self, itemset_key):     
        return self.get_frequency(itemset_key)/self.ntransactions
       
    def split(self, nchunks):
        itemsets_names = self.itemsets.keys()
        nitemsets = len(itemsets_names)
        
        print ('Number of frequent item-sets: ' + str(nitemsets))
        itemset_chunks = [[] for _ in range(nchunks)]
        size_of_chunk = (int)(nitemsets/nchunks) + 1
                    
        index = 0
        counter = 0
        
        for itemset_key in itemsets_names:
            if counter < size_of_chunk:
                itemset_chunks[index].append(string_2_itemset(itemset_key))
                counter += 1
            elif counter == size_of_chunk:
                index += 1
                itemset_chunks[index].append(string_2_itemset(itemset_key))
                counter = 1  
                  
        return itemset_chunks
    
    def save_2_file(self, file_name, write_mode = 'a', write_support = False):
        with open(file_name, write_mode) as text_file:
            for key, value in self.itemsets.items():
                t = value
                if write_support == True:
                    t = value/self.ntransactions
                text_file.write(key + ':' + str(t))
                text_file.write('\n')
            
    def load_from_file(self,file_name):
        self.itemsets.clear()
        
        with open(file_name, "r") as text_file:
            self.ntransactions = int(text_file.readline())
            for line in text_file:
                #print (line)
                subStrings = line.split(':')
                itemset_key = subStrings[0].strip()
                frequency = int(subStrings[1].strip())
                
                self.itemsets[itemset_key] = frequency
                
    def _complement_condition(self, r1, r2):
        merged_itemset = merge_itemsets(r1.left_items, 
                                        r2.left_items)
        
        s = self.get_frequency(itemset_2_string(merged_itemset))
        sl = self.get_frequency(r1.lhs_string())
        sr = self.get_frequency(r2.lhs_string())
    
        #if s > 0: return True
        return max(s/sl, s/sr)
     
        
    '''
    Check if two rules are contrary each other based on the matching function
    r1, r2: dictionaries includes {'r': rule, 'f': feature vector}
    contrast_params: contains thresholds, and size of LHS, RHS features 
    '''
    def is_contrast(self, r1, r2, contrast_params):
        
        n = contrast_params.n_lhs_features
        a = cosine_similarity(np.reshape(r1['f'][n:], (1, -1)),
                              np.reshape(r2['f'][n:], (1, -1)))[0,0]
        if a > contrast_params.delta2: return (False, 0, 0)
        
        b = cosine_similarity(np.reshape(r1['f'][:n], (1, -1)), 
                              np.reshape(r2['f'][:n], (1, -1)))[0,0]
        if b <= contrast_params.delta1: return (False, 0, 0)
        
        t = self._complement_condition(r1['r'], r2['r'])
        if t > contrast_params.share_threshold:
            return (True, b, t)
        return (False, 0, 0)
    
    
    def is_inner_contrast(self, group, contrast_params):
        #print('check inner')
        both_condition = self.find_pottential_contrast_locs(group, group, contrast_params)
        if both_condition is None: return False 
        
        for i in range(len(both_condition[0])):
            x = both_condition[0][i]
            y = both_condition[1][i]
            if x >= y: continue
            t = self._complement_condition(group['r'][x], group['r'][y])
            if t > contrast_params.share_threshold: return True 
            
        return False

        
        
    def find_pottential_contrast_locs(self, group1, group2, contrast_params):
        rhs_sim = cosine_similarity(group1['rhs'], group2['rhs']) 
        rhs_condition = (rhs_sim > contrast_params.delta2).astype(int) 
        if np.all(rhs_condition > 0) == True: return None 
    
        
        lhs_sim = cosine_similarity(group1['lhs'], group2['lhs'])
        lhs_condition = (lhs_sim <= contrast_params.delta1).astype(int)
        if np.all(lhs_condition > 0) == True: return None 
        
        locs = np.where(lhs_condition + rhs_condition <= 0)
        return locs 
        
    def is_outer_contrast(self, group1, group2, contrast_params):
        #print('check outer')
        both_condition = self.find_pottential_contrast_locs(group1, group2, contrast_params)
        if both_condition is None: return False 
        
        for i in range(len(both_condition[0])):
            x = both_condition[0][i]
            y = both_condition[1][i]
            t = self._complement_condition(group1['r'][x], group2['r'][y])
            if t > contrast_params.share_threshold: return True 
            
        return False
    

In [53]:
class HashTable:
    def __init__(self):
        self.table = {}
        
    def size(self):
        return len(self.table)
    
    def is_empty(self):
        return len(self.table) == 0;
    
    def is_contain(self, key, last_item):
        return (key in self.table) and (self.table[key].is_contain(last_item))
    
    def get_items(self):
        return self.table.items()
    
    # insert a new key into the table
    def insert_key(self, key):
        self.table[key] = HashItemCollection()
    
    def insert(self, key, value):
        self.table[key] = value
            
    # remove a key from the table
    def remove_item(self, key):
        self.table.pop(key, None)
        
    # insert a new transaction id into a specific item-set
    def add_tid(self, key, item, tid):
        self.table[key].add_tid(item, tid)
        
    # insert a item set and its transaction 
    def add_item(self, key, hash_item):
        self.table[key].add_item(hash_item)
    
    # get all item-set in the hash table
    def generate_itemset_dictionary(self):
        collection = ItemsetDictionary()
        for key, hash_item_collection in self.table.items():
            for hash_item in hash_item_collection:
                new_key = ''
                if key == '': 
                    new_key = hash_item.last_item
                else:
                    new_key = key + ',' + hash_item.last_item
                collection.add_itemset(new_key, hash_item.size())
        return collection
    
    def generate_itemset_dictionary_vw(self, output_file, write_mode):
        count = 0
        file_writer = open(output_file, write_mode)
        for key, hash_item_collection in self.table.items():
            for hash_item in hash_item_collection:
                new_key = ''
                if key == '': 
                    new_key = hash_item.last_item
                else:
                    new_key = key + ',' + hash_item.last_item
                file_writer.write(new_key + ':' + str(hash_item.size()))
                file_writer.write('\n')
                count += 1
                    
        file_writer.close()
        return count
    
    # get number of item-set have same K - 1 first items.
    def count_itemsets(self, key):
        return self.table[key].size()
    
    # get frequent item-set
    def generate_frequent_itemsets(self, minsup):
        L = HashTable()
        for key, hash_item_collection in self.table.items():
            L.insert_key(key)
            for hash_item in hash_item_collection:
                if hash_item.size() >= minsup:
                    L.add_item(key, hash_item)
            if L.count_itemsets(key) == 0:
                L.remove_item(key)
        return L
                 
    def sort(self):
        for hash_item_collection in self.table.values():
            hash_item_collection.sort()

    # this function is used for multi-thread
    def append(self, other_hash_table):
   
        for key, hash_item_collection in other_hash_table.get_items():
            self.table[key] = hash_item_collection

    def clear(self):
        self.table.clear()
        
    def split(self, n):
        number_of_keys = self.size()
        if number_of_keys < n:
            return [self]
        
        number_for_each_part = (int)(number_of_keys/n) + 1
        counter = 0
        sub_hash_tables = []
        sub_hash_table = HashTable()
        
        for key, hash_item_collection in self.get_items():
            if counter < number_for_each_part:
                sub_hash_table.insert(key, hash_item_collection)
            elif counter == number_for_each_part:
                sub_hash_tables.append(sub_hash_table)
                sub_hash_table = HashTable()
                sub_hash_table.insert(key, hash_item_collection)
                counter = 0
            counter += 1
        sub_hash_tables.append(sub_hash_table)
        return sub_hash_tables     
    
    def serialize(self, file_name):
        with open(file_name, "w") as text_file:
            #json.dump(self.table, text_file)
            k = 0
            for key, value in self.table.items():
                if k > 0:
                    text_file.write('\n')
                text_file.write(key)
                text_file.write('\n')
                text_file.write(value.serialize())
                k += 1
            
    def deserialize(self, file_name, reset_table = True):
        if reset_table == True:
            self.table = {}
        with open(file_name, "r") as text_file:
            k = 0
            collection_key = None
            for line in text_file:
                if k % 2 == 0:
                    collection_key = line.strip()
                else:
                    collection = HashItemCollection()
                    collection.deserialize(line.strip())
                    self.table[collection_key] = collection
                k = k + 1

In [54]:
from multiprocessing import Process
from multiprocessing.managers import BaseManager

class Apriori:
    def __init__(self, train_data_set):
        self.tmp_folder = Filepath
        self.freq_itemsets_tmp_file = self.tmp_folder + 'freqitemsets.tmp'
        self.itemsets_tmp_file = self.tmp_folder + 'itemsetscandidates.tmp'
        self.freq_k_item_sets_tmp_file = self.tmp_folder + 'freq_k_itemsets.tmp'
        self.data_set = train_data_set
        self.L1 = None
        
       
        
    
    def generate_L1(self, min_sup):
        C_1 = HashTable()
        itemset_key = ''
        C_1.insert_key(itemset_key)
    
        n = self.data_set.size()
        print ('size of data-set: ' + str(n))
        
        for tid in range(n):
            transaction = self.data_set.get_transaction(tid)
            for item in transaction:
                C_1.add_tid(itemset_key, item, tid)
            
        print ('get frequent item sets with 1 item')
        self.L1 = C_1.generate_frequent_itemsets(min_sup)
      
    @staticmethod
    def generate_Lk(min_sup, L_k1, C_k, k):
        print('generate candidates with ' + str(k) + ' items')
        for key, hash_item_collection in L_k1.get_items():
            for index in range(hash_item_collection.size() - 1):
                
                index_th_item = hash_item_collection.get_item(index)
                new_key = ''
                if key == '':
                    new_key = index_th_item.last_item
                else:
                    new_key = key +',' + index_th_item.last_item
                new_hash_collection = HashItemCollection()
                
                #check if it is infrequent item-set
                for item in hash_item_collection.get_items_from(index + 1):
                    new_item = HashItem(item.last_item)
                    inter_items = set(index_th_item.tids).intersection(item.tids)      
                    if len(inter_items) >= min_sup:  
                        new_item.add_tids(list(inter_items))
                        new_hash_collection.add_item(new_item)
                        
                if new_hash_collection.size() > 0:        
                    C_k.insert(new_key,  new_hash_collection) 

    def generate_frequent_itemsets(self, min_sup, nthreads, end, output_file, write_support = False):
        
        '''
        Step 1: Generate frequent item-sets with 1 item and write to file
        '''
        nTransactions = self.data_set.size()
        with open(output_file, 'w') as text_file:
            text_file.write(str(nTransactions))
            text_file.write('\n')
        
        
        self.generate_L1(min_sup)
        freq_itemsets_dict = self.L1.generate_itemset_dictionary()
        freq_itemsets_dict.ntransactions = nTransactions
        freq_itemsets_dict.save_2_file(output_file, 'a', write_support)
        freq_itemsets_dict.clear()
        
        '''
        Step 2: Generate frequent item-sets with more than 1 item and append to the file
        '''
        k = 2    
        L_k1 = self.L1
        
        while not L_k1.is_empty() and (end == -1 or k <= end):
            
            print('extracting item-sets with ' + str(k) + ' items ....')
            
            '''
            Divide data into many parts and create processes to generate frequent item-sets
            '''
            L_k = HashTable()
            chunks = L_k1.split(nthreads)
            processes = []
            
            C_ks = []
            BaseManager.register("AprioriHash", HashTable)
            manager = BaseManager()
            manager.start()
            C_ks.append(manager.AprioriHash())
            
            index = 0
            for L_k_1_chunk in chunks:
                process_i = Process(target = Apriori.generate_Lk, 
                                    args=(min_sup, L_k_1_chunk,C_ks[index], k))
                processes.append(process_i)
                index += 1
            
            # wait for all thread completes
            for process_i in processes:
                process_i.start()
                process_i.join()
             
            '''
            Merge results which returns from processes
            '''
            for new_C_k in C_ks:
                L_k.append(new_C_k)
            L_k1.clear()
            L_k1 = L_k
    
            '''
            Append frequent item-sets with k items to file
            '''
            freq_itemsets_dict = L_k1.generate_itemset_dictionary()
            
            print ('Writing frequent itemset to file ' + str(freq_itemsets_dict.size()))
            freq_itemsets_dict.ntransactions = nTransactions
            freq_itemsets_dict.save_2_file(output_file, 'a', write_support)
            freq_itemsets_dict.clear()
            
            k += 1
            
        print ('stop at k = ' + str(k))
     
    @staticmethod
    def generate_Lk_vw(min_sup, L_k1, C_k_file, k):
        print('generate candidates with ' + str(k) + ' items')
        file_writer = open(C_k_file, 'w') 
        for key, hash_item_collection in L_k1.get_items():
            for index in range(hash_item_collection.size() - 1):
                
                index_th_item = hash_item_collection.get_item(index)
                new_key = ''
                if key == '':
                    new_key = index_th_item.last_item
                else:
                    new_key = key +',' + index_th_item.last_item
                new_hash_collection = HashItemCollection()
                
                #check if it is infrequent item-set
                for item in hash_item_collection.get_items_from(index + 1):
                    new_item = HashItem(item.last_item)
                    inter_items = set(index_th_item.tids).intersection(item.tids)      
                    if len(inter_items) >= min_sup:  
                        new_item.add_tids(list(inter_items))
                        new_hash_collection.add_item(new_item)
                        
                if new_hash_collection.size() > 0:  
                    file_writer.write(new_key)
                    file_writer.write('\n')
                    file_writer.write(new_hash_collection.serialize())      
                    file_writer.write('\n')
        file_writer.close()

    def generate_frequent_itemsets_vw(self, min_sup, nThreads, end, output_file):
        
        '''
        Step 1: Generate frequent item-sets with 1 item and write to file
        '''
        ntransactions = self.data_set.size()
        with open(output_file, 'w') as text_file:
            text_file.write(str(ntransactions))
            text_file.write('\n')
        
        
        self.generate_L1(min_sup)
        self.L1.generate_itemset_dictionary_vw(output_file, 'a')
        
        '''
        Step 2: Generate frequent item-sets with more than 1 item and append to the file
        '''
        k = 2    
        L_k1 = self.L1
        
        while not L_k1.is_empty() and (end == -1 or k <= end):
            
            print('extracting item-sets with ' + str(k) + ' items ....')
            
            '''
            Divide data into many parts and create processes to generate frequent item-sets
            '''
            chunks = L_k1.split(nThreads)
            L_k1 = None
            processes = []
            
            index = 0
            for L_k_1_chunk in chunks:
                chunk_output_file = self.freq_itemsets_tmp_file +'.'+ str(index)
                process_i = Process(target = Apriori.generate_Lk_vw, 
                                    args=(min_sup, L_k_1_chunk,chunk_output_file, k))
                processes.append(process_i)
                index += 1
            
            # wait for all thread completes
            for process_i in processes:
                process_i.start()
                process_i.join()
             
            '''
            Merge results which returns from processes
            '''
            L_k1 = HashTable()
            for index in range(len(chunks)):
                chunk_input_file = self.freq_itemsets_tmp_file +'.'+ str(index)
                L_k1.deserialize(chunk_input_file, False)
            
            '''
            Append frequent item-sets with k items to file
            '''
            print ('Writing frequent itemset to file....')
            x = L_k1.generate_itemset_dictionary_vw(output_file, 'a')
            print ('#item-sets: ' + str(x))
            k += 1
            
        print ('stop at k = ' + str(k))

    def get_item_interaction_matrix(self):
        self.generate_L1(0)
        items_dict = self.L1.generate_itemset_dictionary()
        items_dict.nTransaction = self.data_set.size()
        
        nItems = items_dict.size()
        dict_item_indexes = items_dict.convert_2_indexes()
            
        A = np.zeros((nItems, nItems))
        for transaction in self.data_set:
            indexes = []
            for item_name in transaction:
                indexes.append(dict_item_indexes[item_name])
            for i in range(len(indexes)):
                for j in range(i+1, len(indexes)):
                    A[indexes[i], indexes[j]] += 1
                    A[indexes[j], indexes[i]] += 1
        return dict_item_indexes, A
    
        
        

In [55]:
class ItemsetFormatter(object):
   
    @staticmethod
    def mydefault(itemset):
        return True
    
    @staticmethod
    def mass(itemset):
        for item in itemset:
            if item.isdigit() == False:
                return True
        return False
    
    @staticmethod
    def tcr(itemset):
        for item in itemset:
            if item == 'CD4' or item == 'CD8':
                return True
        return False
    
    @staticmethod
    def rna(itemset):
        for item in itemset:
            if 'rna_' in item:
                return True
        return False
        
    @staticmethod
    def ank3(itemset):
        for item in itemset:
            if item == 'CASE' or item == 'HEALTHY':
                return True
        return False
    
    @staticmethod
    def spect(itemset):
        for item in itemset:
            if 'class@' in item:
                return True
        return False
    
    @staticmethod
    def kdd(itemset):
        for item in itemset:
            if 'c_' in item:
                return True
        return False
    
    @staticmethod
    def tcrm(itemset):
        a_count = 0
        b_count = 0
        for item in itemset:
            if 'b_' in item:
                b_count += 1
            if 'a_' in item:
                a_count += 1
        return (a_count > 0 and b_count > 0)

    @staticmethod
    def ppi(itemset):
        a_count = 0
        b_count = 0
        for item in itemset:
            if 'h@' in item:
                b_count += 1
            if 'v@' in item:
                a_count += 1
        return (a_count > 0 and b_count > 0)

    @staticmethod
    def splice(itemset):
        for item in itemset:
            if item == 'EI' or item == 'IE' or item == 'N@':
                return True
        return False

In [56]:
class RuleFormatter(object):
    
    @staticmethod
    def mydefaultLeft(item):
        return True
    
    @staticmethod
    def mydefaultRight(item):
        return True
    
    @staticmethod
    def mydefault(rule):
        #return True
        return len(rule.right_items) == 1#<= 2
    
    
    @staticmethod
    def massLeft(item):
        return item.isdigit()
    
    @staticmethod
    def massRight(item):
        return not item.isdigit()
    
    @staticmethod
    def mass(rule):
        return rule.lhs_string().isdigit() and (not rule.rhs_string().isdigit())
    
    @staticmethod
    def rna(rule):
        condition = (len(rule.right_items) == 1)
        condition &= ('rna_' in rule.rhs_string())
        condition &=  ('rna_' not in rule.lhs_string())
        return condition
    
    @staticmethod
    def tcrLeft(item):
        return item != 'CD4' and item != 'CD8'
    
    @staticmethod
    def tcrRight(item):
        return item == 'CD4' or item == 'CD8'    
    
    @staticmethod
    def tcr(rule):
        left_key = rule.lhs_string()
        right_key = rule.rhs_string()
        return ('CD4' not in left_key) and ('CD8' not in left_key) and (right_key == 'CD4' or right_key == 'CD8')
    
    @staticmethod
    def ank3Left(item):
        return item != 'CASE' and item != 'HEALTHY'
    
    @staticmethod
    def ank3Right(item):
        return item == 'CASE' or item == 'HEALTHY'
        
    @staticmethod
    def ank3(rule):
        left_key = rule.lhs_string()
        right_key = rule.rhs_string()
        return ('CASE' not in left_key) and ('HEALTHY' not in left_key) and (right_key == 'CASE' or right_key == 'HEALTHY')
    
    @staticmethod
    def spectLeft(item):
        return 'class@' not in item
    
    @staticmethod
    def spectRight(item):
        return 'class@' in item
        
    @staticmethod
    def spect(rule):
        flag = True
        for item in rule.right_items:
            if 'class@' not in item:
                flag = False
                break
        left_key = rule.lhs_string()
        return ('class@' not in left_key) and flag == True
    
    @staticmethod
    def kddLeft(item):
        return ('c_' in item) == False
    
    @staticmethod
    def kddRight(item):
        return 'c_' in item
        
    @staticmethod
    def kdd(rule):
        left_key = rule.lhs_string()
        right_key = rule.rhs_string()
        return ('c_' not in left_key) and (len(rule.right_items) == 1 and 'c_' in right_key)
    
    @staticmethod
    def tcrmLeft(item):
        return True
    
    @staticmethod
    def tcrmRight(item):
        return True
        
    @staticmethod
    def tcrm(rule):
        a_count1 = 0
        b_count1 = 0
        for item in rule.left_items:
            if 'b_' in item:
                b_count1 += 1
            if 'a_' in item:
                a_count1 += 1
        if a_count1 > 0 and b_count1 > 0: return False
        
        a_count2 = 0
        b_count2 = 0
        for item in rule.right_items:
            if 'b_' in item:
                b_count2 += 1
            if 'a_' in item:
                a_count2 += 1
        if a_count2 > 0 and b_count2 > 0: return False
        
        return (a_count1 > 0 and b_count2 > 0) or (b_count1 > 0 and a_count2 > 0)
    
    @staticmethod
    def ppiLeft(item):
        return True
    
    @staticmethod
    def ppiRight(item):
        return True
        
    @staticmethod
    def ppi(rule):
        a_count1 = 0
        b_count1 = 0
        for item in rule.left_items:
            if 'h@' in item:
                b_count1 += 1
            if 'v@' in item:
                a_count1 += 1
        if a_count1 > 0 and b_count1 > 0: return False
        
        a_count2 = 0
        b_count2 = 0
        for item in rule.right_items:
            if 'h@' in item:
                b_count2 += 1
            if 'v@' in item:
                a_count2 += 1
        if a_count2 > 0 and b_count2 > 0: return False
        
        return (a_count1 > 0 and b_count2 > 0) or (b_count1 > 0 and a_count2 > 0)
    
    
    @staticmethod
    def spliceLeft(item):
        return item != 'EI' and item != 'IE' and item != 'N@'
    
    @staticmethod
    def spliceRight(item):
        return item == 'D_0' or item == 'D_1' or item == 'N@'
        
    @staticmethod
    def splice(rule):
        left_key = rule.lhs_string()
        right_key = rule.rhs_string()
        return ('EI' not in left_key) and ('IE' not in left_key) and ('N@' not in left_key) and (right_key == 'EI' or right_key == 'IE' or right_key == 'N@')
        

In [57]:
class AssociationRule:
    def __init__(self, left, right):
        self.left_items = left
        self.right_items = right
        self.scores = []
        
    def length(self):
        return len(self.left_items) + len(self.right_items)
     
    def score(self, index):
        return self.scores[index]
    
    def lhs_string(self):
        return itemset_2_string(self.left_items)
        
    def rhs_string(self):
        return itemset_2_string(self.right_items)
    
    def serialize(self):
        left_key = self.lhs_string()
        right_key = self.rhs_string()
        return left_key + ">" + right_key
    
    @staticmethod        
    def string_2_rule(s):
        subStrings = s.split(">")
        left = string_2_itemset(subStrings[0].strip())
        right = string_2_itemset(subStrings[1].strip())
        #print("AssociationRule(left, right",AssociationRule(left, right))
        return AssociationRule(left, right)

    def append_score(self, score):
        self.scores.append(score)
        
    def get_itemset(self):
        itemset = []
        itemset.extend(self.left_items)
        itemset.extend(self.right_items)
        itemset.sort()
        return itemset
        
        
    def rule_itemset_2_string(self):
        itemset = self.get_itemset()
        return itemset_2_string(itemset)
    
    def compute_basic_probs(self,frequent_itemsets, nTransactions):  
        
        left = frequent_itemsets[self.lhs_string()]
        right = frequent_itemsets[self.rhs_string()]
        
        both = frequent_itemsets[self.rule_itemset_2_string()]
        
        vector = {}
        
        ''' 1. P(A)'''
        p_A = left/nTransactions
        vector['A'] = p_A
        
        ''' 2. P(B)'''
        p_B = right/nTransactions
        vector['B'] = p_B
        
        ''' 3. P(~A)'''
        p_not_A = 1 - p_A
        vector['~A'] = p_not_A
        
        ''' 4. P(~B)'''
        p_not_B = 1 - p_B
        vector['~B'] = p_not_B
        
        ''' 5. P(AB) '''
        p_A_and_B = both/nTransactions
        vector['AB'] = p_A_and_B
        
        ''' 6. P(~AB)'''
        p_not_A_and_B = (right - both)/nTransactions
        vector['~AB'] = p_not_A_and_B
        
        ''' 7. P(A~B)'''
        p_A_and_not_B = (left - both)/nTransactions
        vector['A~B'] = p_A_and_not_B
        
        ''' 8. P(~A~B)'''
        p_not_A_and_not_B = 1 - (left + right - both)/nTransactions
        vector['~A~B'] = p_not_A_and_not_B 
        
        '''
        9. P(A|B)
        '''
        p_A_if_B = p_A_and_B / p_B
        vector['A|B'] = p_A_if_B
        
        '''
        10. P(~A|~B)
        '''
        p_not_A_if_not_B = p_not_A_and_not_B / p_not_B
        vector['~A|~B'] = p_not_A_if_not_B
        
        '''
        11. P(A|~B)
        '''
        p_A_if_not_B = p_A_and_not_B/p_not_B
        vector['A|~B'] = p_A_if_not_B
        
        '''
        12. p(~A|B)
        '''
        p_not_A_if_B = p_not_A_and_B / p_B
        vector['~A|B'] = p_not_A_if_B
        
        '''
        13. P(B|A)
        '''
        p_B_if_A = p_A_and_B / p_A
        vector['B|A'] = p_B_if_A
        
        '''
        14. P(~B|~A)
        '''
        p_not_B_if_not_A = p_not_A_and_not_B / p_not_A
        vector['~B|~A'] = p_not_B_if_not_A
        
        '''
        15. P(B|~A)
        '''
        p_B_if_not_A = p_not_A_and_B/p_not_A
        vector['B|~A'] = p_B_if_not_A
        
        '''
        16. p(~B|A)
        '''
        p_not_B_if_A = p_A_and_not_B / p_A
        vector['~B|A'] = p_not_B_if_A
        
        return vector
    
    def is_redundant_(self, bits, k, itemset, freq_itemset_dict): 
        '''
        Run out of items --> create rule and check format criterion
        '''
        if k >= len(itemset):
            items_1 = []
            items_2 = []
            for index in range(len(bits)):
                if bits[index] == True:
                    items_1.append(itemset[index])
                else:
                    items_2.append(itemset[index])
            for item in items_2:
                rule = AssociationRule(items_1, [item])
                confidence = freq_itemset_dict.getConfidence(rule)
                if confidence == 1: return True
            return False 
      
        value_domain = [True, False]
        for value in value_domain:
            bits[k] = value
            checker = self.is_redundant_(bits, k+1, itemset, freq_itemset_dict)
            if checker == True: return True
            bits[k] = True    
        return False
    
    '''
    Expand an item-set with equivalent items.
    '''
    def is_redundant(self, freq_itemset_dict):
        bits = [True for _ in self.left_items]
        checker = self.is_redundant_(bits, 0, self.left_items, freq_itemset_dict)
        if checker == True: return True
        
        bits =  [True for _ in self.right_items]
        return self.is_redundant_(bits, 0, self.right_items, freq_itemset_dict)
    
    '''
    Check if an item-set is satisfied condition of the rule. 
    '''
    def satisfy_rule(self, itemset, is_lhs = True):
        condition = self.left_items
        if is_lhs == False: condition = self.right_items
        if len(condition) > len(itemset) or len(itemset) == 0:
            return False
        for item in condition:
            if item not in itemset:
                return False
        return True
    

In [58]:
class RulesCollection(object):

    def __init__(self):
        self.rules = []
        
        
    def size(self):
        return len(self.rules)
        
    def add(self, r):
        self.rules.append(r)
        
    def clear(self):
        self.rules.clear()
        
    def save(self, file_name, is_append):
        mode = 'w'
        if is_append == True:
            mode = 'a'
        with open(file_name, mode) as text_file:
            for rule in self.rules:
                text_file.write(rule.serialize())
                text_file.write('\n')
                
    def load_from_file(self, file_name):    
        with open(file_name, "r") as text_file:
            for line in text_file:
                rule = AssociationRule.string_2_rule(line)
                self.rules.append(rule)
        
    def remove_redundancy(self, freq_itemset_dict):
        new_rules = []
        for r in self.rules:
            if r.is_redundant(freq_itemset_dict):
                continue
            new_rules.append(r)
        self.rules = new_rules 
                
class RulesDictionary():
    
    def __init__(self):
        self.rules = {}
                    
    def load_from_file(self, file_name):
        with open(file_name, "r") as text_file:
            for line in text_file:
                rule = AssociationRule.string_2_rule(line)
                self.rules[line.strip()] = rule
    
    def get_rules(self):
        return list(self.rules.values())
    
    def rule_2_string(self):
        return list(self.rules.keys())
    
    def clear(self):
        self.rules.clear()

In [59]:
class Generator:
    
    def __init__(self, freq_itemset_dict, 
                 min_conf, 
                 itemset_formatter, 
                 rule_formatter, 
                 nThreads):
        self.itemset_formatter = itemset_formatter
        self.rule_formatter = rule_formatter
        
        self.nthreads = nThreads
        self.freq_itemset_dict = freq_itemset_dict
        
        self.min_conf = min_conf
    
    @staticmethod
    def string_2_rule_and_support(s):
        subStrings = s.split('#')
        rule  = Generator.string_2_rule(subStrings[0].strip())
        v = json.loads(subStrings[1].strip())
        return rule, v
    
    @staticmethod
    def rule_and_support_2_string(rule, p):
        return rule.serialize() + '#' + json.dumps(p)
                
    '''
    Generate association rules for one item-set
    '''
    def subsets(self, bits, item_set, k, rule_collection, total_freq): 
        '''
        Run out of items --> create rule and check format criterion
        '''
        if k >= len(item_set):
            left = []
            right = []
                    
            for index in range(len(bits)):
                if bits[index] == True:
                    left.append(item_set[index])
                else:
                    right.append(item_set[index])
                                      
            if (len(left) > 0 and len(right) > 0):
                rule = AssociationRule(left, right)
                if (self.rule_formatter == None or self.rule_formatter(rule) == True):
                    rule_collection.add(rule)
            
            return 
      
        value_domain = [True, False]
        '''
        Include k-th item into LHS 
        '''
        
        for value in value_domain:
            bits[k] = value
               
            if (value == False):
                left_itemset = []
                for index in range(len(bits)):
                    if bits[index] == True:
                        left_itemset.append(item_set[index])
                        
                left_value = self.freq_itemset_dict.get_frequency(itemset_2_string(left_itemset))
                confident = 0
                if left_value > 0: confident = total_freq/left_value
                
                if confident < self.min_conf:
                    bits[k] = True
                    continue
                self.subsets(bits, item_set, k+1, rule_collection, total_freq)
            else:
                self.subsets(bits, item_set, k+1, rule_collection, total_freq)
                
            bits[k] = True
    '''
    Generate association rules for a set of item-sets and write results to a file
    '''
    def generate_rules(self, freq_itemsets_collection, output_file_name):
        total_rules = 0
        remaining_rules = 0
        k = 0
        rule_collection = RulesCollection()
        with open(output_file_name, 'w') as _:
            print ('clear old file...')
            
        for itemset in freq_itemsets_collection:
            '''
            Check item-set first if it can generate a rule
            '''
            if len(itemset) == 1:
                continue
     
         
            if self.itemset_formatter is not None and \
            self.itemset_formatter(itemset) == False:
                continue
            
            '''
            Write generated rule_collection into file
            '''
            k += 1
            if k % 200 == 0:
                print ('writing some rule_collection to file: ' + str(k))
                total_rules += rule_collection.size()
                rule_collection.remove_redundancy(self.freq_itemset_dict)
                rule_collection.save(output_file_name, True)
                remaining_rules += rule_collection.size()
                rule_collection.clear()
            
            '''
            Generating association rule_collection.
            '''
            total_freq = self.freq_itemset_dict.get_frequency(itemset_2_string(itemset))
            bits = [True] * len(itemset)
            self.subsets(bits, itemset, 0, rule_collection, total_freq)
                    
        print ('writing last rule_collection to file: ' + str(k))
        total_rules += rule_collection.size()
        rule_collection.remove_redundancy(self.freq_itemset_dict)
        rule_collection.save(output_file_name, True)
        remaining_rules += rule_collection.size()
        rule_collection.clear()
        
        print ('Finish for sub frequent item-sets!!!')
        print ('Number of redundant rules ' + str(total_rules - remaining_rules) + '/' + str(total_rules))
                  
    '''
    Generate association rules for whole data-set
    '''  
    def execute(self, output_file_name):
        
        itemset_chunks = self.freq_itemset_dict.split(self.nthreads)
        
        processes = []
        for index in range(self.nthreads):
            file_name = output_file_name + '.' + str(index)
            process_i = Process(target=self.generate_rules, 
                                args=(itemset_chunks[index], file_name))
            processes.append(process_i)
            
            
        for process_i in processes:
            process_i.start()
            
        # wait for all thread completes
        for process_i in processes:
            process_i.join()
            
        print ('Finish generating rules!!!!')    
            
            

Argumetns

In [60]:
class ARMParams(object):
    '''
    classdocs
    '''

    def __init__(self, minsup, minconf, itemset_max_size=-1):
        '''
        Constructor
        '''
        self.min_sup = minsup 
        self.min_conf = minconf
        self.itemset_max_size = itemset_max_size

In [61]:
class ARMFiles(object):
    '''
    classdocs
    '''

    def __init__(self, default_folder = Filepath):
        '''
        Constructor
        '''
        self.temp_folder = default_folder
        
        self.itemset_tmp_file = self.temp_folder + 'miner.tmp.itemsets'
        self.rules_tmp_file = self.temp_folder + 'miner.tmp.rules'
      
  
        
        self.feature_tmp_file = self.temp_folder +'miner.tmp.features'
        self.non_redundant_rule_tmp_file = self.temp_folder +'miner.tmp.non_redundant_rules'
        self.non_redundant_rule_feature_tmp_file = self.temp_folder + 'miner.tmp.non_redundant_rules.features'
        

getting data

In [62]:
class DataSet:
    def __init__(self):
        self.current = 0
        self.train_data = []
        self.data_labels = []
        
    
    def __iter__(self):
        return iter(self.train_data)
                
    def size(self):
        return len(self.train_data)
    
    def get_transaction(self, index):
        return self.train_data[index]
    
    def clear(self):
        self.train_data.clear()
        
    def add_transaction(self, t):
        return self.train_data.append(t)

    '''
    Load data set from a file. The input file must be formated in CSV (comma separated)
    class_index is used in the case of data-set with labels. 
    '''
    def load(self, file_path, class_index = -1, has_header = False):
        self.train_data = []
        if class_index != -1: self.data_labels = []
        
        with open(file_path, "r") as text_in_file:
            if has_header == True:
                text_in_file.readline()
                
            for line in text_in_file:
                #print("dataset script line", line)
                transaction = [x.strip() for x in line.split(',')]
                transaction = list(filter(None, transaction))
                #print("datset script transaction" , transaction)
                
                if (class_index != -1):
                    self.data_labels.append(transaction[class_index])
                    del transaction[class_index]
                
                self.train_data.append(list(set(transaction)))
        print("Loading done")

Getting data as input

In [63]:
train_data_set = DataSet()

In [64]:
train_data_set.load(inputfilepath, -1)

Loading done


In [65]:
rule_miner = RuleMiner('spect', train_data_set)

Give arguments

In [66]:
arm_params = ARMParams(minsup, minconf, itemset_max_size)

Generates frequent item sets and rules

In [67]:
rule_miner.generate_itemsets_and_rules(arm_params)

generating frequent item-sets...
size of data-set: 32561
get frequent item sets with 1 item
extracting item-sets with 2 items ....
generate candidates with 2 items
Writing frequent itemset to file....
#item-sets: 1402
extracting item-sets with 3 items ....
generate candidates with 3 items
generate candidates with 3 items
generate candidates with 3 items
generate candidates with 3 items
Writing frequent itemset to file....
#item-sets: 8060
extracting item-sets with 4 items ....
generate candidates with 4 items
generate candidates with 4 items
generate candidates with 4 items
generate candidates with 4 items
Writing frequent itemset to file....
#item-sets: 25154
extracting item-sets with 5 items ....
generate candidates with 5 items
generate candidates with 5 items
generate candidates with 5 items
generate candidates with 5 items
Writing frequent itemset to file....
#item-sets: 48709
extracting item-sets with 6 items ....
generate candidates with 6 items
generate candidates with 6 items
