#                                        Practical Work in AI - updated version 

## Working with paper "Computing Optimal Decision Sets with SAT" "4.1 Iterative SAT Model" (Yu et al., 2020)

by Nina Braunmiller k11923286

## Table of contents
0. Imports<br>
<br>
1.  Function implementation<br>
1.a. Create Decision Set Classifier<br>
1.b. Functions for working with data sets<br>
<br>
2. Working with data set example from "Computing Optimal Decision Sets with SAT" page 4<br>
<br>
3. Two data sets from the internet <br>
3.a. binary data set<br>
3.b. recruitment data<br>
<br>
4. Working closer with the Mushroom data set<br>
4.a. Convert the discrete data set into a binary one<br>
4.b. Try out the classifiers for full data set<br>
4.c. Use classifiers for reduced data set<br>
4.d. Trying out if our decision set classifier can effectively learn from RIPPER's used features<br>
<br>
5. Final words

Short remark: The aim of the notebook is to implement the decision set classifier (see 1.a.). It is in short described by 'dsc' or 'opt' like in the underlying paper.<br>
You also have to install the KISSAT solver (https://github.com/arminbiere/kissat).

## 0. Imports

In [1]:
import pyeda

from pyeda.inter import *
import numpy as np
import re
import random

import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import os

from prettytable import PrettyTable 

import sklearn
from sklearn.model_selection import StratifiedKFold # train_test_split

import sys # for extending recursion depth

# pip install wittgenstein
import wittgenstein as lw # for using RIPPER

from collections import Counter # to count elements in a string

import os # to find the path of our used solver kissat

from sklearn.utils import shuffle # to shuffle a data array 

In [2]:
# Extend recursion depth
sys.setrecursionlimit(80000) # 10000 is an example, try with different values

In [3]:
# Getting the used versions of imports:
import matplotlib, prettytable 
import_list = [sys, pyeda, np, re, pd, matplotlib, prettytable, sklearn, lw] # sys gives us the python version
my_versions = ['3.7.3','0.28.0','1.19.2','2.2.1','1.1.4','3.3.2','3.0.0','0.23.2','0.3.2']
for ele, my_version in zip(import_list, my_versions):
    try:
        v = ele.__version__
        print
    except:
        try:
            v = ele.version
            
        except:
            v = 'cant say version'
    print(ele, ': \nyour version ', v, '\noriginally used version ', my_version, '\n')
    

<module 'sys' (built-in)> : 
your version  3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] 
originally used version  3.7.3 

<module 'pyeda' from '/home/c/anaconda3/lib/python3.7/site-packages/pyeda/__init__.py'> : 
your version  0.28.0 
originally used version  0.28.0 

<module 'numpy' from '/home/c/anaconda3/lib/python3.7/site-packages/numpy/__init__.py'> : 
your version  1.19.2 
originally used version  1.19.2 

<module 're' from '/home/c/anaconda3/lib/python3.7/re.py'> : 
your version  2.2.1 
originally used version  2.2.1 

<module 'pandas' from '/home/c/anaconda3/lib/python3.7/site-packages/pandas/__init__.py'> : 
your version  1.1.4 
originally used version  1.1.4 

<module 'matplotlib' from '/home/c/anaconda3/lib/python3.7/site-packages/matplotlib/__init__.py'> : 
your version  3.3.2 
originally used version  3.3.2 

<module 'prettytable' from '/home/c/anaconda3/lib/python3.7/site-packages/prettytable/__init__.py'> : 
your version  3.0.0 
originally used version  3.0.0 

<m

## 1.  Function implementation

### 1.a. Create Decision Set Classifier

In [4]:
random.seed(0) 

class DecisionSetClassifier:
    '''
    learns a decision/rule set which will then used to classify binary test data labels,
    implemented after the paper "Computing Optimal Decision Sets with SAT" by Yu, Ignatiev, Stuckey, Bodic, 2020
    chapter of the relating paper: "4.1 Iterative SAT Model" from where we implement the constraints (1)-(6);
    paper describes that classifier also as "opt".
    This classifier works for binary data sets (features and labels have to be binary)  
    
    general idea:
    a chain of nodes are a rule iff it ends with a node containing the label feature. 
    a decision/rule set can contain several rules. The order of these rules doesn't matter.
    a sample is classified by majority vote. When we have a tie the relating sample gets a class prediction of
    the most common class within the train set.
    
    '''
    
    def __init__(self, number_nodes):
        # user needs to give the number of nodes which are part of the rules (decision set size):
        self.number_nodes = number_nodes
        
        # Independent of user input:
        self.most_common_class = None
        
        self.s_feat = None
        self.s_mask = None
        
        self.t_array = None
        self.t_array_for_prediction = None
        
        self.prediction_array = None
        
        self.data_shape = None # is needed for ruleset visualization
        
        self.list_number_samples_per_rule = [] # the index refers to the current rule in the decision set;
        # here we look how many test set samples are fulfilled by the single rules

        self.falsely_classified_list = [] # indices belong to the single rules of decision set;
        # how many test set samples fulfill the rule conditions but are misclassified (locally, not globally)
    
    def reset(self):
        ''' call this method when we want to create a new instance without information left from the old instance '''
        
        self.most_common_class = None
        
        self.s_feat = None
        self.s_mask = None
        
        self.t_array = None
        self.t_array_for_prediction = None
        
        self.prediction_array = None
        
        self.data_shape = None # is needed for ruleset visualization
        
        self.list_number_samples_per_rule = [] # the index refers to the current rule in the decision set;
        # here we look how many test set samples are fulfilled by the single rules

        self.falsely_classified_list = [] # indices belong to the single rules of decision set;
        # how many test set samples fulfill the rule conditions but are misclassified (locally, not globally)
        
    def fit(self, data, path_to_kissat_solver = '/home/c/kissat/build/kissat'):
        '''data includes targets (last column) here!;
        data is training data;
        with the fit method we want to learn rules from given training set, the method contains all needed 
        constraints 
        
        For this method we worked with the package pyeda which can convert strings into expressions such that 
        the string elements are usable to work with SAT, e. g. transforming into DIMACS.
        '''
        try: 
        
            self.data_shape = data.shape # is needed for ruleset visualization

            ####################################################
            # PREPARATIONS OF VARIABLES FOR CLASSIFIER

            # get most common class in train set which is later needed for prediction:
            y_train = data[:,-1] 
            bins_counts = np.bincount(y_train)
            self.most_common_class = np.argmax(bins_counts)


            # initialize variables later needed:
            number_features = data.shape[1]-1 # class is no feature here
            number_examples = data.shape[0]

            # create the s_jr variables:
            matrix = exprvars('s',self.number_nodes,number_features+1) # +1 because we want nodes for the class


            ####################################################
            # CONSTRAINT 1
            # we only have one feature per node

            cond1_str = '('

            for ind, single_nodes in enumerate(matrix[:-1]): # [:-1] because of condtion 2 we already know the which 
                # feature will be at node 11. Therefore, we ignore node 11 here.
                single_state_all_features = '('
                single_state_all_features_list = []
                for ind2,s in enumerate(single_nodes):
                    if ind2 != 0:
                        single_state_all_features += ' | '
                    single_state_all_features += str(s)
                    single_state_all_features_list.append(s)
                single_state_all_features += ')'

                string_excluding = ''
                for ind3, ele3 in enumerate(single_state_all_features_list):
                    if ind3 < len(single_state_all_features_list)-1:
                        for ind4, ele4 in enumerate(single_state_all_features_list[ind3+1:]):
                            string_excluding += f' & (~{ele3} | ~{ele4})'
                            
                string_one_node = '(' + single_state_all_features + string_excluding + ')'
                cond1_str += string_one_node
                
                if ind < len(matrix[:-1]) -1:
                    cond1_str += ' & '
                    
            cond1_str += ')'

            cond1_func = expr(cond1_str)
            print('CONSTRAINT 1 done')

            ####################################################
            # CONSTRAINT 2
            # the last node is always a leaf

            cond2_last_node = exprvar('s',(self.number_nodes-1,number_features+1-1)) # -1 because we start at index 0
            print('CONSTRAINT 2 done')

            ####################################################
            # CONSTRAINT 3
            # at the first node all samples are true

            cond3_expr = exprvars('v',number_examples,1) # 0 stands for 0-th node
            cond3_str = ''
            for ele in cond3_expr:
                if cond3_str == '':
                    cond3_str += str(ele[0])
                else:
                    cond3_str += f'&{ele[0]}'
            cond3_func = expr(cond3_str)
            print('CONSTRAINT 3 done')

            ####################################################
            # CONSTRAINT 4 + 5 + 6 (have two loops in common)

            # 4: a sample is valid iff the current node is the start of the next rule or the feature value 
            # of that sample matches with the feature truth value of the previous node

            # 5: when a sample is valid at a leaf, it has the same label as the leaf node

            # 6: for every sample we have a leaf literal where the sample is valid

            # constraints 4, 5, 6 share same loop.

            cond4_str = ''
            cond5_str = ''
            cond6_str = '('

            for ind_ex in range(number_examples):

                if ind_ex != 0:
                    cond6_str += ')&('

                for ind_node in range(self.number_nodes):

                    # feat_str part of constraint 4:
                    feat_str = '('
                    for ind_feat in range(number_features): 
                        if feat_str != '(':
                            feat_str += '|'
                        feat_str += f'(s[{ind_node},{ind_feat}] & Equal(t[{ind_node}], {data[ind_ex,ind_feat]}))'

                    feat_str += ')'
                    if ind_node < self.number_nodes-1: # loop stops for condition 4 one earlier than for 5,6
                        if cond4_str != '':
                            cond4_str += '&'

                        cond4_str += f'(v[{ind_ex},{ind_node+1}] <=> s[{ind_node},{number_features}] | (v[{ind_ex},{ind_node}] & {feat_str}))'

                    if cond5_str != '':
                        cond5_str += '&' 
                    cond5_str += f'((s[{ind_node},{number_features}] & v[{ind_ex},{ind_node}]) => Equal(t[{ind_node}], {data[ind_ex,-1]}))'

                    if ind_node != 0:
                        cond6_str += '|'
                    cond6_str += f'(s[{ind_node},{number_features}] & v[{ind_ex},{ind_node}])'

            cond6_str += ')'    

            cond4_func = expr(cond4_str)
            print('CONSTRAINT 4 done')

            cond5_func = expr(cond5_str)  
            print('CONSTRAINT 5 done')

            cond6_func = expr(cond6_str)
            print('CONSTRAINT 6 done')

            ####################################################
            # PUTTING THE CONSTRAINTS TOGETHER AND GET A MODEL WITH HELP OF TSEITIN
            sat_str = cond1_str + '&' + str(cond2_last_node) + '&' + cond3_str + '&' + cond4_str + '&' + cond5_str + '&' + cond6_str
            print('final string done')
            sat_func = expr(sat_str)
            print('expressing string done')

            sat_func_tsei = sat_func.tseitin() # to_cnf() also possible but far more complex, too long, is
            # exponential sized
            print('tseitin transformation done')

            # Convert tseitin formula into the DIMACS format which is needed to feed the kissat solver which 
            # isn't included in our used package pyeda:
            sat_func_tsei_dimacs = pyeda.boolalg.expr.expr2dimacscnf(sat_func_tsei)[1] 
            # [0]: we get the variable mapping

            print('dimacs conversion done')



            # Store the DIMACS formula in a file:
            file = open('my_cnf.cnf', 'w')
            file.write(f'{sat_func_tsei_dimacs}')
            file.close()



            # With jupyter notebook we have the opportunity to run command lines within the notebook by simply 
            # using:
            # ! command line
            # Let's use the command line here to solve the DIMACS format with the kissat solver which was
            # downloaded from github (https://github.com/arminbiere/kissat): 

            path_including_kissat = path_to_kissat_solver #os.path.relpath("kissat") # getting path to kissat solver when using same directory

            out_with_linereading = ! '{path_including_kissat}' my_cnf.cnf #/home/c/kissat/build/add4.cnf

            in1 = out_with_linereading.index('c ---- [ result ] ------------------------------------------------------------')
            in2 = out_with_linereading.index('c ---- [ profiling ] ---------------------------------------------------------')

            result_list = out_with_linereading[in1+1:in2]
            sol_string = ' '.join(result_list)
            list_literals = re.findall(r'-?\d+', sol_string) # find all digits independent of sign

            if list_literals == []: # we have no literals because there is no model
                return None

            list_literals.remove('0') # the 0 only marks the end of clauses in DIMACS

            # Let's have a look at our mapping of expression symbols to natural numbers 
            # (change happend when formula was converted into DIMACS):
            dimacs_mapping_dict = pyeda.boolalg.expr.expr2dimacscnf(sat_func_tsei)[0]         


            ####################################################
            # When using PICOSAT
            '''
            # When we don't use the KISSAT solver than we can make use of the in pyeda implemented PICOSAT solver.
            # This solver doesn't need the DIMACS format. However, it is slower than the KISSAT solver.:
            sat_model = sat_func_tsei.satisfy_one() # uses PICOSAT
            if sat_model is None: # finding no model with PICOSAT
                return None

            # Find the variables in the model which are used/true:

            s_array = np.array([0]*self.number_nodes, dtype=str)
            t_array = np.array([0]*self.number_nodes)

            for k,v in sat_model.items(): 
                if v==1 and ('aux' not in str(k)) and ('v' not in str(k)) and ('t' not in str(k)):
                    node_index = int((re.findall(r'\d+', str(k)))[0])
                    used_feat_ = (re.findall(r'\d+', str(k)))[1]
                    s_array[node_index] = used_feat_
                        
                if v==1 and ('aux' not in str(k)) and ('v' not in str(k)) and ('s' not in str(k)):
                    node_index = int((re.findall(r'\d+', str(k)))[0])
                    t_array[node_index] = 1
                    
            self.s_feat = s_array
            self.t_array = t_array
            '''
            ####################################################

            print('finding model done')



            ####################################################
            # PREPARE s_ARRAY AND t_ARRAY
            # Here we want to collect all used features in order of the decision set. So, we collect from s_jr
            # the r which stands for the feature index.
            # Each node j can only carry one feature. This we want to filter.

            # When using KISSAT instead of PICOSAT we get a slightly different model output description.
            # Get all variables which are positive/used in our KISSAT model:

            
            s_array = np.array([0]*self.number_nodes, dtype=object)
            t_array = np.array([0]*self.number_nodes)
            
            for single_literal in list_literals:

                paper_variable = dimacs_mapping_dict[int(single_literal)] # paper_variable means the 
                # symbols used in the paper: s_jr, t_j, v_ij. 
                # v_ij describes the fitting of example i at node j. Ignore them for rule formulation.
                
                # Get all variables which appear in our final model of the formula:         
                if ('~' not in str(paper_variable)) and ('aux' not in str(paper_variable)) and ('v' not in str(paper_variable)):
                    
                    # get all used s_jr variables our model and nothing else:
                    if ('t' not in str(paper_variable)):
                                                
                        node_index = int((re.findall(r'\d+', str(paper_variable)))[0]) 
                        # get all node indices j of s_jr
                        
                        used_feat_ = (re.findall(r'\d+', str(paper_variable)))[1]
                        
                        # get all used features r of s_jr
                        
                        s_array[node_index] = used_feat_
                        
                    # get all positive t_j variables our model and nothing else:
                    if ('s' not in str(paper_variable)):
                        
                        node_index = int((re.findall(r'\d+', str(paper_variable)))[0])
                        # get all node indices j of t_j
                        
                        t_array[node_index] = 1
                        
            self.t_array = t_array # collection of t_j ordered by node indices j
            # at each node j where feature value = 1 we have in this array also a 1, else it is 0
            
            self.s_feat = s_array.astype(str) # collection of all used features r in order following node indices j 
            # retrieved from s_jr
            
            # Get indices of leaf nodes.
            # That are the nodes in which the label feature is used instead of an other feature:
            self.s_mask = np.where(self.s_feat==f'{data.shape[1]-1}') # data.shape[1]-1 gives our class

            # modify t_array for creating the prediction array later:
            self.t_array_for_prediction = np.copy(self.t_array) # without np.copy() self.t_array would become
            # exactly the same object as self.t_array_for_prediction. Therefore, it would be modified when
            # self.t_array_for_prediction would be modified.

            self.t_array_for_prediction[np.where(self.t_array_for_prediction == 0)] = -1
            # We need an extra array here because we will make majority vote for sample classifiaction later.


            ####################################################

            # DONE, we found a model. Therefore, 1 instead of None returned:
            return 1
        
        except: # too high complexity to use code above
            return -1 
    
    def predict(self, data, y_test=None): # here3
        ''' data without labels, y_test needed for our evaluation of rules.
        You can ignore it for samples for which the true label is unknown.'''
        try:
            y_test = np.copy(y_test) # else our global variable beyond the class y_test changes too with
            # changes within this method
        except: 
            pass
        
        old_ind = 0

        number_test_samples = data.shape[0]
        self.prediction_array = np.zeros((number_test_samples, ), dtype='int')
        
        for ind in self.s_mask[0]: # look at the single rules of our decision set, self.s_mask splits into single rules
                        
            # get the used features of one rule without the label 
            # because the label feature's index isn't needed at all:
            s_feat_without_class = np.array(self.s_feat[old_ind:ind+1][:-1], dtype='int')
           
            data_small = data[:,s_feat_without_class] # only look at the features which were used by the rule
  
            # Now bring t_j into the game:
            t_array_filtered_without_class = self.t_array[old_ind:ind+1][:-1] # getting the truth values 
            # for the used features of the current rule
            
            t_array_filtered_without_class = np.repeat(t_array_filtered_without_class.reshape(1,(len(t_array_filtered_without_class))), number_test_samples, axis=0)
            # duplicate the line because below we will compare this array with the example array line by line
           
            
            # Have a look at the filtered examples:
            ex_fit_rule = np.all(data_small == t_array_filtered_without_class, axis=1) # compare sample by sample
            # (row by row) if the single example fulfills the truth values which are given by the current rule
            
            if np.any(ex_fit_rule): # at least one sample fits to the current rule
                
                predicted_class_weight = self.t_array_for_prediction[old_ind:ind+1][-1] 
                # thats the truth value for the class of the current rule {-1/negative_class,1/positive_class}

                # getting the examples which are part of the current class:
                ind_of_fitting_ex = np.where(ex_fit_rule)
                
                self.prediction_array[ind_of_fitting_ex] += predicted_class_weight
                
                # this class weight is either 1 (class 1) or -1 (class 0). 
                # When look at all rules and every time when a rule fits to an example we get its prediction.
                # However, this prediction doesn't have always be the same. 
                # By majority vote will be determined what the final prediction is. 
                # Therefore, we can simply add the weight every time a rule fits.
                # When after the last rule the prediction is 0 the we choose the most common class as prediction.
                # final prediction > 0 predicts 1, whereas prediction < 0 predicts 0.
                
                ########################################################################################
                # only needed for evaluation purposes. no value for this method def predict.
                if y_test is None: 
                    print('Be aware that we need y_test to make use of the table visualization of the method ruleset_performance.')
                else:
                    # count how many samples fit to the current rule:
                    number_fitting_to_rule = len(ind_of_fitting_ex[0])
                    self.list_number_samples_per_rule.append(number_fitting_to_rule)

                    # count how many rules which fulfill rule conditions are misclassifications:
                    y_test[np.where(y_test==0)] = -1
                    number_falsely_classified = len(np.where(y_test[ind_of_fitting_ex] != predicted_class_weight)[0])
                    self.falsely_classified_list.append(number_falsely_classified)
                ########################################################################################
                
            else:
                print('no sample for our rule fitting')
                self.list_number_samples_per_rule.append(0)
                self.falsely_classified_list.append(0)
                
            old_ind = ind + 1 

        # convert prediction array to the classes we want to predict. Final prediction {0,1}:
        self.prediction_array[np.where(self.prediction_array>0)] = 1
        # sample was more often classified as positive than negative.

        self.prediction_array[np.where(self.prediction_array==0)] = self.most_common_class 
        # a tie. Sample got the same amount of positive and negative label predictions or the sample doesn't
        # fit to any rule. Then we predict the most common class in our training set.

        self.prediction_array[np.where(self.prediction_array<0)] = 0
        # sample was more often classified as negative than positive. For this purpose we needed the
        # self.t_array_for_prediction which also contains negative values for the negative class 0.

        return self.prediction_array
   

    def score(self, y_test):
        ''' method predict has to be called before! 
        returns the accuracy'''
        if self.prediction_array is None:
            raise Exception('First call class.predict(data, y_test) !')
        else:
            correctly_predicted = len(np.where(self.prediction_array == y_test)[0])
            acc = correctly_predicted / len(y_test) 
        return acc

    
    def ruleset_(self):
        '''visualization decision set. usable after usage of method fit.
        This method is also automatically used by method ruleset_performance.
        
        Aim: create a visualization of the found rules. For this destination simply use the already found out 
        '''
        
        if self.s_feat is None:
            raise Exception('Call first class.fit(data)')
            
        # self.s_feat, self.t_array, self.s_mask which were found in the fit method:
        s_feat = np.copy(self.s_feat)
       
        t_array = np.copy(self.t_array)
        s_mask = np.copy(self.s_mask) # gives us the split of single rules by indices

        # Replacing truth boolean values 0 and 1 by '' and '¬' for easier reading of ruleset later:
        t_array = t_array.astype(str)
        t_array[np.where(t_array=='1')] = ''
        t_array[np.where(t_array=='0')] = '¬'

        s_plus_t = np.char.add(t_array, s_feat) # s_feat combined with t_array. Needed below
        
        # Concatenate the used features by using 'and'/'^':
        s_feat = np.char.add(s_feat, ' ∧ ') # with np.char.add() can we merge two strings elementwise in an array together

        # Replace the class feature indices by the word 'class':
        
        s_feat = s_feat.astype(object) # needed to put 'class|' into array, when we would have dtype=str then
        # 'class|' would be a too long element to put in (because longer than other elements)
        s_feat[s_mask] = 'class|'
        
        s_feat = s_feat.astype(str) # for upcoming operations the array has to be of type str

        # Adding t_array to s_feat:
        s_feat = np.char.add(t_array, s_feat)
       
        # Feature directly in front of 'class' needs implication instead of conjunction:
        s_feat[s_mask[0]-1] = np.char.add(s_plus_t[s_mask[0]-1], ' → ')
        
        rule_string = ''.join(s_feat.tolist()) # we want all rules as string without commas.
        
        return rule_string
    
    def ruleset_performance(self):
        '''
        This method can only be called after the methods fit and predict!
        '''
        
        if self.list_number_samples_per_rule == []:
            raise Exception('First call class.predict(data, y_test) !')
        
        rule_string = self.ruleset_()
        list_rule_set = rule_string.split('|')[:-1]
        
        table = PrettyTable()
        
        # define the column names of our table:
        columns = ['rules', '# test samples fulfilling rule conditions','# mistakenly fitting test samples', '% mistaken fits']
        # Important: the column '% mistaken fits' doesn't talk about misclassifications but fitting of 
        # single samples to a certain rule although the ground truth label is different. 
        # Nevertheless, the sample can get the right prediction through majority vote.
        
        # add values to each of our columns:
        table.add_column(columns[0], list_rule_set) # the rule visualization
        table.add_column(columns[1], self.list_number_samples_per_rule)
        table.add_column(columns[2], self.falsely_classified_list)
        
        # calculation of % of fitting samples with false ground truth label:
        c = np.array(self.list_number_samples_per_rule)
        percent_misclass = np.array(self.falsely_classified_list)/ c
        percent_misclass[np.where(c==0)] = 0
                
        table.add_column(columns[3], percent_misclass )

        
        return table

### 1.b. Functions for working with data sets

In [5]:
# Create table to compare different classifiers:
def tableizer(current_dataset_name:str , *row_values_lists):
    ''' Create a table comparing our classifier with RIPPER. 
    argument row_values_lists has to be at least one list containing [info whether full data set used:str,
    accuracy score our classifier, accuracy score RIPPER],
    e. g. tableizer('data set jobs' ,['data set with shortend shape', 0.5, 0.1])'''
    
    table = PrettyTable(['data set forms','our decsision set classifier', 'RIPPER'])
    table.title = 'best accuracy scores for data set ' + current_dataset_name
     
    for ind, ele in enumerate(row_values_lists):
        table.add_row(ele)
        
    return table

In [6]:
# Removing duplicates from data:
def remove_duplicates(data_array):
    ''' data_array has to be with y-column!
    Remove all duplicates but also samples with same feature values but different labels (like it is done
    in the underlying paper)'''
    
    # Remove ambiguous examples:
    df = pd.DataFrame(data_array)
    df1 = df.drop_duplicates() # remove all examples which are duplicates from each other (always one kept)
    
    df2 = df1.drop_duplicates(subset=[i for i in range((data_array[:,:-1]).shape[1])],keep=False)
    # remove of examples with same features but different classes
    data_array = df2.to_numpy(dtype='int')
    
    return data_array

In [7]:
# Use sklearn to implement stratified cross validation:
def stratified_cross_validation(dataset, clf_ripper_dsc, number_folds=5,path_to_kissat_solver='/home/c/kissat/build/kissat'):
    ''' number_folds=5 as used in the paper 
    '''
    print('DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT')
    
    random.seed(0)

    data = dataset.astype(int)

    data = shuffle(data ,random_state=0) # bring examples into random order 

    X = data[:,:-1]
    y = data[:,-1]

    
    skt = StratifiedKFold(n_splits=number_folds) # 5 folds were used by the paper

    score_list = []
    rule_list = []

    for train_index, test_index in skt.split(X,y): 
        
        clf = clf_ripper_dsc #lw.RIPPER() # Or irep_clf = lw.IREP() to build a model using IREP     
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        if 'DecisionSetClassifier' in str(clf_ripper_dsc): 
            
            data_train = np.concatenate((X_train, y_train.reshape(len(y_train),1)),axis=1) 
            # looks like the original data without the test set
            # needed for the fit method that our array also has y lables included
            
            fitter = clf.fit(data_train, path_to_kissat_solver)

            if fitter is None:
                score = None
                rule_set = None
                
            elif fitter == -1: # code couldn't be executed successfully
                score = fitter
                rule_set = None
                
            else:

                clf.predict(X_test, y_test)
                score = clf.score(y_test) 
                
                rule_set = clf.ruleset_performance() # only working for our classifier, not ripper

        else: # we use RIPPER
            
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            
            # getting the rule set
            # or use: print(ripper_clf.out_model())
            # retrieved from: https://pypi.org/project/wittgenstein/
            rule_set = clf.ruleset_ # ^ as 'and', V as 'or'  
            
        print('Reached score:', score)
        score_list.append(score)    
                

        print('found ruleset:')        
        print(rule_set) 
               
        rule_list.append(str(rule_set))
        
        try: # our dsc used
            clf.reset() # such that with upcoming instance we have no old information left
        except:
            pass
        
    print(f'reached scores with {number_folds} folds:', score_list)   
    
    score_list.append(None) # so we can remove None below without error message when originally no None in list
    score_list = set(score_list) # with help of the set() all None can be remove below
    score_list.remove(None) # We get score None when our classifier can't find a model
    
    if len(score_list) > 0:
        highest_score = max(score_list)
    else:
        highest_score = None
    
    
    return highest_score, score_list, rule_list

# 2. Working with data set example from "Computing Optimal Decision Sets with SAT" page 4 


In [8]:
# data from underlying paper:
data_array = np.array([[1,0,1,0,0],
                    [1,0,0,1,0],
                    [0,0,1,0,1],
                    [1,1,0,0,0],
                    [0,0,0,1,1],
                    [1,1,1,1,0],
                    [0,1,1,0,0],
                    [0,0,1,1,1],
                    ])

### Let's try out whether our decision set classifier finds the same rules as in the paper

In [9]:
data_array_opt = np.copy(data_array)

X = data_array_opt[:,:-1]
y = data_array_opt[:,-1]

print(data_array_opt)

dsc = DecisionSetClassifier(number_nodes = 7)

fitter = dsc.fit(data_array_opt)

dsc.ruleset_()

dsc.predict(X,y)
print('To predict the y labels we use the train data here because we only want to test if our descision set classifier is working.')

score_dsc = dsc.score(y)
print('reached score:', score_dsc)

dsc.ruleset_performance()

[[1 0 1 0 0]
 [1 0 0 1 0]
 [0 0 1 0 1]
 [1 1 0 0 0]
 [0 0 0 1 1]
 [1 1 1 1 0]
 [0 1 1 0 0]
 [0 0 1 1 1]]
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
To predict the y labels we use the train data here because we only want to test if our descision set classifier is working.
reached score: 1.0


rules,# test samples fulfilling rule conditions,# mistakenly fitting test samples,% mistaken fits
0 → ¬class,4,0,0.0
¬0 ∧ ¬1 → class,3,0,0.0
1 → ¬class,3,0,0.0


As you can see in the column "rules" it for sure does.

### Trying out RIPPER

In [10]:

data_array_rip = data_array.astype(int)


X = data_array_rip[:,:-1]
y = data_array_rip[:,-1]

ripper_clf = lw.RIPPER() # Or irep_clf = lw.IREP() to build a model using IREP

ripper_clf.fit(X, y)

score_ripper = ripper_clf.score(X, y)


print(score_ripper, 'score')
print('found rule set: ')
ripper_rule_set = ripper_clf.ruleset_
# getting the rule set
# or use: print(ripper_clf.out_model())
# retrieved from: https://pypi.org/project/wittgenstein/
print(ripper_rule_set) # ^ as 'and', V as 'or'


0.875 score
found rule set: 
[[0=0]]


In [11]:
tableizer('from underlying paper' ,[f'full data set of shape {data_array.shape}', score_dsc, score_ripper])

data set forms,our decsision set classifier,RIPPER
"full data set of shape (8, 5)",1.0,0.875


# 3. Two data sets from the internet

## 3.a. binary data set

In [12]:
data_dir = Path("datasets") / Path("research_gate_question") / Path("SampleData.csv")

In [13]:
data_df = pd.read_csv(data_dir, sep=',', header=None)

In [14]:
data_df.shape

(250, 751)

In [15]:
data_df.columns = [f'F{i}' for i in range(data_df.shape[1])]
data_df['class'] = data_df['F0'].replace(['BAD','GOOD'], [0,1])
del data_df['F0']
data_df

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,...,F742,F743,F744,F745,F746,F747,F748,F749,F750,class
0,1,1,0,1,1,1,1,0,1,0,...,0,0,1,0,1,1,1,1,0,1
1,1,1,0,1,1,1,1,0,1,0,...,0,0,1,0,1,1,1,1,0,1
2,1,1,0,1,1,1,1,1,0,0,...,0,0,0,0,1,0,1,1,0,1
3,1,1,0,1,1,1,1,1,0,0,...,0,0,0,0,1,0,1,1,0,1
4,1,1,0,1,1,1,1,1,0,0,...,0,0,0,0,1,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,1,0,0,0,1,0,1,1,0,1,...,0,0,1,0,1,1,1,0,1,1
246,1,0,0,0,1,0,1,1,0,1,...,0,0,1,0,1,1,1,0,1,1
247,1,0,0,0,1,0,1,1,0,1,...,0,0,1,0,1,1,1,0,1,1
248,1,0,0,0,1,0,1,1,0,1,...,0,0,1,0,1,1,1,0,1,1


In [16]:
data_df['class'].value_counts()

1    176
0     74
Name: class, dtype: int64

In [17]:
data_array_binary = data_df.to_numpy()
data_array_binary = remove_duplicates(data_array_binary)
data_array_binary.shape

(250, 751)

### Let's get started with stratisfied cross validation
This is suitable for data set with uneven label distribution

#### Using full data set

In [18]:
# highest_score_binary_dsc, _, __ = stratified_cross_validation(data_array_binary, DecisionSetClassifier(number_nodes = 14), number_folds=5)
# problem: jupyter notebook tells us that kernel dies
highest_score_binary_dsc = 'computation too complex'

In [19]:
highest_score_binary_ripper, _, __ = stratified_cross_validation(data_array_binary, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.52
found ruleset:
[[182=0] V [233=1^21=1]]
Reached score: 0.62
found ruleset:
[[458=0^384=0^158=1]]
Reached score: 0.56
found ruleset:
[[182=0]]
Reached score: 0.48
found ruleset:
[[396=0]]
Reached score: 0.6
found ruleset:
[[182=0]]
reached scores with 5 folds: [0.52, 0.62, 0.56, 0.48, 0.6]


In [20]:
list_tableizer_binary1 = [f'full data set of shape {data_array_binary.shape}; #folds=5', highest_score_binary_dsc, highest_score_binary_ripper]

**Interpretation:**<br>
-1: means that the problem was too complex for our decision set classifier. We can extend the recursion depth such that our classifier can make use of more computational ressources. This can be varied under: sys.setrecursionlimit(number)<br><br>
None: means that our decision set classifier didn't find a model for our given data. A reason could be when training with small data sets and stratified cross validation, that there is indeed no pattern for which our classifier could find for a certain amount of rule nodes. We could increase the number of rule nodes 'number_nodes' (danger of too high complexity) and the data set.

#### Using shortened data set

In [21]:
print('From shape', data_array_binary.shape)
data_array_binary_shortend = data_array_binary[:,740:]
data_array_binary_shortend = remove_duplicates(data_array_binary_shortend)
print('To shape', data_array_binary_shortend.shape)

From shape (250, 751)
To shape (59, 11)


In [22]:
number_nodes = 28
highest_score_binary_shortened_dsc,_,__ = stratified_cross_validation(data_array_binary_shortend, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation don

In [23]:
highest_score_binary_shortened_ripper,_,__ = stratified_cross_validation(data_array_binary_shortend, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.5833333333333334
found ruleset:
[[0=0^6=1^7=1] V [8=1^2=0]]
Reached score: 0.6666666666666666
found ruleset:
[[0=0^7=1^6=1]]
Reached score: 0.5
found ruleset:
[[8=1]]
Reached score: 0.5
found ruleset:
[[8=1^1=0] V [9=1] V [6=0^3=1]]
Reached score: 0.6363636363636364
found ruleset:
[[6=1^0=0^7=1]]
reached scores with 5 folds: [0.5833333333333334, 0.6666666666666666, 0.5, 0.5, 0.6363636363636364]


In [24]:
list_tableizer_binary2 = [f'shortened data set of shape {data_array_binary_shortend.shape}; #folds=5', str(highest_score_binary_shortened_dsc)+f' with {number_nodes} nodes', highest_score_binary_shortened_ripper]

Summarize results in table:

In [25]:
tableizer('binary from internet', list_tableizer_binary1, list_tableizer_binary2)

data set forms,our decsision set classifier,RIPPER
"full data set of shape (250, 751); #folds=5",computation too complex,0.62
"shortened data set of shape (59, 11); #folds=5",None with 28 nodes,0.6666666666666666


**Interpretation:**<br>
-1: means that the problem was too complex for our decision set classifier. We can extend the recursion depth such that our classifier can make use of more computational ressources. This can be varied under: sys.setrecursionlimit(number)<br><br>
None: means that our decision set classifier didn't find a model for our given data. A reason could be when training with small data sets and stratified cross validation, that there is indeed no pattern for which our classifier could find for a certain amount of rule nodes. We could increase the number of rule nodes 'number_nodes' (danger of too high complexity) and the data set.

## 3.b. recruitment data
This data is about job success and salary level. Originally it isn't binary. We only keep binary features.

In [26]:
data_dir = Path("datasets") / Path("recuitment") / Path("Placement_Data_Full_Class.csv")

In [27]:
data_df = pd.read_csv(data_dir, sep=',')
data_df.shape

(215, 15)

In [28]:
data_df

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed,250000.0
3,4,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed,425000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed,400000.0
211,212,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed,275000.0
212,213,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed,295000.0
213,214,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed,204000.0


In [29]:
# Delete non binary features:
del data_df['sl_no']

del data_df['ssc_p']
del data_df['hsc_p']
del data_df['degree_p']
del data_df['etest_p']
del data_df['mba_p']

del data_df['salary']

In [30]:
# Let's code the binary feature into numeric one:
data_df['gender'] = data_df['gender'].replace('M',0)
data_df['gender'] = data_df['gender'].replace('F',1)

data_df['ssc_b'] = data_df['ssc_b'].replace('Central',1)
data_df['ssc_b'] = data_df['ssc_b'].replace('Others',0)

data_df['hsc_s'] = data_df['hsc_s'].replace('Commerce',1)
data_df['hsc_s'] = data_df['hsc_s'].replace('Science',0)
values = [0,1]
data_df = data_df[data_df.hsc_s.isin(values) == True]


data_df['degree_t'] = data_df['degree_t'].replace('Comm&Mgmt',1)
data_df['degree_t'] = data_df['degree_t'].replace('Sci&Tech',0)
data_df = data_df[data_df.degree_t.isin(values) == True]

data_df['specialisation'] = data_df['specialisation'].replace('Mkt&Fin',1)
data_df['specialisation'] = data_df['specialisation'].replace('Mkt&HR',0)

data_df['status'] = data_df['status'].replace('Placed',1)
data_df['status'] = data_df['status'].replace('Not Placed',0)

data_df['workex'] = data_df['workex'].replace('Yes',1)
data_df['workex'] = data_df['workex'].replace('No',0)

data_df['hsc_b'] = data_df['hsc_b'].replace('Others',1)
data_df['hsc_b'] = data_df['hsc_b'].replace('Central',0)

data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


Unnamed: 0,gender,ssc_b,hsc_b,hsc_s,degree_t,workex,specialisation,status
0,0,0,1,1,0,0,0,1
1,0,1,1,0,0,1,1,1
3,0,1,0,0,0,0,0,0
4,0,1,0,1,1,0,1,1
5,0,0,1,0,0,1,1,0
...,...,...,...,...,...,...,...,...
210,0,0,1,1,1,0,1,1
211,0,0,1,0,0,0,1,1
212,0,0,1,1,1,1,1,1
213,1,0,1,1,1,0,0,1


In [31]:
data_array_job = data_df.to_numpy(dtype='int')
data_array_job = remove_duplicates(data_array_job)
data_array_job.shape

(41, 8)

### Using full data set

In [32]:
number_nodes = 29
highest_score_job_dsc,_,__ = stratified_cross_validation(data_array_job, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation don

In [33]:
highest_score_job_ripper,_,__ = stratified_cross_validation(data_array_job, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.2222222222222222
found ruleset:
[[1=0^3=0]]
Reached score: 0.75
found ruleset:
[[3=0^6=1]]
Reached score: 0.5
found ruleset:
[[5=1] V [0=1]]
Reached score: 0.5
found ruleset:
[[5=1]]
Reached score: 0.5
found ruleset:
[[6=1^5=1]]
reached scores with 5 folds: [0.2222222222222222, 0.75, 0.5, 0.5, 0.5]


In [34]:
list_tableizer_job1 = [f'full data set of shape {data_array_job.shape}; #folds=5', str(highest_score_job_dsc)+f' with {number_nodes} nodes', highest_score_job_ripper]

### Using shortened data set

In [35]:
data_array_job_shortened = data_array_job[:,2:]
data_array_job_shortened = remove_duplicates(data_array_job_shortened)
data_array_job_shortened.shape

(18, 6)

In [36]:
number_nodes = 20 
highest_score_job_shortened_dsc,_,__ = stratified_cross_validation(data_array_job_shortened, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.5
found ruleset:
+-----------------------+-------------------------------------------+-----------------------------------+-----------------+
|         rules         | # test samples fulfilling rule conditions | # mistakenly fitting test samples | % mistaken fits |
+-----------------------+-------------------------------------------+-----------------------



CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.6666666666666666
found ruleset:
+----------------------+-------------------------------------------+-----------------------------------+-----------------+
|        rules         | # test samples fulfilling rule conditions | # mistakenly fitting test samples | % mistaken fits |
+----------------------+-------------------------------------------+-----------------------------------+-----------------+
| ¬4 ∧ ¬0 ∧ 2 → class  |                     0                     |                 0                 |       0.0 

In [37]:
highest_score_job_shortened_ripper,_,__ = stratified_cross_validation(data_array_job_shortened, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.75
found ruleset:
[[3=1]]
Reached score: 0.75
found ruleset:
[[3=1]]
Reached score: 0.75
found ruleset:
[[2=1]]
Reached score: 0.3333333333333333
found ruleset:
[[1=0]]
Reached score: 0.6666666666666666
found ruleset:
[[1=0]]
reached scores with 5 folds: [0.75, 0.75, 0.75, 0.3333333333333333, 0.6666666666666666]


In [38]:
list_tableizer_job2 = [f'shortened data set of shape {data_array_job_shortened.shape}; #folds=5', str(highest_score_job_shortened_dsc)+f' with {number_nodes} nodes', highest_score_job_shortened_ripper]

Sum up results in a table:

In [39]:
tableizer('recuritment data set' ,list_tableizer_job1,list_tableizer_job2)

data set forms,our decsision set classifier,RIPPER
"full data set of shape (41, 8); #folds=5",None with 29 nodes,0.75
"shortened data set of shape (18, 6); #folds=5",0.6666666666666666 with 20 nodes,0.75


**Interpretation:**<br>
-1: means that the problem was too complex for our decision set classifier. We can extend the recursion depth such that our classifier can make use of more computational ressources. This can be varied under: sys.setrecursionlimit(number)<br><br>
None: means that our decision set classifier didn't find a model for our given data. A reason could be when training with small data sets and stratified cross validation, that there is indeed no pattern for which our classifier could find for a certain amount of rule nodes. We could increase the number of rule nodes 'number_nodes' (danger of too high complexity) and the data set.

# 4. Working closer with the Mushroom data set

## 4.a. Convert the discrete data set into a binary one

In [40]:
data_dir = Path("datasets") / Path("mushrooms") / Path("agaricus-lepiota.data")
data_df = pd.read_csv(data_dir, sep=',', header=None)
data_df.shape
data_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [41]:
data_df['class'] = data_df[0].replace(['p','e'], [0,1]) # create a new column which is called 'class' and
# below added as last column
del data_df[0] # delete first column which orignially was the label column


data_df.columns = [i for i in range(data_df.shape[1])] # rename columns by indexing
data_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,x,s,n,t,p,f,c,n,k,e,...,w,w,p,w,o,p,k,s,u,0
1,x,s,y,t,a,f,c,b,k,e,...,w,w,p,w,o,p,n,n,g,1
2,b,s,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,n,m,1
3,x,y,w,t,p,f,c,n,n,e,...,w,w,p,w,o,p,k,s,u,0
4,x,s,g,f,n,f,w,b,k,t,...,w,w,p,w,o,e,n,a,g,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,k,s,n,f,n,a,c,b,y,e,...,o,o,p,o,o,p,b,c,l,1
8120,x,s,n,f,n,a,c,b,y,e,...,o,o,p,n,o,p,b,v,l,1
8121,f,s,n,f,n,a,c,b,n,e,...,o,o,p,o,o,p,b,c,l,1
8122,k,y,n,f,y,f,c,n,b,t,...,w,w,p,w,o,e,w,v,l,0


In [42]:
# Getting all discrete feature attributes per column/feature:

count_one_hot_features = 0
nested_list = []

for i in range(data_df.shape[1]-1): # -1 because we ignore the last column which describes the classes.
    count_one_hot_features += len(set(data_df[i].values))
    nested_list.append(list(set(data_df[i].values)))
print('Number of total binary features:', count_one_hot_features) # column size of our one hot vector
print('Discrete features attributes for each column/feature:')
nested_list

Number of total binary features: 117
Discrete features attributes for each column/feature:


[['f', 'k', 'x', 'c', 's', 'b'],
 ['y', 's', 'g', 'f'],
 ['p', 'c', 'b', 'n', 'g', 'r', 'w', 'u', 'y', 'e'],
 ['f', 't'],
 ['p', 'f', 'a', 'c', 'n', 's', 'l', 'y', 'm'],
 ['f', 'a'],
 ['w', 'c'],
 ['b', 'n'],
 ['o', 'p', 'k', 'n', 'g', 'r', 'y', 'w', 'h', 'u', 'b', 'e'],
 ['e', 't'],
 ['?', 'c', 'r', 'b', 'e'],
 ['y', 's', 'f', 'k'],
 ['y', 's', 'f', 'k'],
 ['o', 'p', 'c', 'n', 'g', 'y', 'w', 'b', 'e'],
 ['o', 'p', 'c', 'n', 'g', 'y', 'w', 'b', 'e'],
 ['p'],
 ['o', 'w', 'y', 'n'],
 ['o', 'n', 't'],
 ['p', 'f', 'n', 'l', 'e'],
 ['o', 'k', 'b', 'n', 'r', 'w', 'h', 'u', 'y'],
 ['a', 'c', 'n', 's', 'v', 'y'],
 ['p', 'd', 'g', 'w', 'l', 'u', 'm']]

In [43]:
# Convert a binary feature data array which will be used for our decision set classifier:

mushroom_arr = data_df.to_numpy()

# create one-hot-vector representing the mushroom data set:
one_hot_arr = np.zeros((mushroom_arr.shape[0],count_one_hot_features),dtype=int)

start_ind = 0

for col_ind, single_list in enumerate(nested_list):

    for ind2, single_ele in enumerate(single_list):
        
        row_coordinates = np.where(mushroom_arr[:,col_ind] == single_ele)[0]
        
        col_mask = np.array([start_ind+ind2]*len(row_coordinates))
        mask = (row_coordinates, col_mask)
        one_hot_arr[mask] = 1
        
    start_ind = start_ind + len(single_list)
    
# Add the label column to the binary data array
data_array_mush = np.append(one_hot_arr, mushroom_arr[:,-1].reshape(len(mushroom_arr[:,-1]),1), axis=1)
data_array_mush.shape

(8124, 118)

## 4.b. Try out the classifiers for full data set

In [44]:
#number_nodes = 25
#highest_score_mush_dsc,_,__ = stratified_cross_validation(data_array_mush, DecisionSetClassifier(number_nodes), number_folds=5)
# problem too big --> kernel dies
highest_score_mush_dsc = 'problem too complex'

In [45]:
highest_score_mush_ripper,score_list_4b,rule_list_4b = stratified_cross_validation(data_array_mush, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 1.0
found ruleset:
[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^99=0^35=1] V [33=1^25=0^105=0] V [91=1]]
Reached score: 1.0
found ruleset:
[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^108=0^33=0] V [33=1^25=0^105=0] V [26=1^62=1] V [89=1^99=0]]
Reached score: 1.0
found ruleset:
[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^108=0^33=0] V [33=1^25=0^105=0] V [26=1^62=1] V [89=1^99=0]]
Reached score: 1.0
found ruleset:
[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^99=0^60=0^16=0] V [33=1^49=0] V [26=1^109=1] V [91=1]]
Reached score: 1.0
found ruleset:
[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^99=0^60=0^16=0] V [33=1^49=0] V [26=1^109=1] V [91=1]]
reached scores with 5 folds: [1.0, 1.0, 1.0, 1.0, 1.0]


In [46]:
list_tableizer_mush1 = [f'full data set of shape {data_array_mush.shape}; #folds=5', str(highest_score_mush_dsc)+f' with {number_nodes} nodes', highest_score_mush_ripper]

## 4.c. Use classifiers for reduced data set

In [47]:
data_array_short_mush = data_array_mush[:, 100:]
data_array_short_mush = remove_duplicates(data_array_short_mush)
data_array_short_mush.shape

(33, 18)

In [75]:
number_nodes = 18 + 7 # here1
highest_score_mush_short_dsc,_,__ = stratified_cross_validation(data_array_short_mush, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
Reached score: None
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.7142857142857143
found ruleset:
+----------------------------------------+-------------------------------------------+-----------------------------------+-----------------+
|                 rules                  | # test samples fulfilling rule conditions | # mistakenly fitting test samples | % mistaken fits |
+----------------------------------------+-



CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.5714285714285714
found ruleset:
+----------------------------+-------------------------------------------+-----------------------------------+-----------------+
|           rules            | # test samples fulfilling rule conditions | # mistakenly fitting test samples | % mistaken fits |
+----------------------------+-------------------------------------------+-----------------------------------+-----------------+
|   7 ∧ ¬12 ∧ ¬16 → ¬class   |                     1                     |                 1                 |       1.0       |
|         1 → ¬class         |                     1                     |                 0                 |       0.0       |
|    ¬14 ∧ 0 ∧ ¬8 → class

In [49]:
highest_score_mush_short_ripper,ripper_score_list_short_mush,ripper_rule_list_short_mush = stratified_cross_validation(data_array_short_mush, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.2857142857142857
found ruleset:
[[0=1]]
Reached score: 0.7142857142857143
found ruleset:
[[8=0^1=0] V [14=1]]
Reached score: 0.5714285714285714
found ruleset:
[[1=0^7=0^8=0]]
Reached score: 0.5
found ruleset:
[[1=0^9=1] V [6=1]]
Reached score: 0.6666666666666666
found ruleset:
[[1=0^9=1] V [1=0^11=0]]
reached scores with 5 folds: [0.2857142857142857, 0.7142857142857143, 0.5714285714285714, 0.5, 0.6666666666666666]


In [76]:
list_tableizer_mush2 = [f'shortened data set of shape {data_array_short_mush.shape}; #folds=5', str(highest_score_mush_short_dsc)+f' with {number_nodes} nodes', highest_score_mush_short_ripper]

## 4.d. Trying out if our decision set classifier can effectively learn from RIPPER's used features

### ... by directly using features:

In [51]:
# Found ruleset of RIPPER for data_array_mush:
index_highest_score = np.argmax(score_list_4b)
ripper_best_ruleset = rule_list_4b[index_highest_score]
ripper_best_ruleset

'[[26=1^49=0] V [90=1^54=0^22=0] V [26=1^99=0^35=1] V [33=1^25=0^105=0] V [91=1]]'

In [52]:
# Separate found rule nodes in ruleset by 'F':
ripper_best_ruleset = ripper_best_ruleset.replace('[','F' )
ripper_best_ruleset = ripper_best_ruleset.replace('^','F' )
ripper_best_ruleset = ripper_best_ruleset[1:]
ripper_best_ruleset

'F26=1F49=0] V F90=1F54=0F22=0] V F26=1F99=0F35=1] V F33=1F25=0F105=0] V F91=1]]'

In [53]:
# find all features that were used:

# Find all used features by finding out the index area between 'F' and '=':
indices_of_binary_feat = [(m.start(0)) for m in re.finditer('F', ripper_best_ruleset)]
indices_of_binary_feat2 = [(m.start(0)) for m in re.finditer('=', ripper_best_ruleset)]

binary_feat_list = []

for start, end in zip(indices_of_binary_feat,indices_of_binary_feat2):

    binary_feat_list.append(int(ripper_best_ruleset[start+1:end]))
    
used_feat = list(set(binary_feat_list))

# Count 'V'/the number of rules in the best ruleset:
count = Counter(ripper_best_ruleset)
number_rules = count['V']

number_nodes = (len(binary_feat_list) + number_rules) + 2
# we need to add number_rules because each rule contains
# +2 because then our classifier finds model (see below)

# a class node which is not listed in the given ruleset 
used_feat

[33, 99, 35, 90, 105, 49, 54, 22, 25, 26, 91]

Let's only consider these features:

In [54]:
# Convert data array to smaller one which only contains the features which were considered by RIPPER

print('Original shape of mushroom data set:', data_array_mush.shape)

# for indexing over data array:
mask = np.array(used_feat)
mask = np.append(mask, data_array_mush.shape[1]-1) # y-values/labels become part of the mask because we want to
# get data array with them

data_array_binary_mush = data_array_mush[:,mask]
data_array_binary_mush = remove_duplicates(data_array_binary_mush)
print('Shape after only considering features used by best RIPPER:', data_array_binary_mush.shape)

Original shape of mushroom data set: (8124, 118)
Shape after only considering features used by best RIPPER: (23, 12)


Let's use the decision set classifier as stated above:

In [65]:
number_nodes = number_nodes + 6
highest_score_binary_mush_dsc,_,__ = stratified_cross_validation(data_array_binary_mush, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.2
found ruleset:
+-----------------------------------------------+-------------------------------------------+-----------------------------------+-----------------+
|                     rules                     | # test samples fulfilling rule conditions | # mistakenly fitting test samples | % mistaken fits |
+-----------------------------------------------+-------------------------------------------+-----------------------------------+-----------------+
| ¬5 ∧ ¬5 ∧ ¬9 ∧ ¬2 ∧ ¬2 ∧ ¬5 ∧ ¬2 ∧ ¬4 → class |                     1   



CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.6
found ruleset:
+---------------------------------+-------------------------------------------+-----------------------------------+--------------------+
|              rules              | # test samples fulfilling rule conditions | # mistakenly fitting test samples |  % mistaken fits   |
+---------------------------------+-------------------------------------------+-----------------------------------+--------------------+
|            10 → class           |                     0                     |                 0                 |        0.0         |
|          ¬3 ∧ 0 → class         |                     1                  

In [56]:
highest_score_binary_mush_ripper,ripper_score_list_binary_mush,ripper_rule_list_binary_mush = stratified_cross_validation(data_array_binary_mush, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.4
found ruleset:
[[3=1^9=1]]
Reached score: 0.4
found ruleset:
[[9=1]]
Reached score: 0.6
found ruleset:
[[2=1]]
Reached score: 1.0
found ruleset:
[[3=1^6=0]]
Reached score: 0.5
found ruleset:
[[2=1^1=0^3=1]]
reached scores with 5 folds: [0.4, 0.4, 0.6, 1.0, 0.5]


In [66]:
list_tableizer_mush3 = [f'shortened binary data set of shape {data_array_binary_mush.shape}; #folds=5', str(highest_score_binary_mush_dsc)+f' with {number_nodes} nodes', highest_score_binary_mush_ripper]

RIPPER reaches for the complete mushroom dataset of shape (8124, 118) an accuracy of 1 with kfold split of 2. When we ONLY use the found features for our decision set classifier it reaches only an accuracy of 0.42857142857142855. 

### ... by finding out which discrete features where used (learning from whole disrecte features):

This approach will be more complicated for our decision set classifier than above because above one attribute is converted into one feature. However, here we have discrete features which can be handled by the RIPPER but our decision set classifier will get only the discrete features transformed to boolean values.That means our classifier has more options in learning something.

In [58]:
# Convert the objects of the mushroom data array into numeric discrete values:

discrete_mushroom_arr2 = np.copy(mushroom_arr)
discrete_mushroom_arr2.astype(str)

discrete_mushroom_arr = np.copy(mushroom_arr)
for column_index, column in enumerate(nested_list):
    for index, attribute in enumerate(column):
        mask = np.where(mushroom_arr[:,column_index] == attribute)
        discrete_mushroom_arr[mask[0], column_index] = index
        
discrete_mushroom_arr

array([[2, 1, 3, ..., 3, 5, 0],
       [2, 1, 8, ..., 2, 2, 1],
       [5, 1, 6, ..., 2, 6, 1],
       ...,
       [0, 1, 3, ..., 1, 4, 1],
       [1, 0, 3, ..., 4, 4, 0],
       [2, 1, 3, ..., 1, 4, 1]], dtype=object)

In [59]:
# Now RIPPER shall learn NOT from binary data set but from discrete one:
highest_score_dis_mush_ripper,ripper_score_list_dis_mush,ripper_rule_list_dis_mush = stratified_cross_validation(discrete_mushroom_arr, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 1.0
found ruleset:
[[4=4^9=1] V [18=0^7=0^9=0^17=0] V [4=4^10=0^7=0] V [4=4^3=0^11=1] V [6=0^9=1] V [4=4^21=0] V [4=4^11=2] V [11=0^2=3]]
Reached score: 1.0
found ruleset:
[[4=4^9=1] V [18=0^7=0^9=0^17=0] V [4=4^10=0^7=0] V [4=4^3=0^11=1] V [6=0^9=1] V [4=4^19=5^10=3^3=0] V [17=2^19=5]]
Reached score: 1.0
found ruleset:
[[4=4^9=1] V [18=0^10=1] V [4=4^10=0^7=0] V [18=0^12=0] V [4=4^10=4] V [6=0^9=1] V [4=4^12=2] V [4=4^21=0] V [4=4^14=3^6=0]]
Reached score: 1.0
found ruleset:
[[4=4^9=1] V [18=0^7=0^9=0^17=0] V [4=4^10=0^7=0] V [4=4^3=0^11=1] V [6=0^9=1] V [4=4^21=0] V [4=4^14=3^10=3]]
Reached score: 1.0
found ruleset:
[[4=4^9=1] V [18=0^7=0^9=0^17=0] V [4=4^10=0^7=0] V [4=4^3=0^11=1] V [6=0^9=1] V [4=4^21=0] V [4=4^11=2] V [11=0^2=3]]
reached scores with 5 folds: [1.0, 1.0, 1.0, 1.0, 1.0]


Now we can see all used features to end up with a positive class prediction, the important attribute values and number of nodes.<br>
Take the best performing rule set:

In [60]:
index_highest_score = np.argmax(ripper_score_list_dis_mush)
ripper_best_ruleset = ripper_rule_list_dis_mush[index_highest_score]
ripper_best_ruleset

'[[4=4^9=1] V [18=0^7=0^9=0^17=0] V [4=4^10=0^7=0] V [4=4^3=0^11=1] V [6=0^9=1] V [4=4^21=0] V [4=4^11=2] V [11=0^2=3]]'

In [61]:
ripper_best_ruleset = ripper_best_ruleset.replace('[','F' )
ripper_best_ruleset = ripper_best_ruleset.replace('^','F' )
ripper_best_ruleset = ripper_best_ruleset[1:]
ripper_best_ruleset

'F4=4F9=1] V F18=0F7=0F9=0F17=0] V F4=4F10=0F7=0] V F4=4F3=0F11=1] V F6=0F9=1] V F4=4F21=0] V F4=4F11=2] V F11=0F2=3]]'

In [62]:
# find all features that were used:

indices_of_discrete_feat = [(m.start(0)) for m in re.finditer('F', ripper_best_ruleset)]
indices_of_discrete_feat2 = [(m.start(0)) for m in re.finditer('=', ripper_best_ruleset)]

discrete_feat_list = []

for start, end in zip(indices_of_discrete_feat,indices_of_discrete_feat2):

    discrete_feat_list.append(int(ripper_best_ruleset[start+1:end]))

used_feat = list(set(discrete_feat_list))

# Count 'V'/the number of rules in the best ruleset:
count = Counter(ripper_best_ruleset)
number_rules = count['V']

number_nodes = len(discrete_feat_list) + number_rules
used_feat

[2, 3, 4, 6, 7, 9, 10, 11, 17, 18, 21]

Half of the features where used. <br>
Let's come to our decision set classifier. Transfer the used features into one hot encoding for binary usage:

In [63]:
# Only find out the feature attributes for the used discrete features by RIPPER:
# Reason: Our decision set classifier shall later only look at that used features but can only do this
# for binary ones. Therefore, we transfer discrete features into binary features

count_one_hot_features = 0
nested_list_updated = []
for index in used_feat:
    used = nested_list[index] # nested list with attribute values of each feature which where used by RIPPER
    count_one_hot_features += len(used)
    nested_list_updated.append(used)


# create one-hot-vector representing the mushroom data set:

one_hot_arr = np.zeros((mushroom_arr.shape[0], count_one_hot_features),dtype=int)

start_ind = 0
for col_ind, single_list in zip(used_feat, nested_list_updated):
    for ind2, single_ele in enumerate(single_list):
        row_coordinates = np.where(mushroom_arr[:,col_ind] == single_ele)[0]
        
        col_mask = np.array([start_ind+ind2]*len(row_coordinates))
        mask = (row_coordinates, col_mask)
        one_hot_arr[mask] = 1
    start_ind = start_ind + len(single_list)
    
# Create the data set with selected features:
data_array_with_y_discrete_mush = np.append(one_hot_arr, mushroom_arr[:,-1].reshape(len(mushroom_arr[:,-1]),1), axis=1)
data_array_discrete_mush = remove_duplicates(data_array_with_y_discrete_mush)

print('Original mushroom data set shape:', mushroom_arr.shape)
print('Shape only considering the discrete features used by RIPPER:', data_array_discrete_mush.shape)

Original mushroom data set shape: (8124, 23)
Shape only considering the discrete features used by RIPPER: (130, 52)


Let's use the decision set classifier as stated above:

In [67]:
number_nodes = number_nodes
highest_score_dis_mush_dsc,_,__ = stratified_cross_validation(data_array_discrete_mush, DecisionSetClassifier(number_nodes), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
Reached score: -1
found ruleset:
None
CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
no sample for our rule fitting
Reached score: 0.9615384615384616
found ruleset:
+-----------------------------+-------------------------------------------+-----------------------------------+---------------------+
|            rules            | # test samples fulfilling rule conditions | # mistakenly fitting test samples |   % mistaken fits   |
+-----------------------------+-------------------------------------------+-----------------------------------+---------------------+



CONSTRAINT 1 done
CONSTRAINT 2 done
CONSTRAINT 3 done
CONSTRAINT 4 done
CONSTRAINT 5 done
CONSTRAINT 6 done
final string done
expressing string done
tseitin transformation done
dimacs conversion done
finding model done
no sample for our rule fitting
Reached score: 0.9615384615384616
found ruleset:
+------------------------------------------------+-------------------------------------------+-----------------------------------+--------------------+
|                     rules                      | # test samples fulfilling rule conditions | # mistakenly fitting test samples |  % mistaken fits   |
+------------------------------------------------+-------------------------------------------+-----------------------------------+--------------------+
|                  12 → ¬class                   |                     0                     |                 0                 |        0.0         |
|                ¬25 ∧ 21 → class                |                     2                     

In [68]:
highest_score_dis_mush_ripper,_,__ = stratified_cross_validation(data_array_discrete_mush, lw.RIPPER(), number_folds=5)

DO NOT FORGET TO CHANGE THE ARGUMENT path_to_kissat_solver. This leads us to the storage place of KISSAT
Reached score: 0.9615384615384616
found ruleset:
[[16=1^21=1] V [10=0^14=1] V [16=1^44=1] V [16=1^47=1] V [18=1] V [16=1^10=1]]
Reached score: 0.8846153846153846
found ruleset:
[[16=1^30=0^8=0] V [10=0^13=0^36=1] V [16=1^46=0^8=0^50=0]]
Reached score: 0.8461538461538461
found ruleset:
[[16=1^21=1] V [10=0^30=0^12=0] V [16=1] V [14=1] V [18=1]]
Reached score: 0.8846153846153846
found ruleset:
[[23=1^30=0] V [16=1^46=0^30=1^50=0] V [18=1] V [14=1]]
Reached score: 1.0
found ruleset:
[[16=1^10=1] V [10=0^49=0^46=0] V [18=1] V [14=1]]
reached scores with 5 folds: [0.9615384615384616, 0.8846153846153846, 0.8461538461538461, 0.8846153846153846, 1.0]


In [69]:
list_tableizer_mush4 = [f'shortened discrete data set of shape {data_array_discrete_mush.shape}; #folds=5', str(highest_score_dis_mush_dsc)+f' with {number_nodes} nodes', highest_score_dis_mush_ripper]

Final table for the mushroom data set. Having a look how good each classifier performed (accuracy) on each condition (data set manipulation):

In [77]:
tableizer('mushroom data set' ,list_tableizer_mush1,list_tableizer_mush2,list_tableizer_mush3,list_tableizer_mush4)

data set forms,our decsision set classifier,RIPPER
"full data set of shape (8124, 118); #folds=5",problem too complex with 20 nodes,1.0
"shortened data set of shape (33, 18); #folds=5",0.7142857142857143 with 25 nodes,0.7142857142857143
"shortened binary data set of shape (23, 12); #folds=5",0.6 with 33 nodes,1.0
"shortened discrete data set of shape (130, 52); #folds=5",1.0 with 33 nodes,1.0


**Interpretation:**<br>
-1: means that the problem was too complex for our decision set classifier. We can extend the recursion depth such that our classifier can make use of more computational ressources. This can be varied under: sys.setrecursionlimit(number)<br><br>
None: means that our decision set classifier didn't find a model for our given data. A reason could be when training with small data sets and stratified cross validation, that there is indeed no pattern for which our classifier could find for a certain amount of rule nodes. We could increase the number of rule nodes 'number_nodes' and the data set.

## 5. Final words

RIPPER outperforms our implemented decision set classifier 'opt' for the given data sets. Furthermore, it is much faster. One more problem is the scalability of our opt. It struggles with handling high data amounts. Nevertheless, the loops (which lead to heavy and long computations for the final SAT solver) are unavoidable according to one of the authors. <br>
However, the opt was only the starting classifier in the underlying paper "Computing Optimal Decision Sets with SAT" (Yu et al., 2020). It would be interesting to see the performance of further going classifiers from the paper which also include soft constraints. These constraints allow us a to lower the accuracy to less than 1 on the training set such that the model wouldn't need a 100 % fit to training data.<br>
In conclusion, this work gave me a good insight in working with a decision set concept of which I had never heard before. It is an interesting alternative apporach of working with data in the field of AI.