# Exploring Random Forest and Fidex rule generation for obesity risk classification

**Introduction:**

Welcome to HES-Xplain, our interactive platform designed to facilitate explainable artificial intelligence (XAI) techniques. In this use case, we dive into obesity risk classification and showcase another application example of explainability techniques.

This notebook is an alternative to the [`Exploring Dimlp and Fidex rule generation for breast cancer classification`](TODO). It aims to be similar but aims to use a different dataset and training model to show the versatility of our explainability tools.  In addition, we will cover how to pre-process a dataset that is not initially usable by a model and convert it to an exploitable dataset.

**Objectives:**
    1. Observe a different use case where XAI can be used
    2. Understand how to pre-process data 
    3. Understand how to use Dimlp and Fidex.
    4. Showcase the versatility of HES-Xplain using a different dataset and training model.
    5. Provide practical insights into applying Random Forests and Fidex to breast cancer classifiers through an interactive notebook.
    6. Foster a community of XAI enthusiasts and practitioners.

**Outline:**

    1. Dataset and Problem Statement.
    2. Load and pre-process the dataset.
    3. Train the Model.
    4. Local rules generation - Fidex
    5. Global ruleSet generation - FidexGlo
    6. Conclusion.

Through this use case, we aim to empower users to grasp the potential of Random Forests and Fidex as tools for transparent and interpretable classification. With HES-Xplain, we make XAI accessible, helping users build trust in their models and make informed decisions.

# Dataset and Problem Statement
The dataset we'll be working with is called the [obesity or CVD risk](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data) and is accessible on [Kaggle](https://www.kaggle.com). It comprises 2111 records of anonymized data pertaining to South American individuals and their dietary habits. In this notebook, our focus is on tackling another prevalent medical challenge: classifying the risk of obesity based on various factors. These factors, drawn from the dataset, are outlined below with their original names:

| **Full name**                             | **Used label** |                                                        **Values/Ranges**                                                       | **Description**                                                                     |
|-------------------------------------------|:--------------:|:------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------------------------------------------------|
| Gender                                    |     Gender     |                                                          Male, Female                                                          | Person's biological gender                                                          |
| Age                                       |       Age      |                                                             [14:61]                                                            | Person's age in years                                                               |
| Height                                    |     Height     |                                                           [1.45:1.98]                                                          | Person's height in meters                                                           |
| Weight                                    |     Weight     |                                                            [39:173]                                                            | Person's weight in kilograms                                                        |
| Family history with overweight            |      FHWO      |                                                             yes, no                                                            | Whether the person has at least one sibling that suffers or suffered of overweight  |
| Frequent consumption of high-caloric food |      FAVC      |                                                             yes, no                                                            | Whether the person is frequently consuming high-caloric food                        |
| Frequency of consumption of vegetables    |      FCVC      |                                                              [1:3]                                                             | Leveled frequency of consumption of vegetables                                      |
| Number of main meals                      |       NCP      |                                                              [1:4]                                                             | Person's number of main meals during a day                                          |
| Consumption of food between meals         |      CAEC      |                                                no, sometimes, frequently, always                                               | Person's consumption of food between main meals frequency per day                   |
| Smoker or not                             |      SMOKE     |                                                             yes, no                                                            | Whether the person smokes                                                           |
| Consumption of water daily                |      CH20      |                                                              [1:3]                                                             | Numeric representation of water consumption frequency per day                       |
| Calories consumption monitoring           |       SCC      |                                                             yes, no                                                            | Whether the person is monitoring his daily calories intake                          |
| Physical activity frequency               |       FAF      |                                                              [0:3]                                                             | Numeric representation of physical activity frequency per week                      |
| Time using technology devices             |       TUE      |                                                              [0:2]                                                             | Numeric representation of electronic devices use frequency per day                  |
| Consumption of alcohol                    |      CALC      |                                                no, sometimes, frequently, always                                               | Frequency of alcohol consumption                                                    |
| Transportation used                       |     MTRANS     |                                   Public_Transportation, Automobile, Bike, Motorbike, Walking                                  | Medium usually used to transit                                                      |
| Obesity level deducted                    |       OLD      | Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III | Obesity level observed according to the interpretation of the person's BMI          |

In our case, we look forward to training a random forest model to classify the obesity level deducted from all the other features. To do so, we need to slightly modify the original dataset to convert several features to be digestible by the model.

# Load and pre-process the dataset
To kick things off, we'll begin by simplifying the names of the columns and taking a look at the CSV file containing the raw data:

In [37]:
import pandas as pd

# silence warnings concerning replace() method being removed on pandas 3.0
pd.set_option('future.no_silent_downcasting', True)

dataset = pd.read_csv("data/OCDDataset/ObesityDataSet.csv")

# reducing labels names size
dataset.rename(
    columns={
        "family_history_with_overweight": "FHWO",
        "NObeyesdad": "OLD",
    },
    inplace=True
)

# shuffle the entire dataset
dataset = dataset.sample(frac=1)

nrows = int(dataset.shape[0] * 1) # using half of the dataset

dataset = dataset.iloc[:nrows,:]

dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
2070,Female,24.693108,1.667383,112.982549,yes,yes,3.0,3.0,Sometimes,no,2.887909,no,0.312923,0.210997,Sometimes,Public_Transportation,Obesity_Type_III
1946,Female,23.69484,1.637524,113.90506,yes,yes,3.0,3.0,Sometimes,no,2.495961,no,0.189831,0.652289,Sometimes,Public_Transportation,Obesity_Type_III
794,Female,17.451085,1.6,65.0,yes,yes,3.0,2.449723,Sometimes,no,2.0,yes,0.479592,1.720642,Sometimes,Public_Transportation,Overweight_Level_I
1741,Male,28.255199,1.816547,120.699119,yes,yes,2.997951,3.0,Sometimes,no,2.715856,no,0.739881,0.972054,Sometimes,Automobile,Obesity_Type_II
1795,Male,20.068432,1.657132,105.580491,yes,yes,2.724121,1.437959,Sometimes,no,1.590418,no,0.029603,1.122118,no,Public_Transportation,Obesity_Type_II


You can observe a sample of the dataset. To make the dataset more compatible with machine learning, we'll start by converting features that have "yes" or "no" values into their boolean representation:

In [38]:
# TODO: convert values
strToBinDict = {"yes": 1, "no": 0}
dataset["FHWO"] = dataset["FHWO"].replace(strToBinDict).astype("int8")
dataset["FAVC"] = dataset["FAVC"].replace(strToBinDict).astype("int8")
dataset["SMOKE"] = dataset["SMOKE"].replace(strToBinDict).astype("int8")
dataset["SCC"] = dataset["SCC"].replace(strToBinDict).astype("int8")
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
2070,Female,24.693108,1.667383,112.982549,1,1,3.0,3.0,Sometimes,0,2.887909,0,0.312923,0.210997,Sometimes,Public_Transportation,Obesity_Type_III
1946,Female,23.69484,1.637524,113.90506,1,1,3.0,3.0,Sometimes,0,2.495961,0,0.189831,0.652289,Sometimes,Public_Transportation,Obesity_Type_III
794,Female,17.451085,1.6,65.0,1,1,3.0,2.449723,Sometimes,0,2.0,1,0.479592,1.720642,Sometimes,Public_Transportation,Overweight_Level_I
1741,Male,28.255199,1.816547,120.699119,1,1,2.997951,3.0,Sometimes,0,2.715856,0,0.739881,0.972054,Sometimes,Automobile,Obesity_Type_II
1795,Male,20.068432,1.657132,105.580491,1,1,2.724121,1.437959,Sometimes,0,1.590418,0,0.029603,1.122118,no,Public_Transportation,Obesity_Type_II


We will convert the `CAEC` and `CALC` columns, which contain the values "Always," "Frequently," "Sometimes," and "no," into a numerical scale from 0.00 to 1.00 based on their frequency. Here's the conversion table:

| **Adjective** | **Conversion Value** |
|---------------|:--------------------:|
| Always        |         1.0          |
| Frequently    |         0.66         |
| Sometimes     |         0.33         |
| no            |         0.0          |

We'll apply a similar procedure as before to achieve this:

In [39]:
adjToValDict = {"Always": 1.0, "Frequently": 0.66, "Sometimes": 0.33, "no": 0.0}
dataset["CAEC"] = dataset["CAEC"].replace(adjToValDict).astype('float64')
dataset["CALC"] = dataset["CALC"].replace(adjToValDict).astype('float64')
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
2070,Female,24.693108,1.667383,112.982549,1,1,3.0,3.0,0.33,0,2.887909,0,0.312923,0.210997,0.33,Public_Transportation,Obesity_Type_III
1946,Female,23.69484,1.637524,113.90506,1,1,3.0,3.0,0.33,0,2.495961,0,0.189831,0.652289,0.33,Public_Transportation,Obesity_Type_III
794,Female,17.451085,1.6,65.0,1,1,3.0,2.449723,0.33,0,2.0,1,0.479592,1.720642,0.33,Public_Transportation,Overweight_Level_I
1741,Male,28.255199,1.816547,120.699119,1,1,2.997951,3.0,0.33,0,2.715856,0,0.739881,0.972054,0.33,Automobile,Obesity_Type_II
1795,Male,20.068432,1.657132,105.580491,1,1,2.724121,1.437959,0.33,0,1.590418,0,0.029603,1.122118,0.0,Public_Transportation,Obesity_Type_II


We'll address three additional columns named Gender, MTRANS, and OLD, which currently contain non-numerical values. These values represent individual options and cannot be quantified using a scale like before. Instead, we'll encode them using a technique called "one hot encoding." This technique will assign a binary value to each option, representing its presence or absence. Let's proceed with applying one hot encoding:

In [40]:
genderCols = pd.get_dummies(dataset["Gender"], prefix="Gender",prefix_sep='_', dtype='int8')
mtransCols = pd.get_dummies(dataset["MTRANS"], prefix="MTRANS",prefix_sep='_', dtype='int8')
oldCols = pd.get_dummies(dataset["OLD"], prefix="OLD",prefix_sep='_', dtype='int8')
dataset = pd.concat([genderCols, dataset.iloc[:,:16], mtransCols,  dataset.iloc[:,16:], oldCols], axis=1)
dataset.drop(["Gender", "MTRANS", "OLD"], axis=1, inplace=True)
dataset.head()

Unnamed: 0,Gender_Female,Gender_Male,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,...,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,OLD_Insufficient_Weight,OLD_Normal_Weight,OLD_Obesity_Type_I,OLD_Obesity_Type_II,OLD_Obesity_Type_III,OLD_Overweight_Level_I,OLD_Overweight_Level_II
2070,1,0,24.693108,1.667383,112.982549,1,1,3.0,3.0,0.33,...,0,1,0,0,0,0,0,1,0,0
1946,1,0,23.69484,1.637524,113.90506,1,1,3.0,3.0,0.33,...,0,1,0,0,0,0,0,1,0,0
794,1,0,17.451085,1.6,65.0,1,1,3.0,2.449723,0.33,...,0,1,0,0,0,0,0,0,1,0
1741,0,1,28.255199,1.816547,120.699119,1,1,2.997951,3.0,0.33,...,0,0,0,0,0,0,1,0,0,0
1795,0,1,20.068432,1.657132,105.580491,1,1,2.724121,1.437959,0.33,...,0,1,0,0,0,0,1,0,0,0


Now that our dataset is prepared to be used, let's ensure the data's integrity by verifying some information, starting with a general overview of our columns:

In [41]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2111 entries, 2070 to 1751
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Gender_Female                 2111 non-null   int8   
 1   Gender_Male                   2111 non-null   int8   
 2   Age                           2111 non-null   float64
 3   Height                        2111 non-null   float64
 4   Weight                        2111 non-null   float64
 5   FHWO                          2111 non-null   int8   
 6   FAVC                          2111 non-null   int8   
 7   FCVC                          2111 non-null   float64
 8   NCP                           2111 non-null   float64
 9   CAEC                          2111 non-null   float64
 10  SMOKE                         2111 non-null   int8   
 11  CH2O                          2111 non-null   float64
 12  SCC                           2111 non-null   int8   
 13  FAF  

Next, let's verify that there are no missing values in our dataset:

In [42]:
dataset.isnull().sum()

Gender_Female                   0
Gender_Male                     0
Age                             0
Height                          0
Weight                          0
FHWO                            0
FAVC                            0
FCVC                            0
NCP                             0
CAEC                            0
SMOKE                           0
CH2O                            0
SCC                             0
FAF                             0
TUE                             0
CALC                            0
MTRANS_Automobile               0
MTRANS_Bike                     0
MTRANS_Motorbike                0
MTRANS_Public_Transportation    0
MTRANS_Walking                  0
OLD_Insufficient_Weight         0
OLD_Normal_Weight               0
OLD_Obesity_Type_I              0
OLD_Obesity_Type_II             0
OLD_Obesity_Type_III            0
OLD_Overweight_Level_I          0
OLD_Overweight_Level_II         0
dtype: int64

Now, we'll ensure that there are no duplicated records in our dataset:

In [43]:
dataset = dataset.drop_duplicates()
dataset.duplicated().sum()

0

Lastly, we need to split our dataset into two distinct datasets, one for the training process and the other for the tests process. Then write them in separate files to allow them to be used by the model. 

In [44]:
nRecords = dataset.shape[0]
trainSplit = int(0.75 * nRecords)

trainds = dataset.iloc[:trainSplit, :]
testds = dataset.iloc[trainSplit:, :]

print(
f"""
Total number of records:\t{nRecords}
Number of records for training:\t{trainds.shape[0]}
Number of records for testing:\t{testds.shape[0]}
"""
)


Total number of records:	2087
Number of records for training:	1565
Number of records for testing:	522



In [45]:
# We are writing .txt files to comply with our random forest algorithm
path = "data/OCDDataset/"
trainDataFile = "train_dataset.txt"
testDataFile = "test_dataset.txt"

trainds.to_csv(path + trainDataFile, header=False, index=False)
testds.to_csv(path + testDataFile, header=False, index=False)

Now that our data is ready and checked for any missing or duplicate values, we're all set to move on to the next step. In the upcoming chapter, we'll use our prepared dataset to train our Random Forest model.

# Train the Model

Let's dig into the main part of our task: training our model. To do so, we're going to use a type of model called random forests. Random forests are a type of machine learning model that works by creating a multitude of decision trees during training. Each tree independently makes predictions, and the final prediction is determined by averaging the predictions of all the trees (for regression tasks) or taking a majority vote (for classification tasks). 

In this case, we are going to use a Python program called [randForestsTrn](https://github.com/HES-XPLAIN/dimlpfidex/blob/main/trainings/randForestsTrn.py). Let's begin with importing the script and printing its help message to observe every option available:

In [46]:
from trainings.randForestsTrn import randForestsTrn as randomForest

randomForest("--help")


Usage: 
--train_data_file <str> --test_data_file <str> --nb_attributes <int [1,inf[> --nb_classes <int [1,inf[> [-h, --help] [--json_config_file <str>] [--root_folder <str>] [--train_class_file <str>] [--train_pred_outfile <str>] [--test_class_file <str>] [--test_pred_outfile <str>] [--console_file <str>] [--stats_file <str>] [--rules_outfile <str>] [--n_estimators <int [1,inf[>] [--criterion <{gini, entropy, log_loss}>] [--max_depth <int [1,inf[>] [--min_samples_split <int [2,inf[ U float]0,1.0]>] [--min_samples_leaf <int [1,inf[ U float]0,1[>] [--min_weight_fraction_leaf <float [0,0.5]>] [--max_features <{sqrt, log2, all, float ]0,1[, int [1,inf[}>] [--max_leaf_nodes <int [2,inf[>] [--min_impurity_decrease <float [0,inf[>] [--bootstrap <bool>] [--oob_score <bool>] [--n_jobs <int>] [--seed <{int [0,inf[}>] [--verbose <int [0,inf[>] [--warm_start <bool>] [--class_weight <{balanced, dict}>] [--ccp_alpha <float [0,inf[>] [--max_samples <int [1,inf[ U float]0,1.0]>]

This is a parser for 

-1

The output shows several options available. From all those, we're going to focus on the required parameters (and the `--root_folder` for convenience purposes) only for now. We already have the train and test data files (generated in the last chapter). Now we just have to get the number of attributes and number of classes. As we know, the original class is `OLD`, as it has been one hotted. We need to count the number of labels with the `OLD` prefix. 

In [47]:
labels = list(dataset.columns)

nclasses = sum(1 for label in labels if label.startswith("OLD_"))
nattributes = len(labels) - nclasses
attributesFile = "attributes_file.txt" 


with open(path+attributesFile, 'w') as af:
    for label in dataset.columns:
        af.write(label+'\n')

print(f"# attributes:\t{nattributes}\n# classes:\t{nclasses}")

# attributes:	21
# classes:	7


So now, lets gather the elements we have to run the model:

| **Parameter name** | **Input**           |
|--------------------|:-------------------:|
| --root_folder      | data/OCDDataset/    |
| --train_data_file  | train_dataset.txt   |
| --test_data_file   | test_dataset.txt    |
| --nb_attributes    | 21                  |
| --nb_classes       | 7                   |

We can now try to run our random forest model with it and let the rest of the options be decided be the defaults.

In [48]:
args = f"--root_folder {path} --train_data_file {trainDataFile} --test_data_file {testDataFile} --nb_attributes {nattributes} --nb_classes {nclasses}"
randomForest(args)

Parameters list:
 - root_folder                                                   data/OCDDataset/
 - train_data_file                                               data/OCDDataset/train_dataset.txt
 - train_pred_outfile                                            data/OCDDataset/predTrain.out
 - test_data_file                                                data/OCDDataset/test_dataset.txt
 - test_pred_outfile                                             data/OCDDataset/predTest.out
 - stats_file                                                    data/OCDDataset/stats.txt
 - nb_attributes                                                 21
 - nb_classes                                                    7
 - rules_outfile                                                 data/OCDDataset/RF_rules.rls
 - n_estimators                                                  100
 - criterion                                                     gini
 - min_samples_split                                    

0

The algorithm ended and generated a `rule_outfile` file. It contains all rules generated by each tree from the random forest. Lets visualize some of the first tree:

In [49]:
nlinesToVisualize = 20
rulesFile = "RF_rules.rls" # if you used the --rules_outfile option when running randomForest(), please don't forget to adapt this according to your input
lines = ""

with open(path + rulesFile) as f:
    for _ in range(nlinesToVisualize):
        lines += next(f) 

print(lines)

-------------------
Tree 1
-------------------
Rule 1: X0<=0.5 X9<=0.4950000196695328 X2<=22.768614768981934 X8<=3.0478315353393555 X9<=0.16500000655651093 X2<=21.518756866455078 -> class 5 Covering: [0, 0, 0, 0, 0, 1, 0]
Rule 2: X0<=0.5 X9<=0.4950000196695328 X2<=22.768614768981934 X8<=3.0478315353393555 X9<=0.16500000655651093 X2>21.518756866455078 X2<=22.009114265441895 -> class 1 Covering: [0, 1, 0, 0, 0, 0, 0]
Rule 3: X0<=0.5 X9<=0.4950000196695328 X2<=22.768614768981934 X8<=3.0478315353393555 X9<=0.16500000655651093 X2>21.518756866455078 X2>22.009114265441895 -> class 5 Covering: [0, 0, 0, 0, 0, 1, 0]
Rule 4: X0<=0.5 X9<=0.4950000196695328 X2<=22.768614768981934 X8<=3.0478315353393555 X9>0.16500000655651093 X19<=0.5 X11<=2.5 X2<=17.5 X15<=0.16500000655651093 -> class 0 Covering: [1, 0, 0, 0, 0, 0, 0]
Rule 5: X0<=0.5 X9<=0.4950000196695328 X2<=22.768614768981934 X8<=3.0478315353393555 X9>0.16500000655651093 X19<=0.5 X11<=2.5 X2<=17.5 X15>0.16500000655651093 X7<=2.5 -> class 2 Cove

TODO: explain what we see here

Now we trained and generated our set of rules, lets use Fidex to sort them (TODO complete this)

# Local rules generation - Fidex

Now we can generate some local rules to explain the models' results. We can start with launching Fidex on one test sample. This will generate a rule explaining the sample locally. It is local because the algorithm searches a rule only for one sample.

Fidex is located in the fidex module. Let's take a look at the parameters :

In [50]:
from dimlpfidex.fidex import fidex

status = fidex("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--train_data_file <str>       Train data file
--train_pred_file <str>       Train prediction file
--train_class_file <str>      Train true class file, not mandatory if classes are specified in train data file
--test_data_file <str>        Test sample(s) data file with data, prediction(if no --test_pred_file) and true class(if no --test_class_file)
--weights_file <str>          Weights file (not mandatory if a rules file is given with --rules_file)
--rules_file <str>            Rules file to be converted to hyperlocus (not mandatory if a weights file is given with --weights_file)
--rules_outfile <str>         Rule(s) output file. If a .json filename is given, rules are saved in a special json format
--nb_attributes <int [1,inf[> Number of at

Let's have a closer look at the Fidex help output. Firstly, we can observe that there are `required parameters`. 

- `--train_data_file`: a file containing features from the training portion of the dataset
- `--train_pred_file`: a file containing predictions from the training portion of the dataset
- `--train_class_file`: a file containing classes from the training portion of the dataset
- `--test_data_file`: a file containing features from the testing portion of the dataset
- `--weights_file`: a file containing weights from a model training (in our case, we don't need it because we already have a `rules file` from the RF training)
- `--rules_file`: a file containing the rules generated by a model training 
- `--rules_outfile`: a file name that will contain the output of the Fidex algorithm
- `--nb_attributes`: the number of attributes present in the dataset
- `--nb_classes`: the number of classes present in the dataset

There are also optional arguments that we are going to use:
- `--root_folder`: path defining the root directory where every other path specified in other arguments begins
- `--attributes_file`: a file containing all attributes and classes names

All steps done until now will allow us to run the Fidex program. Let's try to do so:

In [51]:
localRuleOutFileName = "fidex_rules.rls"
trainClassesFile = "train_classes.txt"
trainPredsFile = "predTrain.out" # generated by the RF
testPredsFile = "predTest.out" # generated by the RF
testSampleFile = "test_sample.txt"

sampleSelected = 0
assert(sampleSelected < nrows)


# extract a sample to generate local rule
testPreds = pd.read_csv(path+testPredsFile, sep=" ", header=None, index_col=None).iloc[:, :nclasses]
sampleData = testds.iloc[sampleSelected, :nattributes].to_list()
samplePred = testPreds.iloc[sampleSelected, :].to_list()
sampleClasses = testds.iloc[sampleSelected, nattributes:].to_list()


with open(path+testSampleFile, 'w') as f:
    f.write(" ".join(str(x) for x in sampleData) + '\n')
    f.write(" ".join(str(x) for x in samplePred) + '\n')
    f.write(" ".join(str(x) for x in sampleClasses) + '\n')

args = f"""--root_folder {path} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --test_data_file {testSampleFile}  
        --rules_file {rulesFile} 
        --rules_outfile {localRuleOutFileName} 
        --nb_attributes {nattributes} 
        --attributes_file {attributesFile} 
        --nb_classes {nclasses}"""

status = fidex(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - test_data_file                                                       data/OCDDataset/test_sample.txt
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - rules_outfile                                                        data/OCDDataset/fidex_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     21
 - nb_classes                                                                                         7
 - nb_quant_levels                             

We ran the Fidex algorithm on the test portion of the dataset and this generated an output file. Let's observe its content:

In [52]:
lines = ""
with open(path+localRuleOutFileName, 'r') as f:
        for line in f:
            lines += line

print(lines)

No decision threshold is used.

Rule for sample 0 :

Age>=37.5 Height<1.553105 -> OLD_Obesity_Type_I
   Train Covering size : 9
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.987




# Global ruleSet generation - FidexGlo


In [54]:
from dimlpfidex.fidex import fidexGloRules

heuristic = 1
nthreads = 8
globalRulesOutfile = "fidexGloRules_rules.rls"
# --normalization_file normalization_stats.txt (do I need this ?)

args = f" --root_folder {path} --nb_threads {nthreads} --train_data_file {trainDataFile} --train_pred_file {trainPredsFile} --rules_file {rulesFile} --attributes_file {attributesFile} --nb_attributes {nattributes} --nb_classes {nclasses} --heuristic {heuristic} --global_rules_outfile {globalRulesOutfile}"
status = fidexGloRules(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - global_rules_outfile                                         data/OCDDataset/fidexGloRules_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     21
 - nb_classes                                                                                         7
 - nb_quant_levels                                                                                   50
 - heuristic                                   

# Conclusion