# Exploring Random Forest and Fidex rule generation for obesity risk classification

**Introduction:**

Welcome to HES-Xplain, our interactive platform designed to facilitate explainable artificial intelligence (XAI) techniques. In this use case, we dive into obesity risk classification and showcase another application example of explainability techniques.

This notebook is an alternative to the [`Exploring Dimlp and Fidex rule generation for breast cancer classification`](TODO). It aims to be similar but aims to use a different dataset and training model to show the versatility of our explainability tools.  In addition, we will cover how to pre-process a dataset that is not initially usable by a model and convert it to an exploitable dataset.

**Objectives:**

    1. Observe a different use case where XAI can be used
    2. Understand how to pre-process data 
    3. Understand how to use Dimlp and Fidex.
    4. Showcase the versatility of HES-Xplain using a different dataset and training model.
    5. Provide practical insights into applying Random Forests and Fidex to breast cancer classifiers through an interactive notebook.
    6. Foster a community of XAI enthusiasts and practitioners.

**Outline:**

    1. Dataset and Problem Statement.
    2. Load and pre-process the dataset.
    3. Train the Model.
    4. Local rules generation - Fidex
    5. Global ruleSet generation - FidexGlo
    6. Conclusion.

Through this use case, we aim to show the users the potential of Random Forests and Fidex as tools for transparent and interpretable classification. With HES-Xplain, we make XAI accessible, helping users build trust in their models and make informed decisions.

# Dataset and Problem Statement
The dataset we'll be working with is called the [obesity or CVD risk](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data) and is accessible on [Kaggle](https://www.kaggle.com). It comprises 2111 records of anonymized data conerning South American individuals and their dietary habits. In this notebook, our focus is on another medical challenge: classifying the risk of obesity based on various factors. These factors, drawn from the dataset, are outlined below with their original names:

| **Full name**                             | **Used label** |                                                        **Values/Ranges**                                                       | **Description**                                                                     |
|-------------------------------------------|:--------------:|:------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------------------------------------------------|
| Gender                                    |     Gender     |                                                          Male, Female                                                          | Person's biological gender                                                          |
| Age                                       |       Age      |                                                             [14:61]                                                            | Person's age in years                                                               |
| Height                                    |     Height     |                                                           [1.45:1.98]                                                          | Person's height in meters                                                           |
| Weight                                    |     Weight     |                                                            [39:173]                                                            | Person's weight in kilograms                                                        |
| Family history with overweight            |      FHWO      |                                                             yes, no                                                            | Whether the person has at least one sibling that suffers or suffered of overweight  |
| Frequent consumption of high-caloric food |      FAVC      |                                                             yes, no                                                            | Whether the person is frequently consuming high-caloric food                        |
| Frequency of consumption of vegetables    |      FCVC      |                                                              [1:3]                                                             | Leveled frequency of consumption of vegetables                                      |
| Number of main meals                      |       NCP      |                                                              [1:4]                                                             | Person's number of main meals during a day                                          |
| Consumption of food between meals         |      CAEC      |                                                no, sometimes, frequently, always                                               | Person's consumption of food between main meals frequency per day                   |
| Smoker or not                             |      SMOKE     |                                                             yes, no                                                            | Whether the person smokes                                                           |
| Consumption of water daily                |      CH20      |                                                              [1:3]                                                             | Numeric representation of water consumption frequency per day                       |
| Calories consumption monitoring           |       SCC      |                                                             yes, no                                                            | Whether the person is monitoring his daily calories intake                          |
| Physical activity frequency               |       FAF      |                                                              [0:3]                                                             | Numeric representation of physical activity frequency per week                      |
| Time using technology devices             |       TUE      |                                                              [0:2]                                                             | Numeric representation of electronic devices use frequency per day                  |
| Consumption of alcohol                    |      CALC      |                                                no, sometimes, frequently, always                                               | Frequency of alcohol consumption                                                    |
| Transportation used                       |     MTRANS     |                                   Public_Transportation, Automobile, Bike, Motorbike, Walking                                  | Medium usually used to transit                                                      |
| Obesity level deducted                    |       OLD      | Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III | Obesity level observed according to the interpretation of the person's BMI          |

In our case, we look forward to training a random forest model to classify the obesity level deducted from all the other features. To do so, we need to slightly modify the original dataset to convert several features to be digestible by the model.

# Load and pre-process the dataset
To kick things off, we'll begin by simplifying the names of the columns and taking a look at the CSV file containing the raw data:

In [1]:
import pandas as pd
from dimlpfidex.fidex import fidex, fidexGloRules, fidexGloStats
from trainings.randForestsTrn import randForestsTrn as randomForest

# silence warnings concerning replace() method being removed on pandas 3.0
pd.set_option("future.no_silent_downcasting", True)


# utility function to preview a file entirely or only the first `nlines` lines
def previewFile(filepath, nlines=-1):
    lines = ""

    with open(filepath, "r") as f:
        if nlines == -1:
            for line in f:
                lines += line
        else:
            for _ in range(nlines):
                try:
                    lines += next(f)
                except StopIteration:
                    break
    print(lines)



dataset = pd.read_csv("data/OCDDataset/ObesityDataSet.csv")

# reducing labels names size
dataset.rename(
    columns={
        "family_history_with_overweight": "FHWO",
        "NObeyesdad": "OLD",
    },
    inplace=True,
)

# shuffle the entire dataset
dataset = dataset.sample(frac=1)
nrows = int(dataset.shape[0] * 0.1)
dataset = dataset.iloc[:nrows, :]

dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
1704,Male,22.851721,1.853373,121.737836,yes,yes,2.922511,2.692889,Sometimes,no,1.809531,no,0.478595,0.0,Sometimes,Public_Transportation,Obesity_Type_II
574,Female,19.833682,1.699464,49.676046,no,yes,1.270448,3.731212,Frequently,no,1.876915,no,2.0,1.0,Sometimes,Public_Transportation,Insufficient_Weight
143,Female,34.0,1.68,75.0,no,yes,3.0,1.0,Sometimes,no,1.0,no,0.0,0.0,Sometimes,Automobile,Overweight_Level_I
1346,Female,18.166318,1.649553,82.323954,yes,yes,2.864776,3.0,Sometimes,no,1.876915,no,0.631565,0.186414,no,Public_Transportation,Obesity_Type_I
707,Female,16.910997,1.74823,49.928447,no,yes,2.494451,3.56544,Sometimes,no,1.491268,no,1.951027,0.956204,Sometimes,Public_Transportation,Insufficient_Weight


You can observe a sample of the dataset. To make the dataset more compatible with machine learning, we'll start by converting features that have "yes" or "no" values into their boolean representation:

In [2]:
# TODO: convert values
strToBinDict = {"yes": 1, "no": 0}
dataset["FHWO"] = dataset["FHWO"].replace(strToBinDict).astype("int8")
dataset["FAVC"] = dataset["FAVC"].replace(strToBinDict).astype("int8")
dataset["SMOKE"] = dataset["SMOKE"].replace(strToBinDict).astype("int8")
dataset["SCC"] = dataset["SCC"].replace(strToBinDict).astype("int8")
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
1704,Male,22.851721,1.853373,121.737836,1,1,2.922511,2.692889,Sometimes,0,1.809531,0,0.478595,0.0,Sometimes,Public_Transportation,Obesity_Type_II
574,Female,19.833682,1.699464,49.676046,0,1,1.270448,3.731212,Frequently,0,1.876915,0,2.0,1.0,Sometimes,Public_Transportation,Insufficient_Weight
143,Female,34.0,1.68,75.0,0,1,3.0,1.0,Sometimes,0,1.0,0,0.0,0.0,Sometimes,Automobile,Overweight_Level_I
1346,Female,18.166318,1.649553,82.323954,1,1,2.864776,3.0,Sometimes,0,1.876915,0,0.631565,0.186414,no,Public_Transportation,Obesity_Type_I
707,Female,16.910997,1.74823,49.928447,0,1,2.494451,3.56544,Sometimes,0,1.491268,0,1.951027,0.956204,Sometimes,Public_Transportation,Insufficient_Weight


We will convert the `CAEC` and `CALC` columns, which contain the values "Always," "Frequently," "Sometimes," and "no," into a numerical scale from 0.00 to 1.00 based on their frequency. Here's the conversion table:

| **Adjective** | **Conversion Value** |
|---------------|:--------------------:|
| Always        |         1.0          |
| Frequently    |         0.66         |
| Sometimes     |         0.33         |
| no            |         0.0          |

We'll apply a similar procedure as before to achieve this:

In [3]:
adjToValDict = {"Always": 1.0, "Frequently": 0.66, "Sometimes": 0.33, "no": 0.0}
dataset["CAEC"] = dataset["CAEC"].replace(adjToValDict).astype('float64')
dataset["CALC"] = dataset["CALC"].replace(adjToValDict).astype('float64')
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
1704,Male,22.851721,1.853373,121.737836,1,1,2.922511,2.692889,0.33,0,1.809531,0,0.478595,0.0,0.33,Public_Transportation,Obesity_Type_II
574,Female,19.833682,1.699464,49.676046,0,1,1.270448,3.731212,0.66,0,1.876915,0,2.0,1.0,0.33,Public_Transportation,Insufficient_Weight
143,Female,34.0,1.68,75.0,0,1,3.0,1.0,0.33,0,1.0,0,0.0,0.0,0.33,Automobile,Overweight_Level_I
1346,Female,18.166318,1.649553,82.323954,1,1,2.864776,3.0,0.33,0,1.876915,0,0.631565,0.186414,0.0,Public_Transportation,Obesity_Type_I
707,Female,16.910997,1.74823,49.928447,0,1,2.494451,3.56544,0.33,0,1.491268,0,1.951027,0.956204,0.33,Public_Transportation,Insufficient_Weight


We'll address three additional columns named `Gender`, `MTRANS`, and `OLD`, which currently contain non-numerical values. These values represent individual options and cannot be quantified using a scale like before. Instead, we'll encode them using a technique called "one hot encoding." This technique will assign a binary value to each option, representing its presence or absence. Let's proceed with applying one hot encoding:

In [4]:
genderCols = pd.get_dummies(dataset["Gender"], prefix="Gender",prefix_sep='_', dtype='int8')
mtransCols = pd.get_dummies(dataset["MTRANS"], prefix="MTRANS",prefix_sep='_', dtype='int8')
oldCols = pd.get_dummies(dataset["OLD"], prefix="OLD",prefix_sep='_', dtype='int8')
dataset = pd.concat([genderCols, dataset.iloc[:,:16], mtransCols,  dataset.iloc[:,16:], oldCols], axis=1)
dataset.drop(["Gender", "MTRANS", "OLD"], axis=1, inplace=True)
dataset.head()

Unnamed: 0,Gender_Female,Gender_Male,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,...,MTRANS_Bike,MTRANS_Public_Transportation,MTRANS_Walking,OLD_Insufficient_Weight,OLD_Normal_Weight,OLD_Obesity_Type_I,OLD_Obesity_Type_II,OLD_Obesity_Type_III,OLD_Overweight_Level_I,OLD_Overweight_Level_II
1704,0,1,22.851721,1.853373,121.737836,1,1,2.922511,2.692889,0.33,...,0,1,0,0,0,0,1,0,0,0
574,1,0,19.833682,1.699464,49.676046,0,1,1.270448,3.731212,0.66,...,0,1,0,1,0,0,0,0,0,0
143,1,0,34.0,1.68,75.0,0,1,3.0,1.0,0.33,...,0,0,0,0,0,0,0,0,1,0
1346,1,0,18.166318,1.649553,82.323954,1,1,2.864776,3.0,0.33,...,0,1,0,0,0,1,0,0,0,0
707,1,0,16.910997,1.74823,49.928447,0,1,2.494451,3.56544,0.33,...,0,1,0,1,0,0,0,0,0,0


Now that our dataset is prepared to be used, let's ensure the data's integrity by verifying some information, starting with a general overview of our columns:

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 211 entries, 1704 to 695
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Gender_Female                 211 non-null    int8   
 1   Gender_Male                   211 non-null    int8   
 2   Age                           211 non-null    float64
 3   Height                        211 non-null    float64
 4   Weight                        211 non-null    float64
 5   FHWO                          211 non-null    int8   
 6   FAVC                          211 non-null    int8   
 7   FCVC                          211 non-null    float64
 8   NCP                           211 non-null    float64
 9   CAEC                          211 non-null    float64
 10  SMOKE                         211 non-null    int8   
 11  CH2O                          211 non-null    float64
 12  SCC                           211 non-null    int8   
 13  FAF    

Next, let's verify that there are no missing values in our dataset:

In [6]:
dataset.isnull().sum()

Gender_Female                   0
Gender_Male                     0
Age                             0
Height                          0
Weight                          0
FHWO                            0
FAVC                            0
FCVC                            0
NCP                             0
CAEC                            0
SMOKE                           0
CH2O                            0
SCC                             0
FAF                             0
TUE                             0
CALC                            0
MTRANS_Automobile               0
MTRANS_Bike                     0
MTRANS_Public_Transportation    0
MTRANS_Walking                  0
OLD_Insufficient_Weight         0
OLD_Normal_Weight               0
OLD_Obesity_Type_I              0
OLD_Obesity_Type_II             0
OLD_Obesity_Type_III            0
OLD_Overweight_Level_I          0
OLD_Overweight_Level_II         0
dtype: int64

Now, we'll ensure that there are no duplicated records in our dataset:

In [7]:
dataset = dataset.drop_duplicates()
dataset.duplicated().sum()

0

Lastly, we need to split our dataset into two distinct datasets, one for the training process and the other for the tests process. Then write them in separate files to allow them to be used by the model. 

In [8]:
nRecords = dataset.shape[0]
trainSplit = int(0.75 * nRecords)

trainds = dataset.iloc[:trainSplit, :]
testds = dataset.iloc[trainSplit:, :]

print(
f"""
Total number of records:\t{nRecords}
Number of records for training:\t{trainds.shape[0]}
Number of records for testing:\t{testds.shape[0]}
"""
)


Total number of records:	209
Number of records for training:	156
Number of records for testing:	53



In [9]:
# We are writing TXT format to comply with our random forest algorithm
rootDir = "data/OCDDataset/"
trainDataFile = "train_dataset.txt"
testDataFile = "test_dataset.txt"

trainds.to_csv(rootDir + trainDataFile, header=False, index=False)
testds.to_csv(rootDir + testDataFile, header=False, index=False)

Now that our data is ready and checked for any missing or duplicate values, we're all set to move on to the next step. In the upcoming chapter, we'll use our prepared dataset to train our Random Forest model.

# Train the Model

Let's dig into the main part of our task: training our model. To do so, we're going to use a type of model called random forests. Random forests are a type of machine learning model that works by creating a multitude of decision trees during training. Each tree independently makes predictions, and the final prediction is determined by averaging the predictions of all the trees (for regression tasks) or taking a majority vote (for classification tasks). 

In this case, we are going to use a Python program called [randForestsTrn](https://github.com/HES-XPLAIN/dimlpfidex/blob/main/trainings/randForestsTrn.py). Let's begin with importing the script and printing its help message to observe every option available:

In [10]:
randomForest("--help")

Usage: 
--train_data_file <str> --test_data_file <str> --nb_attributes <int [1,inf[> --nb_classes <int [1,inf[> [-h, --help] [--json_config_file <str>] [--root_folder <str>] [--train_class_file <str>] [--train_pred_outfile <str>] [--test_class_file <str>] [--test_pred_outfile <str>] [--console_file <str>] [--stats_file <str>] [--rules_outfile <str>] [--n_estimators <int [1,inf[>] [--criterion <{gini, entropy, log_loss}>] [--max_depth <int [1,inf[>] [--min_samples_split <int [2,inf[ U float]0,1.0]>] [--min_samples_leaf <int [1,inf[ U float]0,1[>] [--min_weight_fraction_leaf <float [0,0.5]>] [--max_features <{sqrt, log2, all, float ]0,1[, int [1,inf[}>] [--max_leaf_nodes <int [2,inf[>] [--min_impurity_decrease <float [0,inf[>] [--bootstrap <bool>] [--oob_score <bool>] [--n_jobs <int>] [--seed <{int [0,inf[}>] [--verbose <int [0,inf[>] [--warm_start <bool>] [--class_weight <{balanced, dict}>] [--ccp_alpha <float [0,inf[>] [--max_samples <int [1,inf[ U float]0,1.0]>]

This is a parser for 

-1

The output shows several options available. From all those, we're going to focus on the required parameters (and the `--root_folder` for convenience purposes) only. We already have the train and test data files (generated in the last chapter). Now we just have to get the number of attributes and number of classes. As we know, the original class is `OLD`, as it has been one hotted. We need to count the number of labels with the `OLD` prefix. 

In [11]:
labels = list(dataset.columns)

nclasses = sum(1 for label in labels if label.startswith("OLD_"))
nattributes = len(labels) - nclasses
attributesFile = "attributes_file.txt" 


with open(rootDir+attributesFile, 'w') as af:
    for label in dataset.columns:
        af.write(label+'\n')

print(f"# attributes:\t{nattributes}\n# classes:\t{nclasses}")

# attributes:	20
# classes:	7


So now, let's gather the elements we have to run the model:

| **Parameter name** | **Input**           |
|--------------------|:-------------------:|
| --root_folder      | data/OCDDataset/    |
| --train_data_file  | train_dataset.txt   |
| --test_data_file   | test_dataset.txt    |
| --nb_attributes    | 21\*                |
| --nb_classes       | 7\*                 |

<br>

> *\*these values depend on the portion of the dataset used and therefore can vary.*

We can now try to run our random forest model with it and let the rest of the options be decided be the defaults.

In [12]:
args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --test_data_file {testDataFile} 
        --nb_attributes {nattributes} 
        --nb_classes {nclasses}
        """

randomForest(args)

Parameters list:
 - root_folder                                                   data/OCDDataset/
 - train_data_file                                               data/OCDDataset/train_dataset.txt
 - train_pred_outfile                                            data/OCDDataset/predTrain.out
 - test_data_file                                                data/OCDDataset/test_dataset.txt
 - test_pred_outfile                                             data/OCDDataset/predTest.out
 - stats_file                                                    data/OCDDataset/stats.txt
 - nb_attributes                                                 20
 - nb_classes                                                    7
 - rules_outfile                                                 data/OCDDataset/RF_rules.rls
 - n_estimators                                                  100
 - criterion                                                     gini
 - min_samples_split                                    

0

The algorithm ended and generated a `rule_outfile` file. It contains all rules generated by each tree from the random forest. Lets visualize some of the first tree:

In [13]:
rulesFile = "RF_rules.rls" # if you used the --rules_outfile option when running randomForest(), please don't forget to adapt this according to your input
previewFile(rootDir+rulesFile, 100)

-------------------
Tree 1
-------------------
Rule 1: X5<=0.5 X0<=0.5 X15<=0.16500000655651093 -> class 0 Covering: [1, 0, 0, 0, 0, 0, 0]
Rule 2: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9<=0.16500000655651093 X13<=0.31559500098228455 -> class 1 Covering: [0, 1, 0, 0, 0, 0, 0]
Rule 3: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9<=0.16500000655651093 X13>0.31559500098228455 X11<=2.004379987716675 -> class 1 Covering: [0, 1, 0, 0, 0, 0, 0]
Rule 4: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9<=0.16500000655651093 X13>0.31559500098228455 X11>2.004379987716675 -> class 5 Covering: [0, 0, 0, 0, 0, 1, 0]
Rule 5: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9>0.16500000655651093 X4<=80.0 -> class 1 Covering: [0, 1, 0, 0, 0, 0, 0]
Rule 6: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9>0.16500000655651093 X4>80.0 X2<=31.0 -> class 2 Covering: [0, 0, 1, 0, 0, 0, 0]
Rule 7: X5<=0.5 X0<=0.5 X15>0.16500000655651093 X9>0.16500000655651093 X4>80.0 X2>31.0 -> class 6 Covering: [0, 0, 0, 0, 0, 0, 1]
Rule 8: X5<=0.

The output shows us a portion of the newly generated file. It contains all rules generated by each tree of the random forest algorithm. All rules generated seem to be quite similar between them in a single tree. This is totaly normal. (TODO complete and adapt this)

Now we trained and generated our set of rules, let's use Fidex to sort them (TODO complete this)

# Local rules generation - Fidex

Now we can generate some `local` rules to explain the model's results. We can start with launching Fidex on one test sample. This will generate a rule explaining the sample locally. It is `local` because the algorithm searches a rule only for one sample.

Fidex is located in the fidex module. First of all, let's take a look at the parameters :

In [14]:
status = fidex("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--train_data_file <str>       Train data file
--train_pred_file <str>       Train prediction file
--train_class_file <str>      Train true class file, not mandatory if classes are specified in train data file
--test_data_file <str>        Test sample(s) data file with data, prediction(if no --test_pred_file) and true class(if no --test_class_file)
--weights_file <str>          Weights file (not mandatory if a rules file is given with --rules_file)
--rules_file <str>            Rules file to be converted to hyperlocus (not mandatory if a weights file is given with --weights_file)
--rules_outfile <str>         Rule(s) output file. If a .json filename is given, rules are saved in a special json format
--nb_attributes <int [1,inf[> Number of at

Let's have a closer look at the Fidex help output. We can observe that there are `required parameters`. Let's have a look on them:

- `--train_data_file`: a file containing features from the training portion of the dataset
- `--train_pred_file`: a file containing predictions from the training portion of the dataset
- `--train_class_file`: a file containing classes from the training portion of the dataset
- `--test_data_file`: a file containing samples to be used when generating a local rule
- `--weights_file`: a file containing weights from a model training (in our case, we don't need it because we already have a `rules file` from the RF training)
- `--rules_file`: a file containing the rules generated by a model training 
- `--rules_outfile`: a file name that will contain the output of the Fidex algorithm
- `--nb_attributes`: the number of attributes present in the dataset
- `--nb_classes`: the number of classes present in the dataset

There are also optional arguments that we are going to use:
- `--root_folder`: path defining the root directory where every other path specified in other arguments begins
- `--attributes_file`: a file containing all attributes and class names

All steps done until now will allow us to run the Fidex program. To see what happens, we launch it with just one sample, and we save beforehand the test data sample in a file with its classes and predictions:

In [15]:
localRuleOutFileName = "fidex_rules.rls"
trainClassesFile = "train_classes.txt"
trainPredsFile = "predTrain.out" # generated by the RF
testPredsFile = "predTest.out" # generated by the RF
testSampleFile = "test_sample.txt"

sampleSelected = 0
assert(sampleSelected < nrows)


# extract a sample to generate local rule
testPreds = pd.read_csv(rootDir+testPredsFile, sep=" ", header=None, index_col=None).iloc[:, :nclasses]
sampleData = testds.iloc[sampleSelected, :nattributes].to_list()
samplePred = testPreds.iloc[sampleSelected, :].to_list()
sampleClasses = testds.iloc[sampleSelected, nattributes:].to_list()

# write the sample, classes and predictions in the testSampleFile file (file writing format must be respected)
with open(rootDir+testSampleFile, 'w') as f:
    f.write(" ".join(str(x) for x in sampleData) + '\n')
    f.write(" ".join(str(x) for x in samplePred) + '\n')
    f.write(" ".join(str(x) for x in sampleClasses) + '\n')

args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --test_data_file {testSampleFile}  
        --rules_file {rulesFile} 
        --rules_outfile {localRuleOutFileName} 
        --nb_attributes {nattributes} 
        --attributes_file {attributesFile} 
        --nb_classes {nclasses}
        """

status = fidex(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - test_data_file                                                       data/OCDDataset/test_sample.txt
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - rules_outfile                                                        data/OCDDataset/fidex_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - nb_quant_levels                             

The output of the algorithm shows us, in the terminal, a walkthrough of the process. At the end of it, you can observe the generated rule. Let's have a closer look to it by extracting the freshly written rule file: 

In [16]:
previewFile(rootDir+localRuleOutFileName, 20)

No decision threshold is used.

Rule for sample 0 :

CH2O<1.000272 FCVC<2.274445 TUE>=0.001614 -> OLD_Normal_Weight
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.72




A rule is composed of several properties:
- The index of the sample from which the rule has been generated
- The rule itself, composed of a single or list of antecedents
- The number of samples, in the training dataset, covered by the rule
- The fidelity of the rule according to the model's predictions
- The accuracy of the rule
- The confidence of the rule with its choices, concerning the prediction values

Let's try to run Fidex again, but this time, with all the samples (please take note that this process can take some time and highly varies depending on the dataset size):

In [17]:
args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --test_data_file {testDataFile}  
        --test_pred_file {testPredsFile}
        --rules_file {rulesFile} 
        --rules_outfile {localRuleOutFileName} 
        --nb_attributes {nattributes} 
        --attributes_file {attributesFile} 
        --nb_classes {nclasses}
        """

status = fidex(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - test_data_file                                                      data/OCDDataset/test_dataset.txt
 - test_pred_file                                                          data/OCDDataset/predTest.out
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - rules_outfile                                                        data/OCDDataset/fidex_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                  

The Fidex algorithm generated a rule file, let's observe what is inside:

In [18]:
fidexRulesOutfile = "fidex_rules.rls"
previewFile(rootDir+fidexRulesOutfile, 20)

No decision threshold is used.

Rule for sample 0 :

CH2O<1.000272 FCVC<2.274445 CALC>=0.165 -> OLD_Normal_Weight
   Train Covering size : 3
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.7275

-------------------------------------------------

Rule for sample 1 :

Weight<43.563086 -> OLD_Insufficient_Weight
   Train Covering size : 7
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.8425




# TODO

# Global ruleSet generation - FidexGlo
We have seen how to compute a rule that explains the decision of the model for a specific sample with the Fidex algorithm. But how could we get a general set of rules that characterizes the whole train dataset ? Using the `fidexGloRules` (fidex global rules) algorithm, it is possible to archieve such thing.

A global ruleset is a collection of rules that explains the model's decision for each sample present on the training portion of the dataset. Let's have a look on the fidexGloRules arguments:

In [19]:
status = fidexGloRules("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--train_data_file <str>       Train data file
--train_pred_file <str>       Train prediction file
--train_class_file <str>      Train true class file, not mandatory if classes are specified in train data file
--weights_file <str>          Weights file (not mandatory if a rules file is given with --rules_file)
--rules_file <str>            Rules file to be converted to hyperlocus (not mandatory if a weights file is given with --weights_file)
--global_rules_outfile <str>  Rules output file. If a .json filename is given, rules are saved in a special json format>
--heuristic <int [1,3]>       Heuristic 1: optimal fidexGlo, 2: fast fidexGlo 3: very fast fidexGlo
--nb_attributes <int [1,inf[> Number of attributes in dataset
--nb_classes <int [2,i

Meanwhile, there are `required parameters` very similar to the `Fidex` algorithm, there are a lot of optional arguments that you can use to customize the behavior of the algorithm. Let's have a look at some of them:

- `--heuristic`: various ways to run the algorithm, these ways aim to increase execution speed. But also has a performance impact on results.
- `--nb_threads`: number of processes used to compute the algorithm. Accelerate the process.
- `--min_covering`: minimal number of samples a rule must cover
- `--max_failed_attempts`: maximum failed attempts allowed when generating a rule
- `--min_fidelity`: minimal fidelity allowed when generating a rule

In [20]:
heuristic = 1
nthreads = 8
globalRulesOutfile = "fidexGloRules_rules.rls"

args = f"""
        --root_folder {rootDir} 
        --nb_threads {nthreads} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --rules_file {rulesFile} 
        --attributes_file {attributesFile} 
        --nb_attributes {nattributes} 
        --nb_classes {nclasses} 
        --heuristic {heuristic} 
        --global_rules_outfile {globalRulesOutfile}
        """

status = fidexGloRules(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - global_rules_outfile                                         data/OCDDataset/fidexGloRules_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - nb_quant_levels                                                                                   50
 - heuristic                                   

The algorithm generated a file that we're going to partially observe:

In [21]:
previewFile(rootDir+globalRulesOutfile, 26)

Number of rules : 42, mean sample covering number per rule : 4.785714, mean number of antecedents per rule : 2.52381
No decision threshold is used.

Rule 1: FCVC>=2.99922 Weight>=100.922882 Gender_Female>=0.5 -> OLD_Obesity_Type_III
   Train Covering size : 24
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.967083

Rule 2: Weight<54.541988 Height>=1.600442 -> OLD_Insufficient_Weight
   Train Covering size : 15
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.871333

Rule 3: Weight>=112.144943 Age>=24.69316 -> OLD_Obesity_Type_II
   Train Covering size : 15
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.948667

Rule 4: Weight>=99.072113 TUE>=1.022324 -> OLD_Obesity_Type_II
   Train Covering size : 9
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.906667



> *The algorithm result is subject to randomness as it uses random processes to compute. Results may differ between executions.*

You can observe the rules are ordered by their covering size. The first rule is the one that best describes the training portion of the dataset.

Here's a given rule (can be unrelated to execution due to randomness) we are going to analyze:

```md
Rule 1: Weight>=101.258713 Gender_Female>=0.5 -> OLD_Obesity_Type_III
   Train Covering size : 25
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.9852
```

This rule means the model is **98% sure** you are suffering from `Type 3 obesity` if your `weight` is above or equal to `~101.2 kg` and if you are, biologically speaking, a `female`. The rule is also **100% fidel** according to the model's predictions and is **100% accurate** concerning the training portion of the dataset.

To get statistics on the test portion of the dataset, let's execute the `fidexGloStats`. Beginning with an overview of the arguments of the program: 

In [22]:
status = fidexGloStats("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--test_data_file <str>        Test data file
--test_pred_file <str>        Test prediction file
--test_class_file <str>       Test true class file, not mandatory if classes are specified in test data file
--global_rules_file <str>     Ruleset input file
--nb_attributes <int [1,inf[> Number of attributes in dataset
--nb_classes <int [2,inf[>    Number of classes in dataset

----------------------------

Optional parameters: 

--json_config_file <str>      JSON file to configure all parameters. If used, this must be the sole argument and must specify the file's relative path
--root_folder <str>           Folder based on main folder dimlpfidex(default folder) containg all used files and where generated files will be saved. If a file name is sp

As you can observe, the required arguments are pretty much the same as previous executions. The only one that differs is `--global_rules_file` which simply asks to input the `global rule file` to compute statistics. Let's try it:

In [27]:
args = f"""
        --root_folder {rootDir}
        --test_data_file {testDataFile}
        --test_pred_file {testPredsFile}
        --global_rules_file {globalRulesOutfile}
        --nb_attributes {nattributes}
        --nb_classes {nclasses}
        --attributes_file {attributesFile}
        """

status = fidexGloStats(args)

Parameters list:
 - test_data_file                                                      data/OCDDataset/test_dataset.txt
 - test_pred_file                                                          data/OCDDataset/predTest.out
 - global_rules_file                                            data/OCDDataset/fidexGloRules_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - positive_class_index                                                                              -1
End of Parameters list.

Importing files...

Data imported.

Compute statistics...

Global statistics of the rule set : 
Number of rules : 42, mean sam

# Conclusion