# Exploring Random Forest and Fidex rule generation for obesity risk classification

**Introduction:**

Welcome to HES-Xplain, our interactive platform designed to facilitate explainable artificial intelligence (XAI) techniques. In this use case, we dive into obesity risk classification and showcase another application example of explainability techniques.

This notebook is an alternative to the [`Exploring Dimlp and Fidex rule generation for breast cancer classification`](TODO). It aims to be similar but aims to use a different dataset and training model to show the versatility of our explainability tools.  In addition, we will cover how to pre-process a dataset that is not initially usable by a model and convert it to an exploitable dataset.

**Objectives:**

    1. Observe a different use case where XAI can be used.
    2. Understand how to pre-process data.
    3. Understand how to use Dimlp and Fidex.
    4. Showcase the versatility of HES-Xplain using a different dataset and training model.
    5. Provide practical insights into applying Random Forests and Fidex to breast cancer classifiers through an interactive notebook.
    6. Foster a community of XAI enthusiasts and practitioners.

**Outline:**

    1. Dataset and Problem Statement.
    2. Load and pre-process the dataset.
    3. Train the Model.
    4. Local rules generation - Fidex.
    5. Global ruleSet generation - FidexGlo.
    6. Conclusion.
    7. References.

Through this use case, we aim to show the users the potential of Random Forests and Fidex as tools for transparent and interpretable classification. With HES-Xplain, we make XAI accessible, helping users build trust in their models and make informed decisions.

---
## Google Colab Setup

This section prepares the notebook for use with Google Colaboratory. If applicable, change the following variable to True:


In [1]:
# Colab compatibility
use_colab = False

In [2]:
if use_colab:
    # ensure the directory is empty
    !rm -rf * .config

    !# install codebase from GitHub
    !git clone --no-checkout https://github.com/HES-XPLAIN/notebooks.git --depth=1 .
    !git config core.sparseCheckout true
    !git sparse-checkout set --cone
    !git sparse-checkout add use_case_dimlpfidex
    !git sparse-checkout reapply
    !git checkout main

    # adjust folder structure
    !mv use_case_dimlpfidex/* .
    !rm -rf use_case_dimlpfidex/

# Dataset and Problem Statement
The dataset we'll be working with is called the [obesity or CVD risk](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data) and is accessible on [Kaggle](https://www.kaggle.com). It comprises 2111 records of anonymized data concerning South American individuals and their dietary habits. In this notebook, our focus is on another medical challenge: classifying the risk of obesity based on various factors. These factors, drawn from the dataset, are outlined below with their original names:

| **Full name**                             | **Used label** |                                                        **Values/Ranges**                                                       | **Description**                                                                     |
|-------------------------------------------|:--------------:|:------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------------------------------------------------|
| Gender                                    |     Gender     |                                                          Male, Female                                                          | Person's biological gender                                                          |
| Age                                       |       Age      |                                                             [14:61]                                                            | Person's age in years                                                               |
| Height                                    |     Height     |                                                           [1.45:1.98]                                                          | Person's height in meters                                                           |
| Weight                                    |     Weight     |                                                            [39:173]                                                            | Person's weight in kilograms                                                        |
| Family history with overweight            |      FHWO      |                                                             yes, no                                                            | Whether the person has at least one sibling that suffers or suffered of overweight  |
| Frequent consumption of high-caloric food |      FAVC      |                                                             yes, no                                                            | Whether the person is frequently consuming high-caloric food                        |
| Frequency of consumption of vegetables    |      FCVC      |                                                              [1:3]                                                             | Leveled frequency of consumption of vegetables                                      |
| Number of main meals                      |       NCP      |                                                              [1:4]                                                             | Person's number of main meals during a day                                          |
| Consumption of food between meals         |      CAEC      |                                                no, sometimes, frequently, always                                               | Person's consumption of food between main meals frequency per day                   |
| Smoker or not                             |      SMOKE     |                                                             yes, no                                                            | Whether the person smokes                                                           |
| Consumption of water daily                |      CH20      |                                                              [1:3]                                                             | Numeric representation of water consumption frequency per day                       |
| Calories consumption monitoring           |       SCC      |                                                             yes, no                                                            | Whether the person is monitoring his daily calories intake                          |
| Physical activity frequency               |       FAF      |                                                              [0:3]                                                             | Numeric representation of physical activity frequency per week                      |
| Time using technology devices             |       TUE      |                                                              [0:2]                                                             | Numeric representation of electronic devices use frequency per day                  |
| Consumption of alcohol                    |      CALC      |                                                no, sometimes, frequently, always                                               | Frequency of alcohol consumption                                                    |
| Transportation used                       |     MTRANS     |                                   Public_Transportation, Automobile, Bike, Motorbike, Walking                                  | Medium usually used to transit                                                      |
| Obesity level deducted                    |       OLD      | Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III | Obesity level observed according to the interpretation of the person's BMI          |

Our goal is to train a random forest model to classify the obesity level based on the other features. To achieve this, we will need to modify the original dataset to convert several features into a format that is suitable for modeling.

# Load and pre-process the dataset
Here we start by simplifying the names of the columns and taking a look at the CSV file containing the raw data:

>**`Pandas` version must be higher than 2.0 and less than 3.0**

In [3]:
import pandas as pd
from dimlpfidex.fidex import fidex, fidexGloRules, fidexGloStats
from trainings.randForestsTrn import randForestsTrn as randomForest

# utility function to preview a file entirely or only the first `nlines` lines
def previewFile(filepath, nlines=-1):
    lines = ""

    with open(filepath, "r") as f:
        if nlines == -1:
            for line in f:
                lines += line
        else:
            for _ in range(nlines):
                try:
                    lines += next(f)
                except StopIteration:
                    break
    print(lines)

dataset = pd.read_csv("data/OCDDataset/ObesityDataSet.csv")

# reducing labels names size
dataset.rename(
    columns={
        "family_history_with_overweight": "FHWO",
        "NObeyesdad": "OLD",
    },
    inplace=True,
)

# shuffle the entire dataset
dataset = dataset.sample(frac=1)
nrows = int(dataset.shape[0] * 0.1)
dataset = dataset.iloc[:nrows, :]

dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
303,Female,16.0,1.57,49.0,no,yes,2.0,4.0,Always,no,2.0,no,0.0,1.0,Sometimes,Public_Transportation,Normal_Weight
639,Male,18.0,1.721854,52.514302,yes,yes,2.33998,3.0,Sometimes,no,2.0,no,0.027433,1.884138,Sometimes,Public_Transportation,Insufficient_Weight
20,Male,22.0,1.65,80.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,3.0,2.0,no,Walking,Overweight_Level_II
1717,Male,25.822348,1.766626,114.187096,yes,yes,2.075321,3.0,Sometimes,no,2.144838,no,1.501754,0.276319,Sometimes,Public_Transportation,Obesity_Type_II
1168,Male,22.649792,1.685045,81.022119,yes,yes,2.0,2.756622,Sometimes,no,1.606076,no,0.432973,1.749586,no,Public_Transportation,Overweight_Level_II


To make the dataset more compatible with machine learning, we'll start by converting features that have "yes" or "no" values into their boolean representation:

In [4]:
strToBinDict = {"yes": 1, "no": 0}

dataset["FHWO"] = dataset["FHWO"].replace(strToBinDict).astype("int8")
dataset["FAVC"] = dataset["FAVC"].replace(strToBinDict).astype("int8")
dataset["SMOKE"] = dataset["SMOKE"].replace(strToBinDict).astype("int8")
dataset["SCC"] = dataset["SCC"].replace(strToBinDict).astype("int8")

dataset.head()

  dataset["FHWO"] = dataset["FHWO"].replace(strToBinDict).astype("int8")
  dataset["FAVC"] = dataset["FAVC"].replace(strToBinDict).astype("int8")
  dataset["SMOKE"] = dataset["SMOKE"].replace(strToBinDict).astype("int8")
  dataset["SCC"] = dataset["SCC"].replace(strToBinDict).astype("int8")


Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
303,Female,16.0,1.57,49.0,0,1,2.0,4.0,Always,0,2.0,0,0.0,1.0,Sometimes,Public_Transportation,Normal_Weight
639,Male,18.0,1.721854,52.514302,1,1,2.33998,3.0,Sometimes,0,2.0,0,0.027433,1.884138,Sometimes,Public_Transportation,Insufficient_Weight
20,Male,22.0,1.65,80.0,1,0,2.0,3.0,Sometimes,0,2.0,0,3.0,2.0,no,Walking,Overweight_Level_II
1717,Male,25.822348,1.766626,114.187096,1,1,2.075321,3.0,Sometimes,0,2.144838,0,1.501754,0.276319,Sometimes,Public_Transportation,Obesity_Type_II
1168,Male,22.649792,1.685045,81.022119,1,1,2.0,2.756622,Sometimes,0,1.606076,0,0.432973,1.749586,no,Public_Transportation,Overweight_Level_II


We will convert the `CAEC` and `CALC` columns, which contain the values "Always," "Frequently," "Sometimes," and "no," into a numerical scale from 0.00 to 1.00 based on their frequency. Here's the conversion table:

| **Adjective** | **Conversion Value** |
|---------------|:--------------------:|
| Always        |         1.00         |
| Frequently    |         0.66         |
| Sometimes     |         0.33         |
| no            |         0.00         |

We'll apply a similar procedure as before to achieve the conversion:

In [5]:
adjToValDict = {"Always": 1.0, "Frequently": 0.66, "Sometimes": 0.33, "no": 0.0}

dataset["CAEC"] = dataset["CAEC"].replace(adjToValDict).astype('float64')
dataset["CALC"] = dataset["CALC"].replace(adjToValDict).astype('float64')

dataset.head()

  dataset["CAEC"] = dataset["CAEC"].replace(adjToValDict).astype('float64')
  dataset["CALC"] = dataset["CALC"].replace(adjToValDict).astype('float64')


Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
303,Female,16.0,1.57,49.0,0,1,2.0,4.0,1.0,0,2.0,0,0.0,1.0,0.33,Public_Transportation,Normal_Weight
639,Male,18.0,1.721854,52.514302,1,1,2.33998,3.0,0.33,0,2.0,0,0.027433,1.884138,0.33,Public_Transportation,Insufficient_Weight
20,Male,22.0,1.65,80.0,1,0,2.0,3.0,0.33,0,2.0,0,3.0,2.0,0.0,Walking,Overweight_Level_II
1717,Male,25.822348,1.766626,114.187096,1,1,2.075321,3.0,0.33,0,2.144838,0,1.501754,0.276319,0.33,Public_Transportation,Obesity_Type_II
1168,Male,22.649792,1.685045,81.022119,1,1,2.0,2.756622,0.33,0,1.606076,0,0.432973,1.749586,0.0,Public_Transportation,Overweight_Level_II


We'll address three additional columns named `Gender`, `MTRANS`, and `OLD`, which currently contain non-numerical values. These values represent individual options and cannot be quantified using a scale like before. Instead, we'll encode them using a technique called "one-hot encoding". This technique will assign a binary value to each option, representing its presence or absence. 

Let's proceed with applying one-hot encoding:

In [6]:
genderCols = pd.get_dummies(dataset["Gender"], prefix="Gender",prefix_sep='_', dtype='int8')
mtransCols = pd.get_dummies(dataset["MTRANS"], prefix="MTRANS",prefix_sep='_', dtype='int8')
oldCols = pd.get_dummies(dataset["OLD"], prefix="OLD",prefix_sep='_', dtype='int8')
dataset = pd.concat([genderCols, dataset.iloc[:,:16], mtransCols,  dataset.iloc[:,16:], oldCols], axis=1)
dataset.drop(["Gender", "MTRANS", "OLD"], axis=1, inplace=True)

dataset.head()

Unnamed: 0,Gender_Female,Gender_Male,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,...,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,OLD_Insufficient_Weight,OLD_Normal_Weight,OLD_Obesity_Type_I,OLD_Obesity_Type_II,OLD_Obesity_Type_III,OLD_Overweight_Level_I,OLD_Overweight_Level_II
303,1,0,16.0,1.57,49.0,0,1,2.0,4.0,1.0,...,0,1,0,0,1,0,0,0,0,0
639,0,1,18.0,1.721854,52.514302,1,1,2.33998,3.0,0.33,...,0,1,0,1,0,0,0,0,0,0
20,0,1,22.0,1.65,80.0,1,0,2.0,3.0,0.33,...,0,0,1,0,0,0,0,0,0,1
1717,0,1,25.822348,1.766626,114.187096,1,1,2.075321,3.0,0.33,...,0,1,0,0,0,0,1,0,0,0
1168,0,1,22.649792,1.685045,81.022119,1,1,2.0,2.756622,0.33,...,0,1,0,0,0,0,0,0,0,1


Now that our dataset is prepared to be used, let's ensure the data's integrity by verifying some information, starting with a general overview of our columns:

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 211 entries, 303 to 469
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Gender_Female                 211 non-null    int8   
 1   Gender_Male                   211 non-null    int8   
 2   Age                           211 non-null    float64
 3   Height                        211 non-null    float64
 4   Weight                        211 non-null    float64
 5   FHWO                          211 non-null    int8   
 6   FAVC                          211 non-null    int8   
 7   FCVC                          211 non-null    float64
 8   NCP                           211 non-null    float64
 9   CAEC                          211 non-null    float64
 10  SMOKE                         211 non-null    int8   
 11  CH2O                          211 non-null    float64
 12  SCC                           211 non-null    int8   
 13  FAF     

Next, we check that there are no missing values in our dataset:

In [8]:
dataset.isnull().sum()

Gender_Female                   0
Gender_Male                     0
Age                             0
Height                          0
Weight                          0
FHWO                            0
FAVC                            0
FCVC                            0
NCP                             0
CAEC                            0
SMOKE                           0
CH2O                            0
SCC                             0
FAF                             0
TUE                             0
CALC                            0
MTRANS_Automobile               0
MTRANS_Motorbike                0
MTRANS_Public_Transportation    0
MTRANS_Walking                  0
OLD_Insufficient_Weight         0
OLD_Normal_Weight               0
OLD_Obesity_Type_I              0
OLD_Obesity_Type_II             0
OLD_Obesity_Type_III            0
OLD_Overweight_Level_I          0
OLD_Overweight_Level_II         0
dtype: int64

Now, we'll ensure that there are no duplicated records in our dataset:

In [9]:
dataset = dataset.drop_duplicates()
dataset.duplicated().sum()

0

Lastly, we need to split our dataset into two distinct datasets, one for the training process and the other for the testing process. Then, write them in separate files to allow them to be used by the model:

In [10]:
nRecords = dataset.shape[0]
trainSplit = int(0.75 * nRecords)

trainds = dataset.iloc[:trainSplit, :]
testds = dataset.iloc[trainSplit:, :]

# We are writing TXT format to comply with our random forest algorithm
rootDir = "data/OCDDataset/"
trainDataFile = "train_dataset.txt"
testDataFile = "test_dataset.txt"

trainds.to_csv(rootDir + trainDataFile, header=False, index=False)
testds.to_csv(rootDir + testDataFile, header=False, index=False)

print(
f"""
Total number of records:\t{nRecords}
Number of records for training:\t{trainds.shape[0]}
Number of records for testing:\t{testds.shape[0]}
"""
)


Total number of records:	210
Number of records for training:	157
Number of records for testing:	53



Now that our data is ready and checked for any missing or duplicate values, we're all set to move on to the next step. In the upcoming chapter, we'll use our prepared dataset to train our Random Forest model.

# Train the Model
Here is the main part of our task: training the model. To do so, we're going to use a type of model called random forests.

In this case, we are going to use our Python program called [randomForest](https://hes-xplain.github.io/documentation/algorithms/training-methods/randforeststrn/). Let's begin with printing the program help message to observe every option available:

In [11]:
status = randomForest("--help")

Usage: 
--train_data_file <str> --test_data_file <str> --nb_attributes <int [1,inf[> --nb_classes <int [1,inf[> [-h, --help] [--json_config_file <str>] [--root_folder <str>] [--train_class_file <str>] [--test_class_file <str>] [--train_pred_outfile <str>] [--test_pred_outfile <str>] [--stats_file <str>] [--console_file <str>] [--rules_outfile <str>] [--n_estimators <int [1,inf[>] [--criterion <{gini, entropy, log_loss}>] [--max_depth <int [1,inf[>] [--min_samples_split <int [2,inf[ U float]0,1.0]>] [--min_samples_leaf <int [1,inf[ U float]0,1[>] [--min_weight_fraction_leaf <float [0,0.5]>] [--max_features <{sqrt, log2, all, float ]0,1[, int [1,inf[}>] [--max_leaf_nodes <int [2,inf[>] [--min_impurity_decrease <float [0,inf[>] [--bootstrap <bool>] [--oob_score <bool>] [--n_jobs <int>] [--seed <{int [0,inf[}>] [--verbose <int [0,inf[>] [--warm_start <bool>] [--class_weight <{balanced, balanced_subsample, dict}>] [--ccp_alpha <float [0,inf[>] [--max_samples <int [1,inf[ U float]0,1.0]>]

T

The output reveals various options. Among these, we'll focus on the required parameters (and the `--root_folder` for convenience). Since we've already generated the train and test data files in the previous chapter, our next step is to determine the number of attributes and classes. As the original class is denoted by `OLD`, post one-hot encoding, we need to count the number of labels prefixed with `OLD_`.

In [12]:
labels = list(dataset.columns)

nclasses = sum(1 for label in labels if label.startswith("OLD_"))
nattributes = len(labels) - nclasses
attributesFile = "attributes_file.txt" 


with open(rootDir+attributesFile, 'w') as f:
    for label in dataset.columns:
        f.write(label+'\n')

print(f"# attributes:\t{nattributes}\n# classes:\t{nclasses}")

# attributes:	20
# classes:	7


So now, let's gather the elements we have to run the model:

| **Parameter name** | **Input**           |
|--------------------|:-------------------:|
| --root_folder      | data/OCDDataset/    |
| --train_data_file  | train_dataset.txt   |
| --test_data_file   | test_dataset.txt    |
| --nb_attributes    | 20                  |
| --nb_classes       | 7                   |

<br>
With these parameters in place, we can proceed to run our random forest model, allowing the remaining options to be determined by their default settings.

In [13]:
args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --test_data_file {testDataFile} 
        --nb_attributes {nattributes} 
        --nb_classes {nclasses}
        """

randomForest(args)

Parameters list:
 - root_folder                                                   data/OCDDataset/
 - train_data_file                                               data/OCDDataset/train_dataset.txt
 - test_data_file                                                data/OCDDataset/test_dataset.txt
 - nb_attributes                                                 20
 - nb_classes                                                    7
 - train_pred_outfile                                            data/OCDDataset/predTrain.out
 - test_pred_outfile                                             data/OCDDataset/predTest.out
 - stats_file                                                    data/OCDDataset/stats.txt
 - rules_outfile                                                 data/OCDDataset/RF_rules.rls
 - n_estimators                                                  100
 - criterion                                                     gini
 - min_samples_split                                    

0

The algorithm ended and generated a `rule_outfile` file. It contains all rules generated by each tree from the random forest. Lets visualize some of the first tree:

In [14]:
rulesFile = "RF_rules.rls"
previewFile(rootDir+rulesFile, 100)

-------------------
Tree 1
-------------------
Rule 1: X0<=0.5 X8<=3.0478315353393555 X8<=2.9513829946517944 X7<=2.7991230487823486 X4<=105.48030090332031 X4<=74.0 X15<=0.16500000655651093 -> class 0 Covering: [1, 0, 0, 0, 0, 0, 0]
Rule 2: X0<=0.5 X8<=3.0478315353393555 X8<=2.9513829946517944 X7<=2.7991230487823486 X4<=105.48030090332031 X4<=74.0 X15>0.16500000655651093 -> class 5 Covering: [0, 0, 0, 0, 0, 1, 0]
Rule 3: X0<=0.5 X8<=3.0478315353393555 X8<=2.9513829946517944 X7<=2.7991230487823486 X4<=105.48030090332031 X4>74.0 X2<=22.877116203308105 X3<=1.7013825178146362 -> class 6 Covering: [0, 0, 0, 0, 0, 0, 1]
Rule 4: X0<=0.5 X8<=3.0478315353393555 X8<=2.9513829946517944 X7<=2.7991230487823486 X4<=105.48030090332031 X4>74.0 X2<=22.877116203308105 X3>1.7013825178146362 X4<=94.02547836303711 -> class 5 Covering: [0, 0, 0, 0, 0, 1, 0]
Rule 5: X0<=0.5 X8<=3.0478315353393555 X8<=2.9513829946517944 X7<=2.7991230487823486 X4<=105.48030090332031 X4>74.0 X2<=22.877116203308105 X3>1.701382517

The output displays a segment of the newly generated file, showcasing rules generated by each tree of the random forest algorithm. Remarkably, the rules generated appear to be quite similar within a single tree. This is entirely normal, as each tree operates independently and may converge on similar decision boundaries.

With our set of rules generated and ready to go, let's move on the next chapter and explore the Fidex algorithm to find local rules for given samples.

# Local rules generation - Fidex

Now we can generate local rules to explain the model's results. We can start with launching [Fidex](https://hes-xplain.github.io/documentation/algorithms/fidex/fidex/) on one test sample. This will generate a rule explaining the sample locally. It is called local because the algorithm searches a rule only for one sample.

First of all, let's take a look at Fidex's arguments :

In [15]:
status = fidex("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--train_data_file <str>       Path to the file containing the train portion of the dataset
--train_class_file <str>      Path to the file containing the train true classes of the dataset, not mandatory if classes are specified in train data file
--train_pred_file <str>       Path to the file containing predictions on the train portion of the dataset
--test_data_file <str>        Path to the file containing the test sample(s) data, prediction (if no --test_pred_file) and true class(if no --test_class_file)
--test_pred_file <str>        Path to the file containing predictions on the test portion of the dataset
--weights_file <str>          Path to the file containing the trained weights of the model (not mandatory if a rules file is given wit

Let's have a closer look at the Fidex help output. We can observe that there are **required parameters**. Let's have a look at them:

- `--train_data_file`: a file containing features from the training portion of the dataset
- `--train_pred_file`: a file containing predictions from the training portion of the dataset
- `--train_class_file`: a file containing classes from the training portion of the dataset
- `--test_data_file`: a file containing samples to be used when generating a local rule
- `--weights_file`: a file containing weights from a model training (in our case, we don't need it because we already have a `rules file` from the RF training)
- `--rules_file`: a file containing the rules generated by a model training 
- `--rules_outfile`: a file name that will contain the output of the Fidex algorithm
- `--nb_attributes`: the number of attributes present in the dataset
- `--nb_classes`: the number of classes present in the dataset

There are also optional arguments that we are going to use:
- `--root_folder`: path defining the root directory where every other path specified in other arguments begins
- `--attributes_file`: a file containing all attributes and class names

All steps done until now will allow us to run the Fidex program. To see what happens, we launch it with just one sample. Therefore, we save the chosen test data sample in a file with its classes and predictions:

In [16]:
localRuleOutFileName = "fidex_rules.rls"
trainClassesFile = "train_classes.txt"
trainPredsFile = "predTrain.out" # generated by the RF
testPredsFile = "predTest.out" # generated by the RF
testSampleFile = "test_sample.txt"

sampleSelected = 0
assert(sampleSelected < nrows)


# extract a sample to generate local rule
testPreds = pd.read_csv(rootDir+testPredsFile, sep=" ", header=None, index_col=None).iloc[:, :nclasses]
sampleData = testds.iloc[sampleSelected, :nattributes].to_list()
samplePred = testPreds.iloc[sampleSelected, :].to_list()
sampleClasses = testds.iloc[sampleSelected, nattributes:].to_list()

# write the sample, classes and predictions in the testSampleFile file (file writing format must be respected)
with open(rootDir+testSampleFile, 'w') as f:
    f.write(" ".join(str(x) for x in sampleData) + '\n')
    f.write(" ".join(str(x) for x in samplePred) + '\n')
    f.write(" ".join(str(x) for x in sampleClasses) + '\n')

args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --test_data_file {testSampleFile}  
        --rules_file {rulesFile} 
        --rules_outfile {localRuleOutFileName} 
        --nb_attributes {nattributes} 
        --attributes_file {attributesFile} 
        --nb_classes {nclasses}
        """

status = fidex(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - test_data_file                                                       data/OCDDataset/test_sample.txt
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - rules_outfile                                                        data/OCDDataset/fidex_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - nb_quant_levels                             

The output of the algorithm shows us, in the terminal, a walkthrough of the process. At the end of it, you can observe the generated rule. Let's have a closer look at it by extracting the freshly written rule file: 

In [17]:
previewFile(rootDir+localRuleOutFileName, 20)

No decision threshold is used.

Rule for sample 0 :

NCP<1.784425 TUE>=1.373035 TUE<1.785434 -> OLD_Obesity_Type_I
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.703333




The output displays a preview of a rule generated by Fidex. Each rule includes various properties:
- The index of the sample from which the rule has been generated
- The rule itself, composed of a single or list of antecedents and the prediction
- The number of samples, in the training dataset, covered by the rule
- The fidelity of the rule according to the model's predictions
- The accuracy of the rule
- The confidence of the rule with its choices, concerning the prediction values

These rules provide insights into the model's predictions for each sample, helping to explain its decision-making process.

Next, we'll rerun Fidex with all test samples to generate a comprehensive set of rules for further analysis. Please note that this process may take some time depending on the dataset size.

In [18]:
args = f"""
        --root_folder {rootDir} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --test_data_file {testDataFile}  
        --test_pred_file {testPredsFile}
        --rules_file {rulesFile} 
        --rules_outfile {localRuleOutFileName} 
        --nb_attributes {nattributes} 
        --attributes_file {attributesFile} 
        --nb_classes {nclasses}
        """

status = fidex(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - test_data_file                                                      data/OCDDataset/test_dataset.txt
 - test_pred_file                                                          data/OCDDataset/predTest.out
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - rules_outfile                                                        data/OCDDataset/fidex_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                  

The Fidex algorithm generated a rule file, let's observe what is inside:

In [19]:
fidexRulesOutfile = "fidex_rules.rls"
previewFile(rootDir+fidexRulesOutfile, 100) 

No decision threshold is used.

Rule for sample 0 :

NCP<1.784425 TUE>=1.373035 Age>=17 -> OLD_Obesity_Type_I
   Train Covering size : 4
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.722

-------------------------------------------------

Rule for sample 1 :

NCP>=3.993853 Gender_Female>=0.5 -> OLD_Normal_Weight
   Train Covering size : 3
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.6625

-------------------------------------------------

Rule for sample 2 :

Weight>=102.261555 Age>=33.095692 -> OLD_Obesity_Type_II
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.806667

-------------------------------------------------

Rule for sample 3 :

Weight>=102.261555 Age>=33.095692 -> OLD_Obesity_Type_II
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.84

-------------------------------------------------

Rule for sample 4 :

CALC>=0.33 NCP>=1.000771 -> OLD

As you can see, the output is very similar to the single sample test, the only difference is the amount of rules generated, which is proportional to the number of samples.

Running Fidex with all samples provides a comprehensive set of rules adapted to every sample given. These rules are useful for understanding how different factors influence the model's predictions for various samples.

In the next chapter, we will move on to global ruleSet generation using FidexGloRules. This will help us understand the overall behavior of the model by generating a set of global rules.

# Global ruleSet generation - FidexGlo
We have seen how to compute a rule that explains the decision of the model for a specific sample with the Fidex algorithm. But how could we get a general set of rules that characterizes the whole train dataset? Using the [FidexGloRules](https://hes-xplain.github.io/documentation/algorithms/fidex/fidexglorules) algorithm, it is possible to achieve this.

A global ruleset is a collection of rules that explains the model's decision for each sample present on the training portion of the dataset. Let's have a look at the fidexGloRules arguments:

In [20]:
status = fidexGloRules("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--train_data_file <str>       Path to the file containing the train portion of the dataset
--train_class_file <str>      Path to the file containing the train true classes of the dataset, not mandatory if classes are specified in train data file
--train_pred_file <str>       Path to the file containing predictions on the train portion of the dataset
--weights_file <str>          Path to the file containing the trained weights of the model (not mandatory if a rules file is given with --rules_file)
--rules_file <str>            Path to the file containing the trained rules to be converted to hyperlocus (not mandatory if a weights file is given with --weights_file)
--global_rules_outfile <str>  Path to the file where the output rule(s) will be

Meanwhile, there are required parameters very similar to the `Fidex` algorithm, there are many optional arguments that you can use to customize the behavior of the algorithm. Let's have a look at some of them:

- `--heuristic`: various ways to run the algorithm, these ways aim to increase execution speed. But also has a performance impact on results.
- `--nb_threads`: number of threads used to compute the algorithm. Accelerate the process.
- `--min_covering`: minimal number of samples a rule must cover
- `--max_failed_attempts`: maximum failed attempts allowed when generating a rule
- `--min_fidelity`: minimal fidelity allowed when generating a rule

In [21]:
heuristic = 1
nthreads = 2
globalRulesOutfile = "fidexGloRules_rules.rls"

args = f"""
        --root_folder {rootDir} 
        --nb_threads {nthreads} 
        --train_data_file {trainDataFile} 
        --train_pred_file {trainPredsFile} 
        --rules_file {rulesFile} 
        --attributes_file {attributesFile} 
        --nb_attributes {nattributes} 
        --nb_classes {nclasses} 
        --heuristic {heuristic} 
        --global_rules_outfile {globalRulesOutfile}
        """

status = fidexGloRules(args)

Parameters list:
 - train_data_file                                                    data/OCDDataset/train_dataset.txt
 - train_pred_file                                                        data/OCDDataset/predTrain.out
 - rules_file                                                              data/OCDDataset/RF_rules.rls
 - global_rules_outfile                                         data/OCDDataset/fidexGloRules_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - nb_quant_levels                                                                                   50
 - heuristic                                   

The algorithm generated a file that we're going to partially observe:

In [22]:
previewFile(rootDir+globalRulesOutfile)

Number of rules : 53, mean sample covering number per rule : 3.924528, mean number of antecedents per rule : 2.679245
No decision threshold is used.

Rule 1: Weight>=113.07732 Gender_Male>=0.5 -> OLD_Obesity_Type_II
   Train Covering size : 18
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.924444

Rule 2: Weight>=102.000061 Gender_Female>=0.5 -> OLD_Obesity_Type_III
   Train Covering size : 15
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.936667

Rule 3: Weight<53.543285 Height>=1.650095 -> OLD_Insufficient_Weight
   Train Covering size : 12
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.854167

Rule 4: Weight<61.931969 Height>=1.700091 -> OLD_Insufficient_Weight
   Train Covering size : 11
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.862727

Rule 5: FCVC<2.104439 Age<21.978546 Age>=20.497272 Weight<78.764286 Weight>=53.04331 -> OLD_Overweight_Level_I
   Train Covering size : 9
   Train Fidelity : 1


> *The algorithm result is subject to randomness as it uses random processes to compute. Results may differ between executions.*

You can observe the rules are ordered by their covering size. The first rule is the one that best describes the training portion of the dataset.

Here's a given rule (can be unrelated to execution due to randomness) we are going to analyze:

```md
Rule 1: Weight>=101.258713 Gender_Female>=0.5 -> OLD_Obesity_Type_III
   Train Covering size : 25
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.9852
```

This rule means the model is **98% sure** someone could suffer from `Type 3 obesity` if his/her weight is above or equal to `~101.2 kg` and if he/she is, biologically speaking, a female. The rule is also **100% fidel** according to the model's predictions and is **100% accurate** concerning the training portion of the dataset.

To get statistics on the test portion of the dataset, let's execute the [fidexGloStats](https://hes-xplain.github.io/documentation/algorithms/fidex/fidexglostats) algorithm. Beginning with an overview of the arguments of the program: 

In [23]:
status = fidexGloStats("--help")


---------------------------------------------------------------------

The arguments can be specified in the command or in a json configuration file with --json_config_file your_config_file.json.

----------------------------

Required parameters:

--test_data_file <str>        Path to the file containing the test portion of the dataset
--test_class_file <str>       Path to the file containing the test true classes of the dataset, not mandatory if classes are specified in test data file
--test_pred_file <str>        Path to the file containing predictions on the test portion of the dataset
--global_rules_file <str>     Path to the file containing the global rules obtained with fidexGloRules algorithm.
--nb_attributes <int [1,inf[> Number of attributes in the dataset
--nb_classes <int [2,inf[>    Number of classes in the dataset

----------------------------

Optional parameters: 

-h --help                     Show this help message and exit
--json_config_file <str>      Path to the J

As you can observe, the required arguments are pretty much the same as previous executions. The only one that differs is `--global_rules_file` which simply asks to input the global rule file to compute statistics. Let's try this:

In [24]:
statsOutfileName = "RF_stats.txt"

args = f"""
        --root_folder {rootDir}
        --test_data_file {testDataFile}
        --test_pred_file {testPredsFile}
        --global_rules_file {globalRulesOutfile}
        --nb_attributes {nattributes}
        --nb_classes {nclasses}
        --attributes_file {attributesFile}
        --stats_file {statsOutfileName}
        """

status = fidexGloStats(args)

Parameters list:
 - test_data_file                                                      data/OCDDataset/test_dataset.txt
 - test_pred_file                                                          data/OCDDataset/predTest.out
 - global_rules_file                                            data/OCDDataset/fidexGloRules_rules.rls
 - root_folder                                                                         data/OCDDataset/
 - attributes_file                                                  data/OCDDataset/attributes_file.txt
 - stats_file                                                              data/OCDDataset/RF_stats.txt
 - nb_attributes                                                                                     20
 - nb_classes                                                                                         7
 - positive_class_index                                                                              -1
End of Parameters list.

Importing files...

Da

The execution of the algorithm generated a file that we named `RF_stats.txt` containing pretty much the same feedback as the program output. That being said, let's have a look inside the generated file:

In [25]:
previewFile(rootDir+statsOutfileName)

Global statistics of the rule set : 
Number of rules : 53, mean sample covering number per rule : 3.924528, mean number of antecedents per rule : 2.679245

Statistics with a test set of 53 samples :

No decision threshold is used.
No positive index class is used.
The global rule fidelity rate is : 0.830189
The global rule accuracy is : 0.792453
The explainability rate (when we can find one or more rules, either correct ones or activated ones which all agree on the same class) is : 0.905660
The default rule rate (when we can't find any rule activated for a sample) is : 0.018868
The mean number of correct(fidel) activated rules per sample is : 1.018868
The mean number of wrong(not fidel) activated rules per sample is : 0.886792
The model test accuracy is : 0.811321
The model test accuracy when rules and model agree is : 0.886364
The model test accuracy when activated rules and model agree is : 0.883721



The output of the program shows various metrics, let's have a look at them individually:

- `Global statistics`: Several values expressing general information about the ruleset.
- `Decision threshold`: Value used to define a threshold where a class is considered as true. In this case, it's written that `no decision threshold is used`.
- `Positive index class`: This value means which class is considered as the positive one. If no threshold is used, this cannot be used, like in this case.
- `Global rule fidelity rate`: Expressing whether the ruleset accurately reflects the model's predictions.
- `Global rule accuracy`: Proportion of correct predictions made by the ruleset.
- `Explainability rate`: Proportion of the samples that could be explained by one or more rules.
- `Default rule rate`: Proportion of samples that could not be explained by a rule offered by the ruleset.
- `Mean number of correct activated rules`: Average number of correct rules activated per sample.
- `Mean number of wrong activated rules`: Average number of incorrect rules activated per sample.
- `Model test accuracy`: Accuracy of the model on the test dataset
- `Model test accuracy when rules agree`: Accuracy of the model on test samples where the ruleset and model predictions agree.
- `Model test accuracy when activated rules agree`: Accuracy when at least one activated rule agrees with the model's prediction.

With this program, you can have a general overview of the quality of the ruleset.

With the generation of local and global rules using the Fidex algorithms, we have a clearer view of how our model makes predictions. These rules help us understand the model's decisions, making it more transparent. Now, let's wrap up our findings and discuss the importance of explainable AI in the final chapter.

# Conclusion

In this notebook, we explored explainable AI using Random Forests and the Fidex family of algorithms. We prepared our dataset, trained a Random Forest model, and examined the generated rules. We used `Fidex` to create local rules for individual sample explanations and `FidexGloRules` to generate a global ruleset for the entire training dataset. Finally, we evaluated the ruleset with `FidexGloStats`, providing insights into the model's accuracy, fidelity, and explainability.

This process demonstrated how explainable AI techniques can clarify complex models, making them more transparent and trustworthy. By understanding our model's decision-making process, we can ensure better, more reliable outcomes in various applications. Using Random Forests with Fidex offers a balanced approach to building interpretable and effective AI models.

# References

HES-XPLAIN: [website](https://hes-xplain.github.io/), [Github page](https://github.com/HES-XPLAIN)

Dataset: [source](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster), [author](https://www.kaggle.com/aravindpcoder)

Dimlpfidex: [Github repository](https://github.com/HES-XPLAIN/dimlpfidex), [documentation](https://hes-xplain.github.io/documentation/overview/)

Algorithms: [randomForest](https://hes-xplain.github.io/documentation/algorithms/training-methods/randforeststrn/), [Fidex](https://hes-xplain.github.io/documentation/algorithms/fidex/fidex/), [FidexGloRules](https://hes-xplain.github.io/documentation/algorithms/fidex/fidexglorules), [FidexGloStats](https://hes-xplain.github.io/documentation/algorithms/fidex/fidexglostats)
