# IMPTOX 

## Random Forest modelling and Rules extraction with FIDEX


For this example, we will consider that you have a fully numerical dataset without any missing values. 

This means that all values : 
- TODO COMPLETE CONDITIONS


Note: You need to be at the root of the repository 


In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import matplotlib.pyplot as plt
import os
import pickle
import kagglehub

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

from dimlpfidex.fidex import fidex, fidexGloRules, fidexGloStats

# We'll try with this: https://hes-xplain.github.io/documentation/dimlpfidex/training-methods/randforeststrn/#arguments-list

# Example notebook here: https://colab.research.google.com/github/HES-XPLAIN/notebooks/blob/main/use_case_dimlpfidex/obesityCvdRisk.ipynb#scrollTo=9chG9JhJB0i3

from trainings import randForestsTrn # Part of DIMLPFIDEX



In [None]:
# Make rf directory 
os.makedirs('./notebooks/rf', exist_ok=True)
root_dir = pl.Path("./notebooks/rf")

In [3]:
# Launch the server from the root (IMPTOX_XAI) directory
!pwd

/mnt/c/Users/TSchowing/Desktop/repositories/IMPTOX_XAI


In [4]:
# Download latest version and import dataframe. Replace this with your own dataset
path = kagglehub.dataset_download("gargmanas/pima-indians-diabetes")
print("Path to dataset files:", path )

df = pd.read_csv(path + '/pima-indians-diabetes.csv')
df.columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','OUT']

df

Path to dataset files: /home/twg/.cache/kagglehub/datasets/gargmanas/pima-indians-diabetes/versions/1


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,OUT
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63,0
763,2,122,70,27,0,36.8,0.340,27,0
764,5,121,72,23,112,26.2,0.245,30,0
765,1,126,60,0,0,30.1,0.349,47,1


In [5]:
# Markdown sample for reporting
#df.head(8).to_markdown()

Now we have a dataset in DataFrame format with the output column last called "OUT", we can save the files and define our global parameters. Modify here according to your data. 

In [6]:
NB_FEATURES = 8 # Number of input columns, we can set to length-1 too. 
NB_CLASSES = 2 # Not the number of output column, but the number of classes in the output column

# Shuffle and split the dataset into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42, shuffle=True, stratify=df['OUT'])

# Full dataset to csv
train_data.to_csv(root_dir.joinpath("train_dataset_full.csv"), header=False, index=False)
test_data.to_csv(root_dir.joinpath("test_dataset_full.csv"), header=False, index=False)


# Save the features names in a txt file
features = df.columns[:-1]
features_filename = "attributes.txt"
with open(root_dir.joinpath(features_filename), "w") as file:
    for feature in features:
        file.write(f"{feature}\n")


# Save the dataset in txt format for the FIDEX algorithm
# - Train and test data, without output column
train_data.iloc[:, :-1].to_csv(root_dir.joinpath("train_dataset.txt"), header=False, index=False)
test_data.iloc[:, :-1].to_csv(root_dir.joinpath("test_dataset.txt"), header=False, index=False)

# - Train and test data, output column only
train_data.iloc[:, -1:].to_csv(root_dir.joinpath("train_class.txt"), header=False, index=False)
test_data.iloc[:, -1:].to_csv(root_dir.joinpath("test_class.txt"), header=False, index=False)



# HES-XPLAIN training methods

Here we will be using `randForestsTrn`, one of different [training methods](https://hes-xplain.github.io/documentation/dimlpfidex/training-methods/overview/) available with the FIDEX algorithm for the generation of interpretable decisions, 

If you are more comfortable with it, you can generate a configuration file using the [DIMLP Fidex GUI](https://hes-xplain.github.io/documentation/dimlpfidex/gui/#installation-guide). However here we are going to use basic parameters for readability and simplicity reasons. To list the different parameters and multiple options, you can execute the following code: 

```python
randForestsTrn(
"""--help"""
)
```

Then we can execute the training using the following parameters: 

- `--root_folder`: Here we specified the folder notebooks/rf accessible from the root of this repository. This is from where the jupyter server should be launched. 
- `--train_data_file`: "train_dataset_full.csv" Different options are available. Here we provide one file containing the features as well as the output
- `--test_data_file`: "test_dataset_full.csv" Different options are available. Here we provide one file containing the features as well as the output
- `--nb_attributes`: 8 here we set the number of variables/features without the output 
- `--nb_classes`: 2 Here we set the number of classes present in the output column. In our example we have two classes (1/0). 

In [7]:
randForestsTrn(
f"""
--root_folder {root_dir}
--train_data_file {"train_dataset_full.csv"}
--test_data_file {"test_dataset_full.csv"}
--nb_attributes {NB_FEATURES} 
--nb_classes {NB_CLASSES}"""
)

Parameters list:
 - root_folder                                                   notebooks/rf
 - train_data_file                                               notebooks/rf/train_dataset_full.csv
 - test_data_file                                                notebooks/rf/test_dataset_full.csv
 - nb_attributes                                                 8
 - nb_classes                                                    2
 - train_pred_outfile                                            notebooks/rf/predTrain.out
 - test_pred_outfile                                             notebooks/rf/predTest.out
 - stats_file                                                    notebooks/rf/stats.txt
 - rules_outfile                                                 notebooks/rf/RF_rules.rls
 - n_estimators                                                  100
 - criterion                                                     gini
 - min_samples_split                                             2
 -

0

Various files specific to this training method, have been created. 

In [8]:
!ls ./notebooks/rf/

FidexGlo_RF_global_rules_out.rls  predTrain.out		 train_class.txt
Fidex_RF_rules_out.rls		  stats.txt		 train_dataset.txt
RF_rules.rls			  test_class.txt	 train_dataset_full.csv
attributes.txt			  test_dataset.txt
predTest.out			  test_dataset_full.csv


In the different files produced, we can access the different accuracies, as well as the predictions probabilities for class 0 and class 1. 

In [9]:
!cat ./notebooks/rf/stats.txt

Training accuracy : 100%.
Testing accuracy : 69.480519%.

In [17]:
!cat ./notebooks/rf/predTest.out | head -n 10

0.3 0.7 
0.71 0.29 
0.89 0.11 
0.19 0.81 
0.96 0.04 
0.51 0.49 
0.59 0.41 
0.11 0.89 
1.0 0.0 
0.24 0.76 


# FIDEX

We can generate local rules to explain the model's results on one test sample, or on all of them depending the size of our sample. 


```
fidex("--help")
```

In [10]:
# Set all parameters
train_data_file = f"train_dataset.txt"
train_class_file= f"train_class.txt"
train_pred_file= f"predTrain.out"
test_data_file= f"test_dataset.txt"
test_pred_file= f"predTest.out"
rules_file= f"RF_rules.rls"
rules_outfile= f"Fidex_RF_rules_out.rls"
attributes_file= f"attributes.txt"
nb_attributes=f"{NB_FEATURES}"
nb_classes=f"{NB_CLASSES}"

In [11]:
# Run the FIDEX algorithm
fidex(f"""
--root_folder {root_dir}
--train_data_file {train_data_file}
--train_class_file {train_class_file}
--test_data_file {test_data_file}
--train_pred_file {train_pred_file}
--test_pred_file {test_pred_file}
--rules_file {rules_file}
--rules_outfile {rules_outfile}
--attributes_file {features_filename}
--nb_attributes {nb_attributes}
--nb_classes {nb_classes}
""")

Parameters list:
 - train_data_file                                                       notebooks/rf/train_dataset.txt
 - train_pred_file                                                           notebooks/rf/predTrain.out
 - train_class_file                                                        notebooks/rf/train_class.txt
 - test_data_file                                                         notebooks/rf/test_dataset.txt
 - test_pred_file                                                             notebooks/rf/predTest.out
 - rules_file                                                                 notebooks/rf/RF_rules.rls
 - rules_outfile                                                    notebooks/rf/Fidex_RF_rules_out.rls
 - root_folder                                                                             notebooks/rf
 - attributes_file                                                          notebooks/rf/attributes.txt
 - nb_attributes                               

0

                                                                       0.000000
 - min_fidelity                                                                                1.000000
 - lowest_min_fidelity                                                                         0.750000
 - covering_strategy                                                                                  1
End of Parameters list.

Import files...

Import time = 0.08738 sec
Files imported

----------------------------------------------

Creation of hyperspace...
Hyperspace created

Computation of rule for sample 0 : 

Initial fidelity : 0.347471
Final fidelity : 1

Extracted rule :
Glucose>=158.5 Glucose<160.5 -> class 1
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.833333

Result found after 2 iterations.
-------------------------------------------------
Computation of rule for sample 1 : 

Initial fidelity : 0.652529
Final fidelity : 1

Extracted rule :


Rule for each sample

In [12]:
!head -n 20 ./notebooks/rf/Fidex_RF_rules_out.rls

ain Confidence : 0.948

Result found after 2 iterations.
-------------------------------------------------
Computation of rule for sample 58 : 

Initial fidelity : 0.652529
Final fidelity : 1

Extracted rule :
Age<28.5 Glucose<149.5 DiabetesPedigreeFunction<0.5015 DiabetesPedigreeFunction>=0.32 DiabetesPedigreeFunction<0.335 -> class 0
   Train Covering size : 3
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.8775

Result found after 5 iterations.
-------------------------------------------------
Computation of rule for sample 59 : 

Initial fidelity : 0.652529
Final fidelity : 1

Extracted rule :
BMI<26.45 BMI>=25.95 -> class 0
   Train Covering size : 8
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.962222

Result found after 2 iterations.
-------------------------------------------------
Computation of rule for sample 60 : 

Initial fidelity : 0.652529
Final fidelity : 1

Extracted rule :
Glucose<102.5 SkinThickness<19.5 Age>=36.5 -> class 0


No decision threshold is used.

Rule for sample 0 :

Glucose>=158.5 Glucose<160.5 -> class 1
   Train Covering size : 2
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.833333

-------------------------------------------------

Rule for sample 1 :

Glucose<89.5 BloodPressure>=82.5 -> class 0
   Train Covering size : 4
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.904



# Global fidex Glo    


In [13]:
global_rules_out_file = "FidexGlo_RF_global_rules_out.rls"

In [14]:
fidexGloRules(f"""
--root_folder {root_dir}
--train_data_file {train_data_file}
--train_class_file {train_class_file}
--train_pred_file {train_pred_file}
--rules_file {rules_file}
--global_rules_outfile {global_rules_out_file}
--heuristic 1
--attributes_file {features_filename}
--nb_attributes {nb_attributes}
--nb_classes {nb_classes}
""")

Parameters list:


0

 - train_data_file                                                       notebooks/rf/train_dataset.txt
 - train_pred_file                                                           notebooks/rf/predTrain.out
 - train_class_file                                                        notebooks/rf/train_class.txt
 - rules_file                                                                 notebooks/rf/RF_rules.rls
 - global_rules_outfile                                   notebooks/rf/FidexGlo_RF_global_rules_out.rls
 - root_folder                                                                             notebooks/rf
 - attributes_file                                                          notebooks/rf/attributes.txt
 - nb_attributes                                                                                      8
 - nb_classes                                                                                         2
 - nb_quant_levels                                              

## Rules and sample covering

Here we can see how many samples are concerned per rules. 

In [15]:
!head -n 33 ./notebooks/rf/FidexGlo_RF_global_rules_out.rls

Number of rules : 151, mean sample covering number per rule : 11.112583, mean number of antecedents per rule : 3.099338
No decision threshold is used.

Rule 1: Glucose<104.5 BMI<28.8 -> class 0
   Train Covering size : 81
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.978148

Rule 2: Glucose<99.5 SkinThickness<21.5 -> class 0
   Train Covering size : 76
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.963816

Rule 3: Age<22.5 Glucose<111.5 -> class 0
   Train Covering size : 61
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.97541

Rule 4: BMI<26.45 Glucose<124.5 BMI>=22.95 -> class 0
   Train Covering size : 57
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.972281

Rule 5: Glucose<107.5 Age<23.5 DiabetesPedigreeFunction<0.6565 Insulin<100.5 -> class 0
   Train Covering size : 51
   Train Fidelity : 1
   Train Accuracy : 1
   Train Confidence : 0.979412



# Conclusion


TODO