# Exploring Random Forest and Fidex rule generation for obesity risk classification

**Introduction:**

Welcome to HES-Xplain, our interactive platform designed to facilitate explainable artificial intelligence (XAI) techniques. In this use case, we dive into obesity risk classification and showcase another application example of explainability techniques.

This notebook is an alternative to the [`Exploring Dimlp and Fidex rule generation for breast cancer classification`](TODO). It aims to be similar but aims to use a different dataset and training model to show the versatility of our explainability tools.  In addition, we will cover how to pre-process a dataset that is not initially usable by a model and convert it to an exploitable dataset.

**Objectives:**
    1. Observe a different use case where XAI can be used
    2. Understand how to pre-process data 
    3. Understand how to use Dimlp and Fidex.
    4. Showcase the versatility of HES-Xplain using a different dataset and training model.
    5. Provide practical insights into applying Random Forests and Fidex to breast cancer classifiers through an interactive notebook.
    6. Foster a community of XAI enthusiasts and practitioners.

**Outline:**

    1. Dataset and Problem Statement.
    2. Load and pre-process the dataset.
    3. Train the Model.
    4. Local rules generation - Fidex // ???
    5. Global ruleSet generation - FidexGlo // ???
    6. Conclusion.
    7. References.

Through this use case, we aim to empower users to grasp the potential of Random Forests and Fidex as tools for transparent and interpretable classification. With HES-Xplain, we make XAI accessible, helping users build trust in their models and make informed decisions.

# Dataset and Problem Statement
This dataset we are going to use is named [obesity or CVD risk](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data) and is available on [Kaggle](https://www.kaggle.com). It contains 2111 records of anonymized data concerning South American people and their food consumption habits. In this notebook, we will use this dataset to dive into another medical field common disease classification problem which is the obesity risk a person presents according to multiple factors. These factors are present in the dataset and are described below (names are taken from the original dataset):

| **Full name**                             | **Used label** |                                                        **Values/Ranges**                                                       | **Description**                                                                     |
|-------------------------------------------|:--------------:|:------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------------------------------------------------|
| Gender                                    |     Gender     |                                                          Male, Female                                                          | Person's biological gender                                                          |
| Age                                       |       Age      |                                                             [14:61]                                                            | Person's age in years                                                               |
| Height                                    |     Height     |                                                           [1.45:1.98]                                                          | Person's height in meters                                                           |
| Weight                                    |     Weight     |                                                            [39:173]                                                            | Person's weight in kilograms                                                        |
| Family history with overweight            |      FHWO      |                                                             yes, no                                                            | Whether the person has at least one sibling that suffers or suffered of overweight  |
| Frequent consumption of high-caloric food |      FAVC      |                                                             yes, no                                                            | Whether the person is frequently consuming high-caloric food                        |
| Frequency of consumption of vegetables    |      FCVC      |                                                              [1:3]                                                             | Leveled frequency of consumption of vegetables                                      |
| Number of main meals                      |       NCP      |                                                              [1:4]                                                             | Person's number of main meals during a day                                          |
| Consumption of food between meals         |      CAEC      |                                                no, sometimes, frequently, always                                               | Person's consumption of food between main meals frequency per day                   |
| Smoker or not                             |      SMOKE     |                                                             yes, no                                                            | Whether the person smokes                                                           |
| Consumption of water daily                |      CH20      |                                                              [1:3]                                                             | Numeric representation of water consumption frequency per day                       |
| Calories consumption monitoring           |       SCC      |                                                             yes, no                                                            | Whether the person is monitoring his daily calories intake                          |
| Physical activity frequency               |       FAF      |                                                              [0:3]                                                             | Numeric representation of physical activity frequency per week                      |
| Time using technology devices             |       TUE      |                                                              [0:2]                                                             | Numeric representation of electronic devices use frequency per day                  |
| Consumption of alcohol                    |      CALC      |                                                no, sometimes, frequently, always                                               | Frequency of alcohol consumption                                                    |
| Transportation used                       |     MTRANS     |                                   Public_Transportation, Automobile, Bike, Motorbike, Walking                                  | Medium usually used to transit                                                      |
| Obesity level deducted                    |       OLD      | Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III | Obesity level observed according to the interpretation of the person's BMI          |

In our case, we look forward to training a random forest model to classify the obesity level deducted from all the other features. To do so, we need to slightly modify the original dataset to convert several features to be digestible by the model.

# Load and pre-process the dataset
Let's start by renaming lengthy column labels and observing the CSV file containing raw data:

In [9]:
import pandas as pd

dataset = pd.read_csv("data/OCDDataset/ObesityDataSet.csv")
dataset.rename(
    columns={
        "family_history_with_overweight": "FHWO",
        "NObeyesdad": "OLD",
    },
    inplace=True
)
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


You can observe a sample of the dataset. Now, we need to convert some data to be more "Machine Learning" friendly. To do so, lets begin by converting features with "yes" or "no" values as its boolean representation with 0s and 1s (or True and False): 

In [10]:
# TODO: convert values
strToBinDict = {"yes": True, "no": False}
dataset["FHWO"].replace(strToBinDict, inplace=True)
dataset["FAVC"].replace(strToBinDict, inplace=True)
dataset["SMOKE"].replace(strToBinDict, inplace=True)
dataset["SCC"].replace(strToBinDict, inplace=True)
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
0,Female,21.0,1.62,64.0,True,False,2.0,3.0,Sometimes,False,2.0,False,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,True,False,3.0,3.0,Sometimes,True,3.0,True,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,True,False,2.0,3.0,Sometimes,False,2.0,False,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,False,False,3.0,3.0,Sometimes,False,2.0,False,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,False,False,2.0,1.0,Sometimes,False,2.0,False,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


The `CAEC` and `CALC` columns contains values that are not digestible as it is and must be converted. As you can see, the values are:
- Always
- Frequently
- Sometimes
- no

These adjectives quantifies a frequency, this means that we can translate it to a numerical scale from 0 to 1. Let's convert them like so:  
| **Adjective** | **Conversion value** |
|---------------|:-----------:|
| Always        | 1.0       |
| Frequently    | 0.66      |
| Sometimes     | 0.33      |
| no            | 0.0       |

The procedure is very similar to the binary translation made earlier:

In [11]:
adjToValDict = {"Always": 1.0, "Frequently": 0.66, "Sometimes": 0.33, "no": 0.0}
dataset["CAEC"].replace(adjToValDict, inplace=True)
dataset["CALC"].replace(adjToValDict, inplace=True)
dataset.head()

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,OLD
0,Female,21.0,1.62,64.0,True,False,2.0,3.0,0.33,False,2.0,False,0.0,1.0,0.0,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,True,False,3.0,3.0,0.33,True,3.0,True,3.0,0.0,0.33,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,True,False,2.0,3.0,0.33,False,2.0,False,2.0,1.0,0.66,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,False,False,3.0,3.0,0.33,False,2.0,False,2.0,0.0,0.66,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,False,False,2.0,1.0,0.33,False,2.0,False,0.0,0.0,0.33,Public_Transportation,Overweight_Level_II


Now we must deal with two other columns that contain non-numerical values yet, these are `Gender`, `MTRANS` and `OLD`. It's values are quantifing something and just represent an option individually. This means that we cannot apply a scale like the one before. Instead, we are going to index them by using an encoding technique named `one hot`.

In [12]:
genderCols = pd.get_dummies(dataset["Gender"], prefix="Gender",prefix_sep='_')
mtransCols = pd.get_dummies(dataset["MTRANS"], prefix="MTRANS",prefix_sep='_')
oldCols = pd.get_dummies(dataset["OLD"], prefix="OLD",prefix_sep='_')
dataset = pd.concat([genderCols, dataset.iloc[:,:16], mtransCols,  dataset.iloc[:,16:], oldCols], axis=1)
dataset.drop(["Gender", "MTRANS", "OLD"], axis=1, inplace=True)
dataset.head()

Unnamed: 0,Gender_Female,Gender_Male,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,...,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking,OLD_Insufficient_Weight,OLD_Normal_Weight,OLD_Obesity_Type_I,OLD_Obesity_Type_II,OLD_Obesity_Type_III,OLD_Overweight_Level_I,OLD_Overweight_Level_II
0,True,False,21.0,1.62,64.0,True,False,2.0,3.0,0.33,...,False,True,False,False,True,False,False,False,False,False
1,True,False,21.0,1.52,56.0,True,False,3.0,3.0,0.33,...,False,True,False,False,True,False,False,False,False,False
2,False,True,23.0,1.8,77.0,True,False,2.0,3.0,0.33,...,False,True,False,False,True,False,False,False,False,False
3,False,True,27.0,1.8,87.0,False,False,3.0,3.0,0.33,...,False,False,True,False,False,False,False,False,True,False
4,False,True,22.0,1.78,89.8,False,False,2.0,1.0,0.33,...,False,True,False,False,False,False,False,False,False,True


Now our dataset is ready to be exploited, lets verify some data just to be sure of the data's sanity:

In [None]:
# TODO describe etc...

# Train the Model