<!-- Author: Moein E. Samadi <moein.samadi@rwth-aachen.de> -->

# Usage example of NoiseCut

Here, we present a usage example of `NoiseCut` within the context of a binary classification task. To illustrate this, we employ a synthetic dataset that has been generated following the guidelines outlined in the `Generation_of_synthetic_data.ipynb` notebook.

In [1]:
import pandas as pd

from noisecut.model.noisecut_coder import Metric
from noisecut.model.noisecut_model import NoiseCut
from noisecut.tree_structured.data_manipulator import DataManipulator

### 1. Set training and test sets

Assign `X` as the features and `Y` as the labels.

In [2]:
input_file = "../data/7D_synthetic_data_manual"

data = pd.read_csv(
    input_file,
    delimiter="    ",
    header=None,
    skiprows=1,
    engine="python",
)
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]

To randomly sample the training and test sets, you can use the build-in function of the `DataManipulator` class.
If you also work with a synthetic dataset (like this example), you can also add noise to the labeling of the data by using `get_noisy_data` function of the `DataManipulator` class.

In [3]:
Training_set_size = 50  # The percentage of training set
Noise_intencity = (
    5  # The labels' percentage should be toggled from 0 to 1, or vice versa.
)

manipulator = DataManipulator()
x_noisy, y_noisy = manipulator.get_noisy_data(
    X,
    Y,
    percentage_noise=Noise_intencity,
)
x_train, y_train, x_test, y_test = manipulator.split_data(
    x_noisy,
    y_noisy,
    percentage_training_data=Training_set_size,
)

### 2. Fitting the model

To fit the training set into the hybrid model, you should use `NoiseCut` class. To instantiate an object of this class, you have to provide an array `n_input_each_box` as an input which is an indicator of the tree-structure of the hybrid model. First element of the `n_input_each_box` represents number of input features to the first-layer black boxes, which is `3` in the example of the synthetic data generated in the `Generation_of_synthetic_data.ipynb` notebook; second element represents number of input features to the second first-layer black boxes, which is `2` and it continues in this manner.

To fit the training set into the hybrid model, utilize the `NoiseCut` class. To instantiate an object of this class, you'll need to provide an input array called `n_input_each_box`. This array serves as an indicator for the tree-structure of the hybrid model. The initial element of `n_input_each_box` corresponds to the number of input features for the  first black box in the first layer of the network, which is `3` in the example of the synthetic data generated in the `Generation_of_synthetic_data.ipynb` notebook; Subsequently, the second element signifies the number of input features for the second first-layer black box, which in this case is `2`. This pattern continues for the successive elements.


Then, the model can be simply fitted by using `fit` function of the `NoiseCut` class.

In [4]:
mdl = NoiseCut(n_input_each_box=[3, 2, 2])
mdl.fit(x_train, y_train)

### 3. Evaluation

The evaluation of the NoiseCut algorithm's performance can be conducted by utilizing the test set. This test set can be provided as input to the `predict` function within the `NoiseCut` class.

To assess the model's performance, you can utilize the built-in function of the `Metric` class called `set_confusion_matrix`. This function enables you to establish the confusion matrix, thereby facilitating the computation of accuracy, recall, precision, and F1 score for the predicted output derived from the test dataset.

In [5]:
y_predicted = mdl.predict(x_test)

accuracy, recall, precision, F1 = Metric.set_confusion_matrix(
    y_test, y_predicted
)

print(
    "accuracy = {a:3.3f}, recall = {r:3.3f}, precision = {p:3.3f}, "
    "F1 = {f:3.3f}".format(a=accuracy, r=recall, p=precision, f=F1)
)

accuracy = 0.922, recall = 0.902, precision = 0.974, F1 = 0.937


### 4. Predictions as probability

The outcomes of the hybrid model can be obtained by calculating the probability of the label being `1` for any binary input fed into the model. This can be accomplished using the `predict_probability_of_being_1` function within the `NoiseCut` class.
You can insert a single binary input or even more than one as an array of shape (n_sample, n_festures). If you insert more than one binary input, you receive an array of shape (n_samples,) of the probabilities in one-to-one mapping of the binary input. 

In [6]:
y_pred_proba = mdl.predict_probability_of_being_1([0, 0, 0, 0, 0, 0, 0])
print(f"Prediction probability for a binary input: {y_pred_proba}")

y_pred_proba = mdl.predict_probability_of_being_1(
    [[0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 1, 0, 1]]
)
print(f"Prediction probability for two binary inputs: {y_pred_proba}")

Prediction probability for a binary input: 1.0
Prediction probability for two binary inputs: [1.         0.93333333]


The `predict_probability_of_being_1` function can be applied to the complete test set in order to obtain the predicted probabilities. With these probabilities at hand, it becomes possible to calculate the area under the ROC curve.

In [7]:
from sklearn import metrics  # noqa: E402

y_pred_proba = mdl.predict_probability_of_being_1(x_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test.astype(int), y_pred_proba)
print("AUC-ROC=", metrics.auc(fpr, tpr))

AUC-ROC= 0.9178154825026511


### 5. Retrieved functions of the black boxes

After fitting model, the predicted binary function of first-layer black boxes can be taken by calling `get_binary_function_of_box` of the `NoiseCut` class. You have to give the ID of first-layer black box as an input which is a number in range `[0, n_box-1]`. 
Moreover, the predicted binary function of second-layer black box can be taken by calling `get_binary_function_black_box` of the `NoiseCut` class. It does not need any input as there is only one second-layer black box.

In [8]:
func_0 = mdl.get_binary_function_of_box(0)
func_1 = mdl.get_binary_function_of_box(1)
func_2 = mdl.get_binary_function_of_box(2)
func_bb = mdl.get_binary_function_black_box()
func_0

array([False, False,  True,  True,  True, False,  True, False])