## Example Usage of NoiseCut Model

To use `noisecut` in a project:

### 1. Import NoiseCut Package

In [1]:
import pandas as pd
from noisecut.tree_structured.data_manipulator import DataManipulator

from noisecut.model.noisecut_coder import Metric
from noisecut.model.noisecut_model import NoiseCut

### 2. Set Train and Test Data

Assign `X` and `Y` of the dataset.

In [2]:
input_file = "10D_example"

data = pd.read_csv(
    input_file, delimiter="    ", header=None, skiprows=1, engine="python"
)
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]

To select the training and test dataset randomly, you can use the build-in function of the `DataManipulator` class.
If you also work with a synthetic dataset (like this example), you can also make your dataset noisy by using `get_noisy_data` function of the `DataManipulator` class.

In [3]:
manipulator = DataManipulator()
x_noisy, y_noisy = manipulator.get_noisy_data(X, Y, percentage_noise=0)
x_train, y_train, x_test, y_test = manipulator.split_data(
    x_noisy, y_noisy, percentage_training_data=20
)

### 3. Fit Model

To fit the dataset into NoiseCut model, you should use `NoiseCut` class. To instantiate an object of this class, you have to provide an array `n_input_each_box` as an input which is an indicator of the tree-structure of the hybrid model. First element of the `n_input_each_box` represents number of input features to the first box of the first-layer black boxes, which is `2` in the below example; second element represents number of input features to the second box of the first-layer black boxes, which is `3` and it continues in this manner.

Then, the model can be simply fitted by using `fit` function of the `NoiseCut` class.

In [4]:
mdl = NoiseCut(n_input_each_box=[2, 3, 1, 4])
mdl.fit(x_train, y_train)

### 4. Predict Test Data and Examine Performance

The test dataset can be predicted by feeding test data as an input to the `predict` function of the `NoiseCut` class.

To examine the performance of the model, you can use the build-in function of the `Metric` class which is `set_confusion_matrix` to set the confusion matrix to compute the accuracy, recall, precision, and F1 of the predicted output of the test dataset.

In [5]:
y_predicted = mdl.predict(x_test)

accuracy, recall, precision, F1 = Metric.set_confusion_matrix(
    y_test, y_predicted
)

print(
    "accuracy = {a:3.3f}, recall = {r:3.3f}, precision = {p:3.3f}, "
    "F1 = {f:3.3f}".format(a=accuracy, r=recall, p=precision, f=F1)
)

accuracy = 1.000, recall = 1.000, precision = 1.000, F1 = 1.000


### 5. Useful Information about the Model

#### 5.1. Functions of the Black Boxes

After fitting model, the predicted binary function of first-layer black boxes can be taken by calling `get_binary_function_of_box` of the `NoiseCut` class. You have to give the ID of first-layer black box as an input which is a number in range `[0, n_box-1]`. 
Moreover, the predicted binary function of second-layer black box can be taken by calling `get_binary_function_black_box` of the `NoiseCut` class. It does not need any input as there is only one second-layer black box.

In [6]:
func_0 = mdl.get_binary_function_of_box(0)
func_1 = mdl.get_binary_function_of_box(1)
func_2 = mdl.get_binary_function_of_box(2)
func_3 = mdl.get_binary_function_of_box(3)
func_bb = mdl.get_binary_function_black_box()

#### 5.2. Uncertaity Measure

To get further insight, the 'probability of being 1' for each input binary of the model has been set based on the number of input training data which has a certain binary input to the 2nd-layer black box, based on the computed function of each 1st-layer black boxes, and a certain binary output(0 or 1); `set_uncertainty_measure` function provides you with such information. The printed result can be also written in a .cvs file format if you provide a path as an input for `set_uncertainty_measure` function.

Note: It can be used as a tool for setting uncertainty of the predictions of the model.


In [7]:
mdl.set_uncertainty_measure(file_path_result="result_noisecut")

Binary input black box, number of 0, number of 1
[0 0 0 0], 10, 0, 0.0
[1 0 0 0], 0, 3, 1.0
[0 1 0 0], 17, 0, 0.0
[1 1 0 0], 5, 0, 0.0
[0 0 1 0], 9, 0, 0.0
[1 0 1 0], 0, 4, 1.0
[0 1 1 0], 12, 0, 0.0
[1 1 1 0], 10, 0, 0.0
[0 0 0 1], 0, 22, 1.0
[1 0 0 1], 8, 0, 0.0
[0 1 0 1], 29, 0, 0.0
[1 1 0 1], 0, 8, 1.0
[0 0 1 1], 16, 0, 0.0
[1 0 1 1], 6, 0, 0.0
[0 1 1 1], 0, 37, 1.0
[1 1 1 1], 8, 0, 0.0


#### 5.3. Pseudo-Boolean Function of the 2nd-layer Black Box

You can also get the Pseudo-Boolean function of the 2nd-layer black box by calling `predict_pseudo_boolean_func_coef` function of the `NoiseCut` class. `X1`, `X2`, and `X3` in the printed output represent binary output of the first, second and third box of the 1st-layer black boxes to the 2nd-layer black box.

In [8]:
func_coef = mdl.predict_pseudo_boolean_func_coef()

The Boolean Function is:
0.00 + 1.00 X1 + 0.00 X2 + 0.00 X3 + 1.00 X4 + -1.00 X1 X2 + 0.00 X1 X3 + -2.00 X1 X4 + -0.00 X2 X3 + -1.00 X2 X4 + -1.00 X3 X4 + -0.00 X1 X2 X3 + 3.00 X1 X2 X4 + 1.00 X1 X3 X4 + 2.00 X2 X3 X4 + -3.00 X1 X2 X3 X4 + 


#### 5.4. Scoring

To predict a score for any binary input to the model, you can use the `predict_score` function of the `NoiseCut` class. Except from the `x` input which is the set of binary input data that you aim to predict score for, you need to provide another array for the function which is called `vector_n_score`. `vector_n_score` is a measure for setting score based on the probability of being 1. Minimum possible value for the score is 1 and the maximum value depends on the length of `vector_n_score` array. For instance, when `vector_n_score = np.array([0, 0.2, 0.4, 0.6, 0.8, 1])`, it means that when probability is between 0 and 0.2, set the score to 1. When probability is between 0.2 and 0.4, set the score to 2 and it continues in this manner. `vector_n_score` should be sorted. After that, the first element of the array should be 0 and the last element of the array should be 1.

In [9]:
score_test = mdl.predict_score(
    x=x_test, vector_n_score=[0, 0.2, 0.4, 0.6, 0.8, 1]
)

#### 5.5. Probability of Being One for Each Score

To get the probability of being one for each score for the predicted output of the test data, you can use `predict_mortality_of_each_score` function. Length of the return array is an indicator of the maximum score; the first element of the output represents the probability of the test data with `score=1` to be one, the second element of the output represents the probability of the test data with `score=2` to be one and it continues in a same manner. If the returned output for a specific score is `-1`, it shows that there is not any data with this score in the binary input data `x`. 

In [10]:
output = mdl.predict_mortality_of_each_score(
    x_test=x_test,
    y_test=y_test,
    vector_n_score=[0, 0.2, 0.4, 0.6, 0.8, 1],
    print_mortality=True,
)
print(f"The output is: {output}")

Score=1 is 0.00%
Score=2: There is not any data which matches this score!
Score=3: There is not any data which matches this score!
Score=4: There is not any data which matches this score!
Score=5 is 100.00%
The output is: (array([  0.,  -1.,  -1.,  -1., 100.]), array([568,   0,   0,   0,   0]), array([  0,   0,   0,   0, 252]))


#### 5.6. Probability of Being One

To predict probability of being one for any binary input to the model, you can use `predict_probability_of_being_1` function of the `NoiseCut` class. You can insert a single binary input or even more than one as an array of shape (n_sample, n_festures). If you insert more than one binary input, you receive an array of shape (n_samples,) of the probabilities in one-to-one mapping of the binary input.

In [11]:
result = mdl.predict_probability_of_being_1([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(f"Result of the single binary input: {result}")

result = mdl.predict_probability_of_being_1(
    [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 0, 1, 0, 0, 0, 1, 0]]
)
print(f"Result of two binary input: {result}")

Result of the single binary input: 0.0
Result of two binary input: [0. 0.]


## Generation of Synthetic Dataset

You can generate a Tree-structured dataset which has arbitrarily number of black boxes at the 1st-layer and only one black box at the 2nd-layer. Funtion of each black box can be set randomely or manually.

### Random generation of Tree-structured dataset

To build a randomly generated dataset which has a tree-structured, you can use the `SampleGenerator` class. To instantiate an object of this class, you need to input an array which indicates the number of input features to each 1st-layer black boxes. First element of the array represents number of input features to the first black box of the 1st-layer black boxes, second element represents number of input features to the second black box of the 1st-layer black boxes, and the rest follows the same.
Length of the array is also an indicator of the number of 1st-layer black boxes, which is `4` in the below example. If you set `allowance_rand=True`, all the functions are set randomly when the object is instantiated.

In [12]:
from noisecut.tree_structured.sample_generator import SampleGenerator  # noqa: E402

gen_dataset = SampleGenerator([2, 3, 1, 4], allowance_rand=True)

To build the dataset of the randomly generated model, you can call `get_complete_data_set` function of the `SampleGenerator` class.

In [13]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()

If you also call the `get_complete_data_set` function with an input, as a path to store the result, a file with the input name will be created in the path provided.

In [14]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="10D_example"
)
y_gen_dataset

array([ True,  True,  True, ...,  True,  True,  True])

The randomly set binary function of first-layer black boxes can be taken by calling `get_binary_function_of_box` function of the `SampleGenerator` class. You have to give the ID of first-layer black box as an input which is a number in range `[0, n_box-1]`. 
Moreover, the randomly set binary function of second-layer black box can be taken by calling `get_binary_function_black_box` of the `SampleGenerator` class. It does not need any input as there is only one second-layer black box.

In [15]:
func_0 = gen_dataset.get_binary_function_of_box(0)
func_1 = gen_dataset.get_binary_function_of_box(1)
func_2 = gen_dataset.get_binary_function_of_box(2)
func_3 = gen_dataset.get_binary_function_of_box(3)
func_bb = gen_dataset.get_binary_function_black_box()
gen_dataset.print_binary_function_model()

Function Box1
([feature_1, feature_2]: Binary Output) ->
([0 0]: 1), ([1 0]: 1), ([0 1]: 0), ([1 1]: 0)
Function Box2
([feature_3, feature_4, feature_5]: Binary Output) ->
([0 0 0]: 0), ([1 0 0]: 1), ([0 1 0]: 0), ([1 1 0]: 0), ([0 0 1]: 0), ([1 0 1]: 0), ([0 1 1]: 1), ([1 1 1]: 1)
Function Box3
([feature_6]: Binary Output) ->
([0]: 0), ([1]: 1)
Function Box4
([feature_7, feature_8, feature_9, feature_10]: Binary Output) ->
([0 0 0 0]: 1), ([1 0 0 0]: 0), ([0 1 0 0]: 0), ([1 1 0 0]: 0), ([0 0 1 0]: 0), ([1 0 1 0]: 1), ([0 1 1 0]: 1), ([1 1 1 0]: 1), ([0 0 0 1]: 0), ([1 0 0 1]: 1), ([0 1 0 1]: 0), ([1 1 0 1]: 0), ([0 0 1 1]: 0), ([1 0 1 1]: 1), ([0 1 1 1]: 1), ([1 1 1 1]: 1)
Function Black Box
([Output_box_1, Output_box_2, Output_box_3, Output_box_4]: Binary Output) ->
([0 0 0 0]: 1), ([1 0 0 0]: 0), ([0 1 0 0]: 0), ([1 1 0 0]: 1), ([0 0 1 0]: 1), ([1 0 1 0]: 1), ([0 1 1 0]: 1), ([1 1 1 0]: 0), ([0 0 0 1]: 1), ([1 0 0 1]: 1), ([0 1 0 1]: 0), ([1 1 0 1]: 1), ([0 0 1 1]: 1), ([1 0 1 1]: 0

### Generation of Tree-structured dataset by Setting Functions Manually

In the same manner as random generation of dataset, after importing the `SampleGenerator` class, you need to instantiate an object of the class.

In [16]:
from noisecut.tree_structured.sample_generator import SampleGenerator  # noqa: E402

gen_dataset = SampleGenerator([2, 3, 1])

To set the functions manually, you can use the `set_binary_function_of_box` function of the `SampleGenerator` class. Input variables of the function are ID of the 1st-layer black box and the binary function of the box.

In [17]:
gen_dataset.set_binary_function_of_box(0, [0, 0, 1, 0])
gen_dataset.set_binary_function_of_box(1, [0, 0, 1, 1, 1, 1, 1, 0])
gen_dataset.set_binary_function_of_box(2, [0, 1])
gen_dataset.set_binary_function_black_box([1, 1, 0, 0, 1, 1, 1, 1])

After setting all functions of the black boxes, you can check whether your dataset has functionality by calling `has_synthetic_example_functionality` function of the `SampleGenerator` class. If the generated dataset does not have functionality, you can change the functions of the black boxes and check it again.

In [18]:
gen_dataset.has_synthetic_example_functionality()

False

In [19]:
gen_dataset.set_binary_function_of_box(0, [0, 0, 1, 1])
gen_dataset.set_binary_function_of_box(1, [0, 0, 1, 1, 1, 1, 1, 0])
gen_dataset.set_binary_function_of_box(2, [0, 1])
gen_dataset.set_binary_function_black_box([0, 1, 0, 0, 1, 1, 0, 1])

gen_dataset.has_synthetic_example_functionality()

True

You can also get compelete dataset in the same manner as it has been explained in the previous part.

In [20]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="6D_example"
)