<!-- Author: Moein E. Samadi <moein.samadi@rwth-aachen.de> -->

# Synthetic data generation

We generated synthetic data sets to benchmark the binary classification performance of NoiseCut against other machine learning classifiers. 
Synthetic data sets were created such that the structure of the information flow from binary-represented input data $\mathbf{x} \in \{0,1\}^n$ to binary outputs or labels $y \in \{0,1\}$ conforms to a tree-structured network, as illustrated the figure below:

<p style="text-align: center">
  <img src="../artwork/Synthetic_data_structure.png" width="500" height="350" align="center">
</p>

Figure 1: A schematic representation of the information flow from binary represented input data to binary labels. This procedure has been used to generate the synthetic data.

Figure 1 illustrates a simple example of the labeling procedure in the synthetic data sets. We assumed a tree-structured network $\mathcal{F}: \{0,1\}^7 \longmapsto  \{0,1\}$ mapping binary variables $\mathbf{x}$ to binary labels $y$:

$$
    y  = \mathcal{F}(X) \;\;,\;\; \mathbf{x} \in \{0,1\}^7 \;\;,\;\; y \in \{0,1\}.
$$

In the nework of Figure 1, there are three first-layer boxes $\mathrm{F_1}: \{0,1\}^3 \longmapsto  \{0,1\}$, $\mathrm{F_2}: \{0,1\}^2 \longmapsto  \{0,1\}$, and $\mathrm{F_3}: \{0,1\}^2 \longmapsto  \{0,1\}$ that separately perform computations on subsets of input features. Here are the input/putput function of the first-layer boxes in Figure 1:

$$
    \mathrm{F_1}: \{000, 100, 010, 110, 001, 101, 011, 111\} \longmapsto \{0, 0, 1, 1, 1, 0, 1, 0\}, \\
    \mathrm{F_2}: \{00, 10, 01, 11\} \longmapsto \{1, 0, 1, 1\}, \\
    \mathrm{F_3}: \{00, 10, 01, 11\} \longmapsto \{1, 1, 0, 0\}.
$$

For instance, when we enter $\mathbf{x}^\prime = \{0, 1, 0, 1, 0, 0, 1\}$ to the network, the three first-layer boxes return $\{1, 1, 1\}$, which is then forwarded to the output-box $\mathrm{F_O}: \{0,1\}^3 \longmapsto  \{0,1\}$ with the following input/output function:
$$
    \mathrm{F_O}: \{000, 100, 010, 110, 001, 101, 011, 111\} \longmapsto \{1, 0, 0, 0, 0, 1, 1, 0\}.
$$
Finally, the output-box returns the generated label, here $y^\prime=0$, for the entered input $\mathbf{x}^\prime$ to the network.

## Generating tree-structured data through randomly defined functions
One can generate tree-structured synthetic data featuring an arbitrary number of first-layer boxes and an output-box by using NoiseCut. The functionality of each black box can be assigned randomly or manually determined.

For the generation of a tree-structured synthetic dataset featuring interior black boxes with randomly allocated functions, one can seamlessly employ the `SampleGenerator` class.

To instantiate an object of this class, you need to input an array which
indicates the number of input features to each first-layer black box.
The first element of the array represents the number of input features to the first black box, the second element represents the number of input features to the second black box, and the rest follows the same. The length of the array is also an indicator of the number of first-layer black boxes, which is `3` in the below example. If you set `allowance_rand=True`, all the functions are set randomly when the object is instantiated.

In [1]:
from noisecut.tree_structured.sample_generator import SampleGenerator

gen_dataset = SampleGenerator([3, 2, 2], allowance_rand=True)

To construct the dataset for the randomly generated model, simply invoke the `get_complete_data_set` function found within the `SampleGenerator` class.

In [2]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()

If you also call the `get_complete_data_set` function with an input, as a path to store the result, a file with the input name will be created in the path provided.

In [3]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="../data/7D_synthetic_data_random"
)
print("Generated binary labels:", "\n", y_gen_dataset.astype(int))

Generated binary labels: 
 [1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0
 1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1
 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0
 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1]


The randomly set binary function of first-layer black boxes can be taken by calling `get_binary_function_of_box` function of the `SampleGenerator` class. You have to give the ID of first-layer black box as an input which is a number in range `[0, n_box-1]`. 
Moreover, the randomly set binary function of the output-box can be taken by calling `get_binary_function_black_box` of the `SampleGenerator` class. It does not need any input as there is only one output-box in the nework.

In [4]:
func_0 = gen_dataset.get_binary_function_of_box(0)
func_1 = gen_dataset.get_binary_function_of_box(1)
func_2 = gen_dataset.get_binary_function_of_box(2)
func_bb = gen_dataset.get_binary_function_black_box()
print("The function of the output-box:", "\n", func_bb)

The function of the output-box: 
 [False  True  True False  True  True False  True]


You can also obtain the functions of all the first-layer black boxes, along with the function of the output box, simultaneously, by invoking `gen_dataset.print_binary_function_model()`.

In [5]:
gen_dataset.print_binary_function_model()

Function Box1
([feature_1, feature_2, feature_3]: Binary Output) ->
([0 0 0]: 1), ([1 0 0]: 1), ([0 1 0]: 1), ([1 1 0]: 0), ([0 0 1]: 1), ([1 0 1]: 0), ([0 1 1]: 1), ([1 1 1]: 1)
Function Box2
([feature_4, feature_5]: Binary Output) ->
([0 0]: 1), ([1 0]: 1), ([0 1]: 0), ([1 1]: 0)
Function Box3
([feature_6, feature_7]: Binary Output) ->
([0 0]: 1), ([1 0]: 0), ([0 1]: 1), ([1 1]: 0)
Function Black Box
([Output_box_1, Output_box_2, Output_box_3]: Binary Output) ->
([0 0 0]: 0), ([1 0 0]: 1), ([0 1 0]: 1), ([1 1 0]: 0), ([0 0 1]: 1), ([1 0 1]: 1), ([0 1 1]: 0), ([1 1 1]: 1)


## Generating  tree-structured data by setting functions manually

In the same manner as random generating tree-structured data through randomly defined functions, after importing the `SampleGenerator` class with `allowance_rand=False`, you need to instantiate an object of the class.

In [6]:
from noisecut.tree_structured.sample_generator import (  # noqa: E402
    SampleGenerator,
)

gen_dataset = SampleGenerator([3, 2, 2], allowance_rand=False)

To set the functions manually, you can use the `set_binary_function_of_box` function of the `SampleGenerator` class. Input variables of the function are ID of the associated first-layer black box and the desired binary function of the box. In the example below, we generated the binary functions depicted in Figure 1.

In [7]:
gen_dataset.set_binary_function_of_box(0, [0, 0, 1, 1, 1, 0, 1, 0])
gen_dataset.set_binary_function_of_box(1, [1, 0, 1, 1])
gen_dataset.set_binary_function_of_box(2, [1, 1, 0, 0])
gen_dataset.set_binary_function_black_box([0, 0, 1, 1, 1, 0, 1, 0])

After determining all functions of the black boxes, you can check whether your generated dataset doesn't provide an in vain black box in the network by calling `has_synthetic_example_functionality` function of the `SampleGenerator` class. If the function returns `Flase`, you might need to change the determined functions of the black boxes and check it again. This test will enable you to create a non-reducible tree-structured dataset by incorporating productive black boxes within the network.

In [8]:
gen_dataset.has_synthetic_example_functionality()

True

You can also get and store the compelete dataset in the same manner as it has been explained in the previous part.

In [9]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="../data/7D_synthetic_data_manual"
)