# Generator Example #3 - Data Driven
In this notebook we we present a continuation to example #2, where time series were created.<br>
Here we show instead the functionalities of our automatic data driven method.<br>
This is done with tests on a mock artificial dataset.

In [1]:
from lib import Generator, MULTI_VARIABLES, IMPUTERS
import pandas as pd
import numpy as np
import random
import string

## Create synthetic dataset
In order to explore the usage of the data generator a synthetic dataset containing categorical and numeric columns, as well as, varying amounts missing values is created.

In [2]:
data = [random.randint(1,10) for _ in range(1200)]
data = pd.DataFrame(np.array(data).reshape(200,-1))

letters = [random.choice(string.ascii_letters[:15]) for _ in range(data.shape[0])]
data[0] = letters
letters = [random.choice(string.ascii_letters[:10]) for _ in range(data.shape[0])]
data[2] = letters

for column in data.columns:
    missing_percentage = random.randint(10,50)
    missing_values = random.sample(list(range(data.shape[0])), data.shape[0]*missing_percentage//100)
    data.loc[missing_values, column] = np.nan

data

Unnamed: 0,0,1,2,3,4,5
0,k,7.0,e,10.0,,
1,,,b,8.0,8.0,9.0
2,,10.0,,5.0,7.0,4.0
3,o,8.0,,6.0,4.0,3.0
4,d,,b,3.0,8.0,10.0
...,...,...,...,...,...,...
195,,6.0,c,10.0,3.0,1.0
196,,2.0,,,7.0,5.0
197,g,8.0,a,2.0,8.0,4.0
198,o,5.0,,1.0,5.0,1.0


## Generator setup
The generator works by training a set of imputation models with passed data, completely filling in the dataset and then randomly sampling and imputing values from the data.<br>
Both the creation and training of the models, as well as, the value imputation process can be lengthy, so these are not done initialization, but instead done by the user with the "setup" call.

In [3]:
gen = Generator()
gen.add_multi_variable(MULTI_VARIABLES.DataDriven.value, name="", data=data, imputer=IMPUTERS.SamplingImputer.value, as_categorical=["0","3"])

Training classification models.


100%|██████████| 60/60 [00:00<00:00, 87.95it/s]


Training regression models.


100%|██████████| 30/30 [00:02<00:00, 10.15it/s]


Filling missing values. Column 1 of 6


100%|██████████| 38/38 [00:00<00:00, 188.62it/s]


Filling missing values. Column 2 of 6


100%|██████████| 32/32 [00:01<00:00, 30.10it/s]


Filling missing values. Column 3 of 6


100%|██████████| 66/66 [00:00<00:00, 201.14it/s]


Filling missing values. Column 4 of 6


100%|██████████| 20/20 [00:00<00:00, 194.69it/s]


Filling missing values. Column 5 of 6


100%|██████████| 44/44 [00:01<00:00, 30.66it/s]


Filling missing values. Column 6 of 6


100%|██████████| 62/62 [00:01<00:00, 32.67it/s]


## Setup Verification
First we need to make sure that the generator setup was successfully done.<br>
We do this, first by checking if the original data was correctly stored and not altered in the setup process.<br>
Then, we look at the imputed original data to see if the process yielded the desired results, and contains no missing values.

In [4]:
print(gen._multi_variables[0]._data_driven_generator._full_data.isna().any().any())
gen._multi_variables[0]._data_driven_generator._full_data

False


Unnamed: 0,0,1,2,3,4,5
0,k,7.00000,e,10.0,5.403846,5.681159
1,e,5.60119,b,8.0,8.000000,9.000000
2,a,10.00000,a,5.0,7.000000,4.000000
3,o,8.00000,h,6.0,4.000000,3.000000
4,d,5.60119,b,3.0,8.000000,10.000000
...,...,...,...,...,...,...
195,a,6.00000,c,10.0,3.000000,1.000000
196,a,2.00000,c,5.0,7.000000,5.000000
197,g,8.00000,a,2.0,8.000000,4.000000
198,o,5.00000,g,1.0,5.000000,1.000000


## Generate
As is the core intention of the data generator, new entries, in a larger amount and with more diverse combinations are produced.<br>
Any number of data entries can be produced, with the only cost being the memory and time used by the computation, thus being suitable for data streaming purposes.<br>

In [5]:
x = gen.generate(20)

Filling missing values. Column 1 of 6


100%|██████████| 6/6 [00:00<00:00, 171.89it/s]

Filling missing values. Column 2 of 6



100%|██████████| 10/10 [00:00<00:00, 24.52it/s]


Filling missing values. Column 3 of 6


100%|██████████| 13/13 [00:00<00:00, 206.91it/s]


Filling missing values. Column 4 of 6


100%|██████████| 10/10 [00:00<00:00, 192.82it/s]


Filling missing values. Column 5 of 6


100%|██████████| 12/12 [00:00<00:00, 25.55it/s]


Filling missing values. Column 6 of 6


100%|██████████| 9/9 [00:00<00:00, 25.14it/s]


## Data integrity
Ensure that the original saved data was not modified during the setup and generating process.

In [6]:
gen._multi_variables[0]._data_driven_generator.data

Unnamed: 0,0,1,2,3,4,5
0,k,7.0,e,10.0,,
1,,,b,8.0,8.0,9.0
2,,10.0,,5.0,7.0,4.0
3,o,8.0,,6.0,4.0,3.0
4,d,,b,3.0,8.0,10.0
...,...,...,...,...,...,...
195,,6.0,c,10.0,3.0,1.0
196,,2.0,,,7.0,5.0
197,g,8.0,a,2.0,8.0,4.0
198,o,5.0,,1.0,5.0,1.0


## Data verification
Below first we verify that the data generation process was done successfully, without missing values.<br>
Then we look at a few entries of the generated data to check if these are coherent with existing values and expectations.

In [7]:
print(x.isna().any().any())
x

False


Unnamed: 0,0,1,2,3,4,5
0,h,5.60119,a,1.0,6.475962,9.0
1,a,6.720238,d,10.0,6.475962,5.681159
2,o,6.720238,c,9.0,6.475962,4.85112
3,g,5.0,h,10.0,6.475962,4.85112
4,a,6.720238,a,6.0,5.403846,4.85112
5,e,6.720238,h,10.0,10.0,7.0
6,a,6.0,b,1.0,6.475962,4.85112
7,f,3.0,f,4.0,6.475962,2.0
8,c,9.0,c,1.0,3.0,4.85112
9,f,6.720238,b,1.0,8.0,8.0


Due to the intended randomness of the data generation process some generated entries might be repeated, with the amount increasing as we decrease the complexity of the original data and as we increase the number of generated rows.<br>
Below we verify if the number of repeated entries is within acceptable parameters.

In [8]:
x.groupby(x.columns.tolist(),as_index=False).size()

Unnamed: 0,0,1,2,3,4,5,size
0,a,5.60119,a,1.0,6.475962,4.85112,1
1,a,6.0,b,1.0,6.475962,4.85112,1
2,a,6.720238,a,6.0,5.403846,4.85112,1
3,a,6.720238,b,1.0,3.0,2.0,1
4,a,6.720238,d,10.0,6.475962,5.681159,1
5,c,6.720238,b,1.0,6.475962,8.0,1
6,c,9.0,c,1.0,3.0,4.85112,1
7,e,6.720238,h,10.0,10.0,7.0,1
8,f,3.0,f,4.0,6.475962,2.0,1
9,f,5.0,h,2.0,8.0,3.0,1
