# Generator Example #1 - Tabular data
In this notebook we go over the basic functionalities of the data generator.<br>
Only the task of tabular data generation is approached, with series data and data driven generation being inspected in other notebooks.

In [10]:
from lib import Generator, VARIABLES, TRANSFORMATIONS
from sklearn.tree import DecisionTreeClassifier

## Base variables
The generator has the basic functionality of adding independent base variables.<br>
Bellow we show this functionality, as well as the fact that multiples of the same type can be created.<br>
Furthermore, it is possible to not pass a name, letting the generator automatically pick one for us.<br>

In [11]:
gen = Generator()
gen.add_variable(VARIABLES.Constant.value, name="constant", value=5)
gen.add_variable(VARIABLES.Uniform.value, name="uniform", min=0, max=10)
gen.add_variable(VARIABLES.Normal.value, name="normal", mean=5, std=2)
gen.add_variable(VARIABLES.Normal.value, name="normal_2", mean=5, std=2)
gen.add_variable(VARIABLES.Exponential.value, name="exponential", beta=3)
gen.add_variable(VARIABLES.Poisson.value, name="poisson", lam=3)
gen.add_variable(VARIABLES.Poisson.value, lam=3)
gen.add_variable(VARIABLES.Sample.value, name="sample", values=["Yes","No","Maybe"], probabilities=[0.2,0.1,0.7])

In [12]:
gen.generate(10)

Unnamed: 0,constant,uniform,normal,normal_2,exponential,poisson,auto_7,sample
0,5,3.132914,5.017,2.128387,0.546729,3,2,Yes
1,5,2.2207,2.829246,6.740029,3.478759,4,3,Maybe
2,5,8.540973,2.969422,3.992184,6.180465,4,2,Yes
3,5,7.836664,4.184073,3.455307,1.522372,4,4,Yes
4,5,9.156455,2.904892,3.311794,0.825703,2,2,No
5,5,0.554596,6.280299,2.999607,0.440202,2,2,Maybe
6,5,9.328911,3.890888,4.640889,0.395671,4,2,Maybe
7,5,9.052074,5.04567,3.624051,3.725204,8,1,No
8,5,2.542271,3.648921,3.184974,0.034147,4,5,Maybe
9,5,2.883739,4.075995,7.789668,5.740704,2,3,Maybe


## Removing variables
Removing a variable is even easier.<br>
Bellow we remove 2 that we no longer want.

In [13]:
gen.pop_variable("normal_2")
gen.pop_variable("auto_7")
gen.generate(10)

Unnamed: 0,constant,uniform,normal,exponential,poisson,sample
0,5,0.486264,5.282401,6.632216,2,Yes
1,5,0.505276,7.97948,0.108667,3,Maybe
2,5,5.447581,2.561254,2.938542,2,Maybe
3,5,6.929992,4.788081,1.755355,4,Maybe
4,5,8.657441,7.400359,3.543844,4,Maybe
5,5,2.910855,5.454713,1.564577,3,Yes
6,5,9.499768,2.421695,3.964084,0,Maybe
7,5,3.751179,5.052738,1.553121,3,Maybe
8,5,3.117421,2.419661,3.058973,2,Maybe
9,5,0.176951,2.380099,2.36106,2,Maybe


## Basic transformations
Each transformation is added to the pipeline and performed in the order of addition.<br>
There are some very basic transformations that can be used to change our data.<br>
Duplicating a column, can be easily achieved.<br>
We can also remove multiple variables with this method instead.<br>

In [14]:
gen.add_transformation(
    TRANSFORMATIONS.Duplicate.value,
    name = "duplicate",
    in_columns = ["sample","normal"],
    out_columns = ["sample_2","normal_2"],
)
gen.add_transformation(TRANSFORMATIONS.Drop.value, name="drop", in_columns=["uniform"])
gen.generate(10)

Unnamed: 0,constant,normal,exponential,poisson,sample,sample_2,normal_2
0,5,2.65482,0.112419,3,Maybe,Maybe,2.65482
1,5,4.310016,3.567602,2,Maybe,Maybe,4.310016
2,5,4.06962,4.826209,4,Yes,Yes,4.06962
3,5,8.244661,4.5839,4,Maybe,Maybe,8.244661
4,5,5.489847,13.260128,3,Maybe,Maybe,5.489847
5,5,3.699489,0.822472,2,Maybe,Maybe,3.699489
6,5,3.71253,2.700654,2,Maybe,Maybe,3.71253
7,5,3.171598,1.310434,2,Maybe,Maybe,3.171598
8,5,7.726115,0.290031,2,Maybe,Maybe,7.726115
9,5,3.809802,4.08333,1,Maybe,Maybe,3.809802


## Independent transformations
There are a set of simple transformations which act on a single column.<br>
These include for example log.<br>

In [15]:
gen.add_transformation(TRANSFORMATIONS.Log.value, name="log", in_columns=["constant","normal"], out_columns=["log_constant","log_normal"], drop_input=False)
gen.generate(10)

Unnamed: 0,constant,normal,exponential,poisson,sample,sample_2,normal_2,log_constant,log_normal
0,5,6.863747,3.831897,5,Maybe,Maybe,6.863747,1.609438,1.926254
1,5,4.211476,14.821897,1,Maybe,Maybe,4.211476,1.609438,1.437813
2,5,2.628173,5.06367,3,Maybe,Maybe,2.628173,1.609438,0.966289
3,5,2.180189,2.463695,2,Maybe,Maybe,2.180189,1.609438,0.779412
4,5,4.972668,9.569771,2,Maybe,Maybe,4.972668,1.609438,1.603956
5,5,4.119539,1.2471,3,Maybe,Maybe,4.119539,1.609438,1.415741
6,5,4.464214,1.595897,2,Maybe,Maybe,4.464214,1.609438,1.496093
7,5,7.366189,0.911394,4,Maybe,Maybe,7.366189,1.609438,1.9969
8,5,4.399429,3.183553,3,Maybe,Maybe,4.399429,1.609438,1.481475
9,5,6.092,3.663462,2,Maybe,Maybe,6.092,1.609438,1.806976


## Remove transformations
Much like base variables, transformations that we no longer want can be easily removed.<br>

In [16]:
gen.pop_transformation("duplicate")
gen.generate(10)

Unnamed: 0,constant,normal,exponential,poisson,sample,log_constant,log_normal
0,5,5.106303,0.133393,1,No,1.609438,1.630476
1,5,2.489445,1.050661,2,Maybe,1.609438,0.91206
2,5,5.179353,1.687,5,No,1.609438,1.64468
3,5,4.459836,2.665342,1,Maybe,1.609438,1.495112
4,5,5.2474,2.18206,3,Maybe,1.609438,1.657733
5,5,3.964619,1.824083,7,Maybe,1.609438,1.37741
6,5,7.612749,5.506778,3,Maybe,1.609438,2.029824
7,5,6.626914,6.110381,0,Maybe,1.609438,1.891139
8,5,5.906622,5.204553,4,Maybe,1.609438,1.776074
9,5,4.144212,3.399255,0,Maybe,1.609438,1.421713


## Combination operations
There are also some useful combination operations, such as sum and multiplication of columns.<br>
By default, the columns we combine are removed, which can, of course, be changed.<br>

In [17]:
gen.add_transformation(
    TRANSFORMATIONS.Sum.value,
    name = "sum", 
    in_columns = ["poisson","log_constant"], 
    out_column = ["sum_poisson_log_constant"],
    drop_input = False
)
gen.add_transformation(
    TRANSFORMATIONS.Product.value,
    name = "mul",
    in_columns = ["poisson","normal"],
    out_column = ["mul_poisson_normal"],
)
gen.generate(10)

Unnamed: 0,constant,exponential,sample,log_constant,log_normal,sum_poisson_log_constant,mul_poisson_normal
0,5,1.296157,No,1.609438,1.66673,5.609438,21.179312
1,5,2.6044,Maybe,1.609438,1.508265,1.609438,0.0
2,5,1.592857,Maybe,1.609438,1.878586,4.609438,19.63274
3,5,1.43083,Maybe,1.609438,1.130969,3.609438,6.197315
4,5,1.282216,Maybe,1.609438,2.110543,2.609438,8.252721
5,5,2.722874,Maybe,1.609438,2.075263,5.609438,31.866568
6,5,0.243995,Maybe,1.609438,2.011492,3.609438,14.948915
7,5,2.718353,Maybe,1.609438,1.138575,4.609438,9.36695
8,5,3.68806,Maybe,1.609438,1.865647,3.609438,12.920224
9,5,0.508046,Yes,1.609438,1.54031,8.609438,32.662248


## Dependent transformations
Besides the basic independent transformations, there are some more complex that take in multiple columns and operate them to create a single output.<br>
These include linear and polynomial transformations.<br>

In [18]:
gen.add_transformation(
    TRANSFORMATIONS.Linear.value, 
    name = "linear", 
    in_columns = ["exponential","sum_poisson_log_constant"], 
    out_column = ["linear_exponential_sum_poisson_log_constant"], 
    intercept = 10, 
    coefs = [0.1,-2.5],
    drop_input = True
)
gen.add_transformation(
    TRANSFORMATIONS.Polynomial.value,
    name = "poly",
    in_columns = ["constant","log_normal","mul_poisson_normal"],
    out_column = "poly_constant_log_normal_mul_poisson_normal",
    degree = 3,
    intercept = -20,
    coefs = [0.05,-0.12,0,-0.04,0,0,0.13,-0.02,0],
    inters = [0.03,-0.02,0.12,0.3,0,0.01,-0.07,0,0.17,0],
    drop_input = True
)
gen.generate(10)

Unnamed: 0,sample,log_constant,linear_exponential_sum_poisson_log_constant,poly_constant_log_normal_mul_poisson_normal
0,Maybe,1.609438,-9.021212,-2.262261
1,Maybe,1.609438,-1.496787,6.281497
2,Yes,1.609438,-1.343328,7.450352
3,Maybe,1.609438,-3.843391,11.122401
4,Maybe,1.609438,-1.310204,8.515871
5,Yes,1.609438,-16.444997,5.945133
6,Maybe,1.609438,3.504754,9.434166
7,Yes,1.609438,6.178154,-0.083142
8,Maybe,1.609438,1.180668,8.081735
9,Maybe,1.609438,-3.754085,3.384651


## Noise
Of course, no data generator would be complete without a way to add random noise to the data.<br>
Bellow we present a simple normal noise, to change numerical columns.<br>
And a more complex value swapping noise for the categorical columns.<br>
Furthermore, we show the functionality that allows us to modify a column without creating a new one.<br>
This works even if we set 'drop_input' to True, however, the new column will be added at the end.<br>

In [19]:
gen.add_transformation(
    TRANSFORMATIONS.NormalNoise.value, 
    name = "normal_noise", 
    in_columns = ["log_constant"], 
    mean = -1, 
    std = 5, 
    drop_input = True
)
gen.add_transformation(
    TRANSFORMATIONS.RandomSwap.value,
    name = "swap",
    in_columns = ["sample"],
    out_columns = ["swap_sample"],
    values=["Yes","No","Maybe"],
    swap = {"Yes": [1,0,0],"No": [0.5,0,0.5],"Maybe": [0,0.5,0.5]},
    drop_input = False
)
gen.generate(10)

Unnamed: 0,sample,linear_exponential_sum_poisson_log_constant,poly_constant_log_normal_mul_poisson_normal,log_constant,swap_sample
0,Maybe,5.984361,9.882438,12.62781,No
1,Maybe,1.368311,1.888827,1.208465,Yes
2,Yes,-0.980657,22.423121,0.065837,Yes
3,Maybe,-3.524417,10.716136,-8.877977,Yes
4,Yes,-3.874147,4.637065,5.588299,Yes
5,Maybe,-6.484908,7.828634,-1.303942,Yes
6,Maybe,1.179049,7.359525,-3.074665,No
7,Maybe,1.09125,9.64663,-3.750008,No
8,Maybe,2.101734,9.055988,10.747901,Yes
9,Yes,1.658982,11.589116,-1.890523,Yes


## Pre-trained models
The functionality to use any pre-trained model with a "predict" function also exists.<br>
Bellow we use a "DecisionTreeClassifier" model to show this.<br>
For simplicity the training data comes from the generator itself, of course any external model can be used.<br>

In [20]:
data = gen.generate(10)
X = data[["log_constant","linear_exponential_sum_poisson_log_constant"]]
y = data[["sample","swap_sample"]]
model = DecisionTreeClassifier(random_state=1)
model.fit(X, y)

DecisionTreeClassifier(random_state=1)

In [21]:
gen.add_transformation(
    TRANSFORMATIONS.Model.value, 
    name="model", 
    in_columns=["log_constant","linear_exponential_sum_poisson_log_constant"], 
    out_columns=["model_sample","model_swap_sample"], 
    model=model, 
    drop_input=True
)
gen.generate(10)

Unnamed: 0,sample,poly_constant_log_normal_mul_poisson_normal,swap_sample,model_sample,model_swap_sample
0,Maybe,-9.96919,Yes,Maybe,Yes
1,No,-2.690764,Maybe,Maybe,Yes
2,Maybe,-1.127397,No,Maybe,Yes
3,Maybe,10.62773,Yes,Maybe,Yes
4,Maybe,-1.343685,Yes,Maybe,Yes
5,Yes,13.978883,Yes,Maybe,Yes
6,Maybe,4.210134,No,Maybe,Yes
7,No,3.877076,Maybe,Maybe,No
8,Maybe,7.718062,No,Maybe,Yes
9,Maybe,7.720495,Yes,Maybe,Yes
