# How to use nbsynthetic

First we have to install nbsynthetic directly from the repository.

In [1]:
#pip install git+https://github.com/NextBrain-ml/nbsynthetic.git

We import the necessary dependencies to use nbsynthetic. 

In [2]:
from nbsynthetic.data import input_data
from nbsynthetic.data_preparation import SmartBrain
from nbsynthetic.vgan import GAN
from nbsynthetic.synthetic import synthetic_data
from nbsynthetic.statistics import mmd_rbf, Wilcoxon, Student_t, Kolmogorov_Smirnov
from nbsynthetic.statistics import plot_histograms

Then we need to load the dataset. We can do by ourselves or we can use the module```input_data```. Once uploaded, we have the option prepare the dataset, if is necessary, using the the module ```nbEncode``` inside ```SmartBrain```:  
```python
SB = SmartBrain()
df = SB.nbEncode(df)
```
This module will deal with id columns (which we want to remove), nan values (which we want to fill or remove), and will encode categorical features.
We can also prepare the dataset by ourselves.<br> The necessary condition are:
- Input data has to be a pd.DataFrame
- Remove id columns
- Drop nan values
- Encode catergorical features.
- Finally, all numeric features has to be of type 'int' or 'float', and the categorical features (including boolean) has to be of type 'category'. This dtype is only existing in pd.DataFrame. 



In [3]:
df = input_data('Marketing_campaigns', decimal=',')
SB = SmartBrain() 
df = SB.nbEncode(df)
#check the data types
df.dtypes

% costFemale    float64
%costMale       float64
%cosSexUn       float64
%Cost 18_24     float64
%Cost 25_34     float64
%Cost 35_44     float64
%Cost 45_54     float64
%CostAgeUn      float64
Cost            float64
year            float64
week            float64
day             float64
ROAS            float64
dtype: object

The last step is to create a synthetic dataset. This will be a pd.DataFrame. The only argument we need to decide is the len of this dataset with the ```samples``` parameter. 

In [4]:
samples= 300
newdf = synthetic_data(
    GAN, 
    df, 
    samples = samples
    )

Epoch (1/10) | D. loss: 0.68 | G. loss: 0.68 |: 100%|##########| 2/2 [00:04<00:00,  2.06s/it]
Epoch (2/10) | D. loss: 0.68 | G. loss: 0.69 |: 100%|##########| 2/2 [00:00<00:00,  6.07it/s]
Epoch (3/10) | D. loss: 0.67 | G. loss: 0.66 |: 100%|##########| 2/2 [00:00<00:00,  4.68it/s]
Epoch (4/10) | D. loss: 0.66 | G. loss: 0.66 |: 100%|##########| 2/2 [00:00<00:00,  2.60it/s]
Epoch (5/10) | D. loss: 0.67 | G. loss: 0.65 |: 100%|##########| 2/2 [00:00<00:00,  5.92it/s]
Epoch (6/10) | D. loss: 0.68 | G. loss: 0.62 |: 100%|##########| 2/2 [00:00<00:00,  5.43it/s]
Epoch (7/10) | D. loss: 0.65 | G. loss: 0.66 |: 100%|##########| 2/2 [00:00<00:00,  5.72it/s]
Epoch (8/10) | D. loss: 0.63 | G. loss: 0.65 |: 100%|##########| 2/2 [00:00<00:00,  4.84it/s]
Epoch (9/10) | D. loss: 0.63 | G. loss: 0.63 |: 100%|##########| 2/2 [00:00<00:00,  6.53it/s]
Epoch (10/10) | D. loss: 0.63 | G. loss: 0.63 |: 100%|##########| 2/2 [00:00<00:00,  6.88it/s]


Once the synthetic dataset is selected, we can check how similar they are original and synthetic dataset. nbsynthetic offers a bunch of parametrical and non parametrical tests to measure it. 

In [5]:
"""
    Maximum Mean Discrepancy (MMD) is a statistical tests 
    to determine if two samples are from different distributions.
    This statistic test measures the distance between the means 
    of the two samples  mapped into a reproducing kernel Hilbert space (RKHS).
    Maximum Mean Discrepancy has found numerous applications in 
    machine learning and nonparametric testing [1][2].

    Maths[3]: 
        Compute the radial basis function (RBF) kernel 
        between two vectors between X and Y.
        k(x,y) = exp(-gamma * ||x-y||^2 / 2)
        where gamma is the inverse of the standard 
        deviation of the RBF. A small gamma value define 
        a Gaussian function with a large variance.

    [1] Ilya Tolstikhin, Bharath K. Sriperumbudur, and Bernhard Schölkopf. 2016. 
    Minimax estimation of maximum mean discrepancy with radial kernels. 
    In Proceedings of the 30th International Conference on Neural 
    Information Processing Systems (NIPS'16). 
    Curran Associates Inc., Red Hook, NY, USA, 1938–1946.

    [2] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. 
    A kernel method for the two sample problem. 
    In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
    Information Processing Systems 19, pages 513–520, Cambridge, MA, 2007. MIT Press.

    [3] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

        Args:

           X: ndarray/pd.DataFrame of shape (n_samples_X, n_features)
           Y: ndarray/pd.DataFrame of shape (n_samples_Y, n_features)
           gamma: float

        Returns:
            Maximum Mean Discrepancy (MMD) value :(float)
    """

mmd_rbf(df, newdf, gamma=None)

Maximum Mean Discrepance = 0.04270


If both datasets are the same, MMD will be 0. An interesting result for comparing two datasets would be a Maximum Mean Discrepance value lower than 0,05 . 

Finally, we can finish our checking process by visually comparing the probability density functions for each feature on the dataset. It can be done with ```plot_histograms```. 

In [6]:
plot_histograms(df, newdf)