# How to use nbsynthetic

First we have to install nbsynthetic directly from the repository.

In [1]:
#pip install git+https://github.com/NextBrain-ml/nbsynthetic.git

## Import modules

We import the necessary dependencies to use nbsynthetic.

In [2]:
from nbsynthetic.data import input_data
from nbsynthetic.data_preparation import SmartBrain
from nbsynthetic.vgan import GAN
from nbsynthetic.synthetic import synthetic_data
from nbsynthetic.statistics import mmd_rbf, Wilcoxon, Student_t, Kolmogorov_Smirnov
from nbsynthetic.statistics import plot_histograms

## Load data

Then we need to load the dataset. We can do by ourselves or we can use the module```input_data```. Once uploaded, we have the option prepare the dataset, if is necessary, using the the module ```nbEncode``` inside ```SmartBrain```:  
```python
SB = SmartBrain()
df = SB.nbEncode(df)
```
This module will deal with id columns (which we want to remove), nan values (which we want to fill or remove), and will encode categorical features.
We can also prepare the dataset by ourselves.<br> The necessary condition are:
- Input data has to be a pd.DataFrame
- Remove id columns
- Drop nan values
- Encode catergorical features.
- Finally, all numeric features has to be of type 'int' or 'float', and the categorical features (including boolean) has to be of type 'category'. This dtype is only existing in pd.DataFrame. 



In [3]:
df = input_data('Marketing_campaigns', decimal=',')
SB = SmartBrain() 
df = SB.nbEncode(df)
#check the data types
df.dtypes

% costFemale    float64
%costMale       float64
%cosSexUn       float64
%Cost 18_24     float64
%Cost 25_34     float64
%Cost 35_44     float64
%Cost 45_54     float64
%CostAgeUn      float64
Cost            float64
year            float64
week            float64
day             float64
ROAS            float64
dtype: object

## Dataset description

This dataset could be from any given company that want to monitor the performance of its marketing campaings with Google Ads. Each campaign has a specific segmentation of target for sex and age (and also for country and region). Google Ads API gives users detailed information about the cost of each campaign, the numner of impressions and the number of conversions. From this, you can calculate the Return on ad spend (or ROAS) with the formula:<br/>
<br/>
$ROAS = (\frac{\text{revenue attributable to ads}}{\text{cost of ads}})\times100$ 


To simplify the problem, we removed the columns for cost, impressions, and conversions, leaving only the segmentation metrics and the final ROAS number for each campaign. We also converted the date column into Year/Month/Day number columns . The issue with this data can be viewed as a problem of a small sample size (the dataset has only 39 instances). So, if we wish to run a regression analysis, a rule of thumb that many researchers follow is that the minimum number of points must be at least 10 times the number of independent variables. So, with 12 independent variables (13 features in total), a minimum length of 120 points is required for a valid result. 

## Generate synthetic data

The last step is to create a synthetic dataset. This will be a pd.DataFrame. The only argument we need to decide is the len of this dataset with the ```samples``` parameter, together wit the prepared dataset, ```df```, and the Generative Adversarial Network, ```GAN```. 

In [4]:
samples= 500
newdf = synthetic_data(
    GAN, 
    df, 
    samples = samples
    )

Epoch (1/10) | D. loss: 0.71 | G. loss: 0.69 |: 100%|##########| 2/2 [00:05<00:00,  2.62s/it]
Epoch (2/10) | D. loss: 0.70 | G. loss: 0.68 |: 100%|##########| 2/2 [00:00<00:00,  2.61it/s]
Epoch (3/10) | D. loss: 0.67 | G. loss: 0.68 |: 100%|##########| 2/2 [00:00<00:00,  2.02it/s]
Epoch (4/10) | D. loss: 0.68 | G. loss: 0.67 |: 100%|##########| 2/2 [00:00<00:00,  3.59it/s]
Epoch (5/10) | D. loss: 0.66 | G. loss: 0.64 |: 100%|##########| 2/2 [00:00<00:00,  3.24it/s]
Epoch (6/10) | D. loss: 0.64 | G. loss: 0.68 |: 100%|##########| 2/2 [00:00<00:00,  4.38it/s]
Epoch (7/10) | D. loss: 0.67 | G. loss: 0.62 |: 100%|##########| 2/2 [00:00<00:00,  4.57it/s]
Epoch (8/10) | D. loss: 0.62 | G. loss: 0.64 |: 100%|##########| 2/2 [00:00<00:00,  3.93it/s]
Epoch (9/10) | D. loss: 0.61 | G. loss: 0.63 |: 100%|##########| 2/2 [00:00<00:00,  4.25it/s]
Epoch (10/10) | D. loss: 0.59 | G. loss: 0.66 |: 100%|##########| 2/2 [00:00<00:00,  4.33it/s]


## Testing

Once the synthetic dataset is selected, we can check how similar they are, original and synthetic dataset. Modern statistical tests as student-t test or Wilcoxon  test, compares if to two related paired samples comes from the same distribution. Thus, for comparing both input and synthetic datasets we have to compare each feature with both its input data sample and its synthetic data sample. These tests are very useful but it's interpreation is nos simple when we have multidimensional data. ¿Do I have to have an acceptable p-value for each feature?¿Does it means then that both datasets are equivalent?. For dealing with this problem we have chosen a novel measurement called Maximum Mean Discrepancy (MMD). MMD is a statistical tests to determine if two samples are from different distributions. This statistic test measures the distance between the means of the two samples  mapped into a reproducing kernel Hilbert space (RKHS)$^1$. This distance is based on the notion of embedding probabilities in a reproducing kernel Hilbert space. Here the use of the 'kernel trick' and Hilber Spaces (reproducing kernel Hilbert space) allows us to measure the 'distance' between both complete datasets. To know more about reproducing kernel Hilbert space, we suggest to read this [source](https://ieeexplore.ieee.org/abstract/document/1624356).<br>
A Hilbert Space, at the contrary of a Euclidean space, is a metric space that allows the vectors to be infinite-dimensional. In a Hilbert Space, the finitie-dimensional vector can be represented as a continuous vector representing a function. Then, our  problem can be framed as selecting an optimal function from a large family of functions (our high-dimensional dataset). What we want to specify is a prior distribution over an entire space of functions, as in Bayesian nonparametrics methods, in order to more easily compare both complete datasets.  
<br/>
</br>
</br>
<font size="0.6">
$^1$ Ilya Tolstikhin, Bharath K. Sriperumbudur, and Bernhard Schölkopf (2016). Minimax estimation of maximum mean discrepancy with radial kernels. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 1938–1946.<br/>
$^2$ A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. (2007). A kernel method for the two sample problem. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 513–520, Cambridge, MA. MIT Press.<font size>

In [5]:
"""
    ###########Maximum Mean Discrepancy (MMD)################

    Maths: 
        Compute the radial basis function (RBF) kernel 
        between two vectors between X and Y.
        k(x,y) = exp(-gamma * ||x-y||^2 / 2)
        where gamma is the inverse of the standard 
        deviation of the RBF. A small gamma value define 
        a Gaussian function with a large variance.

        Args:

           X: pd.DataFrame of shape (n_samples_X, n_features)
           Y: pd.DataFrame of shape (n_samples_Y, n_features)
           gamma: float

        Returns:
            Maximum Mean Discrepancy (MMD) value :(float)
    """

mmd_rbf(df, newdf, gamma=None)

Maximum Mean Discrepance = 0.07895


If both datasets are the same, MMD will be 0. An interesting result for comparing two datasets, according our tests with different input datasets, would be a Maximum Mean Discrepance value lower than 0.05 . 

Finally, we can finish our checking process by visually comparing the probability density functions for each feature on the dataset. It can be done with ```plot_histograms```. 

In [6]:
plot_histograms(df, newdf)

To check the utility of the generated synthetic data, we can solve a ML problem. Let's consider our target to be 'ROAS' column, so we will have a regression problem. We are using an an ensemble method algorithm: [Gradient Boosting Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html). We make the same analysis (using the same algorithm parametrization) with both, original  and synthetic dataset.

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

X = df.drop(columns=['ROAS'])
y = df['ROAS']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
    )


est = GradientBoostingRegressor(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=1, 
    random_state=42,
    loss='squared_error'
    ).fit(X_train, y_train)


y_predict = est.predict(X_test)
print(f'Mean Squared Error original data = {mean_squared_error(y_test, y_predict):.3f}')

Mean Squared Error original data = 0.552


In [8]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_predict)
fig.add_scatter(x=[0,1.7], y=[0,1.7])
fig.update_xaxes(title='y test', range=[0, 1.7])
fig.update_yaxes(title='original y predict', range=[0, 1.7])
fig.update_layout(showlegend=False, title_text="Algorithm trained and tested with original data")
fig.show()

In [9]:
X = newdf.drop(columns=['ROAS'])
y = newdf['ROAS']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=0
    )
est = GradientBoostingRegressor(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=1, 
    random_state=42,
    loss='squared_error'
    ).fit(X_train, y_train)
    
y_pred = est.predict(X_test)
print(f'Mean Squared Error synthetic data = {mean_squared_error(y_test, y_pred):.3f}')

Mean Squared Error synthetic data = 0.023


In [10]:
import plotly.express as px
fig = px.scatter(x=y_test, y=y_pred)
fig.add_scatter(x=[0,1.7], y=[0,1.7])
fig.update_xaxes(title='synthetic y test', range=[0, 1.7])
fig.update_yaxes(title='y predict', range=[0, 1.7])
fig.update_layout(
    showlegend=False, 
    title_text=f"Algorithm trained and tested with syntehtic data. MSE={mean_squared_error(y_test, y_pred):.3f}"
    )
fig.show()

The synthetic dataset contains 500 instances, whereas the original dataset contains just 39 instances. The original data sample size is insufficient to produce accurate results. Many researchers, for example, recommend at least 10 observations per variable in regression analysis. With 13 independent variables in our dataset, a simple guideline would be to have a minimum sample size of 130. The minimum sample size can also be calculated using accurate statistical approaches such as the confidence interval or the effect size.

We can see that by increasing the sample size, we also improved the regression accuracy (decreasig the MSE error significantly).


Finally, we can cross check our the algorithm using the synthetic dataset for training and the original data for testing.

In [11]:
X_train = newdf.drop(columns=['ROAS'])
y_train = newdf['ROAS']
X_test = df.drop(columns=['ROAS'])
y_test = df['ROAS']
est = GradientBoostingRegressor(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=1, 
    random_state=42,
    loss='squared_error'
    ).fit(X_train, y_train)
y_pred = est.predict(X_test)
print(f'Mean Squared Error cross data = {mean_squared_error(y_test, y_pred):.3f}')

Mean Squared Error cross data = 0.382


We can see that te MSE of this cross-training is higher (lower accuracy) than the obtained with training/testing with the synthetic dataset. But MSE is lower (higher accuracy) when we compare with training/testing with the original data. But the importan question is that the validity of the cross-training is less questionable as we have trained the algorithm with a sample size much higher than original dataset.

In [12]:
import plotly.express as px
fig = px.scatter(x=df['ROAS'], y=y_pred)
fig.add_scatter(x=[0,1.7], y=[0,1.7])
fig.update_xaxes(title='y test', range=[0, 1.7])
fig.update_yaxes(title='y predict', range=[0, 1.7])
fig.update_layout(
    showlegend=False, 
    title_text=f"Algorithm trained with syntehtic data and tested with original data. MSE={mean_squared_error(y_test, y_pred):.3f}")
fig.show()

## Conclusions

We proved that we could create a larger 'equivalent' synthetic dataset (n=500) from a small sample size original dataset (n=39). By training the algorithm with synthetic data, we solved a regression machine learning problem. After testing this algorithm with original data, we observed that prediction accuracy grows together with its training reliability.