# Evaluating Synthetic Data Generators

A very common question when someone starts using **SDV** to generate
synthetic data is: _"How good is the data that I just generated?"_

In order to answer this question, **SDV** has a collection of metrics and tools
that allow you to compare the _real_ that you provided and the _synthetic_ data
that you generated using **SDV** or any other tool and compute a series of
scores that indicate how similar they are.

In this guide we will show you how to perform this evaluation and how to explore
the different metrics that exist.

## Using the SDV Evaluation Framework

In order to be able to evaluate the quality of synthetic data we essentially need
two things: _real_ data and _synthetic_ data that pretends to resemble it.

Let us start by loading a demo table and generate a synthetic replica of it
using the `GaussianCopula` model.

In [1]:
# Setup logging and warnings - change ERROR to INFO for increased verbosity
import logging
logging.basicConfig(level=logging.ERROR)

logging.getLogger().setLevel(level=logging.ERROR)
logging.getLogger('sdv').setLevel(level=logging.ERROR)

import warnings
warnings.simplefilter("ignore")

In [8]:
from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula

real_data = load_tabular_demo('student_placements')

model = GaussianCopula()
model.fit(real_data)
synthetic_data = model.sample()

After the previous steps we will have two tables:

- `real_data`: A table containing data about student placements

In [7]:
real_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


- `synthetic_data`: A synthetically generated table that contains data in the
  same format and with similar statistical properties as the `real_data`.  

In [9]:
synthetic_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17403,M,64.745363,67.054098,Commerce,67.129322,Comm&Mgmt,False,0,50.000528,Mkt&HR,54.47596,30284.62748,True,2020-01-18 12:02:36.256540672,2020-08-21 21:03:29.414898944,12.0
1,17459,F,50.331534,75.459284,Commerce,69.408288,Comm&Mgmt,False,0,82.477234,Mkt&Fin,58.853212,32091.931044,True,2020-01-21 06:39:21.449370624,2020-12-23 02:07:20.159499520,12.0
2,17459,F,54.643007,66.061751,Commerce,59.863628,Comm&Mgmt,False,0,62.777879,Mkt&HR,59.011167,31925.549018,True,2020-03-27 11:08:43.174364160,2020-08-25 22:34:33.743831040,6.0
3,17273,M,64.32462,68.857534,Commerce,62.917123,Comm&Mgmt,False,1,50.037711,Mkt&HR,53.865449,21786.013378,True,2020-03-15 01:55:28.996904448,2020-08-05 02:36:27.998250240,3.0
4,17331,M,61.397248,52.729968,Science,55.619599,Sci&Tech,False,0,54.396182,Mkt&HR,54.953005,,False,NaT,NaT,3.0


.. note:: For more details about this process, please visit the :ref:`gaussian_copula` guide.

### Computing an overall score

The simplest way to see how similar the two tables are is to import the `sdv.evaluation.evaluate`
function and run it passing both the `synthetic_data` and the `real_data` tables.

In [10]:
from sdv.evaluation import evaluate

evaluate(synthetic_data, real_data)

0.6898332363299029

The output of this function call will be a number between 0 and 1 that will
indicate us how similar the two tables are, being 0 the worst and 1 the best
possible score.

### How was the obtained score computed?

The `evaluate` function applies a collection of pre-configured metric functions and returns
the average of the scores that the data obtained on each one of them. In most scenarios this
can be enough to get an idea about the similarity of the two tables, but you might want to
explore the metrics in more detail.

In order to see the different metrics that were applied you can pass and additional argument
`aggregate=False`, which will make the `evaluate` function return a dictionary with the scores
that each one of the metrics functions returned:

In [4]:
evaluate(synthetic_data, real_data, aggregate=False)

{'cstest': 0.8808544706979581,
 'kstest': 0.5412183833523473,
 'logistic_detection': 0.783997710543094,
 'svc_detection': 0.8226395409494002}

### Can I control which metrics are applied?

By default, the `evaluate` function will apply all the metrics that are included within
the SDV Evaluation framework. However, the list of metrics that are applied can be controlled
by passing a list with the names of the metrics that you want to apply.

For example, if you were interested on obtaining only the `cstest` and `kstest` metrics
you can call the `evaluate` function as follows:

In [5]:
evaluate(synthetic_data, real_data, metrics=['cstest', 'kstest'])

0.7110364270251527

Or, if we want to see the scores separately:

In [6]:
evaluate(synthetic_data, real_data, metrics=['cstest', 'kstest'], aggregate=False)

{'cstest': 0.8808544706979581, 'kstest': 0.5412183833523473}

The complete list of possible metrics is:

* `cstest`: This metric compares the distributions of all the categorical
  columns of the table by using a Chi-squared test and returns the average of
  the `p-values` obtained across all the columns. If the tables that you are
  evaluating do not contain any categorical columns the result will be `nan`.
* `kstest`: This metric compares the distributions of all the numerical columns
  of the table with a two-sample Kolmogorov–Smirnov test using the empirical CDF
  and returns the average of the `p-values` obtained across all the columns.
  If the tables that you are evaluating do not contain any numerical columns the
  result will be `nan`.
* `logistic_detection`: This metric tries to use a Logistic Regression classifier
  to detect whether each row is real or synthetic and then evaluates its performance
  using an Area under the ROC curve metric.
* `svc_detection`: This metric tries to use an Support Vector Classifier
  to detect whether each row is real or synthetic and then evaluates its performance
  using an Area under the ROC curve metric.