# Usage of DataSynthesizer

Given a private dataset, DataSynthesizer can be used to generate a synthetic dataset for release to public. It infers the data types and domains for attributes in dataset. Histograms are used to model the distribution of each attribute. Synthetic dataset is sampled from the histograms or uniformly from the inferred domains.

## Data types
 The DataSynthesizer currently supports 4 basic data types.

| data type | example |
|-----------|---------|
| integer   | id, age, ...|
| float     | score, rating, ...|
| string    | first name, gender, ...|
| datetime  | birthday, event time, ...|

## Data description format

The domain of an attribute is as follows.
- The "catagorical" indicates attributes with particular values, e.g., "gender", "nationality".
- Domains are modeled by histograms, except noncategorical "string".

|data type|categorical  |min             |max             |values             |probabilities      |values count      |missing rate|
|---------|----------|----------------|----------------|-------------------|-------------------|------------------|------------|
|int      |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|float    |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   True   |min in length   |max in length   |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   False  |min in length   |max in length   |0                  |0                  |0               |missing rate|
|datetime |True/False|min in timestamp|max in timestamp|x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|

##### Step 0: Import DataDestriber and DataGenerator from DataSynthesizer

In [1]:
from DataSynthesizer import DataDestriber, DataGenerator

In [2]:
# Directories of input and output files
input_dataset_file = './raw_data/AdultIncomeData/adult.csv'
dataset_description_file = './output/description/AdultIncomeData_description.csv'
synthetic_data_file = './output/synthetic_data/AdultIncomeData_synthetic.csv'

##### Step 1: Initialize a DatasetDescriber

In [3]:
describer = DataDestriber()

Initialized a dataset describer.


##### Step 1: Generate dataset description

The dataset description is inferred by code, which also allows users to customize the data types and categorical indicators, e.g.,
    - "education-num" is of type "float".
    - "native-country" is not categrocial.
    - "age" is categorical.

In [4]:
describer.describe_dataset(file_name=input_dataset_file,
                           column_to_datatype_dict={'education-num': 'float'},
                           column_to_categorical_dict={'native-country':False,'age':True})

##### Step 2: Get the dataset description

Let's take a look at the input dataset

In [5]:
describer.input_dataset.head()

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1.0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,,White,Male,2174.0,,40.0,United-States,<=50K
1,2.0,50.0,Self-emp-not-inc,83311.0,Bachelors,,,,Husband,White,,0.0,0.0,13.0,United-States,
2,3.0,38.0,Private,215646.0,HS-grad,9.0,Divorced,,,White,Male,,,,United-States,<=50K
3,4.0,53.0,,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,5.0,28.0,,,Bachelors,,,,,Black,Female,0.0,,40.0,Cuba,


The dataset description is

In [6]:
describer.dataset_description

Unnamed: 0_level_0,data type,categorical,min,max,values,value counts,histogram size,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[14653.0, 29305.0, 30933.0, 4885.0, 8141.0, 32...","[1120, 1114, 1111, 1101, 1098, 1098, 1093, 109...",20.0,0.335432
age,int,True,17.0,90.0,"[36.0, 35.0, 31.0, 33.0, 34.0, 23.0, 28.0, 37....","[812, 803, 803, 794, 791, 790, 773, 773, 772, ...",73.0,0.098676
workclass,string,True,2.0,17.0,"[ Private, Self-emp-not-inc, Local-gov, ?, ...","[13426, 1468, 1250, 1090, 762, 667, 586, 8, 5]",9.0,0.408433
fnlwgt,int,False,12285.0,1455435.0,"[156600.0, 84442.5, 228757.5, 10841.85, 300915...","[8379, 6399, 3447, 3375, 1996, 872, 287, 113, ...",20.0,0.232395
education,string,True,4.0,13.0,"[ HS-grad, Some-college, Bachelors, Masters...","[7195, 4914, 3638, 1185, 939, 790, 724, 619, 4...",16.0,0.319646
education-num,float,True,1.0,16.0,"[9.0, 10.0, 13.0, 14.0, 11.0, 7.0, 12.0, 6.0, ...","[9203, 6380, 4682, 1521, 1212, 1029, 926, 817,...",16.0,0.124321
marital-status,string,True,8.0,22.0,"[ Married-civ-spouse, Never-married, Divorce...","[11715, 8381, 3450, 808, 792, 323, 22]",7.0,0.217131
occupation,string,True,2.0,18.0,"[ Craft-repair, Exec-managerial, Prof-specia...","[2374, 2368, 2343, 2184, 2136, 1845, 1141, 104...",15.0,0.424741
relationship,string,True,5.0,15.0,"[ Husband, Not-in-family, Own-child, Unmarr...","[10173, 6343, 3874, 2667, 1202, 771]",6.0,0.231289
race,string,True,6.0,19.0,"[ White, Black, Asian-Pac-Islander, Amer-In...","[25125, 2826, 948, 282, 246]",5.0,0.09625


##### Step 3: save the dataset description

In [7]:
describer.dataset_description.to_csv(dataset_description_file)

### Generate synthetic data

##### Step 4: Initialize a SyntheticDataGenerator.

In [8]:
generator = DataGenerator()

Initialized a synthetic data generator.


##### Step 5: Generate sysnthetic dataset

By default, the data is sampled from the histograms in dataset description. But users can let some columns to sample uniformly in doamin of [min, max].

> generator.generate_uniform_random_dataset(dataset_description_file, N=10) # will generate a totoally random dataset.

Here the example is to generate 10 rows in synthetic datset, where "age" and "education" are sampled uniformly.

In [9]:
generator.generate_synthetic_dataset(dataset_description_file, N=10, uniform_columns={'age', 'education'})

##### Step 6: Random missing

Remove values of a given column randomly, e.g., removing 60% of age values.

In [10]:
generator.random_missing_on_column('age', 0.6)

In [11]:
generator.synthetic_dataset

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,4065,48.0,Private,160190,HS-grad,16.0,Never-married,Transport-moving,Husband,Black,Male,379,14,35,fRddDTdaMUweLKM,<=50K
1,24288,,Private,227763,Masters,13.0,Married-civ-spouse,Transport-moving,Own-child,White,Male,344,56,35,Aowy,<=50K
2,27102,,Private,231896,5th-6th,4.0,Married-civ-spouse,Other-service,Not-in-family,White,Male,33,23,35,rptKdaHMGcLk,<=50K
3,10423,26.0,Private,78382,1st-4th,13.0,Never-married,?,Unmarried,Asian-Pac-Islander,Male,345,6,34,qqDoCBKmcYWQkdK,<=50K
4,16241,,Private,231779,9th,10.0,Divorced,Exec-managerial,Other-relative,White,Male,15110,3,39,hObNPggEZuxcBEXlnTOpQIuOBL,<=50K
5,836,60.0,Private,296062,11th,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,20,7,15,QzWQIDsKXA,>50K
6,381,,Self-emp-not-inc,149888,1st-4th,13.0,Never-married,Handlers-cleaners,Not-in-family,White,Female,46,1,54,vvwKLlLYlXVXRLubVw,<=50K
7,20011,,Private,145245,Assoc-acdm,14.0,Never-married,Craft-repair,Husband,Black,Male,245,4,39,okRSXjsqCjcxZzqez,>50K
8,305,,Private,87477,Prof-school,10.0,Married-civ-spouse,Exec-managerial,Own-child,White,Male,679,1754,35,ypAWuCwLXnTNhitTsz,<=50K
9,21814,23.0,Private,157218,Assoc-acdm,13.0,Never-married,Machine-op-inspct,Other-relative,White,Male,359,9,32,CDIIdNNcEqOLwFSkObyzkUWax,<=50K


##### Step 7: Save the synthetic dataset

In [12]:
generator.synthetic_dataset.to_csv(synthetic_data_file, index=False)