# Usage of datafaker

### Import two classes from datafaker
1. DatasetDestriber can infer the domain of each column in dataset.
2. SyntheticDataGenerator can generate synthetic data according to the dataset description.

In [1]:
from datafaker import DatasetDestriber, SyntheticDataGenerator

### Data types
 The datafaker currently supports 4 basic data types.

| data type | example |
|-----------|---------|
| integer   | id, age, ...|
| float     | score, rating, ...|
| string    | first name, gender, ...|
| datetime  | birthday, event time, ...|

The data types can be part of the input. If not, they will be inferred from the dataset.

### Data description format

The domain of data is described as follows.
- The "catagorical" indicates attributes with particular values, e.g., "gender", "nationality".
- Most domains are modeled by a histogram, except noncategorical "string".

|data type|categorical  |min             |max             |values             |probabilities      |values count      |missing rate|
|---------|----------|----------------|----------------|-------------------|-------------------|------------------|------------|
|int      |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|float    |True/False|min             |max             |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   True   |min in length   |max in length   |x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|
|string   |   False  |min in length   |max in length   |0                  |0                  |0               |missing rate|
|datetime |True/False|min in timestamp|max in timestamp|x-axis in histogram|y-axis in histogram|#bins in histogram|missing rate|

##### Step 1: Specify the directories for input and output files

In [2]:
input_dataset_file = './raw_data/AdultIncomeData/adult.csv'
dataset_description_file = './output/description/AdultIncomeData_description.csv'
synthetic_data_file = './output/synthetic_data/AdultIncomeData_synthetic.csv'

##### Step 2: Initialize a DatasetDescriber

In [3]:
describer = DatasetDestriber()

Initialized a dataset description generator.


##### Step 3: Generate dataset description

- description1 is inferred by code.
- description2 also contains customization on datatypes and category indicators from the user.

In [4]:
description1 = describer.get_dataset_description(file_name=input_dataset_file)

In [5]:
description1

Unnamed: 0_level_0,data type,categorical,min,max,values,probabilities,values count,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[-31.56, 1629.0, 3257.0, 4885.0, 6513.0, 8141....","[0.0496788206479, 0.0494939692222, 0.050741716...",20.0,0.335432
age,int,False,17.0,90.0,"[16.927, 20.65, 24.3, 27.95, 31.6, 35.25, 38.9...","[0.0741447458089, 0.0971446095134, 0.075337331...",20.0,0.098676
workclass,string,True,2.0,17.0,"[ Never-worked, ?, Private, Without-pay, S...","[0.000259578444606, 0.0565881009241, 0.6970200...",9.0,0.408433
fnlwgt,int,False,12285.0,1455435.0,"[10841.85, 84442.5, 156600.0, 228757.5, 300915...","[0.135032407778, 0.256021445147, 0.33524045771...",20.0,0.232395
education,string,True,4.0,13.0,"[ Preschool, 11th, Masters, 1st-4th, 7th-8...","[0.00162506206834, 0.0356610842775, 0.05349162...",16.0,0.319646
education-num,int,True,1.0,16.0,"[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...","[0.00126258198015, 0.00519061480728, 0.0104864...",16.0,0.124321
marital-status,string,True,8.0,22.0,"[ Married-AF-spouse, Married-spouse-absent, ...","[0.000863049703817, 0.0126711388333, 0.4595739...",7.0,0.217131
occupation,string,True,2.0,18.0,"[ Handlers-cleaners, Exec-managerial, ?, Te...","[0.0424963963483, 0.126421440393, 0.0559500293...",15.0,0.424741
relationship,string,True,5.0,15.0,"[ Unmarried, Wife, Own-child, Other-relativ...","[0.106552137435, 0.0480223731522, 0.1547742708...",6.0,0.231289
race,string,True,6.0,19.0,"[ Asian-Pac-Islander, Black, White, Amer-In...","[0.0322153124681, 0.0960342542563, 0.853807727...",5.0,0.09625


In [6]:
description1 = describer.get_dataset_description(file_name=input_dataset_file)
description2 = describer.get_dataset_description(file_name=input_dataset_file,
                                                 column_to_datatype_dict={'education-num': 'float'},
                                                 column_to_categorical_dict={'native-country':False,'age':True})

The input dataset is

In [7]:
describer.input_dataset.head()

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1.0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,,White,Male,2174.0,,40.0,United-States,<=50K
1,2.0,50.0,Self-emp-not-inc,83311.0,Bachelors,,,,Husband,White,,0.0,0.0,13.0,United-States,
2,3.0,38.0,Private,215646.0,HS-grad,9.0,Divorced,,,White,Male,,,,United-States,<=50K
3,4.0,53.0,,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,5.0,28.0,,,Bachelors,,,,,Black,Female,0.0,,40.0,Cuba,


The dataset description inferred by code is

In [8]:
description1

Unnamed: 0_level_0,data type,categorical,min,max,values,probabilities,values count,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[-31.56, 1629.0, 3257.0, 4885.0, 6513.0, 8141....","[0.0496788206479, 0.0494939692222, 0.050741716...",20.0,0.335432
age,int,False,17.0,90.0,"[16.927, 20.65, 24.3, 27.95, 31.6, 35.25, 38.9...","[0.0741447458089, 0.0971446095134, 0.075337331...",20.0,0.098676
workclass,string,True,2.0,17.0,"[ Never-worked, ?, Private, Without-pay, S...","[0.000259578444606, 0.0565881009241, 0.6970200...",9.0,0.408433
fnlwgt,int,False,12285.0,1455435.0,"[10841.85, 84442.5, 156600.0, 228757.5, 300915...","[0.135032407778, 0.256021445147, 0.33524045771...",20.0,0.232395
education,string,True,4.0,13.0,"[ Preschool, 11th, Masters, 1st-4th, 7th-8...","[0.00162506206834, 0.0356610842775, 0.05349162...",16.0,0.319646
education-num,int,True,1.0,16.0,"[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...","[0.00126258198015, 0.00519061480728, 0.0104864...",16.0,0.124321
marital-status,string,True,8.0,22.0,"[ Married-AF-spouse, Married-spouse-absent, ...","[0.000863049703817, 0.0126711388333, 0.4595739...",7.0,0.217131
occupation,string,True,2.0,18.0,"[ Handlers-cleaners, Exec-managerial, ?, Te...","[0.0424963963483, 0.126421440393, 0.0559500293...",15.0,0.424741
relationship,string,True,5.0,15.0,"[ Unmarried, Wife, Own-child, Other-relativ...","[0.106552137435, 0.0480223731522, 0.1547742708...",6.0,0.231289
race,string,True,6.0,19.0,"[ Asian-Pac-Islander, Black, White, Amer-In...","[0.0322153124681, 0.0960342542563, 0.853807727...",5.0,0.09625


The dataset description inferred by code, which also contains the datatypes and categorical indicators from the user.
    - "education-num" is of datat type "float".
    - "native-country" is not categrocial.
    - "age" is categorical.

In [9]:
description2

Unnamed: 0_level_0,data type,categorical,min,max,values,probabilities,values count,missing
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ID,int,False,1.0,32561.0,"[-31.56, 1629.0, 3257.0, 4885.0, 6513.0, 8141....","[0.0496788206479, 0.0494939692222, 0.050741716...",20.0,0.335432
age,int,True,17.0,90.0,"[17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24....","[0.0121643723593, 0.0171391576939, 0.021636908...",73.0,0.098676
workclass,string,True,2.0,17.0,"[ Never-worked, ?, Private, Without-pay, S...","[0.000259578444606, 0.0565881009241, 0.6970200...",9.0,0.408433
fnlwgt,int,False,12285.0,1455435.0,"[10841.85, 84442.5, 156600.0, 228757.5, 300915...","[0.135032407778, 0.256021445147, 0.33524045771...",20.0,0.232395
education,string,True,4.0,13.0,"[ Preschool, 11th, Masters, 1st-4th, 7th-8...","[0.00162506206834, 0.0356610842775, 0.05349162...",16.0,0.319646
education-num,float,True,1.0,16.0,"[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...","[0.00126258198015, 0.00519061480728, 0.0104864...",16.0,0.124321
marital-status,string,True,8.0,22.0,"[ Married-AF-spouse, Married-spouse-absent, ...","[0.000863049703817, 0.0126711388333, 0.4595739...",7.0,0.217131
occupation,string,True,2.0,18.0,"[ Handlers-cleaners, Exec-managerial, ?, Te...","[0.0424963963483, 0.126421440393, 0.0559500293...",15.0,0.424741
relationship,string,True,5.0,15.0,"[ Unmarried, Wife, Own-child, Other-relativ...","[0.106552137435, 0.0480223731522, 0.1547742708...",6.0,0.231289
race,string,True,6.0,19.0,"[ Asian-Pac-Islander, Black, White, Amer-In...","[0.0322153124681, 0.0960342542563, 0.853807727...",5.0,0.09625


##### Step 4: save the dataset description

In [10]:
describer.dataset_description.to_csv(dataset_description_file)

### Generate synthetic data

###### Step 1: Initialize a SyntheticDataGenerator.

In [11]:
generator = SyntheticDataGenerator()

Initialized a synthetic data generator.


##### Step 2: Generate 10 rows in sysnthetic dataset

The values are sampled from the histograms in dataset description file.

In [12]:
synthetic_dataset = generator.get_synthetic_data(dataset_description_file, N=10)
synthetic_dataset

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,12789,33.0,Private,149889,7th-8th,9.0,Married-civ-spouse,Protective-serv,Husband,White,Male,526,3,45,DLQjQUVBGDPMzzLsdecJ,<=50K
1,18088,52.0,Private,156836,Some-college,11.0,Married-civ-spouse,Craft-repair,Not-in-family,White,Male,110,23,35,LZFcUmAeoMcfUXKyANJg,<=50K
2,9402,75.0,Private,158591,10th,10.0,Separated,Exec-managerial,Unmarried,White,Male,5363,25,35,bHWavujjTfbXItCYVnXt,<=50K
3,6420,60.0,Self-emp-not-inc,12823,Doctorate,9.0,Divorced,Machine-op-inspct,Not-in-family,White,Male,856,10,45,nmtaaEBlvFAZqmCIPEWz,<=50K
4,25718,49.0,Private,148689,Masters,9.0,Married-civ-spouse,Sales,Husband,Other,Male,319,41,25,XubOscUffOyScseMczRv,<=50K
5,1194,30.0,Private,290062,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,364,13,45,cVqizVzdfTOKqnuoEnIp,>50K
6,10881,61.0,?,163695,Doctorate,13.0,Married-civ-spouse,Exec-managerial,Unmarried,White,Male,206,47,35,dPUAXXyECQYAXXKcxXsZ,<=50K
7,26364,39.0,Private,157007,5th-6th,9.0,Married-civ-spouse,Adm-clerical,Not-in-family,White,Female,146,23,43,buqygXyTpFJamretaudu,>50K
8,7777,39.0,Self-emp-not-inc,158268,Bachelors,14.0,Never-married,Prof-specialty,Own-child,White,Male,65,38,58,wzCLTcuaqSBmzlFLorpd,<=50K
9,31215,45.0,Private,296927,HS-grad,9.0,Married-civ-spouse,Sales,Husband,Black,Male,964,4,35,JiOPyRHkNZAfBiUgBlRM,<=50K


##### Step 3: Random missing

Random missing proportional to missing rates in dataset description

In [13]:
generator.random_missing_on_dataset_as_description()
generator.synthetic_dataset

Unnamed: 0,ID,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,12789.0,33.0,Private,,7th-8th,9.0,Married-civ-spouse,,Husband,White,Male,526.0,,45.0,DLQjQUVBGDPMzzLsdecJ,<=50K
1,18088.0,52.0,,156836.0,Some-college,11.0,Married-civ-spouse,Craft-repair,Not-in-family,White,Male,110.0,23.0,35.0,LZFcUmAeoMcfUXKyANJg,<=50K
2,9402.0,75.0,Private,158591.0,10th,10.0,Separated,,Unmarried,White,Male,,,35.0,bHWavujjTfbXItCYVnXt,
3,,60.0,Self-emp-not-inc,12823.0,Doctorate,9.0,,Machine-op-inspct,Not-in-family,White,Male,856.0,10.0,45.0,nmtaaEBlvFAZqmCIPEWz,
4,25718.0,49.0,Private,148689.0,,9.0,Married-civ-spouse,Sales,Husband,Other,,319.0,41.0,25.0,XubOscUffOyScseMczRv,
5,1194.0,30.0,,290062.0,,9.0,Married-civ-spouse,Exec-managerial,,White,,364.0,13.0,45.0,cVqizVzdfTOKqnuoEnIp,>50K
6,,61.0,?,163695.0,Doctorate,13.0,Married-civ-spouse,,Unmarried,White,Male,206.0,,35.0,dPUAXXyECQYAXXKcxXsZ,<=50K
7,26364.0,39.0,Private,157007.0,5th-6th,9.0,Married-civ-spouse,,,White,,,23.0,43.0,buqygXyTpFJamretaudu,>50K
8,,39.0,,,,,,Prof-specialty,Own-child,White,,65.0,,58.0,wzCLTcuaqSBmzlFLorpd,<=50K
9,31215.0,45.0,,296927.0,HS-grad,9.0,Married-civ-spouse,Sales,Husband,Black,Male,964.0,4.0,35.0,JiOPyRHkNZAfBiUgBlRM,


##### Step 4: Save the synthetic dataset

In [14]:
synthetic_dataset.to_csv(synthetic_data_file, index=False)