# Create Synthetic data 

<h5>For generate the data we gonna use the SDV or Synthetic Data Vault. SVD generates synthetic data by applying mathematical techniques and machine learning models such as the deep learning model.</h5>
<b>Even if the data contain multiple data types and missing data, SDV will handle it, so we only need to provide the data.

To use the SDMetrics library, you’ll need:

-Real data, loaded as a pandas DataFrame <br>
-Synthetic data, loaded as a pandas DataFrame <br>
-Metadata, represented as a dictionary format

In [30]:
import pandas as pd
import os 
import numpy as np
from sdv.tabular import CTGAN, GaussianCopula, CopulaGAN
from sdv.evaluation import evaluate
from sdmetrics.reports.single_table import QualityReport
import warnings
warnings.filterwarnings('ignore')

In [2]:
infringement_path = os.path.join('dataset', 'infringement_dataset.csv')
description_path = os.path.join('dataset', 'columns_description.csv')

### Load data

We will choose only 1000 sample of the original data to generate sintetic data since the dataset is too large

In [3]:
data = pd.read_csv(infringement_path)
data = data.sample(int(data.shape[0]/10))

In [4]:
to_drop=["address",
"appendix_a", 
"appendix_b", 
"appendix_c",
"appendix_d",
"appendix_e",  
"appendix_f",  
"appendix_g",  
"appendix_h",  
"appendix_i",  
"appendix_j",  
"appendix_k",  
"appendix_l",  
"appendix_m",  
"appendix_n",  
"appendix_o",
"appendix_p",
"appendix_q",  
"appendix_r",  
"appendix_s",  
"appendix_t",  
"car_age",
"first_name",
"last_name",
"num_req_bureau_day",
"num_req_bureau_hour", 
"num_req_bureau_month", 
"num_req_bureau_qrt", 
"num_req_bureau_week", 
"num_req_bureau_year", 
"provided_email", 
"provided_homephone",
"provided_mobilephone",
"provided_workphone",
"region_rating",
"score_ext_1",
"score_ext_2",
"score_ext_3"]

In [5]:
data = data.drop(to_drop, axis=1)
data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
196414,327738,0,Cash loans,M,Y,Y,0,247500.0,521280.0,28408.5,...,327738.0,1172.250000,22510.575000,675000.000000,704925.000000,2.0,0.0,2.0,0.0,4.0
195685,326907,0,Cash loans,M,Y,Y,0,211500.0,490495.5,46701.0,...,326907.0,270.611111,46616.238000,556447.500000,600227.500000,3.0,9.0,6.0,0.0,18.0
101943,218354,0,Cash loans,F,N,Y,0,67500.0,339241.5,12919.5,...,218354.0,1079.000000,8204.827500,106148.250000,105806.250000,2.0,0.0,0.0,0.0,2.0
68974,179993,1,Cash loans,M,Y,Y,0,202500.0,206280.0,9747.0,...,179993.0,844.600000,23360.499000,224027.100000,361437.300000,3.0,0.0,2.0,0.0,5.0
145114,268266,0,Cash loans,F,N,Y,0,247500.0,701149.5,25312.5,...,268266.0,938.500000,8144.048571,79973.437500,99056.250000,7.0,1.0,0.0,0.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112765,230805,0,Cash loans,M,Y,N,0,360000.0,687600.0,18265.5,...,230805.0,237.000000,12689.257500,81900.000000,85036.500000,2.0,0.0,0.0,0.0,2.0
9958,111588,0,Cash loans,F,N,Y,0,180000.0,675000.0,50598.0,...,111588.0,902.500000,13285.912500,154365.750000,171400.500000,2.0,0.0,0.0,0.0,2.0
215770,350014,0,Cash loans,F,N,Y,0,112500.0,204768.0,10089.0,...,350014.0,1242.571429,4747.365000,26703.642857,26697.857143,2.0,2.0,3.0,0.0,7.0
131904,252983,0,Cash loans,F,Y,Y,0,157500.0,543037.5,30451.5,...,252983.0,646.200000,10421.145000,99509.400000,108619.200000,4.0,1.0,0.0,0.0,5.0


### Create and train model - CTGAN

In [6]:
%%time
model = CTGAN(epochs=15, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model.fit(data)

Epoch 1, Loss G: -0.0577,Loss D:  0.1893
Epoch 2, Loss G: -0.7891,Loss D:  0.3373
Epoch 3, Loss G: -1.7846,Loss D:  0.0902
Epoch 4, Loss G: -2.1582,Loss D: -0.1335
Epoch 5, Loss G: -2.8466,Loss D: -0.1148
Epoch 6, Loss G: -2.9831,Loss D: -0.5848
Epoch 7, Loss G: -2.9823,Loss D:  0.3195
Epoch 8, Loss G: -2.1031,Loss D: -0.6593
Epoch 9, Loss G: -1.7996,Loss D: -0.5207
Epoch 10, Loss G: -0.8074,Loss D: -0.8476
Epoch 11, Loss G: -0.9816,Loss D: -0.5976
Epoch 12, Loss G: -1.4110,Loss D: -0.1036
Epoch 13, Loss G: -2.0254,Loss D: -0.3797
Epoch 14, Loss G: -1.8664,Loss D: -0.0762
Epoch 15, Loss G: -1.5873,Loss D:  0.0406


<h4>After fitting the model we gonna use it to generate the new data</h4>

In [7]:
%%time
synthetic_data = model.sample(num_rows=data.shape[0])
synthetic_data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,425664,0,Cash loans,F,N,Y,0,98826.46,567863.0,10123.8,...,,757.0,,,3038.0,3.0,1.0,1.0,,
1,104095,1,Cash loans,M,N,Y,0,229160.29,241513.0,15761.8,...,102662.0,458.0,17235.0,,94985.0,13.0,0.0,,0.0,1.0
2,131219,0,Cash loans,F,N,Y,0,59774.21,1131696.0,2317.5,...,275943.0,,,,,,,,,
3,456245,0,Cash loans,F,N,Y,0,281631.78,597777.0,8384.8,...,,,15741.0,,312741.0,,1.0,,0.0,
4,327771,0,Revolving loans,F,N,N,0,358414.61,439562.0,24854.7,...,283775.0,1876.0,8030.0,37163.0,47408.0,4.0,0.0,,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,200044,0,Cash loans,M,N,Y,0,175331.30,80732.0,30011.4,...,,1857.0,14273.0,50124.0,40207.0,3.0,0.0,0.0,0.0,1.0
30747,314306,1,Revolving loans,M,Y,Y,1,161848.30,1266956.0,24760.4,...,361184.0,,1557.0,,,,1.0,1.0,,
30748,391458,0,Cash loans,F,N,Y,0,225834.04,454779.0,25955.3,...,140840.0,524.0,8946.0,80185.0,0.0,2.0,0.0,0.0,0.0,2.0
30749,186809,0,Cash loans,F,N,Y,0,100057.58,736310.0,18079.1,...,456245.0,1501.0,7819.0,121962.0,148951.0,10.0,0.0,0.0,0.0,5.0


In [14]:
save_path = os.path.join('dataset', 'synthetic_data_CTGAN.csv')
synthetic_data.to_csv(save_path, index=False)

#### Evaluate results

In [8]:
model_score = evaluate(synthetic_data, data)
model_score

0.9039802727802406

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [9]:
report = QualityReport()

report.generate(data, synthetic_data,model.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:05<00:00,  1.26s/it]



Overall Quality Score: 87.65%

Properties:
Column Shapes: 88.38%
Column Pair Trends: 86.92%


In [26]:
details = report.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.958148
1,infringed,KSComplement,0.964359
2,num_children,KSComplement,0.830087
3,annual_income,KSComplement,0.828168
4,credit_amount,KSComplement,0.778739
5,credit_annuity,KSComplement,0.865296
6,goods_valuation,KSComplement,0.800377
7,age,KSComplement,0.95005
8,days_employed,KSComplement,0.660434
9,mobilephone_reachable,KSComplement,0.973789


In [29]:
print('Column with more quality',details[details['Quality Score'] == details['Quality Score'].max()]['Column'], details['Quality Score'].max())
print('Column with less quality',details[details['Quality Score'] == details['Quality Score'].min()]['Column'], details['Quality Score'].min())

Column with more quality 9    mobilephone_reachable
Name: Column, dtype: object 0.9737894702611297
Column with less quality 8    days_employed
Name: Column, dtype: object 0.6604338070306657


In [13]:
# report.get_visualization(property_name='Column Pair Trends')

### Create and train model - GaussianCopula

In [37]:
%%time
model = GaussianCopula()
model.fit(data)

<h4>After fitting the model we gonna use it to generate the new data</h4>

In [38]:
%%time
synthetic_data = model.sample(num_rows=data.shape[0])
synthetic_data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,280791,1,Cash loans,F,Y,N,1,96070.00,223933.0,8479.6,...,283323.0,201.0,6317.0,76671.0,70550.0,0.0,1.0,2.0,0.0,2.0
1,171681,0,Cash loans,M,Y,Y,1,185842.42,470031.0,19671.1,...,140655.0,1524.0,1136.0,11920.0,12789.0,0.0,0.0,0.0,0.0,1.0
2,222722,1,Cash loans,M,N,N,2,32982.04,105907.0,10586.7,...,224839.0,577.0,1006.0,5686.0,3619.0,1.0,2.0,2.0,0.0,3.0
3,153325,0,Cash loans,F,N,Y,1,183587.91,1262320.0,52806.9,...,146415.0,319.0,28671.0,300159.0,350946.0,5.0,1.0,1.0,0.0,10.0
4,416971,0,Cash loans,F,N,Y,1,51854.61,721050.0,27940.6,...,423896.0,648.0,12856.0,60128.0,126373.0,4.0,2.0,1.0,0.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,344170,0,Cash loans,M,N,Y,0,247449.89,148552.0,9904.0,...,402068.0,335.0,14926.0,191948.0,184883.0,1.0,4.0,4.0,1.0,12.0
30747,395230,0,Revolving loans,F,N,Y,1,186531.18,119894.0,8685.8,...,345403.0,951.0,13506.0,189647.0,292202.0,8.0,1.0,2.0,0.0,12.0
30748,267886,0,Cash loans,M,Y,N,0,358300.26,492819.0,32972.1,...,,,,,,,,,,
30749,274928,0,Cash loans,F,Y,Y,0,389566.16,909732.0,29359.6,...,311163.0,1161.0,31930.0,360537.0,366895.0,6.0,0.0,1.0,0.0,4.0


In [39]:
save_path = os.path.join('dataset', 'synthetic_data_GaussianCopula.csv')
synthetic_data.to_csv(save_path, index=False)

#### Evaluate results

In [40]:
model_score = evaluate(synthetic_data, data)
model_score

0.904552973913539

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [41]:
report = QualityReport()

report.generate(data, synthetic_data,model.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.13s/it]



Overall Quality Score: 90.71%

Properties:
Column Shapes: 88.59%
Column Pair Trends: 92.83%


In [42]:
details = report.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.96966
1,infringed,KSComplement,0.999675
2,num_children,KSComplement,0.758382
3,annual_income,KSComplement,0.870638
4,credit_amount,KSComplement,0.862118
5,credit_annuity,KSComplement,0.937792
6,goods_valuation,KSComplement,0.882274
7,age,KSComplement,0.979968
8,days_employed,KSComplement,0.258593
9,mobilephone_reachable,KSComplement,0.998049


In [43]:
print('Column with more quality',details[details['Quality Score'] == details['Quality Score'].max()]['Column'], details['Quality Score'].max())
print('Column with less quality',details[details['Quality Score'] == details['Quality Score'].min()]['Column'], details['Quality Score'].min())

Column with more quality 1    infringed
Name: Column, dtype: object 0.999674807323339
Column with less quality 8    days_employed
Name: Column, dtype: object 0.25859321648076483


In [None]:
# report.get_visualization(property_name='Column Pair Trends')

### Create and train model - CopulaGAN

In [44]:
%%time
model = CopulaGAN(epochs=15, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model.fit(data)

Epoch 1, Loss G:  0.3631,Loss D:  0.2572
Epoch 2, Loss G: -0.3452,Loss D:  0.2135
Epoch 3, Loss G: -0.5327,Loss D: -0.1284
Epoch 4, Loss G: -0.0350,Loss D: -0.1737
Epoch 5, Loss G: -0.1543,Loss D: -0.1426
Epoch 6, Loss G: -0.5435,Loss D: -0.1882
Epoch 7, Loss G:  0.2884,Loss D: -0.6318
Epoch 8, Loss G: -0.0091,Loss D: -1.2342
Epoch 9, Loss G:  0.7239,Loss D: -1.0191
Epoch 10, Loss G: -0.2397,Loss D: -0.9370
Epoch 11, Loss G: -0.2841,Loss D: -0.9426
Epoch 12, Loss G: -0.7649,Loss D: -0.8763
Epoch 13, Loss G: -1.4604,Loss D: -0.4487
Epoch 14, Loss G: -1.1563,Loss D: -0.3481
Epoch 15, Loss G: -2.2597,Loss D: -0.5252


<h4>After fitting the model we gonna use it to generate the new data</h4>

In [45]:
%%time
synthetic_data = model.sample(num_rows=data.shape[0])
synthetic_data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,372125,0,Revolving loans,F,Y,N,0,76557.37,837027.0,4418.9,...,404699.0,1360.0,10577.0,193794.0,115090.0,,1.0,1.0,0.0,2.0
1,131459,0,Cash loans,F,Y,Y,1,134749.09,904353.0,3440.1,...,172110.0,712.0,8046.0,228013.0,100852.0,2.0,0.0,0.0,0.0,1.0
2,412971,0,Cash loans,F,N,N,0,219420.39,1520582.0,35131.5,...,361151.0,991.0,13568.0,355372.0,31389.0,1.0,0.0,0.0,0.0,3.0
3,106675,0,Cash loans,M,Y,N,0,230607.66,437167.0,32611.2,...,286434.0,,,,163470.0,3.0,1.0,,0.0,5.0
4,348765,1,Revolving loans,M,Y,N,1,156360.95,649430.0,36717.1,...,179506.0,2493.0,9575.0,76728.0,203153.0,7.0,0.0,4.0,0.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,162102,0,Cash loans,M,N,N,0,161527.61,858750.0,23078.6,...,,,,,,,,,,
30747,249768,0,Cash loans,M,Y,N,0,133195.51,837674.0,23591.5,...,277536.0,895.0,,,,,,,0.0,
30748,117092,0,Cash loans,M,Y,N,0,200624.41,763552.0,11715.0,...,,,,,,,,,,
30749,227473,0,Cash loans,F,Y,N,0,128608.41,441662.0,14901.2,...,155473.0,701.0,3639.0,134597.0,376787.0,2.0,3.0,0.0,0.0,1.0


In [46]:
save_path = os.path.join('dataset', 'synthetic_data_CopulaGAN.csv')
synthetic_data.to_csv(save_path, index=False)

#### Evaluate results

In [47]:
model_score = evaluate(synthetic_data, data)
model_score

0.9182610881227382

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [48]:
report = QualityReport()

report.generate(data, synthetic_data,model.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:05<00:00,  1.26s/it]



Overall Quality Score: 88.82%

Properties:
Column Shapes: 90.42%
Column Pair Trends: 87.21%


In [49]:
details = report.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.949368
1,infringed,KSComplement,0.990277
2,num_children,KSComplement,0.904101
3,annual_income,KSComplement,0.871939
4,credit_amount,KSComplement,0.912523
5,credit_annuity,KSComplement,0.837791
6,goods_valuation,KSComplement,0.844673
7,age,KSComplement,0.900719
8,days_employed,KSComplement,0.812364
9,mobilephone_reachable,KSComplement,0.998049


In [50]:
print('Column with more quality',details[details['Quality Score'] == details['Quality Score'].max()]['Column'], details['Quality Score'].max())
print('Column with less quality',details[details['Quality Score'] == details['Quality Score'].min()]['Column'], details['Quality Score'].min())

Column with more quality 9    mobilephone_reachable
Name: Column, dtype: object 0.9980488439400345
Column with less quality 14    past_avg_amt_application
Name: Column, dtype: object 0.7247459808181302


In [None]:
# report.get_visualization(property_name='Column Pair Trends')

## Result analysis

By observing the results, we can see that the overall quality score of our new dataset is 81%. 
This result is good, but we can do a deeper analysis and see the scores on the individual columns and between the correlation in the columns. The result is basically the same, so we can conclude that our synthetic data is good. 

<b>[Pros and Cons of synthetic data](https://www.analyticssteps.com/blogs/what-synthetic-data-types-advantages-and-disadvantages)</b><br><br>
<b>Advantages:</b><br><br>
<ul>
  <li><b>Data quality</b> - Higher data quality, balance, and variety are ensured with synthetic data. Artificially created data can apply labels and automatically fill in missing quantities, allowing for more precise prediction;<br><br></li>
  <li><b> Scalability</b> - Synthetic data is used to cover the gaps left by real-world data;<br><br></li>
  <li><b> Utilization simplicity</b> - Synthetic data guarantees ‌all data has a consistent format and labelling, getting rid of errors and duplicates.<br><br></li>
</ul>
<b> Disadvantages:</b><br><br>
<ul>
  <li>Outliers are challenging to map because synthetic data merely approximates real-world data, it is not a duplicate.Therefore, some outliers that are present in original data may not be covered by synthetic data;<br><br></li>
  <li>The quality of the model depends on the data source.<br><br></li>
</ul>

