# Create Synthetic data 

<h5>For generate the data we gonna use the SDV or Synthetic Data Vault. SVD generates synthetic data by applying mathematical techniques and machine learning models such as the deep learning model.</h5>
<b>Even if the data contain multiple data types and missing data, SDV will handle it, so we only need to provide the data.

To use the SDMetrics library, you’ll need:

-Real data, loaded as a pandas DataFrame <br>
-Synthetic data, loaded as a pandas DataFrame <br>
-Metadata, represented as a dictionary format

In [None]:
! pip install sdv

In [34]:
import pandas as pd
import os 
import numpy as np
from sdv.tabular import CTGAN, GaussianCopula, CopulaGAN
from sdv.evaluation import evaluate
from sdmetrics.reports.single_table import QualityReport
from sdmetrics.reports.utils import get_column_plot
import warnings
warnings.filterwarnings('ignore')

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
data_drive_path = os.path.join('drive', 'MyDrive', 'Colab Notebooks', 'SP-project')

In [20]:
infringement_path = os.path.join(data_drive_path, 'dataset', 'infringement_dataset.csv')

### Load data

We will choose only 1000 sample of the original data to generate sintetic data since the dataset is too large

In [21]:
data = pd.read_csv(infringement_path)
data = data.sample(int(data.shape[0]/10))

In [22]:
to_drop=["address",
"appendix_a", 
"appendix_b", 
"appendix_c",
"appendix_d",
"appendix_e",  
"appendix_f",  
"appendix_g",  
"appendix_h",  
"appendix_i",  
"appendix_j",  
"appendix_k",  
"appendix_l",  
"appendix_m",  
"appendix_n",  
"appendix_o",
"appendix_p",
"appendix_q",  
"appendix_r",  
"appendix_s",  
"appendix_t",  
"car_age",
"first_name",
"last_name",
"num_req_bureau_day",
"num_req_bureau_hour", 
"num_req_bureau_month", 
"num_req_bureau_qrt", 
"num_req_bureau_week", 
"num_req_bureau_year", 
"provided_email", 
"provided_homephone",
"provided_mobilephone",
"provided_workphone",
"region_rating",
"score_ext_1",
"score_ext_2",
"score_ext_3"]

In [23]:
data = data.drop(to_drop, axis=1)
data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
283559,428406,0,Cash loans,F,Y,Y,0,315000.0,942300.0,27679.5,...,428406.0,326.000000,16338.465000,150250.500000,138991.500000,1.0,0.0,0.0,0.0,1.0
126426,246616,0,Cash loans,F,N,Y,0,81000.0,239850.0,23494.5,...,246616.0,962.500000,11128.657500,125401.500000,113925.000000,6.0,0.0,0.0,0.0,6.0
16138,118822,0,Cash loans,F,Y,N,0,360000.0,1885536.0,95782.5,...,118822.0,1522.800000,11904.963750,57593.700000,57639.600000,4.0,1.0,0.0,0.0,5.0
219598,354404,0,Cash loans,F,N,N,0,202500.0,654498.0,27859.5,...,354404.0,1008.857143,13258.701000,139075.714286,152485.714286,5.0,2.0,0.0,0.0,7.0
246579,385372,1,Cash loans,M,N,N,1,135000.0,545040.0,25407.0,...,385372.0,917.000000,4968.315000,20340.000000,24165.000000,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
192961,323769,0,Cash loans,F,N,N,2,315000.0,900000.0,46084.5,...,323769.0,257.666667,23207.242500,403276.500000,452808.000000,2.0,1.0,0.0,0.0,3.0
192,100224,0,Cash loans,F,Y,Y,2,225000.0,1256400.0,36864.0,...,100224.0,1580.333333,9844.170000,76663.500000,75502.500000,1.0,0.0,2.0,0.0,3.0
218401,353036,0,Cash loans,F,N,N,0,76500.0,199080.0,11556.0,...,353036.0,1866.200000,8890.096500,84286.350000,85577.850000,8.0,0.0,2.0,0.0,10.0
24236,128192,0,Cash loans,M,N,Y,0,157500.0,180000.0,12798.0,...,128192.0,865.600000,11702.796429,91063.530000,92269.980000,6.0,1.0,1.0,2.0,10.0


### Create and train model - CTGAN

In [24]:
%%time
model_ctgan = CTGAN(epochs=15, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model_ctgan.fit(data)

Epoch 1, Loss G:  0.2502,Loss D: -0.0510
Epoch 2, Loss G: -0.1003,Loss D: -0.0557
Epoch 3, Loss G: -0.2420,Loss D:  0.0630
Epoch 4, Loss G: -0.4728,Loss D:  0.0130
Epoch 5, Loss G: -0.9434,Loss D:  0.0150
Epoch 6, Loss G: -0.9446,Loss D:  0.0335
Epoch 7, Loss G: -0.7393,Loss D: -0.4376
Epoch 8, Loss G: -1.1872,Loss D: -0.4655
Epoch 9, Loss G:  0.2402,Loss D: -0.3041
Epoch 10, Loss G: -0.0918,Loss D: -0.8371
Epoch 11, Loss G: -0.5632,Loss D: -0.7223
Epoch 12, Loss G: -0.8689,Loss D: -0.6275
Epoch 13, Loss G: -1.3744,Loss D: -0.8704
Epoch 14, Loss G: -1.7406,Loss D: -0.0588
Epoch 15, Loss G: -2.2546,Loss D: -0.4588
CPU times: user 1min 34s, sys: 35.6 s, total: 2min 9s
Wall time: 2min 7s


<h4>After fitting the model we gonna use it to generate the new data</h4>

In [25]:
%%time
synthetic_data_ctgan = model_ctgan.sample(num_rows=data.shape[0])
synthetic_data_ctgan

CPU times: user 1.89 s, sys: 52.3 ms, total: 1.94 s
Wall time: 1.95 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,390548,0,Cash loans,F,N,Y,1,58182.551,233305.0,,...,336724.0,407.0,25740.0,395161.0,6426.0,1.0,0.0,0.0,0.0,1.0
1,330113,0,Cash loans,M,N,Y,0,107003.520,822335.0,19090.5,...,162255.0,2771.0,,189215.0,29969.0,3.0,1.0,1.0,0.0,6.0
2,297794,0,Cash loans,M,Y,Y,1,121795.074,345963.0,8807.7,...,217134.0,1542.0,14976.0,48371.0,62283.0,5.0,0.0,0.0,0.0,1.0
3,404005,0,Cash loans,F,N,Y,2,80631.763,1033121.0,30204.1,...,271094.0,807.0,17295.0,102239.0,202750.0,1.0,0.0,0.0,0.0,1.0
4,333130,0,Cash loans,M,Y,Y,0,52335.909,673540.0,23721.8,...,260521.0,919.0,15835.0,170633.0,164465.0,3.0,,,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,220406,0,Cash loans,F,Y,N,0,71883.100,623980.0,9074.0,...,334679.0,774.0,14987.0,69499.0,146360.0,5.0,1.0,1.0,0.0,4.0
30747,350135,1,Cash loans,M,N,Y,0,182957.248,544839.0,,...,270603.0,500.0,88836.0,345920.0,155013.0,3.0,1.0,0.0,0.0,1.0
30748,263321,0,Cash loans,M,N,Y,0,125815.045,509627.0,6781.9,...,379500.0,1058.0,13129.0,35393.0,133468.0,6.0,0.0,0.0,0.0,1.0
30749,456247,0,Cash loans,F,N,Y,0,114489.036,381273.0,18246.2,...,358907.0,546.0,10666.0,235027.0,281237.0,6.0,0.0,0.0,0.0,5.0


In [27]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_CTGAN.csv')
synthetic_data_ctgan.to_csv(save_path, index=False)

#### Evaluate results

In [28]:
model_score = evaluate(synthetic_data_ctgan, data)
model_score

0.9259618593516252

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [29]:
report_ctgan = QualityReport()

report_ctgan.generate(data, synthetic_data_ctgan,model_ctgan.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.23s/it]



Overall Quality Score: 89.61%

Properties:
Column Shapes: 90.29%
Column Pair Trends: 88.92%


In [30]:
details = report_ctgan.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.94075
1,infringed,KSComplement,0.983383
2,num_children,KSComplement,0.934474
3,annual_income,KSComplement,0.922799
4,credit_amount,KSComplement,0.905206
5,credit_annuity,KSComplement,0.728931
6,goods_valuation,KSComplement,0.690744
7,age,KSComplement,0.912653
8,days_employed,KSComplement,0.831485
9,mobilephone_reachable,KSComplement,0.975871


In [31]:
print('Column with more quality',details[details['Quality Score'] == details['Quality Score'].max()]['Column'], details['Quality Score'].max())
print('Column with less quality',details[details['Quality Score'] == details['Quality Score'].min()]['Column'], details['Quality Score'].min())

Column with more quality 21    contract_type
Name: Column, dtype: object 0.988000390231212
Column with less quality 6    goods_valuation
Name: Column, dtype: object 0.6907437225471798


In [32]:
report_ctgan.get_visualization(property_name='Column Pair Trends')

In [36]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_ctgan,
    metadata=model_ctgan.get_metadata().to_dict(),
    column_name='contract_type'
)

fig.show()

In [37]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_ctgan,
    metadata=model_ctgan.get_metadata().to_dict(),
    column_name='goods_valuation'
)

fig.show()

### Create and train model - GaussianCopula

In [38]:
%%time
model_gauscopula = GaussianCopula()
model_gauscopula.fit(data)

CPU times: user 3.17 s, sys: 50.3 ms, total: 3.22 s
Wall time: 3.23 s


<h4>After fitting the model we gonna use it to generate the new data</h4>

In [39]:
%%time
synthetic_data_gausscopula = model_gauscopula.sample(num_rows=data.shape[0])
synthetic_data_gausscopula

CPU times: user 1.01 s, sys: 142 ms, total: 1.15 s
Wall time: 1.03 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,193504,0,Cash loans,F,N,Y,1,134362.438,492663.0,20674.6,...,172595.0,753.0,20684.0,118432.0,227021.0,2.0,1.0,1.0,0.0,1.0
1,365977,0,Cash loans,M,Y,Y,0,48971.592,1541532.0,29819.6,...,341595.0,891.0,45655.0,183844.0,362924.0,3.0,3.0,3.0,0.0,7.0
2,198816,0,Cash loans,F,N,Y,0,205138.373,1161482.0,30604.1,...,180855.0,733.0,19480.0,185888.0,239118.0,1.0,1.0,2.0,0.0,4.0
3,422608,0,Cash loans,F,N,Y,1,212666.941,381023.0,24015.0,...,368778.0,420.0,16718.0,211681.0,260201.0,3.0,1.0,4.0,0.0,6.0
4,329338,0,Cash loans,F,N,Y,1,135855.064,345070.0,22952.7,...,343838.0,68.0,52711.0,466926.0,570057.0,5.0,1.0,1.0,0.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,159078,0,Cash loans,F,Y,Y,1,319704.778,688138.0,41458.2,...,176518.0,1621.0,5938.0,67566.0,122392.0,3.0,0.0,0.0,1.0,3.0
30747,415951,0,Cash loans,F,N,Y,0,57791.830,262098.0,6735.8,...,417444.0,886.0,1406.0,5727.0,6916.0,4.0,1.0,1.0,0.0,5.0
30748,282187,0,Cash loans,M,Y,Y,0,98319.636,543168.0,27209.6,...,285712.0,536.0,1145.0,155.0,688.0,0.0,1.0,1.0,1.0,1.0
30749,407171,1,Cash loans,M,N,Y,1,226404.465,292615.0,7328.8,...,359791.0,38.0,12627.0,184561.0,170094.0,3.0,2.0,5.0,0.0,6.0


In [40]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_GaussianCopula.csv')
synthetic_data_gausscopula.to_csv(save_path, index=False)

#### Evaluate results

In [42]:
model_score_gauss = evaluate(synthetic_data_gausscopula, data)
model_score_gauss

0.9015870799206975

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [43]:
report_gausscopula = QualityReport()

report_gausscopula.generate(data, synthetic_data_gausscopula,model_gauscopula.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.18s/it]



Overall Quality Score: 90.16%

Properties:
Column Shapes: 88.11%
Column Pair Trends: 92.21%


In [44]:
details_gausscopula = report_gausscopula.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.94075
1,infringed,KSComplement,0.983383
2,num_children,KSComplement,0.934474
3,annual_income,KSComplement,0.922799
4,credit_amount,KSComplement,0.905206
5,credit_annuity,KSComplement,0.728931
6,goods_valuation,KSComplement,0.690744
7,age,KSComplement,0.912653
8,days_employed,KSComplement,0.831485
9,mobilephone_reachable,KSComplement,0.975871


In [47]:
print('Column with more quality',details_gausscopula[details_gausscopula['Quality Score'] == details_gausscopula['Quality Score'].max()]['Column'], details_gausscopula['Quality Score'].max())
print('Column with less quality',details_gausscopula[details_gausscopula['Quality Score'] == details_gausscopula['Quality Score'].min()]['Column'], details_gausscopula['Quality Score'].min())

Column with more quality 1                infringed
9    mobilephone_reachable
Name: Column, dtype: object 0.9984715944196937
Column with less quality 8    days_employed
Name: Column, dtype: object 0.2606094110760625


In [46]:
report_gausscopula.get_visualization(property_name='Column Pair Trends')

In [48]:
from sdmetrics.reports.utils import get_column_plot

fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_gausscopula,
    metadata=model_gauscopula.get_metadata().to_dict(),
    column_name='infringed'
)

fig.show()

In [49]:
from sdmetrics.reports.utils import get_column_plot

fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_gausscopula,
    metadata=model_gauscopula.get_metadata().to_dict(),
    column_name='days_employed'
)

fig.show()

### Create and train model - CopulaGAN

In [50]:
%%time
model_copulagan = CopulaGAN(epochs=15, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model_copulagan.fit(data)

Epoch 1, Loss G:  0.9265,Loss D:  0.0935
Epoch 2, Loss G:  0.3005,Loss D:  0.0339
Epoch 3, Loss G:  0.0527,Loss D:  0.2966
Epoch 4, Loss G:  0.2026,Loss D: -0.1428
Epoch 5, Loss G: -0.4846,Loss D: -0.0646
Epoch 6, Loss G: -0.1811,Loss D: -0.1495
Epoch 7, Loss G:  0.3966,Loss D: -0.8832
Epoch 8, Loss G:  0.5739,Loss D: -1.0648
Epoch 9, Loss G:  0.3714,Loss D: -0.8372
Epoch 10, Loss G: -0.5608,Loss D: -0.7625
Epoch 11, Loss G:  0.3631,Loss D: -0.2006
Epoch 12, Loss G: -0.9812,Loss D:  0.0280
Epoch 13, Loss G: -1.0042,Loss D: -0.3290
Epoch 14, Loss G: -1.6039,Loss D: -0.2833
Epoch 15, Loss G: -1.7142,Loss D:  0.1150
CPU times: user 1min 36s, sys: 34.8 s, total: 2min 11s
Wall time: 2min 10s


<h4>After fitting the model we gonna use it to generate the new data</h4>

In [51]:
%%time
synthetic_data_copulagan = model_copulagan.sample(num_rows=data.shape[0])
synthetic_data_copulagan

CPU times: user 2.53 s, sys: 52.8 ms, total: 2.58 s
Wall time: 2.72 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,455856,0,Cash loans,M,Y,N,0,27192.135,89217.0,23723.2,...,,856.0,15245.0,,,,,1.0,0.0,
1,227417,0,Cash loans,F,N,Y,0,47953.616,300159.0,7272.8,...,,788.0,19940.0,292975.0,29858.0,3.0,0.0,1.0,0.0,4.0
2,338245,0,Cash loans,F,N,Y,0,106241.441,603118.0,21194.3,...,314625.0,688.0,8475.0,61133.0,8062.0,6.0,0.0,0.0,0.0,1.0
3,202030,0,Cash loans,F,N,N,0,70790.790,257503.0,9782.9,...,392889.0,517.0,12548.0,60658.0,22415.0,2.0,2.0,0.0,0.0,2.0
4,387233,0,Cash loans,M,N,Y,1,107581.191,412920.0,25094.6,...,374675.0,72.0,7450.0,81520.0,50680.0,2.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,168754,0,Cash loans,M,Y,Y,0,233948.726,571826.0,14510.1,...,,847.0,10904.0,261466.0,106787.0,1.0,0.0,0.0,0.0,9.0
30747,279642,0,Cash loans,F,Y,Y,0,112395.945,243352.0,39523.2,...,210289.0,1978.0,16390.0,51234.0,254114.0,1.0,0.0,0.0,0.0,1.0
30748,416746,0,Revolving loans,F,Y,N,0,75119.816,1182934.0,16098.5,...,,917.0,,,,3.0,,,,
30749,403427,0,Cash loans,F,N,Y,0,105949.449,506384.0,8091.3,...,359735.0,838.0,12060.0,38984.0,43081.0,1.0,0.0,0.0,0.0,5.0


In [52]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_CopulaGAN.csv')
synthetic_data_copulagan.to_csv(save_path, index=False)

#### Evaluate results

In [53]:
model_score_copulagan = evaluate(synthetic_data_copulagan, data)
model_score_copulagan

0.9149543433086731

<h4>Computing performance score</h4>

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [54]:
report_copulagan = QualityReport()

report_copulagan.generate(data, synthetic_data_copulagan,model_copulagan.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:07<00:00,  1.86s/it]



Overall Quality Score: 88.33%

Properties:
Column Shapes: 89.69%
Column Pair Trends: 86.97%


In [55]:
details_copulagan = report_copulagan.get_details(property_name='Column Shapes')
details_copulagan

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.877207
1,infringed,KSComplement,0.96179
2,num_children,KSComplement,0.889825
3,annual_income,KSComplement,0.674417
4,credit_amount,KSComplement,0.888849
5,credit_annuity,KSComplement,0.879534
6,goods_valuation,KSComplement,0.783997
7,age,KSComplement,0.945563
8,days_employed,KSComplement,0.692888
9,mobilephone_reachable,KSComplement,0.998472


In [56]:
print('Column with more quality',details_copulagan[details_copulagan['Quality Score'] == details_copulagan['Quality Score'].max()]['Column'], details_copulagan['Quality Score'].max())
print('Column with less quality',details_copulagan[details_copulagan['Quality Score'] == details_copulagan['Quality Score'].min()]['Column'], details_copulagan['Quality Score'].min())

Column with more quality 9    mobilephone_reachable
Name: Column, dtype: object 0.9984715944196937
Column with less quality 3    annual_income
Name: Column, dtype: object 0.6744170921270852


In [58]:
report_copulagan.get_visualization(property_name='Column Pair Trends')

In [63]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_copulagan,
    metadata=model_copulagan.get_metadata().to_dict(),
    column_name='contract_type' # second best since the fist has some bugs
)

fig.show()

In [60]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_copulagan,
    metadata=model_copulagan.get_metadata().to_dict(),
    column_name='annual_income'
)

fig.show()

## Result analysis

By observing the results, we can see that the overall quality score of our new dataset is 81%. 
This result is good, but we can do a deeper analysis and see the scores on the individual columns and between the correlation in the columns. The result is basically the same, so we can conclude that our synthetic data is good. 

<b>[Pros and Cons of synthetic data](https://www.analyticssteps.com/blogs/what-synthetic-data-types-advantages-and-disadvantages)</b><br><br>
<b>Advantages:</b><br><br>
<ul>
  <li><b>Data quality</b> - Higher data quality, balance, and variety are ensured with synthetic data. Artificially created data can apply labels and automatically fill in missing quantities, allowing for more precise prediction;<br><br></li>
  <li><b> Scalability</b> - Synthetic data is used to cover the gaps left by real-world data;<br><br></li>
  <li><b> Utilization simplicity</b> - Synthetic data guarantees ‌all data has a consistent format and labelling, getting rid of errors and duplicates.<br><br></li>
</ul>
<b> Disadvantages:</b><br><br>
<ul>
  <li>Outliers are challenging to map because synthetic data merely approximates real-world data, it is not a duplicate.Therefore, some outliers that are present in original data may not be covered by synthetic data;<br><br></li>
  <li>The quality of the model depends on the data source.<br><br></li>
</ul>

