# Create Synthetic data 

To generate the data we gonna use the SDV or Synthetic Data Vault. SVD generates synthetic data by applying mathematical techniques and machine learning models such as the deep learning model.
Even if the data contain multiple data types and missing data, SDV will handle it, so we only need to provide the data.

In [1]:
# ! pip install sdv

In [2]:
import pandas as pd
import os 
import numpy as np
from sdv.tabular import CTGAN, GaussianCopula, CopulaGAN
from sdv.evaluation import evaluate
from sdmetrics.reports.single_table import QualityReport
from sdmetrics.reports.utils import get_column_plot
import warnings
warnings.filterwarnings('ignore')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data_drive_path = os.path.join('drive', 'MyDrive', 'Colab Notebooks', 'SP-project')

In [5]:
infringement_path = os.path.join(data_drive_path, 'dataset', 'infringement_dataset.csv')

### Load data

We will choose only 10%  of the original data to generate sintetic data since the dataset is too large

In [6]:
data = pd.read_csv(infringement_path)
data = data.sample(int(data.shape[0]/10))

In [7]:
to_drop=["address",
"appendix_a", 
"appendix_b", 
"appendix_c",
"appendix_d",
"appendix_e",  
"appendix_f",  
"appendix_g",  
"appendix_h",  
"appendix_i",  
"appendix_j",  
"appendix_k",  
"appendix_l",  
"appendix_m",  
"appendix_n",  
"appendix_o",
"appendix_p",
"appendix_q",  
"appendix_r",  
"appendix_s",  
"appendix_t",  
"car_age",
"first_name",
"last_name",
"num_req_bureau_day",
"num_req_bureau_hour", 
"num_req_bureau_month", 
"num_req_bureau_qrt", 
"num_req_bureau_week", 
"num_req_bureau_year", 
"provided_email", 
"provided_homephone",
"provided_mobilephone",
"provided_workphone",
"region_rating",
"score_ext_1",
"score_ext_2",
"score_ext_3"]

In [8]:
data = data.drop(to_drop, axis=1)
data

Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
138919,261081,0,Cash loans,M,N,Y,0,54000.0,508495.5,26091.0,...,261081.0,1462.000000,12824.50500,132000.00,150853.50,3.0,0.0,0.0,0.0,3.0
39932,146258,0,Revolving loans,F,N,N,0,112500.0,337500.0,16875.0,...,146258.0,171.000000,3235.45500,68922.00,68922.00,1.0,0.0,0.0,0.0,1.0
150559,274538,0,Cash loans,F,N,Y,0,81000.0,298512.0,29524.5,...,274538.0,241.000000,14892.81750,92180.25,92180.25,2.0,0.0,0.0,0.0,2.0
180054,308657,0,Cash loans,F,N,Y,0,166500.0,780363.0,31077.0,...,308657.0,818.000000,9611.46000,126270.00,82431.00,1.0,0.0,0.0,0.0,1.0
214383,348422,0,Cash loans,F,N,Y,0,112500.0,432661.5,20943.0,...,348422.0,817.000000,20111.02875,239031.00,228636.00,4.0,0.0,0.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198816,330510,0,Cash loans,M,N,N,0,180000.0,454500.0,33979.5,...,330510.0,1597.500000,18722.25000,143750.25,136887.75,2.0,0.0,0.0,0.0,2.0
141262,263782,1,Cash loans,M,N,N,0,67500.0,232438.5,21316.5,...,263782.0,529.000000,5401.71000,77175.00,69457.50,1.0,0.0,0.0,0.0,1.0
151432,275536,1,Cash loans,M,Y,Y,1,157500.0,521280.0,31630.5,...,275536.0,446.000000,11581.30125,184439.25,190982.25,3.0,0.0,1.0,0.0,4.0
254848,394895,0,Cash loans,F,Y,Y,1,76500.0,103855.5,8455.5,...,394895.0,283.333333,6381.98250,77430.00,77211.75,2.0,0.0,4.0,0.0,6.0


## Create and train model - CTGAN

In [9]:
%%time
model_ctgan = CTGAN(epochs=15, generator_ dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model_ctgan.fit(data)

Epoch 1, Loss G:  0.7906,Loss D:  0.1004
Epoch 2, Loss G:  0.6128,Loss D: -0.0214
Epoch 3, Loss G:  0.3220,Loss D: -0.1391
Epoch 4, Loss G:  0.4297,Loss D: -0.3106
Epoch 5, Loss G: -0.4470,Loss D:  0.0137
Epoch 6, Loss G: -0.6839,Loss D: -0.2055
Epoch 7, Loss G: -1.0638,Loss D: -0.4535
Epoch 8, Loss G: -0.0520,Loss D: -0.5705
Epoch 9, Loss G:  0.4506,Loss D: -1.3799
Epoch 10, Loss G: -0.4706,Loss D: -0.6767
Epoch 11, Loss G: -1.9495,Loss D: -1.0148
Epoch 12, Loss G: -2.3548,Loss D: -0.8926
Epoch 13, Loss G: -2.4292,Loss D: -0.0388
Epoch 14, Loss G: -2.7644,Loss D: -0.0175
Epoch 15, Loss G: -2.5837,Loss D:  0.0145
CPU times: user 1min 55s, sys: 53.5 s, total: 2min 48s
Wall time: 2min 21s


After fitting the model we gonna use it to generate the new data

In [10]:
%%time
synthetic_data_ctgan = model_ctgan.sample(num_rows=data.shape[0])
synthetic_data_ctgan

CPU times: user 1.81 s, sys: 50.8 ms, total: 1.86 s
Wall time: 1.87 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,314803,0,Cash loans,F,Y,N,2,138063.0,342264.0,28572.2,...,223990.0,661.0,6469.0,141335.0,399137.0,2.0,0.0,3.0,0.0,17.0
1,158588,0,Cash loans,F,Y,Y,0,139925.0,216529.0,63400.3,...,,,,,,,,,,
2,439382,0,Cash loans,F,Y,Y,1,371037.0,1123214.0,31103.0,...,359149.0,48.0,10294.0,245845.0,159366.0,2.0,0.0,0.0,0.0,1.0
3,100032,0,Cash loans,F,N,N,0,204198.0,1145812.0,,...,263408.0,2230.0,13837.0,164750.0,834761.0,2.0,1.0,0.0,0.0,1.0
4,398343,0,Cash loans,M,Y,N,0,129822.0,684479.0,18893.2,...,232409.0,2833.0,16667.0,44917.0,26607.0,3.0,1.0,0.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,162334,0,Cash loans,F,N,Y,2,178709.0,499386.0,11824.0,...,279701.0,941.0,15771.0,,,,,1.0,0.0,6.0
30747,235500,0,Cash loans,F,Y,Y,0,153869.0,926598.0,16547.0,...,441725.0,521.0,10324.0,81254.0,92105.0,2.0,0.0,0.0,0.0,10.0
30748,419072,0,Cash loans,F,N,Y,2,143939.0,181131.0,63753.6,...,201281.0,,15876.0,214524.0,,,1.0,1.0,0.0,5.0
30749,173926,0,Cash loans,F,Y,Y,0,179908.0,149798.0,20180.3,...,284028.0,,,,397699.0,,,,0.0,5.0


In [11]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_CTGAN.csv')
synthetic_data_ctgan.to_csv(save_path, index=False)

### Evaluate results

In [12]:
model_score = evaluate(synthetic_data_ctgan, data)
model_score

0.9132126326596439

This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [13]:
report_ctgan = QualityReport()

report_ctgan.generate(data, synthetic_data_ctgan,model_ctgan.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.18s/it]



Overall Quality Score: 88.59%

Properties:
Column Shapes: 89.07%
Column Pair Trends: 88.11%


In [14]:
details = report_ctgan.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.890117
1,infringed,KSComplement,0.991903
2,num_children,KSComplement,0.927807
3,annual_income,KSComplement,0.889142
4,credit_amount,KSComplement,0.747878
5,credit_annuity,KSComplement,0.827054
6,goods_valuation,KSComplement,0.671483
7,age,KSComplement,0.881532
8,days_employed,KSComplement,0.717082
9,mobilephone_reachable,KSComplement,0.971871


In [15]:
print('Column with more quality',details[details['Quality Score'] == details['Quality Score'].max()]['Column'], details['Quality Score'].max())
print('Column with less quality',details[details['Quality Score'] == details['Quality Score'].min()]['Column'], details['Quality Score'].min())

Column with more quality 1    infringed
Name: Column, dtype: object 0.9919027023511431
Column with less quality 6    goods_valuation
Name: Column, dtype: object 0.6714834665826337


In [16]:
report_ctgan.get_visualization(property_name='Column Pair Trends')

In [17]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_ctgan,
    metadata=model_ctgan.get_metadata().to_dict(),
    column_name='contract_type'
)

fig.show()

In [18]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_ctgan,
    metadata=model_ctgan.get_metadata().to_dict(),
    column_name='goods_valuation'
)

fig.show()

## Create and train model - GaussianCopula

In [19]:
%%time
model_gauscopula = GaussianCopula()
model_gauscopula.fit(data)

CPU times: user 3.02 s, sys: 42 ms, total: 3.06 s
Wall time: 3.09 s


### Generate new data

In [20]:
%%time
synthetic_data_gausscopula = model_gauscopula.sample(num_rows=data.shape[0])
synthetic_data_gausscopula

CPU times: user 1.01 s, sys: 141 ms, total: 1.15 s
Wall time: 1.02 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,157558,0,Cash loans,F,Y,Y,0,173023.0,2185645.0,67685.4,...,212407.0,739.0,6567.0,65721.0,34857.0,5.0,1.0,1.0,0.0,5.0
1,152767,0,Cash loans,F,N,Y,1,123657.0,652697.0,29022.6,...,125183.0,925.0,15035.0,214038.0,219617.0,3.0,2.0,3.0,0.0,8.0
2,411530,0,Cash loans,F,N,Y,0,186067.0,500186.0,25583.3,...,405528.0,429.0,18073.0,94453.0,138207.0,2.0,3.0,1.0,0.0,5.0
3,410414,0,Cash loans,M,Y,Y,0,185560.0,1398065.0,49716.0,...,380117.0,227.0,11280.0,271836.0,204108.0,2.0,1.0,1.0,0.0,2.0
4,418388,0,Cash loans,F,N,N,1,92597.0,813604.0,27626.1,...,406635.0,448.0,27251.0,261768.0,277189.0,5.0,3.0,2.0,0.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,257040,0,Cash loans,F,N,Y,0,167340.0,157357.0,9139.3,...,295905.0,751.0,4131.0,52042.0,78819.0,1.0,0.0,0.0,0.0,2.0
30747,331768,1,Cash loans,F,Y,Y,0,252865.0,1089038.0,40278.5,...,,,,,,,,,,
30748,375438,0,Cash loans,F,N,Y,0,169508.0,723956.0,26280.2,...,359592.0,1132.0,16417.0,169210.0,186828.0,6.0,2.0,1.0,0.0,6.0
30749,361763,0,Cash loans,F,N,Y,0,96979.0,201117.0,20034.4,...,374377.0,696.0,8199.0,91317.0,83872.0,3.0,0.0,3.0,1.0,7.0


In [21]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_GaussianCopula.csv')
synthetic_data_gausscopula.to_csv(save_path, index=False)

### Evaluate results

In [22]:
model_score_gauss = evaluate(synthetic_data_gausscopula, data)
model_score_gauss

0.9045081606776357

In [23]:
report_gausscopula = QualityReport()

report_gausscopula.generate(data, synthetic_data_gausscopula,model_gauscopula.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.12s/it]



Overall Quality Score: 90.71%

Properties:
Column Shapes: 88.5%
Column Pair Trends: 92.93%


In [24]:
details_gausscopula = report_gausscopula.get_details(property_name='Column Shapes')
details

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.890117
1,infringed,KSComplement,0.991903
2,num_children,KSComplement,0.927807
3,annual_income,KSComplement,0.889142
4,credit_amount,KSComplement,0.747878
5,credit_annuity,KSComplement,0.827054
6,goods_valuation,KSComplement,0.671483
7,age,KSComplement,0.881532
8,days_employed,KSComplement,0.717082
9,mobilephone_reachable,KSComplement,0.971871


In [25]:
print('Column with more quality',details_gausscopula[details_gausscopula['Quality Score'] == details_gausscopula['Quality Score'].max()]['Column'], details_gausscopula['Quality Score'].max())
print('Column with less quality',details_gausscopula[details_gausscopula['Quality Score'] == details_gausscopula['Quality Score'].min()]['Column'], details_gausscopula['Quality Score'].min())

Column with more quality 9    mobilephone_reachable
Name: Column, dtype: object 0.9984065558843614
Column with less quality 8    days_employed
Name: Column, dtype: object 0.2623329322623654


In [26]:
report_gausscopula.get_visualization(property_name='Column Pair Trends')

In [27]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_gausscopula,
    metadata=model_gauscopula.get_metadata().to_dict(),
    column_name='infringed'
)

fig.show()

In [28]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_gausscopula,
    metadata=model_gauscopula.get_metadata().to_dict(),
    column_name='days_employed'
)

fig.show()

## Create and train model - CopulaGAN

In [29]:
%%time
model_copulagan = CopulaGAN(epochs=15, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=True)
model_copulagan.fit(data)

Epoch 1, Loss G:  0.5447,Loss D:  0.2093
Epoch 2, Loss G: -1.1857,Loss D:  0.4886
Epoch 3, Loss G: -1.3961,Loss D:  0.4278
Epoch 4, Loss G: -1.0820,Loss D: -0.4004
Epoch 5, Loss G: -1.5901,Loss D: -0.3488
Epoch 6, Loss G: -1.4939,Loss D:  0.0390
Epoch 7, Loss G: -0.8734,Loss D: -0.8560
Epoch 8, Loss G: -0.1958,Loss D: -0.8793
Epoch 9, Loss G: -0.7367,Loss D: -0.6179
Epoch 10, Loss G: -0.7942,Loss D: -0.4425
Epoch 11, Loss G: -1.4429,Loss D: -0.6471
Epoch 12, Loss G: -1.4352,Loss D: -0.3106
Epoch 13, Loss G: -0.9736,Loss D: -0.6985
Epoch 14, Loss G: -0.6134,Loss D: -1.1475
Epoch 15, Loss G: -2.2581,Loss D: -0.5008
CPU times: user 1min 58s, sys: 54.6 s, total: 2min 53s
Wall time: 2min 20s


In [30]:
%%time
synthetic_data_copulagan = model_copulagan.sample(num_rows=data.shape[0])
synthetic_data_copulagan

CPU times: user 2.08 s, sys: 46.4 ms, total: 2.13 s
Wall time: 2.12 s


Unnamed: 0,loan_id,infringed,contract_type,gender,has_own_car,has_own_realty,num_children,annual_income,credit_amount,credit_annuity,...,SK_ID_CURR,avg_days_decision,past_avg_amount_annuity,past_avg_amt_application,past_avg_amt_credit,past_loans_approved,past_loans_refused,past_loans_canceled,past_loans_unused,past_loans_total
0,382313,0,Cash loans,F,N,N,2,251923.0,1045855.0,34041.4,...,208903.0,1418.0,12941.0,91315.0,18066.0,4.0,0.0,0.0,0.0,1.0
1,249431,0,Cash loans,F,N,Y,0,87096.0,418231.0,26855.2,...,446053.0,172.0,25270.0,222250.0,62032.0,4.0,2.0,1.0,,18.0
2,291691,0,Cash loans,F,Y,Y,0,193323.0,484704.0,29940.5,...,,,,,233157.0,3.0,,1.0,0.0,
3,303327,1,Cash loans,F,N,N,1,177603.0,576696.0,33456.0,...,114621.0,,15017.0,,918601.0,,2.0,1.0,0.0,9.0
4,246202,0,Cash loans,F,N,N,1,218785.0,481693.0,27625.6,...,177047.0,657.0,5031.0,244378.0,349438.0,2.0,6.0,1.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30746,353283,1,Revolving loans,F,N,N,0,238086.0,483492.0,10986.1,...,274005.0,406.0,11316.0,172655.0,187240.0,3.0,1.0,2.0,0.0,1.0
30747,102355,1,Cash loans,F,Y,N,1,93441.0,91813.0,23731.3,...,283856.0,1517.0,11168.0,902742.0,,5.0,1.0,1.0,0.0,14.0
30748,152873,0,Cash loans,M,Y,N,1,117104.0,1299619.0,86279.0,...,,,,,,,,,,
30749,363316,0,Cash loans,F,Y,Y,0,93041.0,79561.0,28392.1,...,205638.0,558.0,23519.0,60594.0,46909.0,2.0,1.0,1.0,0.0,5.0


In [31]:
save_path = os.path.join(data_drive_path, 'dataset', 'synthetic_data_CopulaGAN.csv')
synthetic_data_copulagan.to_csv(save_path, index=False)

### Evaluate results

In [32]:
model_score_copulagan = evaluate(synthetic_data_copulagan, data)
model_score_copulagan

0.9130157037101264

In [33]:
report_copulagan = QualityReport()

report_copulagan.generate(data, synthetic_data_copulagan,model_copulagan.get_metadata().to_dict())

Creating report: 100%|██████████| 4/4 [00:04<00:00,  1.13s/it]



Overall Quality Score: 88.65%

Properties:
Column Shapes: 89.78%
Column Pair Trends: 87.52%


In [34]:
details_copulagan = report_copulagan.get_details(property_name='Column Shapes')
details_copulagan

Unnamed: 0,Column,Metric,Quality Score
0,loan_id,KSComplement,0.870411
1,infringed,KSComplement,0.922116
2,num_children,KSComplement,0.812884
3,annual_income,KSComplement,0.776918
4,credit_amount,KSComplement,0.867907
5,credit_annuity,KSComplement,0.832748
6,goods_valuation,KSComplement,0.852164
7,age,KSComplement,0.942961
8,days_employed,KSComplement,0.80225
9,mobilephone_reachable,KSComplement,0.998407


In [35]:
print('Column with more quality',details_copulagan[details_copulagan['Quality Score'] == details_copulagan['Quality Score'].max()]['Column'], details_copulagan['Quality Score'].max())
print('Column with less quality',details_copulagan[details_copulagan['Quality Score'] == details_copulagan['Quality Score'].min()]['Column'], details_copulagan['Quality Score'].min())

Column with more quality 9    mobilephone_reachable
Name: Column, dtype: object 0.9984065558843614
Column with less quality 3    annual_income
Name: Column, dtype: object 0.7769178238106078


In [36]:
report_copulagan.get_visualization(property_name='Column Pair Trends')

In [37]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_copulagan,
    metadata=model_copulagan.get_metadata().to_dict(),
    column_name='contract_type' # second best since the fist has some bugs
)

fig.show()

In [38]:
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_copulagan,
    metadata=model_copulagan.get_metadata().to_dict(),
    column_name='annual_income'
)

fig.show()

## Result analysis

By observing the results, we can see that the overall quality score of our new dataset is 81%. 
This result is good, but we can do a deeper analysis and see the scores on the individual columns and between the correlation in the columns. The result is basically the same, so we can conclude that our synthetic data is good. 


### Advantages
- <b>Data quality</b> - Higher data quality, balance, and variety are ensured with synthetic data. Artificially created data can apply labels and automatically fill in missing quantities, allowing for more precise prediction;
- <b> Scalability</b> - Synthetic data is used to cover the gaps left by real-world data;
-<b> Utilization simplicity</b> - Synthetic data guarantees ‌all data has a consistent format and labelling, getting rid of errors and duplicates.

### Disadvantages
- Outliers are challenging to map because synthetic data merely approximates real-world data, it is not a duplicate.Therefore, some outliers that are present in original data may not be covered by synthetic data
- The quality of the model depends on the data source.