# Synthetic data generation
We want a set of data to use to test model inference. We will generate a synthetic data set with SDV. SDV models statistical relationships between variables to ensure generated data is realistic. 

In [0]:
%run ./transform_data

# Transform data
In this notebook we perform transformations required to get required features for modelling.

# Transform data
First we will clean variables that contain values that are interpreted as missing or 0 values.

+--------------+-----+
|_AGEG5YR_clean|count|
+--------------+-----+
|           1.0|29692|
|           6.0|28968|
|           4.0|28804|
|           5.0|30899|
|           8.0|34936|
|           NaN| 8310|
|          10.0|47701|
|           7.0|31698|
|          13.0|41756|
|          11.0|44774|
|           2.0|23705|
|           9.0|43387|
|          12.0|36803|
|           3.0|26237|
+--------------+-----+



In [0]:
# transform a small set to pandas to use to generate synthetic data
pdf = df_min.sample(fraction=0.05).toPandas()

+------------------+------+
|CHILDREN_clean_mod| count|
+------------------+------+
|              14.0|    13|
|              NULL|  5606|
|              23.0|     1|
|               0.0|336299|
|              32.0|     2|
|              22.0|     3|
|              29.0|     1|
|              18.0|     1|
|               1.0| 48206|
|               6.0|   695|
|              25.0|     1|
|              15.0|     5|
|               4.0|  5672|
|              41.0|     1|
|               5.0|  1854|
|               8.0|   127|
|              81.0|    12|
|              17.0|     2|
|              20.0|     2|
|              82.0|     3|
+------------------+------+
only showing top 20 rows


Next we will calculate income relative to poverty threshold. To do this we first make new variables with children capped at 8 and adults at 9. 

In [0]:
# infer metadata
from sdv.metadata import Metadata
metadata = Metadata.detect_from_dataframes(data= {'data':pdf})



<graphviz.graphs.Digraph at 0xffb6d2d573b0>

+--------------------+------+
|          ADULT_cap9| count|
+--------------------+------+
|                NULL| 81961|
|                 1.0| 96588|
|                 6.0|  2788|
|                 4.0| 26067|
|                 5.0|  8557|
|                 8.0|   402|
|5.397605346934028...|     1|
|                 7.0|   911|
|                 2.0|182543|
|                 9.0|  4096|
|                 3.0| 53756|
+--------------------+------+



Now we have capped variables, we read in the poverty data generated in the convert_census notebook and bin to match the BRFSS data. 

+----------------------+------+
|poverty_threshold_conv| count|
+----------------------+------+
|                  NULL|144700|
|                     2| 79243|
|                     7| 16199|
|                     3|176715|
|                     5|  3586|
|                     4| 32908|
|                     6|  4319|
+----------------------+------+



# Select features and filter

[('_AGEG5YR_clean', 'double'),
 ('EDUCA_clean', 'double'),
 ('_BMI5', 'double'),
 ('_SMOKER3_clean', 'double'),
 ('DRNKANY6_clean', 'double'),
 ('INCOME3_clean', 'double'),
 ('num_conditions', 'int'),
 ('income_adj_pov', 'double'),
 ('RFHLTH_adj', 'double')]

+--------------+-----------+-----+--------------+--------------+-------------+--------------+--------------+----------+
|_AGEG5YR_clean|EDUCA_clean|_BMI5|_SMOKER3_clean|DRNKANY6_clean|INCOME3_clean|num_conditions|income_adj_pov|RFHLTH_adj|
+--------------+-----------+-----+--------------+--------------+-------------+--------------+--------------+----------+
|          8310|       2363|43037|         32022|         43777|        87423|             0|        199183|      1310|
+--------------+-----------+-----+--------------+--------------+-------------+--------------+--------------+----------+



229399
457670


We lose a significant proportion of rows (1/2) when we require all values to be present. The 2 major contributors are INCOME3 and income_adj_pov, which derives from missing INCOME3 values and missing children/adults. A more thorough treatment could attempt to impute some of these values. 

+----------+------+
|RFHLTH_adj| count|
+----------+------+
|       0.0|185479|
|       1.0| 43920|
+----------+------+



In [0]:
# create a synthesizer with the sample dataset and generate 100 rows of synthetic data
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(pdf)

synthetic_data = synthesizer.sample(num_rows=100)

In [0]:
# manually examine the synthetic data
synthetic_data

Unnamed: 0,_AGEG5YR_clean,EDUCA_clean,_BMI5,_SMOKER3_clean,DRNKANY6_clean,INCOME3_clean,num_conditions,income_adj_pov,RFHLTH_adj
0,11.0,6.0,2588.0,4.0,2.0,5.0,3,3.0,1.0
1,4.0,5.0,3100.0,4.0,1.0,10.0,2,5.0,0.0
2,4.0,5.0,3457.0,3.0,2.0,7.0,1,5.0,0.0
3,6.0,3.0,2517.0,3.0,2.0,8.0,0,6.0,0.0
4,11.0,6.0,3306.0,4.0,1.0,7.0,1,5.0,0.0
...,...,...,...,...,...,...,...,...,...
95,11.0,4.0,3297.0,4.0,2.0,1.0,5,-2.0,0.0
96,2.0,4.0,2429.0,4.0,1.0,8.0,0,5.0,0.0
97,3.0,3.0,2151.0,4.0,1.0,6.0,1,2.0,0.0
98,10.0,6.0,3592.0,3.0,1.0,2.0,3,0.0,1.0


In [0]:
synthetic_data.to_parquet('/Volumes/pophealthrisk/pophealthrisk/pophealthrisk/synthetic_data.parquet')