In [1]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from synthpop import MissingDataHandler, DataProcessor, GaussianCopulaMethod
from synthpop.metrics import (
    MetricsReport,
    EfficacyMetrics,
    DisclosureProtection
)
import matplotlib.pyplot as plt
from helper_functions import *
from matplotlib.colors import LinearSegmentedColormap
warnings.filterwarnings('ignore')

### Load data and prepare data

In [2]:
df = pd.read_csv('../../datasets/social diagnosis/SocialDiagnosis2011.csv', delimiter=';', index_col=False)
print(df.shape)
print(df.columns)
display(df.head())

(5000, 6)
Index(['sex', 'age', 'marital', 'income', 'ls', 'smoke'], dtype='object')


Unnamed: 0,sex,age,marital,income,ls,smoke
0,FEMALE,57,MARRIED,800.0,PLEASED,NO
1,MALE,20,SINGLE,350.0,MOSTLY SATISFIED,NO
2,FEMALE,18,SINGLE,,PLEASED,NO
3,FEMALE,78,WIDOWED,900.0,MIXED,NO
4,FEMALE,54,MARRIED,1500.0,MOSTLY SATISFIED,YES


**UI text #1**

The [Social Diagnosis 2011*](https://search.r-project.org/CRAN/refmans/synthpop/html/SD2011.html) dataset is used as a demo. Synthetic data will be generated for the following columns: 

- sex: sex of a person
- age: age of a person
- marital: marital status
- income: personal monthly net income
- ls: perception of life as a whole
- smoke: smoking cigarettes

Gaussian Copula will be used to evaluate the distribution and correlation differences between the real and synthetic data.

*The original paper can be found [here](https://ce.vizja.pl/en/issues/volume/5/issue/3#art254).

### 0. Preview of data

In [3]:
# dataset
df.head()

Unnamed: 0,sex,age,marital,income,ls,smoke
0,FEMALE,57,MARRIED,800.0,PLEASED,NO
1,MALE,20,SINGLE,350.0,MOSTLY SATISFIED,NO
2,FEMALE,18,SINGLE,,PLEASED,NO
3,FEMALE,78,WIDOWED,900.0,MIXED,NO
4,FEMALE,54,MARRIED,1500.0,MOSTLY SATISFIED,YES


In [4]:
print(df.isnull().sum())

sex          0
age          0
marital      9
income     683
ls           8
smoke       10
dtype: int64


### 1. Data types detection

**UI text #2**

The following data types are detected:

[output]

If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the web app.

In [5]:
md_handler = MissingDataHandler()

# Check the data types
column_dtypes = md_handler.get_column_dtypes(df)
print("Column Data Types:", column_dtypes)

Column Data Types: {'sex': 'categorical', 'age': 'numerical', 'marital': 'categorical', 'income': 'numerical', 'ls': 'categorical', 'smoke': 'categorical'}


### 2. Handling missing data

**UI text #3**

For the following columns, the missing data type is:

{{ dynamic

- sex: MAR
- race1: MAR

}}

For Missing At Random (MAR) and Missing Not At Random (MNAR) data, we recommend to impute the missing data. For Missing Completely At Random (MCAR), we recommend to remove the missing data. See the info box for more information. [i]

[demo text] In this demo CART is used, the missing data is therefore imputed. When using Gaussian Copula, the user can choose whether the missing data is removed or imputed, depending on the type of missing data.

[i] _info box:_

MCAR, MAR, and MNAR are terms used to describe different mechanisms of missing data:

1. **MCAR (Missing Completely At Random)**:
- The probability of data being missing is completely independent of both observed and unobserved data. 
- There is no systematic pattern to the missingness.
- Example: A survey respondent accidentally skips a question due to a printing error.
- Recommendation: remove missing data.

2. **MAR (Missing At Random)**:
- The probability of data being missing is related to the observed data but not the missing data itself.
- The missingness can be predicted by other variables in the dataset.
- Example: Students' test scores are missing, but the missingness is related to their attendance records.
- Recommendation: impute missing data.

3. **MNAR (Missing Not At Random)**:
- The probability of data being missing is related to the missing data itself. 
- There is a systematic pattern to the missingness that is related to the unobserved data.
- Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms.
- Recommendation: impute missing data.

In [6]:
# Detect missingness
missingness_dict = md_handler.detect_missingness(df)
print("Detected Missingness Type:", missingness_dict)

Detected Missingness Type: {'marital': 'MAR', 'income': 'MAR', 'ls': 'MAR', 'smoke': 'MAR'}


In [7]:
real_df = md_handler.apply_imputation(df, missingness_dict)

print(real_df.isnull().sum())

sex        0
age        0
marital    0
income     0
ls         0
smoke      0
dtype: int64


### [no section] Pre-processing data

In [8]:
# Instantiate the DataProcessor with the metadata
metadata = column_dtypes
processor = DataProcessor(metadata)

# Preprocess the data: transforms raw data into a numerical format
processed_data = processor.preprocess(real_df)
print("Processed Data:")
display(processed_data.head())

Processed Data:


Unnamed: 0,sex,age,marital,income,ls,smoke
0,0,57.0,3,800.0,4,0
1,1,20.0,4,350.0,3,0
2,0,18.0,4,1411.093352,4,0
3,0,78.0,5,900.0,1,0
4,0,54.0,3,1500.0,3,1


### 3. Synthesized: Gaussian Copula

In [9]:
# Instantiate and fit the GC synthesized
GC = GaussianCopulaMethod(metadata)
GC.fit(processed_data)

INFO:copulas.multivariate.gaussian:Fitting GaussianMultivariate(distribution="{'sex': <class 'copulas.univariate.beta.BetaUnivariate'>, 'age': <class 'copulas.univariate.beta.BetaUnivariate'>, 'marital': <class 'copulas.univariate.beta.BetaUnivariate'>, 'income': <class 'copulas.univariate.beta.BetaUnivariate'>, 'ls': <class 'copulas.univariate.beta.BetaUnivariate'>, 'smoke': <class 'copulas.univariate.beta.BetaUnivariate'>}")


In [10]:
# For prediction, we might use the same data (or new preprocessed data)
synthetic_processed = GC.sample(5000)
print("Synthetic Processed Data (in numerical space):")
display(synthetic_processed.head())

Synthetic Processed Data (in numerical space):


Unnamed: 0,sex,age,marital,income,ls,smoke
0,0.000912,67.679431,1.844519,2282.871206,5.215054,0.000254
1,0.007212,20.018497,3.629754,1088.296227,4.737462,0.000254
2,6e-05,27.506824,5.518484,624.719869,2.589195,0.000254
3,0.275017,20.147052,3.610086,80.943351,0.878939,0.000254
4,0.02554,24.641005,3.217432,527.741169,0.576242,0.000254


**UI text #4**

{n_synth_data} synthetic data points are generated using Gaussian copula (GC). 

GC works in two main steps:
1. The real data is transformed into a uniform distribution. Correlations between variables are modeled using a multivariate normal distribution (the *Gaussian copula*);
2. Synthetic data is created by sampling from this Gaussian copula and transforming the samples back to the original data distributions.

### [no section] Post-processing synthetic data

In [11]:
# Postprocess the synthetic data back to the original format
synthetic_df = processor.postprocess(synthetic_processed)
print("Synthetic Data in Original Format:")
display(synthetic_df.head())

Synthetic Data in Original Format:


Unnamed: 0,sex,age,marital,income,ls,smoke
0,FEMALE,68,LEGALLY SEPARATED,2282.871206,TERRIBLE,NO
1,FEMALE,20,SINGLE,1088.296227,TERRIBLE,NO
2,FEMALE,28,,624.719869,MOSTLY SATISFIED,NO
3,FEMALE,20,SINGLE,80.943351,MIXED,NO
4,FEMALE,25,MARRIED,527.741169,MIXED,NO
