In [1]:
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from helper_functions import *
from sklearn.preprocessing import LabelEncoder
from matplotlib.colors import LinearSegmentedColormap
warnings.filterwarnings('ignore')

### Load data and prepare data

In [2]:
df = pd.read_csv('../datasets/social diagnosis/SocialDiagnosis2011.csv', delimiter=';', index_col=False)
print(df.shape)
df = df.dropna()
print(df.columns)
display(df.head())

(5000, 6)
Index(['sex', 'age', 'marital', 'income', 'ls', 'smoke'], dtype='object')


Unnamed: 0,sex,age,marital,income,ls,smoke
0,FEMALE,57,MARRIED,800.0,PLEASED,NO
1,MALE,20,SINGLE,350.0,MOSTLY SATISFIED,NO
3,FEMALE,78,WIDOWED,900.0,MIXED,NO
4,FEMALE,54,MARRIED,1500.0,MOSTLY SATISFIED,YES
5,MALE,20,SINGLE,-8.0,PLEASED,NO


**UI text #1**

The [Social Diagnosis 2011*](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage) dataset is used as a demo. Synthetic data will be generated for the following columns: 

- sex: ...
- age: age
- martial: 
- income: 
- ls: 
- smoke: 

Gaussian Copula will be used to evaluate the distribution and correlation differences between the real and synthetic data.

*The original paper can be found [here](https://ce.vizja.pl/en/issues/volume/5/issue/3#art254).

### 0. Preview of data

In [3]:
# dataset
df.head()

Unnamed: 0,sex,age,marital,income,ls,smoke
0,FEMALE,57,MARRIED,800.0,PLEASED,NO
1,MALE,20,SINGLE,350.0,MOSTLY SATISFIED,NO
3,FEMALE,78,WIDOWED,900.0,MIXED,NO
4,FEMALE,54,MARRIED,1500.0,MOSTLY SATISFIED,YES
5,MALE,20,SINGLE,-8.0,PLEASED,NO


In [4]:
print(df.isnull().sum())

sex        0
age        0
marital    0
income     0
ls         0
smoke      0
dtype: int64


### 1. Data types detection

In [5]:
# get the data types of columns using helper function
dtypes_dict = data_type(df)
dtypes_dict

{'sex': 'category',
 'age': dtype('int64'),
 'marital': 'category',
 'income': 'float',
 'ls': 'category',
 'smoke': 'category'}

In [6]:
# Encode string columns to numeric values
label_encoders = {}
df_encoded = df.copy()
for column in df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    df_encoded[column] = label_encoders[column].fit_transform(df[column])

df_encoded.head()

Unnamed: 0,sex,age,marital,income,ls,smoke
0,0,57,3,800.0,4,0
1,1,20,4,350.0,3,0
3,0,78,5,900.0,1,0
4,0,54,3,1500.0,3,1
5,1,20,4,-8.0,4,0


In [9]:
# get the data types of columns using helper function
dtypes_encoded_dict = data_type(df_encoded)
dtypes_encoded_dict

{'sex': dtype('int64'),
 'age': dtype('int64'),
 'marital': dtype('int64'),
 'income': 'float',
 'ls': dtype('int64'),
 'smoke': dtype('int64')}

**UI text #2**

If detected data types are incorrect, please change this locally in the dataset before attaching it again.

### 2. Gaussian copula model

**UI text #3**

Gaussian copula (GC) is a statitical method to generata synthetic data that mimic the structure and relationships (dependencies) seen in real data. It works well when the data has relationships between variables that need to be preserved, even if the exact data values change. A 'copula' helps describe how these variables are connected or correlated without focusing on their actual values. The Gaussian Copula specifically uses a normal distribution to model these connections.

GC works in two main steps:
1. The real data is transformed into a uniform distribution. Correlations between variables are modeled using a multivariate normal distribution (the *Gaussian copula*);
2. Synthetic data is created by sampling from this Gaussian copula and transforming the samples back to the original data distributions.

Based on the above histogram plots, one should consider whether the univariate distibution follow approximately a normal disctribution.

In [7]:
# Initialize synthesizer and fit it to the data
synthesizer = GaussianCopulaSynthesizer()
synthesizer.fit(df_encoded)

In [8]:
# Generate synthetic data
n_synth_data = df.shape[0]
synth_df = synthesizer.sample(1000)

[ 0.01163612         inf  0.02310358         inf -0.02370355  0.04836914]


LinAlgError: SVD did not converge

### 3. Evaluation of generated data

In [9]:
# combine original data and decoded synthetic data in dataframe
combined_data = pd.concat((df.assign(realOrSynthetic='real'), synth_df.assign(realOrSynthetic='synthetic')), keys=['real','synthetic'], names=['Data'])

NameError: name 'synth_df' is not defined

**UI text #4**

{n_synth_data} synthetic data points are generated using Gaussian copula. The figures below display the differences in value frequency for each variable. A grey bar chart indicates the amount of values is equally represented in the synthetic data and in the real data. A bar chart with an orange top indicates that the synthetic data containes more values for this variable compared to the real data. Conversely, a bar chart with a blue top shows that the synthetic data contains fewer values for this variable than the real data.

In [9]:
# plot univariate histograms using helper function
univariate_hist(combined_data,dtypes_dict,Comparison=True)

NameError: name 'combined_data' is not defined