# 4. Checking Similarity between Original and Synthetic Data – Using Regression

- One way of assessing whether synthetic data shares similar statistical characteristics with the original data is to build a model using both datasets and compare their results.
- The original and synthetic data we have at hand are import declaration data, consisting of declaration element columns and a detection record column indicating whether a fraud was detected for each declaration during customs inspection.
- In this lesson, two models predicting the fraudulence of a declaration will be trained on original and synthetic datasets. Then, the results of the two models will be compared to assess the statistical similarity between the two datasets. 

## Check Statistical Similarity using Regression Model

The methodology to be used to check statistical similarity between original and synthetic datasets is as follows:
1. Train two models predicting the fraudulence of a declaration on original and synthetic datasets
2. For each model, select declarations predicted as having a high likelihood of fraud 
3. Verify actual detection records for selected declarations and calculate each model’s detection rate
4. Compare the detection rates of the two models to assess statistical similarity

We will repeat the process above three times, utilizing Linear Regression, Random Forest, and XGBoost algorithms.

## Import Library
- Let’s import the necessary libraries.

In [2]:
import pandas as pd
import copy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
from matplotlib import pyplot as plt 

import time
import warnings
warnings.filterwarnings("ignore")

- Set the basic configuration of the Jupyter Notebook. 

In [3]:
# Jupiter cell full screen view
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Each column width at maximum (print all column contents)
pd.set_option('display.max_colwidth', -1)
# Show up to 500 rows
pd.set_option('display.max_rows', 500)
# Display up to 500 columns
pd.set_option('display.max_columns', 500)
# Total length of dataframe
pd.set_option('display.width', 1000)

print('ready to run')
# Logging starttime 
startTime = time.time()

ready to run


## Declaring Functions

- Since training models using multiple algorithms involve repetitive tasks, we usually declare functions in advance for later use. 
- In this practice, the number of repetitions is even larger because we will train each algorithm on two different datasets. Using functions will enhance code readability.

### Declaring variables
- Declare the variables to be used for preprocessing. 
- Separate columns into categorical and numeric types.

In [4]:
category_cols = ['imp_dec_code','dec_custom_code','imp_trd_code','imp_typ_code',\
                 'collect_code','typ_transport_code','dec_mark','importer','ovs_cust_code',\
                 'exps_carr_code','HS10','country_ship_code','country_orig_code','trff_class_code',\
                 'country_orig_mark_code','crime_yn','key_exposure']

In [5]:
number_cols = ['trff_rate','dec_weight','taxabal_price_KRW']

### Min-max normalization function

- Let’s first declare a function that normalizes column values using the min and max values.

In [6]:
def normalize(column):
    return (column - column.min())/(column.max() - column.min())

### Function for data preprocessing
- Now, we will declare a preprocessing function.

In [7]:
def df_preprocessing(df):
    # Copy dataset for a safe use
    copy_df = copy.deepcopy(df)

    # Separate dataset by data type
    copy_df_category = copy_df[category_cols]
    copy_df_number = copy_df[number_cols]
    
    # Load encoding object 
    encoder = LabelEncoder()

    # Encode dataframe containing categorical data
    for column_name,item in copy_df_category.iteritems(): 
        encoder.fit(item)
        labels = encoder.transform(item)
        copy_df_category[column_name] = labels
        
    # Perform min-max normalization on dataframe containing numerical data
    copy_df_number_norm = copy_df_number.apply(normalize)
    
    # Return results
    return copy_df_category, copy_df_number_norm

### Function that separates data into training data and test data

- The next function is ‘train_test_split’, which is a preprocessing function included in scikit-learn. 
- The label column is 'crime_yn,' which denotes whether a fraudulent action has occurred.
- For feature columns, the function excludes the label column 'crime_yn'. It also excludes 'key_exposure,' which has extremly high correlation with 'crime_yn' and hence will make it challenging to use other columns. 'imp_dec_code’ is also excluded, as it has been verified to lack meaningful correlation with other columns. 
- Finally, the function splits the training and test data in an 8:2 ratio.


In [8]:
def df_splist(df):

    # Set label column
    y = df['crime_yn']
    # Remove columns that have too high/low correlation with the label column
    X = df.drop(columns=['crime_yn','key_exposure','imp_dec_code'])
    # Separate data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    return X_train, X_test, y_train, y_test

### Detection rate calculation function for top 5% high-risk predicted declarations

- The following function calculates a model's detection rate on the declaration group with the top 5% highest predicted values by comparing the predictions with actual detection records.

In [9]:
def reg_top_5per(df_reg, target_y):
    
    # Select the rows where predicted values are higher than the threshold defining top 5% of predicted values using conditional expression
    top_5percent = df_reg['pred'].quantile(0.95)
    result = df_reg[df_reg['pred'] > top_5percent]
    # From the label dataframe, filter the rows having the same indices as the top 5% predicted values and store them in ‘y_test_top5’ variable
    y_test_top5 = target_y[target_y.index.isin(result['index'])]
    # Create a dataframe containing ‘y_test_top5’ values under ‘crime_yn’ column for quick calculation
    y_test_top5_df = pd.DataFrame({'crime_yn' :y_test_top5})
    # Get detection rate by calculating the proportion of the number of rows where ‘crime_yn == 1’ to the total count of top 5% predicted values
    top_5_detectionrate = y_test_top5_df[y_test_top5_df['crime_yn']==1].count() / y_test_top5_df.count()
    
    return top_5_detectionrate

### Detection rate calculation function for top 10% high-risk predicted declarations
- The following function calculates the model's detection rate on the declaration group with the top 10% highest predicted values by comparing the predictions with actual detection records.

In [12]:
def reg_top_10per(df_reg, target_y):
    
    # Select the rows where predicted values are higher than the threshold defining top 10% of predicted values using conditional expression
    top_10percent = df_reg['pred'].quantile(0.90)
    result = df_reg[df_reg['pred'] > top_10percent]
    # From the label dataframe, filter the rows having the same indices as the top 10% predicted values and store them in ‘y_test_top10’ variable
    y_test_top10 = target_y[target_y.index.isin(result['index'])]
    # Create a dataframe containing ‘y_test_top10’ values under ‘crime_yn’ column for quick calculation
    y_test_top10_df = pd.DataFrame({'crime_yn' :y_test_top10})
    # Get detection rate by calculating the proportion of the number of rows where ‘crime_yn == 1’ to the total count of top 10% predicted values
    top_10_detectionrate = y_test_top10_df[y_test_top10_df['crime_yn']==1].count() / y_test_top10_df.count()
    
    return top_10_detectionrate

## Data Loading & Preprocessing

- When training a model, a dataset is usually divided into a training set and a test set, with each set containing feature variables (X) and a label variable (y).
- This means that we need four dataframes to train a single model. 
- Since we are going to train two models with two datasets (original and synthetic data), we are going to create 8 dataframes in this practice. 



- Original data (X_train, X_test, y_train, y_test) -> 4 data frames
- Synthetic data (X_syn_train, X_syn_test, y_syn_train, y_syn_test) -> 4 data frames

### Real data

- Now, we will delve into the model training process. Let’s first load the original dataset.

In [11]:
df_base = pd.read_csv('df_syn_en.csv', encoding='utf-8-sig')

- Let’s normalize both categorical and numerical data using the preprocessing function.

In [13]:
copy_base_category, copy_base_number_norm = df_preprocessing(df_base)

#### Combining data

- Since the preprocessing function divides the dataframe into categorical and numerical dataframes, we will concatenate them back into a single dataframe.

In [14]:
copy_base_total = pd.concat([copy_base_category,copy_base_number_norm], axis=1)

#### Separate training data and test data
- Let’s separate the dataset into training and test sets.

In [15]:
X_train, X_test, y_train, y_test = df_splist(copy_base_total)

### Synthetic data

- Following the same methodology as applied to the original dataset, we will also conduct data preprocessing and splitting for the synthetic dataset. 
- Let’s start by loading the synthetic dataset.



In [1]:
# Synthetic data
df_syn = pd.read_csv('./data_sample/df_syn_en_14.csv', encoding='utf-8-sig') 

NameError: name 'pd' is not defined

- Normalize the dataset using the preprocessing function.

In [19]:
copy_syn_category, copy_syn_number_norm = df_preprocessing(df_syn)

#### Combining data

- Combine the preprocessed data into a single table.

In [20]:
copy_syn_total = pd.concat([copy_syn_category,copy_syn_number_norm], axis=1)

#### Separate training data and test data
- Separate the data into training and test sets.

In [21]:
X_syn_train, X_syn_test, y_syn_train, y_syn_test = df_splist(copy_syn_total)

## Evaluating Model
- The model produces a predicted value (fraud label) for each data point (import declaration), which ranges from 0 to 1. 
- The evaluation metric for the model will be the model’s hit rate (fraud detection rate) for the top 5% and top 10% of declarations with the highest predicted values.   
- The fraud detection rates on the top 5% and 10% declaration groups will be used to compare the two models trained on original and synthetic data and verify the statistical similarity between the two datasets.




### Linear regression analysis

- Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
- We assume linear relationship between the variables, and use method of least square to approximate regression parameters.
- Linear regression is used for prediction and inference in various fields, but we need to carefully consider assumptions and limitations. 

In [23]:
from sklearn.linear_model import LinearRegression

#### Original data (X_train, X_test, y_train, y_test)

- Let’s train a linear regression model with the original data. 

In [24]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

- Now that the model training is complete, let's check the model's predictions on the test set.
- The model will output the label value (fraud label)   for each declaration ranging from 0 to 1.
- We will use the model’s fraud detection rate for the top 5% and 10% of declarations with the highest predicted values as an evaluation metric.
- For the calculation, the model’s predictions on the test set will be first stored in a dataframe. 

In [25]:
y_pred = lr.predict(X_test)
index = X_test.index.tolist()
df = pd.DataFrame({'index' : index ,'pred': y_pred})

- Let’s calculate the model’s fraud detection rate for the top 5% of declarations using the 'reg_top_5per()' function declared earlier. 

In [28]:
base_5 = reg_top_5per(df,y_test)

- Let’s calculate the model's detection rate for the top 10% of declarations in the same manner. 

In [28]:
base_10 = reg_top_10per(df,y_test)

#### Synthetic data (X_syn_train, X_syn_test, y_syn_train, y_syn_test)

- Now, we will train another linear regression model using the same algorithm on the synthetic data and calculate the model's detection rates for the top 5% and top 10% of predictions, following the same approach as for the original dataset. 
- Let’s start by training a linear regression model on the synthetic dataset.

In [29]:
lr_syn = LinearRegression()
lr_syn.fit(X_syn_train, y_syn_train)

- Store the model's predictions on the test set in a dataframe.

In [31]:
y_syn_pred = lr_syn.predict(X_syn_test)
index = X_syn_test.index.tolist()
df = pd.DataFrame({'index' : index ,'pred': y_syn_pred})

- Calculate the fraud detection rate of the model for declarations with the top 5% of predicted values.

In [29]:
syn_5 = reg_top_5per(df,y_syn_test)

- Calculate the fraud detection rate of the model for declarations with the top 10% of predicted values. 

In [33]:
syn_10 = reg_top_10per(df,y_syn_test)

#### Detection rate summary

- Now, let’s compare the detection rates of the two models respectively trained on original and synthetic data. 
- To facilitate the comparison, we will put detection rates in a single dataframe.

In [34]:
df_result_lr = pd.DataFrame({'category': 'LinearRegression','base_5' : base_5, 'base_10': base_10,'syn_5': syn_5,'syn_10': syn_10})

In [35]:
df_result_lr

Unnamed: 0,category,base_5,base_10,syn_5,syn_10
crime_yn,LinearRegression,0.353704,0.311111,0.375,0.28125


- It is clear that there exist minimal differences in detection rates between the two models. 
- While there is a slight variance, both models exhibit a decrease in detection rate from the 5% group to the 10% group.
- This suggests that the two models operate similarly, and the datasets share similar patterns.



### RandomForest

Random forest is a machine learning algorithm that builds multiple decision trees and aggregates their predictions to enhance accuracy and mitigate overfitting. 
Key characteristics include:
- random selection of samples and features for building each tree 
- ensemble training aimed at improving predictability; and
- ability to process large-scale, high-dimensional datasets.

- Let’s now test the similarity between the original and synthetic data using the random forest algorithm. 
- First, import the 'RandomForestRegressor' class.


In [36]:
from sklearn.ensemble import RandomForestRegressor

#### Original data (X_train, X_test, y_train, y_test)

- Train a random forest regression model using the original data. 

In [37]:
rf = RandomForestRegressor(n_estimators=70, random_state=42)

In [38]:
rf.fit(X_train, y_train)

- Calculate the predictions of the model on the test set of the original data. 

In [39]:
y_pred = rf.predict(X_test)

In [40]:
index = X_test.index.tolist()

In [41]:
df = pd.DataFrame({'index' : index ,'pred': y_pred})

- Calculate the fraud detection rate of the model for declarations with the top 5% of predicted values.

In [42]:
base_5 = reg_top_5per(df,y_test)

- Calculate the fraud detection rate of the model for declarations with the top 10% of predicted values.

In [43]:
base_10 = reg_top_10per(df,y_test)

#### Synthetic data (X_syn_train, X_syn_test, y_syn_train, y_syn_test)

- Train a random forest regression model using the synthetic data. 

In [44]:
rf_syn = RandomForestRegressor(n_estimators=70, random_state=42)

In [45]:
rf_syn.fit(X_syn_train, y_syn_train)

- Calculate the predictions of the model on the test set of the synthetic data. 

In [46]:
y_syn_pred = rf_syn.predict(X_syn_test)

In [47]:
index = X_syn_test.index.tolist()

In [48]:
df = pd.DataFrame({'index' : index ,'pred': y_syn_pred})

- Calculate the fraud detection rate of the model for declarations with the top 5% of predicted values.

In [49]:
syn_5 = reg_top_5per(df,y_syn_test)

- Calculate the fraud detection rate of the model for declarations with the top 10% of predicted values.

In [50]:
syn_10 = reg_top_10per(df,y_syn_test)

#### Detection rate summary

- Let’s compare the detection rates of the two models respectively trained on original and synthetic data. To facilitate the comparison, we will put detection rates in a single dataframe. 

In [51]:
df_result_rf = pd.DataFrame({'category': 'RandomForestRegressor','base_5' : base_5, 'base_10': base_10,'syn_5': syn_5,'syn_10': syn_10})

In [52]:
df_result_rf

Unnamed: 0,category,base_5,base_10,syn_5,syn_10
crime_yn,RandomForestRegressor,0.937618,0.871747,0.875,0.75


- It is evident that the synthetic data also works well with the Random Forest algorithm.
- As with the linear regression model, both models exhibit comparable detection rates, with a decrease observed in both when moving from the 5% to the 10% prediction group.

### XGBoost

The final algorithm is XGBoost, a powerful machine learning algorithm designed to improve gradient boosting algorithm’s performance. 
Key features of XGBoost include:
- parallel processing and hardware optimization for rapid, efficient model training; 
- prevention of overfitting and improvement of generalization through normalization; and 
- support for handling missing values and feature selection. 

- Let's proceed by importing the XGBoost library. 

In [53]:
import xgboost as xgb

#### Original data (X_train, X_test, y_train, y_test)

- Let’s train an XGBoost regression model using the original data.
- Unlike other models, in XGBoost, it is essential to input the training and test sets in DMatrix type supported by the XGB library. 
- Additionally, you can adjust parameter values to define the optimal training condition. 
- Train the model by inputting the set parameters, training dataset, and the number of boosting rounds.



In [54]:
# Create DMatrix for training and test setsdtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [55]:
# Define parameters
params = {
    "max_depth": 2,
    "eta": 0.1,
    "subsample": 0.5,
    "colsample_bytree": 0.5,
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
}
# Number of boosting rounds
num_round = 100

In [56]:
# Train XGBoost model
xgb_model = xgb.train(params, dtrain, num_round)

- Calculate the predictions of the model on the test set of the original data. 

In [57]:
y_pred = xgb_model.predict(dtest)

In [58]:
index = X_test.index.tolist()

In [59]:
df = pd.DataFrame({'index' : index ,'pred': y_pred})

- Calculate the fraud detection rate of the model for declarations with the top 5% of predicted values.

In [60]:
base_5 = reg_top_5per(df,y_test)

- Calculate the fraud detection rate of the model for declarations with the top 10% of predicted values.

In [61]:
base_10 = reg_top_10per(df,y_test)

#### Synthetic data (X_syn_train, X_syn_test, y_syn_train, y_syn_test)

- Train a random forest regression model using the synthetic data. 

In [62]:
dtrain_syn = xgb.DMatrix(X_syn_train, label=y_syn_train)
dtest_syn = xgb.DMatrix(X_syn_test, label=y_syn_test)

In [63]:
params = {
    "max_depth": 2,
    "eta": 0.1,
    "subsample": 0.5,
    "colsample_bytree": 0.5,
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
}
num_round = 100

In [64]:
xgb_model_syn = xgb.train(params, dtrain_syn, num_round)

- Calculate the predictions of the model on the test set of the synthetic data. 

In [65]:
y_syn_pred = xgb_model_syn.predict(dtest_syn)

In [66]:
index = X_syn_test.index.tolist()

In [67]:
df = pd.DataFrame({'index' : index ,'pred': y_syn_pred})

- Calculate the fraud detection rate of the model for declarations with the top 5% of predicted values.

In [68]:
syn_5 = reg_top_5per(df,y_syn_test)

- Calculate the fraud detection rate of the model for declarations with the top 10% of predicted values.

In [69]:
syn_10 = reg_top_10per(df,y_syn_test)

#### Detection rate summary

- Let’s compare the detection rates of the two models respectively trained on original and synthetic data. To facilitate the comparison, we will put detection rates in a single dataframe. 

In [70]:
df_result_xgb = pd.DataFrame({'category': 'xgboost','base_5' : base_5, 'base_10': base_10,'syn_5': syn_5,'syn_10': syn_10})

In [71]:
df_result_xgb

Unnamed: 0,category,base_5,base_10,syn_5,syn_10
crime_yn,xgboost,0.437037,0.375,0.625,0.46875


- Similar to the preceding two models, a decline in detection rates in both models is observed when moving from the 5% to the 10% prediction group.

## Detection Rate Summary for Each Model

- Let’s summarize the analysis by concatenating the detection rates of six models. 

In [72]:
total_result = pd.concat([df_result_lr, df_result_rf, df_result_xgb])

In [73]:
total_result = total_result.reset_index(drop=True)

In [74]:
total_result

Unnamed: 0,category,base_5,base_10,syn_5,syn_10
0,LinearRegression,0.353704,0.311111,0.375,0.28125
1,RandomForestRegressor,0.937618,0.871747,0.875,0.75
2,xgboost,0.437037,0.375,0.625,0.46875


- Across all three algorithms, there was a consistent decrease in detection rates from the 5% group to the 10% group for models trained on original and synthetic data. 
- This means that when you train a model with synthetic data, you can expect a result showing similar trends as what can be obtained with original data. 
- The analysis methods and results presented in this lesson only serve as an example. The statistical properties of synthetic data generated by CTGAN may vary based on parameters like the size of extracted data or the size of generated data. 
- Hence, it is crucial to identify suitable algorithms and parameters for evaluating generated synthetic data in every project.