
## Synthetic Data Generation Testing

Now that we have successfully tested the preprocessing pipeline, we will extend this notebook to include:
- **Synthetic Data Generation**
- **Saving and Visualizing the Generated Data**
- **Comparing Synthetic and Original Data**


## Preprocessing Part

In [9]:
import sys
import os

# Get the absolute path of the project root (move up from notebooks/tests)
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))  

# Add `src/` directory explicitly to Python path
src_path = os.path.join(project_root, "src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)


In [10]:
# Verify the path
print(sys.path)

['c:\\Users\\delea\\OneDrive\\Documents\\Desktop\\Master Thesis\\MasterThesisCode\\src', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312', '', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32\\lib', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\Pythonwin']


In [11]:
import pandas as pd
from preprocessing.data_loader import load_dataset
from preprocessing.missing_value_handler import handle_missing_values
from preprocessing.encoding import encode_categorical_features

# Define dataset path
original_dataset_path = "../../datasets/original/studentPerformance.csv"
separator = ";"  # Adjust based on dataset format
target_column = "Target"  # Adjust based on dataset


In [12]:
# Load dataset
original_data, dataset_name = load_dataset(original_dataset_path, separator)
original_data.head()


📂 Loading dataset from: ../../datasets/original/studentPerformance.csv...

Processing dataset: studentPerformance
Original dataset size: 4424 rows


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [13]:
# Handle missing values
cleaned_data = handle_missing_values(original_data, strategy="drop")
cleaned_data.head()


Dropped 0 rows due to missing values


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [14]:

# Encode categorical features using Binary Encoding
encoded_data = encode_categorical_features(cleaned_data, target_column)
encoded_data.head()


🔹 Identified Categorical Columns: []
⚠ No categorical columns found. Returning original data.


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate



## Synthetic Data Part


In [15]:
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer
from sdv.single_table import GaussianCopulaSynthesizer

# Generate metadata from the encoded dataset
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(encoded_data)

In [16]:

# Initialize GaussianCopulaSynthesizer (without 'distribution' argument)
gc_synthesizer = GaussianCopulaSynthesizer(metadata)

# Train the synthesizer
gc_synthesizer.fit(encoded_data)

# Generate synthetic data
synthetic_data = gc_synthesizer.sample(num_rows=encoded_data.shape[0])

# Display first few rows
display(synthetic_data.head())




Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,44,1,9303,1,4,137.2,52,15,44,...,0,12,14,6,18.033029,0,10.0,3.5,1.79,Dropout
1,1,17,4,8992,1,1,130.3,5,35,2,...,0,6,19,9,18.571429,0,8.9,2.6,2.97,Dropout
2,1,9,1,9172,1,3,140.3,50,44,21,...,0,2,0,1,9.554333,0,9.5,0.8,3.36,Graduate
3,1,32,1,5634,1,4,124.2,34,3,2,...,0,2,1,5,15.936418,0,11.2,-0.8,2.16,Graduate
4,1,26,3,9984,1,14,137.2,4,20,21,...,0,8,2,4,16.237206,0,10.9,3.7,1.19,Dropout


In [18]:
# Import the evaluation function
from dataOperations.synthetic_data_operations import evaluate_synthetic_data

# Evaluate the synthetic data
evaluate_synthetic_data(original_data, synthetic_data, metadata, target_column, dataset_name)



Running diagnostic comparison for studentPerformance...
Generating report ...

(1/2) Evaluating Data Validity: |██████████| 37/37 [00:00<00:00, 710.07it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 64.50it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%

Diagnostic Results:
<sdmetrics.reports.single_table.diagnostic_report.DiagnosticReport object at 0x00000162D47BCCE0>

Evaluating quality metrics for studentPerformance...
Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 37/37 [00:00<00:00, 373.40it/s]|
Column Shapes Score: 76.11%

(2/2) Evaluating Column Pair Trends: |██████████| 666/666 [00:06<00:00, 95.99it/s]| 
Column Pair Trends Score: 79.07%

Overall Score (Average): 77.59%

Quality Report:
<sdmetrics.reports.single_table.quality_report.QualityReport object at 0x00000162F8F54DA0>

Analyzing column distributions...
                                            Column        Metric     Score
0   

In [17]:

# Import necessary libraries
import os
import pandas as pd
from synthetic_pipeline.data_synthesis import generate_synthetic_data, load_or_train_synthesizer

# Define parameters
TEST_SIZE = 0.2
DATASET_NAME = "loan"
SYNTHETIC_DATA_DIR = "datasets/synthetic"

# Generate synthetic data using the preprocessed dataset
synthetic_data, metadata = generate_synthetic_data(encoded_data, DATASET_NAME, TEST_SIZE)

# Display synthetic data preview
synthetic_data.head()



Processing dataset: loan
Original dataset size: 4424 rows


TypeError: load_or_train_synthesizer() missing 1 required positional argument: 'config'


## Saving Synthetic Data

The generated synthetic data will be saved into the `datasets/synthetic` directory. Let's verify that it is correctly stored.


In [None]:

# Save synthetic data to CSV
synthetic_data_path = os.path.join(SYNTHETIC_DATA_DIR, f"{DATASET_NAME}_synthetic.csv")
synthetic_data.to_csv(synthetic_data_path, index=False)

# Check if the file exists
os.path.exists(synthetic_data_path)



## Comparing Synthetic vs Original Data

We will compare key statistics of the original and synthetic datasets to evaluate how well the synthetic data replicates the original distribution.


In [None]:

# Compare basic statistics of original vs synthetic data
original_stats = cleaned_data.describe()
synthetic_stats = synthetic_data.describe()

# Display comparison
display(original_stats, synthetic_stats)
