
## Synthetic Data Generation Testing

Now that we have successfully tested the preprocessing pipeline, we will extend this notebook to include:
- **Synthetic Data Generation**
- **Saving and Visualizing the Generated Data**
- **Comparing Synthetic and Original Data**


In [None]:
import sys
import os

# Get the absolute path of the project root (move up from notebooks/tests)
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))  

# Add `src/` directory explicitly to Python path
src_path = os.path.join(project_root, "src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)


In [None]:
# Verify the path
print(sys.path)

In [None]:
import pandas as pd
from preprocessing.data_loader import load_dataset
from preprocessing.missing_value_handler import handle_missing_values
from preprocessing.encoding import encode_categorical_features

# Define dataset path
original_dataset_path = "../../datasets/original/studentPerformance.csv"
separator = ";"  # Adjust based on dataset format
target_column = "Target"  # Adjust based on dataset


In [None]:
# Load dataset
original_data, dataset_name = load_dataset(original_dataset_path, separator)
original_data.head()


In [None]:
# Handle missing values
cleaned_data = handle_missing_values(original_data, strategy="drop")
cleaned_data.head()


In [None]:

# Encode categorical features using Binary Encoding
encoded_data = encode_categorical_features(cleaned_data, target_column)
encoded_data.head()


## Synthetic Data Generation Testing

Now that we have successfully tested the preprocessing pipeline, we will extend this notebook to include:
- **Synthetic Data Generation**
- **Saving and Visualizing the Generated Data**
- **Comparing Synthetic and Original Data**


In [None]:
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

# Generate metadata from the encoded dataset
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(encoded_data)

In [None]:
# Initialize CTGAN Synthesizer (Remove learning_rate)
ctgan_synthesizer = CTGANSynthesizer(metadata, epochs=5, batch_size=16)

print("\n🛠️ Testing CTGAN synthesizer on the encoded dataset...")

# Train synthesizer
try:
    ctgan_synthesizer.fit(encoded_data)
    print("✅ CTGAN training successful.")
except Exception as e:
    print(f"❌ CTGAN training failed: {e}")

# Generate synthetic data
try:
    synthetic_data = ctgan_synthesizer.sample(num_rows=encoded_data.shape[0])
    print("✅ CTGAN generated synthetic data successfully.")
    display(synthetic_data.head())  # Show sample rows
except Exception as e:
    print(f"❌ CTGAN failed to generate synthetic data: {e}")

In [None]:

# Import necessary libraries
import os
import pandas as pd
from synthetic_pipeline.data_synthesis import generate_synthetic_data, load_or_train_synthesizer

# Define parameters
TEST_SIZE = 0.2
DATASET_NAME = "loan"
SYNTHETIC_DATA_DIR = "datasets/synthetic"

# Generate synthetic data using the preprocessed dataset
synthetic_data, metadata = generate_synthetic_data(encoded_data, DATASET_NAME, TEST_SIZE)

# Display synthetic data preview
synthetic_data.head()



## Saving Synthetic Data

The generated synthetic data will be saved into the `datasets/synthetic` directory. Let's verify that it is correctly stored.


In [None]:

# Save synthetic data to CSV
synthetic_data_path = os.path.join(SYNTHETIC_DATA_DIR, f"{DATASET_NAME}_synthetic.csv")
synthetic_data.to_csv(synthetic_data_path, index=False)

# Check if the file exists
os.path.exists(synthetic_data_path)



## Comparing Synthetic vs Original Data

We will compare key statistics of the original and synthetic datasets to evaluate how well the synthetic data replicates the original distribution.


In [None]:

# Compare basic statistics of original vs synthetic data
original_stats = cleaned_data.describe()
synthetic_stats = synthetic_data.describe()

# Display comparison
display(original_stats, synthetic_stats)
