# Basic Synthetic Data Generation with DeepBridge

This tutorial demonstrates how to generate synthetic data using the DeepBridge library. I'll walk you through creating synthetic datasets with different methods and comparing their results.

## Overview

In this demonstration, we'll:
1. Create a sample dataset with mixed data types
2. Generate synthetic versions using three different methods:
   - Gaussian Copula
   - CTGAN (Conditional Tabular GAN)
   - UltraLight Generator
3. Evaluate and compare the quality of each method
4. Visualize the differences between original and synthetic data

## Understanding the Different Methods

Each synthetic data generation method has its unique characteristics:

### Gaussian Copula
- Statistical method that preserves the marginal distributions and correlations between features
- Good balance between quality and computational efficiency
- Works well for numerical data with linear relationships
- Medium memory requirements

### CTGAN (Conditional Tabular GAN)
- Neural network-based approach using Generative Adversarial Networks
- Can capture complex, non-linear relationships in the data
- Highest quality for capturing complex patterns
- More computationally intensive and requires more memory
- Longer training time

### UltraLight Generator
- Simplest and fastest approach with minimal memory requirements
- Uses basic statistical modeling rather than complex ML models
- Excellent for large datasets or limited computational resources
- Quality may be lower for complex relationships

## Example Implementation

Let's look at the code to implement these methods:

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


import sys
import os

sys.path.append(os.path.expanduser("~/projetos/DeepBridge"))




from deepbridge.core.db_data import DBDataset
from deepbridge.synthetic import Synthesize
from deepbridge.core.experiment import Experiment


from deepbridge.validation.wrappers import (
    RobustnessSuite, UncertaintySuite, 
)

from deepbridge.utils.robustness import run_robustness_tests
from deepbridge.utils.uncertainty import run_uncertainty_tests
from deepbridge.utils.resilience import run_resilience_tests
from deepbridge.utils.hyperparameter import run_hyperparameter_tests
#---------------------------------------------------------
# Preparação de dados com cuidado especial 
#---------------------------------------------------------


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Gerar dados sintéticos com duas classes
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(20)])
y = pd.Series(y)


# Resetar índices para garantir alinhamento limpo
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

# Dividir dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Resetar índices novamente após a divisão
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# Criar DataFrames de treino e teste com nomes explícitos de colunas
train_df = X_train.copy()
train_df['target'] = y_train
test_df = X_test.copy()
test_df['target'] = y_test



model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


# Criar objeto de dataset

dataset = DBDataset(
    train_data=train_df,
    test_data=test_df,
    target_column='target',
    model=model
)


# Criar e executar o experimento
experiment = Experiment(
      dataset=dataset,
      experiment_type="binary_classification",
      tests=["robustness", "uncertainty", "resilience", "hyperparameters"],
      feature_subset=['feature_0', 'feature_1']
  )

results = experiment.run_tests("quick")



results.save_html("robustness", "report_robustness.html")
results.save_html("uncertainty", "report_uncertainty.html")
results.save_html("resilience", "report_resilience.html")
results.save_html("hyperparameters", "report_hyperparameters.html")

  from .autonotebook import tqdm as notebook_tqdm


Generating robustness report to: /home/guhaase/projetos/DeepBridge/examples/report_robustness.html
Using templates directory: /home/guhaase/projetos/DeepBridge/deepbridge/templates/reports
Template file exists: /home/guhaase/projetos/DeepBridge/deepbridge/templates/reports/robustness/report.html
Successfully read template file (size: 71630 bytes)
Transforming robustness data structure...
Raw structure keys: ['primary_model', 'alternative_models', 'config', 'experiment_type']
Used deep copy to convert results
Found 'primary_model' key, extracting data...
Processing alternative models data...
Processing alternative model: DECISION_TREE
Processing alternative model: LOGISTIC_REGRESSION
Processing alternative model: GBM
Report data structure after transformation:
- primary_model: <class 'dict'>
- alternative_models: <class 'dict'>
- config: <class 'dict'>
- experiment_type: <class 'str'>
- base_score: <class 'float'>
- raw: <class 'dict'>
- quantile: <class 'dict'>
- feature_importance: <c

'/home/guhaase/projetos/DeepBridge/examples/report_hyperparameters.html'

In [2]:
results.results['hyperparameters'].results

KeyError: 'hyperparameters'

In [None]:
results.results['robustness'].results['primary_model']['raw']['by_level'].keys()

In [None]:
results.results['robustness'].results

In [None]:
experiment.initial_results

In [None]:
experiment.test_results['robustness']['primary_model']['raw']['by_level'].keys()

In [None]:
results.results['robustness']