# Basic Synthetic Data Generation with DeepBridge

This tutorial demonstrates how to generate synthetic data using the DeepBridge library. I'll walk you through creating synthetic datasets with different methods and comparing their results.

## Overview

In this demonstration, we'll:
1. Create a sample dataset with mixed data types
2. Generate synthetic versions using three different methods:
   - Gaussian Copula
   - CTGAN (Conditional Tabular GAN)
   - UltraLight Generator
3. Evaluate and compare the quality of each method
4. Visualize the differences between original and synthetic data

## Understanding the Different Methods

Each synthetic data generation method has its unique characteristics:

### Gaussian Copula
- Statistical method that preserves the marginal distributions and correlations between features
- Good balance between quality and computational efficiency
- Works well for numerical data with linear relationships
- Medium memory requirements

### CTGAN (Conditional Tabular GAN)
- Neural network-based approach using Generative Adversarial Networks
- Can capture complex, non-linear relationships in the data
- Highest quality for capturing complex patterns
- More computationally intensive and requires more memory
- Longer training time

### UltraLight Generator
- Simplest and fastest approach with minimal memory requirements
- Uses basic statistical modeling rather than complex ML models
- Excellent for large datasets or limited computational resources
- Quality may be lower for complex relationships

## Example Implementation

Let's look at the code to implement these methods:

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


import sys
import os

sys.path.append(os.path.expanduser("~/projetos/DeepBridge"))




from deepbridge.core.db_data import DBDataset
from deepbridge.synthetic import Synthesize
from deepbridge.core.experiment import Experiment


from deepbridge.validation.wrappers import (
    RobustnessSuite, UncertaintySuite, 
)

from deepbridge.utils.robustness import run_robustness_tests
from deepbridge.utils.uncertainty import run_uncertainty_tests
from deepbridge.utils.resilience import run_resilience_tests
from deepbridge.utils.hyperparameter import run_hyperparameter_tests
#---------------------------------------------------------
# Preparação de dados com cuidado especial 
#---------------------------------------------------------
print("Carregando e preparando dados...")

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Gerar dados sintéticos com duas classes
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(20)])
y = pd.Series(y)

# Verificar e lidar com valores ausentes
print(f"Valores NaN em X antes da limpeza: {X.isna().sum().sum()}")
print(f"Valores infinitos em X: {np.isinf(X.values).sum()}")

# Resetar índices para garantir alinhamento limpo
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

# Dividir dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Resetar índices novamente após a divisão
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# Criar DataFrames de treino e teste com nomes explícitos de colunas
train_df = X_train.copy()
train_df['target'] = y_train
test_df = X_test.copy()
test_df['target'] = y_test

# Verificação final
print(f"NaN em train_df: {train_df.isna().sum().sum()}")
print(f"NaN em test_df: {test_df.isna().sum().sum()}")

# Treinar modelo
print("\nTreinando modelo...")
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


# Criar objeto de dataset
print("\nCriando objeto de dataset...")
dataset = DBDataset(
    train_data=train_df,
    test_data=test_df,
    target_column='target',
    model=model
)

  from .autonotebook import tqdm as notebook_tqdm


Carregando e preparando dados...
Valores NaN em X antes da limpeza: 0
Valores infinitos em X: 0
NaN em train_df: 0
NaN em test_df: 0

Treinando modelo...

Criando objeto de dataset...


In [2]:
# Criar e executar o experimento
experiment = Experiment(
      dataset=dataset,
      experiment_type="binary_classification",
      tests=["robustness", "uncertainty"],
      feature_subset=['feature_0', 'feature_1'],
      suite = "quick"
  )

In [3]:
experiment.initial_results['models']['primary_model']['metrics']

{'accuracy': 0.982,
 'f1': 0.9819959474671671,
 'precision': 0.982151158739503,
 'recall': 0.982}

In [4]:
experiment.initial_results['models']['DECISION_TREE']['metrics']

{'accuracy': 0.935,
 'f1': 0.9349988946805626,
 'precision': 0.9349996598898043,
 'recall': 0.935}

In [5]:
experiment.initial_results['models']['LOGISTIC_REGRESSION']['metrics']

{'accuracy': 0.892,
 'f1': 0.891998163998164,
 'precision': 0.892512069910651,
 'recall': 0.892}

In [6]:
experiment.initial_results['models']['GBM']['metrics']

{'accuracy': 0.9375,
 'f1': 0.9374750876950666,
 'precision': 0.9377924782923626,
 'recall': 0.9375}

In [7]:
experiment.save_report("robustness", "rob.html")

DEBUG: Using direct metrics from primary model: {'auc': 0.97, 'roc_auc': 0.97, 'accuracy': 0.96, 'f1': 0.96, 'precision': 0.96, 'recall': 0.96}
DEBUG: Forced unique metrics for DECISION_TREE: AUC=0.885
DEBUG: Model DECISION_TREE metrics: {'auc': 0.885, 'accuracy': 0.865, 'f1': 0.875, 'precision': 0.88, 'recall': 0.87}
DEBUG: Forced unique metrics for LOGISTIC_REGRESSION: AUC=0.9129
DEBUG: Model LOGISTIC_REGRESSION metrics: {'auc': 0.9129, 'accuracy': 0.9019, 'f1': 0.9119, 'precision': 0.9169, 'recall': 0.9069}
DEBUG: Forced unique metrics for GBM: AUC=0.9491999999999999
DEBUG: Model GBM metrics: {'auc': 0.9491999999999999, 'accuracy': 0.9311999999999999, 'f1': 0.9411999999999999, 'precision': 0.9461999999999999, 'recall': 0.9361999999999999}
DEBUG: Experiment report - primary model metrics:
DEBUG: Primary model metrics: {'auc': 0.97, 'roc_auc': 0.97, 'accuracy': 0.96, 'f1': 0.96, 'precision': 0.96, 'recall': 0.96}
DEBUG: Models in experiment_info: ['primary_model', 'DECISION_TREE', 'LO

'rob.html'

In [8]:
experiment

<deepbridge.core.experiment.experiment.Experiment at 0x7f15f7e058b0>