In [2]:
import data_generator as data

In [3]:
# generate data
res = data.DataGenerator(
    n = 100,
    bad_ratio = 0.7,
    k_con = 2,
    k_bin = 0,
    con_nonlinear = 0,
    con_mean_bad_dif = [2, 1],
    con_var_bad_dif = 0.5,
    covars = [[1, 0.2, 0.2, 1], [1, -0.2, -0.2, 1]],
    seed = 77,
    verbose = True
)
res.generate()
res.data

Generating 2 continuous features with 0 binary features
Simulating (100 x 2) data set
Generating binary features...


Unnamed: 0,X1,X2,BAD,B1
0,0.129017,-0.522957,BAD,1
1,-0.881289,-3.235269,BAD,1
2,-0.440038,-0.592905,BAD,1
3,-0.397349,1.095051,BAD,1
4,1.968382,0.682414,BAD,1
...,...,...,...,...
95,2.460478,0.396800,GOOD,1
96,1.494187,1.670333,GOOD,1
97,1.482158,1.490253,GOOD,1
98,2.289204,0.548125,GOOD,1



### Core parameters and storage*

#### Purpose**  
Hold configuration and the place to store the generated dataset + metadata.

#### Key parameters**  
`n`:  total number of rows (observations) to generate.

`k_con`: number of continuous features (X1..Xk) to simulate, controls dimensionality of multivariate normals and the covariance matrices. Pitfall: high `k_con` with small n can lead to singular covariances / unstable estimation.

`k_bin`: number of binary (categorical, 0/1) features to create.

`bad_ratio`: fraction labeled "BAD" (like defaults/bad accounts).baseline proportion when no binary-combo-specific value is provided. Typical: small values (0.01–0.2) for realistic credit; 0.5 for balanced synthetic experiments. Pitfall: extreme imbalance (very small bad_ratio) requires very large n to get enough BAD examples.

`con_var_bad_dif`: (float or mechanism)the probability/degree that covariance elements differ between BAD and GOOD. *Effect: nonzero → more complicated decision boundary; models may need to handle heteroskedasticity. Typical: 0 (no covariance change) to moderate values (0.1–0.5). Pitfall: large changes can produce non-PSD covariance unless handled carefully.*

`con_mean_bad_dif`: (numeric or array) amount by which BAD observations’ continuous-feature means differ from GOOD.Typical: 0.1–2 for synthetic tests; tune to control signal strength. Can be a scalar (applied to all features) or can later change to vectors to vary by feature.

`con_nonlinear`: fraction of continuous features to make nonlinear. Typical: 0–0.5 for experimentation. Tip: transforms are applied before adding Gaussian noise in this code; order affects distributions. eg: k_con = 4, con_nonlinear = 0.5 → round(4*0.5)=2 variables

`con_noise_var`: standard deviation of Gaussian noise added to continuous features. simulates measurement error and reduces signal-to-noise ratio.Higher → more overlap between classes and noisier features.

`covars`:user-provided covariance matrices to use instead of generating random ones.

`bin_prob`:Bernoulli parameter(s) for each binary feature (probability of 1).Controls prevalence of binary attributes and thus how often different combos appear. Effect: affects combo frequencies and therefore how many observations are available per strata. Typical: 0.1–0.9; can pass a single float to apply to all binaries or a list per-feature.

`bin_noise_var `: probability of flipping a binary bit (inject noise).  Simulates measurement error or mis-reported attributes.

`bin_mean_bad_dif`: (scalar or list) additional mean shift for continuous features associated with binary==1 toward BAD.

`bin_mean_con_dif`:  (scalar or list): base mean difference for continuous features associated with binary levels (applied to GOOD baseline).

`bin_bad_ratio`: per-binary-component influence on local BAD ratio; used to compute combo-specific bad rates.simulates differential default rates across subgroups (important for sampling bias experiments). each combo’s bad_ratio = function(mean of bin_bad_ratio * combo_vals).

`bin_var_bad_dif `:used to compute combo-specific probability of varying covariances between BAD/GOOD. allow the binary attributes to change variance structure (heteroskedastic combos).

`verbose`: (bool)print progress/info messages.

`seed`: RNG seed for reproducibility (used via numpy Generator).

`replicate`: allow re-using parameters from previous generation.  allows creating multiple datasets with the same parameterization or reusing covariances/means across runs (e.g., for controlled experiments).

Output placeholders: `self.data` will hold final pandas DataFrame, `con_params` will record the per-combo means/covariances.(dict with 'means', 'covar', 'combo') `self:arg`: saved configuration dictionary (parameters used).