This Jupyter Notebook provides examples of how to apply the **mdatagen** library to large datasets. Here, we used the 'Penn Machine Learning Benchmarks' to fetch the data. We selected small, medium, and large datasets, as described in the table below:

| Dataset       | n_instances | n_features | n_binary_features | n_categorical_features | n_continuos_features |
|---------------|-------------|------------|--------------------|-------------------------|-----------------------|
| mushroom      | 8124        | 22         | 5                  | 16                      | 1                     |
| adult         | 48842       | 14         | 1                  | 4                       | 9                     |
| kddcup        | 494020      | 41         | 4                  | 9                       | 28                    |
| poker         | 1025010     | 10         | 0                  | 5                       | 5                     |
| mfeat_pixel   | 2000        | 240        | 0                  | 240                     | 0                     |

We selected the MAR multivariate mechanism under median strategy because it represents the worst-case scenario for larger datasets. Moreover, we provided an example to set the number of Threads to parallelize the generation.

In [None]:
# Import the libraries
import numpy as np 
import pmlb
from mdatagen.multivariate.mMAR import mMAR
from time import perf_counter

In [3]:
# Function to help split data
def split_data(data):
    df = data.copy()
    X = df.drop(columns=["target"])
    y = data["target"]

    return X,np.array(y)

# The data from PMLB
adult_data = pmlb.fetch_data('adult')
kddcup = pmlb.fetch_data('kddcup')
mushroom = pmlb.fetch_data('mushroom')
mfeat_pixel = pmlb.fetch_data('mfeat_pixel')
poker  = pmlb.fetch_data('poker')

In [None]:
X_, y_ = split_data(adult_data)

time_init = perf_counter()
generator = mMAR(X=X_, y=y_)
gen_md = generator.median(missing_rate=20)

time_end = perf_counter()
print(f"Tempo: {round(time_end-time_init,4)} s ")

Tempo: 6.0469 s 


- Parallelization:

In [None]:
import os 
X_, y_ = split_data(adult_data)

time_init = perf_counter()
generator = mMAR(X=X_, y=y_, n_Threads=os.cpu_count())
gen_md = generator.median(missing_rate=20)

time_end = perf_counter()
print(f"Tempo: {round(time_end-time_init,4)} s ")