# EDCA tutorial on a arbitrary OpenML Classification dataset

This tutorial show how to use EDCA AutoML framework on a OpenML classification dataset.
The example starts by loading a OpenML dataset, and evaluates it using train-test*.


*we know that train-test evaluation is not recommended for a deep experimentation of the framework, but it helps to see EDCA's functionalities.

## Imports 

In [1]:
import numpy as np
import random
import os
# setup seed
np.random.seed(42)
random.seed(42)
os.environ["PYTHONHASHSEED"] = str(42)
import pandas as pd
from edca.evodata import DataCentricAutoML
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from datetime import datetime
from sklearn import metrics

In [2]:
import edca.evolutionary_algorithm as ea

## Load dataset

In [3]:
data_id = 151 # this example used the electricity dataset (https://openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=gte_2&id=151) 
X, y = fetch_openml(data_id=151, return_X_y=True, as_frame=True)

  warn(


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this example, we will use most EDCA parameters on their default settings.

In [5]:
automl = DataCentricAutoML(
    task='classification', # detail the ML task
    seed=42, # ensure reproducibility
    metric='f1', # specify the search metric
    time_budget=-1, # specify the time budget in seconds, -1 indicates no time limit, using the iterations as stop criteria
    n_iterations=20, # specify the number of iterations
    log_folder_name=f'../tests/{datetime.now().strftime("%Y-%m-%d-%H-%M-%S")}', # specify the log folder to store information
    # search_space_config='classification_models_all.json',
    use_sampling=True, # use sampling to speed up the search
    use_feature_selection=True # use feature selection to speed up the search and improve the model generalization
)

In [6]:
automl.fit(X_train, y_train)

2025-03-12 20:28:59,088: INFO     Evolutionary Search
2025-03-12 20:28:59,089: INFO     Create Initial Population
2025-03-12 20:28:59,102: INFO     Evaluate Initial Population


>>> Dataset Analysis
<<< Dataset Analysis


2025-03-12 20:29:02,697: INFO     Start Search for the best pipeline
2025-03-12 20:29:04,789: INFO     Iteration 1 >>> Fitness: 0.129 - Data%: 0.016 - Metric: 0.242 - CPU Time: 0.058 - S%: 0.032 - F%: 0.500 - CDD: 0.195
2025-03-12 20:29:05,691: INFO     Iteration 2 >>> Fitness: 0.129 - Data%: 0.016 - Metric: 0.242 - CPU Time: 0.058 - S%: 0.032 - F%: 0.500 - CDD: 0.195
2025-03-12 20:29:06,856: INFO     Iteration 3 >>> Fitness: 0.126 - Data%: 0.010 - Metric: 0.242 - CPU Time: 0.038 - S%: 0.020 - F%: 0.500 - CDD: 0.188
2025-03-12 20:29:07,283: INFO     Iteration 4 >>> Fitness: 0.125 - Data%: 0.019 - Metric: 0.230 - CPU Time: 0.043 - S%: 0.038 - F%: 0.500 - CDD: 0.165
2025-03-12 20:29:07,936: INFO     Iteration 5 >>> Fitness: 0.124 - Data%: 0.007 - Metric: 0.242 - CPU Time: 0.037 - S%: 0.015 - F%: 0.500 - CDD: 0.197
2025-03-12 20:29:08,434: INFO     Iteration 6 >>> Fitness: 0.124 - Data%: 0.007 - Metric: 0.242 - CPU Time: 0.037 - S%: 0.015 - F%: 0.500 - CDD: 0.197
2025-03-12 20:29:09,119: 

0.132, 0.107

### Best solution achieved

In [6]:
automl.pipeline_estimator

In [7]:
# data processor
automl.pipeline_estimator.pipeline

In [8]:
# classification model
automl.pipeline_estimator.model

### Selected data

Furthermore, we can analyze the selected data in comparison to the original train data. The results show that EDCA substantially reduces the data, which decreases the computational costs associated.

In [9]:
final_X, final_y = automl.get_final_data()
final_X

Unnamed: 0,date,day,period,nswprice,nswdemand,vicprice,vicdemand,transfer
15541,0.434450,3,0.787234,0.070974,0.477983,0.003467,0.422915,0.414912
15951,0.437901,5,0.319149,0.053260,0.402112,0.003467,0.422915,0.414912
11841,0.424804,3,0.702128,0.058394,0.357929,0.003467,0.422915,0.414912
12893,0.425778,4,0.617021,0.084965,0.512943,0.003467,0.422915,0.414912
39217,0.897925,7,0.021277,0.113066,0.605326,0.005189,0.462455,0.595175
...,...,...,...,...,...,...,...,...
26035,0.465599,5,0.404255,0.056293,0.534216,0.003905,0.502330,0.336842
23177,0.456794,1,0.872340,0.030023,0.434692,0.001998,0.352667,0.732456
41844,0.903367,5,0.765957,0.078029,0.586284,0.005239,0.563698,0.478947
15934,0.437857,4,0.978723,0.045935,0.425915,0.003467,0.422915,0.414912


524 samples

EDCA used instance and feature selection, which considerably reduced the final dataset, reducing the computational costs associated.

In [10]:
print('Original Train dataset:', X_train.shape)
print('EDCA internal train dataset', automl.internal_x_train.shape)
print('EDCA selected dataset:', automl.get_final_data_shape())

Original Train dataset: (36249, 8)
EDCA internal train dataset (27186, 8)
EDCA selected dataset: (107, 8)


1385, 2

## Making predictions

In [11]:
preds = automl.predict(X_test)
preds_proba = automl.predict_proba(X_test)

In [12]:
print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

        DOWN       0.77      0.85      0.81      5191
          UP       0.77      0.67      0.71      3872

    accuracy                           0.77      9063
   macro avg       0.77      0.76      0.76      9063
weighted avg       0.77      0.77      0.77      9063



f1 down = 0.84
f1 up = 0.75
f1 macro avg = 0.80