# ðŸ§¬ Getting Started with EDCA

Welcome! In this tutorial, we will learn how to use EDCA, an AutoML framework that focuses on your data's unique characteristics to build better models.

Instead of just searching for a model, EDCA treats data as a first-class citizen, automatically evolving end-to-end pipelines that clean, reduce, and optimize themselves specifically for your dataset.

## Imports 

In [61]:
# to import edca from the original src, without installing it via pip
import sys
sys.path.append('../edca')

In [62]:
import numpy as np
import random
import os
import pandas as pd
import datetime
from edca.evodata import DataCentricAutoML
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split # to evaluate the framework
from sklearn import metrics # for calculating metrics

In [63]:
# EDCA uses randomness, so its it better to fix the seed to ensure always the same results
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)

## Load dataset

Start by getting the data to feed EDCA. EDCA receives dataset in the pandas DataFrame format. You can fetch a dataset from OpenML to continue (example 1) or continue with your one dataset (example 2). You need to divide the data into the dataframe (X) and the target series (y).

### Example 1 - fetch a dataset from OpenML

In [64]:
# example 1 - fetch a classification dataset from OpenML
data_id = 151 # this example used the electricity dataset (https://openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=gte_2&id=151) 
X, y = fetch_openml(data_id=151, return_X_y=True, as_frame=True)

### Example 2 - your own datasets

Start by telling your dataset's path. Then, load the DataFrame and divided into data and target

In [65]:
# example 2 - use our own classification datasets
data_path = os.path.join('../data/datasets/Australian.csv')
X = pd.read_csv(data_path) # load dataframe
y = X.pop('class') # divide it into data and target

## Data splitting to later evaluate the EDCA's best solution found

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

## Initialize EDCA

In this example, we will use most EDCA parameters on their default settings. See all the available parameters in `EDCA/edca/README.md`

In [67]:
# create a folder to store all information regarding the optimization
save_path = '../logs'
os.makedirs(save_path, exist_ok=True)

In [68]:
# initialize the class
automl = DataCentricAutoML(
    task='classification', # detail the ML task
    seed=SEED, # ensure reproducibility
    metric='f1', # specify the search metric
    time_budget=-1, # specify the time budget in seconds, -1 indicates no time limit, using the iterations as stop criteria
    n_iterations=100, # specify the number of iterations
    log_folder_name=f'{save_path}/experiment-{datetime.datetime.now()}', # specify the log folder to store information
    # search_space_config='classification_models_all.json',
    use_sampling=True, # use sampling to speed up the search
    use_feature_selection=True # use feature selection to speed up the search and improve the model generalization
)

### Optimize the ML pipeline with EDCA

In [None]:
automl.fit(X_train, y_train)

## Analyze the best solution found

In [70]:
# the pipeline entire config
automl.pipeline_estimator

0,1,2
,individual_config,"{'model': {np.str_('XGBClassifier'): {'max_depth': 15, 'n_estimators': 58}}, 'sample': [np.int64(1), np.int64(3), ...], 'scaler': 'RobustScaler'}"
,pipeline_config,"{'automatic_data_optimization': True, 'binary_columns': [], 'binary_with_nans': [], 'categorical_columns': [], ...}"
,seed,
,individual_id,'best_individual'
,fairness_params,{}


In [71]:
# data processing pipeline
automl.pipeline_estimator.pipeline

0,1,2
,transformers,"[('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,False
,force_int_remainder_cols,'deprecated'

0,1,2
,with_centering,True
,with_scaling,True
,quantile_range,"(25.0, ...)"
,copy,True
,unit_variance,False


In [72]:
# classification model
automl.pipeline_estimator.model

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


### Selected data

Furthermore, we can analyze the selected data in comparison to the original train data. The results show that EDCA substantially reduces the data, which decreases the computational costs associated.

In [73]:
final_X, final_y = automl.get_final_data()
final_X

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14
110,1,222.0,154,1,4,4,4,0,0,1,0,2,53,1
51,1,53.0,90,1,3,8,2,0,0,1,0,2,45,50
534,1,167.0,66,2,6,4,104,1,0,1,1,2,157,1
18,1,42.0,40,1,4,4,4,0,0,1,0,2,45,5
631,1,284.0,85,2,3,5,26,0,0,1,0,2,32,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276,0,323.0,12,2,3,5,26,1,0,1,1,2,88,187
343,1,238.0,118,2,11,4,2,1,1,2,0,2,69,94
466,0,340.0,192,2,1,1,1,1,1,12,1,2,1,171
20,0,129.0,20,2,6,4,2,0,0,1,0,2,92,151


EDCA used instance and feature selection, which considerably reduced the final dataset, reducing the computational costs associated.

In [74]:
print('Original Train dataset:', X_train.shape)
print('EDCA internal train dataset', automl.internal_x_train.shape)
print('EDCA selected dataset:', automl.get_final_data_shape())

Original Train dataset: (552, 14)
EDCA internal train dataset (414, 14)
EDCA selected dataset: (116, 14)


## Making predictions

Now, we can use the optimized ML pipeline to make predictions about the saved test data.

In [75]:
preds = automl.predict(X_test)
preds_proba = automl.predict_proba(X_test)

### Assessing its prediction power

In [76]:
print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.89      0.85      0.87        87
           1       0.76      0.82      0.79        51

    accuracy                           0.84       138
   macro avg       0.83      0.84      0.83       138
weighted avg       0.84      0.84      0.84       138

