# AutoML solution vs single model
#### FEDOT version = 0.6.1

In [1]:
pip install fedot==0.6.1

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Below is an example of running an Auto ML solution for a classification problem.
## Description of the task and dataset

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

import logging
logging.raiseExceptions = False

# Input data from csv files 
train_data_path = '../data/scoring_train.csv'
test_data_path = '../data/scoring_test.csv'
df = pd.read_csv(train_data_path)
df.head(5)

Unnamed: 0,ID,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30.59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60.89DaysPastDueNotWorse,NumberOfDependents,target
0,0,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,1
1,1,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0
2,2,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0,0
3,3,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0,0
4,4,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0


## Baseline model

Let's use the api features to solve the classification problem. First, we create a pipeline with a single model "xgboost". 
To do this, we will substitute the appropriate name in the predefined_model field.

Attention!
"predefined_model" - is not an initial assumption for the AutoML algorithm. It's just a single model without AutoML part

In [3]:
from fedot.api.main import Fedot

# task selection, initialisation of the framework
baseline_model = Fedot(problem='classification')

# fit model without optimisation - single XGBoost node is used 
xgb_pipeline = baseline_model.fit(features=train_data_path, target='target', predefined_model='xgboost')

# evaluate the prediction with test data
xgb_predict = baseline_model.predict_proba(features=test_data_path)

2022-12-16 21:46:56,717 - FEDOT logger - Final pipeline: {'depth': 1, 'length': 1, 'nodes': [xgboost]}
xgboost - {'eval_metric': 'mlogloss', 'nthread': -1}


In [4]:
from fedot.core.data.data import InputData
from sklearn.metrics import roc_auc_score

# Read data from csv file as InputData
test_data = InputData.from_csv(test_data_path)
roc_auc_baseline = roc_auc_score(test_data.target, xgb_predict)
roc_auc_baseline

0.8332360242279814

## FEDOT AutoML for classification

We can identify the model using an evolutionary algorithm built into the core of the FEDOT framework.

Here are some parameters that you can specify when initializing a class:
* problem - the name of modelling problem to solve:
        - classification
        - regression
        - ts_forecasting
        - clustering
* seed - value for fixed random seed
* logging_level - level of the output detailing
        - 50 -> critical
        - 40 -> error
        - 30 -> warning
        - 20 -> info
        - 10 -> debug
        - 0 -> nonset
* timeout - time for model design (in minutes)

In [5]:
# new instance to be used as AutoML tool
auto_model = Fedot(problem='classification', seed=42, logging_level=10, timeout=5)

In [6]:
# run of the AutoML-based model generation
pipeline = auto_model.fit(features=train_data_path, target='target')

2022-12-16 21:46:57,546 - AssumptionsHandler - Initial pipeline fitting started
2022-12-16 21:46:57,843 - SecondaryNode - Trying to fit secondary node with operation: rf
2022-12-16 21:46:57,844 - SecondaryNode - Fit all parent nodes in secondary node with operation: rf
2022-12-16 21:46:59,804 - ApiComposer - Initial pipeline was fitted in 2.3 sec.
2022-12-16 21:46:59,805 - AssumptionsHandler - Preset was changed to best_quality
2022-12-16 21:46:59,809 - ApiComposer - AutoML configured. Parameters tuning: True. Time limit: 5 min. Set of candidate models: ['bernb', 'dt', 'knn', 'lgbm', 'logit', 'mlp', 'qda', 'rf', 'scaling', 'normalization', 'pca', 'fast_ica', 'poly_features', 'isolation_forest_class', 'resample'].
2022-12-16 21:46:59,813 - ApiComposer - Pipeline composition started.
2022-12-16 21:46:59,823 - DataSourceSplitter - K-folds cross validation is applied.


Generations:   0%|                                                                          | 1/10000 [00:00<?, ?gen/s]

2022-12-16 21:46:59,827 - MultiprocessingDispatcher - Number of used CPU's: 1
2022-12-16 21:47:11,774 - EvoGraphOptimizer - Generation num: 1
2022-12-16 21:47:11,776 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.826 ComplexityMetricsEnum.node_num=0.200>']
2022-12-16 21:47:12,408 - MultiprocessingDispatcher - Number of used CPU's: 1
2022-12-16 21:50:01,414 - EvoGraphOptimizer - Generation num: 2
2022-12-16 21:50:01,415 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.850 ComplexityMetricsEnum.node_num=0.200>']
2022-12-16 21:50:01,416 - GroupedCondition - Optimisation stopped: Time limit is reached


Generations:   0%|                                                                          | 1/10000 [03:01<?, ?gen/s]

2022-12-16 21:50:01,417 - OptimisationTimer - Composition time: 3.027 min
2022-12-16 21:50:01,417 - OptimisationTimer - Algorithm was terminated due to processing time limit
2022-12-16 21:50:01,418 - EvoGraphOptimizer - Generation num: 3
2022-12-16 21:50:01,419 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.850 ComplexityMetricsEnum.node_num=0.200>']
2022-12-16 21:50:01,420 - EvoGraphOptimizer - no improvements for 1 iterations
2022-12-16 21:50:01,421 - EvoGraphOptimizer - spent time: 3.0 min
2022-12-16 21:50:01,422 - GPComposer - GP composition finished
2022-12-16 21:50:01,424 - DataSourceSplitter - K-folds cross validation is applied.
2022-12-16 21:50:01,425 - ApiComposer - Hyperparameters tuning started with 2 min. timeout
2022-12-16 21:50:01,427 - PipelineTuner - Hyperparameters optimization start





2022-12-16 21:50:12,697 - PipelineTuner - Initial pipeline: {'depth': 2, 'length': 2, 'nodes': [rf, scaling]}
rf - {'n_jobs': 1, 'criterion': 'gini', 'max_features': 0.23913682756197374, 'min_samples_split': 7, 'min_samples_leaf': 7, 'bootstrap': True}
scaling - {} 
Initial metric: 0.850
  0%|                                                                            | 0/1 [00:00<?, ?trial/s, best loss=?]2022-12-16 21:50:12,702 - build_posterior_wrapper took 0.001001 seconds
2022-12-16 21:50:12,703 - TPE using 0 trials
100%|██████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.29s/trial, best loss: -0.8504956]
  0%|                                                                       | 1/100000 [00:00<?, ?trial/s, best loss=?]2022-12-16 21:50:23,993 - build_posterior_wrapper took 0.002007 seconds
2022-12-16 21:50:23,994 - TPE using 1/1 trials with best loss -0.850496
  0%|                                                 | 2/100000 [00:14<409:20:38, 14.74s/tr

In [9]:
prediction = auto_model.predict_proba(features=test_data_path)

# Calculate metric
roc_auc_auto = roc_auc_score(test_data.target, prediction)

In [10]:
# comparison with the manual pipeline

print(f'Baseline {roc_auc_baseline:.2f}')
print(f'AutoML solution {roc_auc_auto:.2f}')

Baseline 0.83
AutoML solution 0.85


Thus, with just a few lines of code, we were able to launch the FEDOT framework and got a better result*.

*Due to the stochastic nature of the algorithm, the metrics for the found solution may differ.

If you want to learn more about FEDOT, you can use [this notebook](2_intro_to_fedot.ipynb).