# AutoML solution vs single model
#### FEDOT version = 0.6.2

In [None]:
pip install fedot==0.6.2

Below is an example of running an Auto ML solution for a classification problem.
## Description of the task and dataset

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

import logging
logging.raiseExceptions = False

# Input data from csv files 
train_data_path = '../data/scoring_train.csv'
test_data_path = '../data/scoring_test.csv'
df = pd.read_csv(train_data_path)
df.head(5)

Unnamed: 0,ID,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30.59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60.89DaysPastDueNotWorse,NumberOfDependents,target
0,0,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,1
1,1,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0
2,2,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0,0
3,3,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0,0
4,4,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0


## Baseline model

Let's use the api features to solve the classification problem. First, we create a pipeline with a single model "xgboost". 
To do this, we will substitute the appropriate name in the predefined_model field.

Attention!
"predefined_model" - is not an initial assumption for the AutoML algorithm. It's just a single model without AutoML part

In [3]:
from fedot.api.main import Fedot

# task selection, initialisation of the framework
baseline_model = Fedot(problem='classification')

# fit model without optimisation - single XGBoost node is used 
xgb_pipeline = baseline_model.fit(features=train_data_path, target='target', predefined_model='xgboost')

# evaluate the prediction with test data
xgb_predict = baseline_model.predict_proba(features=test_data_path)

2023-02-24 11:11:37,654 - CSV data extraction - Used the column as index: "ID".
2023-02-24 11:11:42,159 - FEDOT logger - Final pipeline: {'depth': 1, 'length': 1, 'nodes': [xgboost]}
xgboost - {'eval_metric': 'mlogloss', 'nthread': -1}
Memory consumption for finish in main session: current 63.2 MiB, max: 90.7 MiB
2023-02-24 11:11:42,244 - CSV data extraction - Used the column as index: "ID".


In [4]:
from fedot.core.data.data import InputData
from sklearn.metrics import roc_auc_score

# Read data from csv file as InputData
test_data = InputData.from_csv(test_data_path)
roc_auc_baseline = roc_auc_score(test_data.target, xgb_predict)
roc_auc_baseline

2023-02-24 11:11:42,374 - CSV data extraction - Used the column as index: "ID".


0.8332360242279814

## FEDOT AutoML for classification

We can identify the model using an evolutionary algorithm built into the core of the FEDOT framework.

Here are some parameters that you can specify when initializing a class:
* problem - the name of modelling problem to solve:
        - classification
        - regression
        - ts_forecasting
        - clustering
* seed - value for fixed random seed
* logging_level - level of the output detailing
        - 50 -> critical
        - 40 -> error
        - 30 -> warning
        - 20 -> info
        - 10 -> debug
        - 0 -> nonset
* timeout - time for model design (in minutes)

In [5]:
# new instance to be used as AutoML tool
auto_model = Fedot(problem='classification', seed=42, logging_level=10, timeout=5)

In [6]:
# run of the AutoML-based model generation
pipeline = auto_model.fit(features=train_data_path, target='target')

2023-02-24 11:11:42,422 - CSV data extraction - Used the column as index: "ID".
2023-02-24 11:11:46,385 - AssumptionsHandler - Memory consumption for fitting of the initial pipeline in main session: current 11.5 MiB, max: 39.1 MiB
2023-02-24 11:11:46,389 - ApiComposer - Initial pipeline was fitted in 3.0 sec.
2023-02-24 11:11:46,390 - AssumptionsHandler - Preset was changed to best_quality due to fit time estimation for initial model.
2023-02-24 11:11:46,397 - ApiComposer - AutoML configured. Parameters tuning: True. Time limit: 5 min. Set of candidate models: ['logit', 'qda', 'mlp', 'isolation_forest_class', 'bernb', 'knn', 'rf', 'dt', 'scaling', 'poly_features', 'pca', 'fast_ica', 'lgbm', 'resample', 'normalization', 'logit', 'qda', 'mlp', 'isolation_forest_class', 'bernb', 'knn', 'rf', 'dt', 'scaling', 'poly_features', 'pca', 'fast_ica', 'lgbm', 'resample', 'normalization'].
2023-02-24 11:11:46,403 - ApiComposer - Pipeline composition started.
2023-02-24 11:11:46,418 - DataSourceSpl

Generations:   0%|                                                                          | 1/10000 [00:00<?, ?gen/s]

2023-02-24 11:11:46,424 - MultiprocessingDispatcher - Number of used CPU's: 12
2023-02-24 11:11:55,267 - MultiprocessingDispatcher - Memory consumption for parallel evaluation of population in main session: current 13.7 MiB, max: 39.1 MiB
2023-02-24 11:11:55,269 - EvoGraphOptimizer - Generation num: 1
2023-02-24 11:11:55,271 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.827 ComplexityMetricsEnum.node_num=0.200>']
2023-02-24 11:11:56,420 - MultiprocessingDispatcher - Number of used CPU's: 12
2023-02-24 11:13:39,729 - MultiprocessingDispatcher - Memory consumption for parallel evaluation of population in main session: current 55.9 MiB, max: 57.7 MiB
2023-02-24 11:13:39,732 - EvoGraphOptimizer - Generation num: 2
2023-02-24 11:13:39,733 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.852 ComplexityMetricsEnum.node_num=0.200>']
2023-02-24 11:13:39,

Generations:   0%|                                                                          | 1/10000 [08:09<?, ?gen/s]

2023-02-24 11:19:56,403 - OptimisationTimer - Composition time: 8.166 min
2023-02-24 11:19:56,405 - OptimisationTimer - Algorithm was terminated due to processing time limit
2023-02-24 11:19:56,406 - EvoGraphOptimizer - Generation num: 4
2023-02-24 11:19:56,408 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.852 ComplexityMetricsEnum.node_num=0.200>']
2023-02-24 11:19:56,410 - EvoGraphOptimizer - no improvements for 2 iterations
2023-02-24 11:19:56,411 - EvoGraphOptimizer - spent time: 8.2 min
2023-02-24 11:19:56,415 - GPComposer - GP composition finished
2023-02-24 11:19:56,418 - DataSourceSplitter - K-folds cross validation is applied.
2023-02-24 11:19:56,420 - ApiComposer - Time for pipeline composing was 0:08:10.013178.
The remaining 3.2 seconds are not enough to tune the hyperparameters.
2023-02-24 11:19:56,422 - ApiComposer - Composed pipeline returned without tuning.
2023-02-24 11:19:56,520 - ApiComposer - Mo




2023-02-24 11:20:00,161 - FEDOT logger - Final pipeline was fitted
2023-02-24 11:20:00,163 - FEDOT logger - Final pipeline: {'depth': 2, 'length': 2, 'nodes': [rf, scaling]}
rf - {'n_jobs': -1, 'criterion': 'entropy', 'max_features': 0.17230032499835796, 'min_samples_split': 9, 'min_samples_leaf': 11, 'bootstrap': False}
scaling - {}
Memory consumption for finish in main session: current 17.0 MiB, max: 88.3 MiB


In [7]:
prediction = auto_model.predict_proba(features=test_data_path)

# Calculate metric
roc_auc_auto = roc_auc_score(test_data.target, prediction)

2023-02-24 11:20:00,215 - CSV data extraction - Used the column as index: "ID".


In [8]:
# comparison with the manual pipeline

print(f'Baseline {roc_auc_baseline:.2f}')
print(f'AutoML solution {roc_auc_auto:.2f}')

Baseline 0.83
AutoML solution 0.85


Thus, with just a few lines of code, we were able to launch the FEDOT framework and got a better result*.

*Due to the stochastic nature of the algorithm, the metrics for the found solution may differ.

If you want to learn more about FEDOT, you can use [this notebook](2_intro_to_fedot.ipynb).