# AutoML solution vs single model
#### FEDOT version = 0.6.2

In [1]:
pip install fedot==0.6.2

Collecting fedot==0.6.2
  Downloading fedot-0.6.2-py3-none-any.whl (493 kB)
     -------------------------------------- 493.9/493.9 kB 3.1 MB/s eta 0:00:00
Installing collected packages: fedot
  Attempting uninstall: fedot
    Found existing installation: fedot 0.6.1
    Uninstalling fedot-0.6.1:
      Successfully uninstalled fedot-0.6.1
Successfully installed fedot-0.6.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Below is an example of running an Auto ML solution for a classification problem.
## Description of the task and dataset

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

import logging
logging.raiseExceptions = False

# Input data from csv files 
train_data_path = '../data/scoring_train.csv'
test_data_path = '../data/scoring_test.csv'
df = pd.read_csv(train_data_path)
df.head(5)

Unnamed: 0,ID,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30.59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60.89DaysPastDueNotWorse,NumberOfDependents,target
0,0,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,1
1,1,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0
2,2,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0,0
3,3,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0,0
4,4,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0


## Baseline model

Let's use the api features to solve the classification problem. First, we create a pipeline with a single model "xgboost". 
To do this, we will substitute the appropriate name in the predefined_model field.

Attention!
"predefined_model" - is not an initial assumption for the AutoML algorithm. It's just a single model without AutoML part

In [2]:
from fedot.api.main import Fedot

# task selection, initialisation of the framework
baseline_model = Fedot(problem='classification')

# fit model without optimisation - single XGBoost node is used 
xgb_pipeline = baseline_model.fit(features=train_data_path, target='target', predefined_model='xgboost')

# evaluate the prediction with test data
xgb_predict = baseline_model.predict_proba(features=test_data_path)

2023-02-21 21:48:44,359 - CSV data extraction - Used the column as index: "ID".
2023-02-21 21:48:49,955 - FEDOT logger - Final pipeline: {'depth': 1, 'length': 1, 'nodes': [xgboost]}
xgboost - {'eval_metric': 'mlogloss', 'nthread': -1}
Memory consumption for finish in main session: current 63.2 MiB, max: 90.7 MiB
2023-02-21 21:48:50,025 - CSV data extraction - Used the column as index: "ID".


In [3]:
from fedot.core.data.data import InputData
from sklearn.metrics import roc_auc_score

# Read data from csv file as InputData
test_data = InputData.from_csv(test_data_path)
roc_auc_baseline = roc_auc_score(test_data.target, xgb_predict)
roc_auc_baseline

2023-02-21 21:48:50,144 - CSV data extraction - Used the column as index: "ID".


0.8332360242279814

## FEDOT AutoML for classification

We can identify the model using an evolutionary algorithm built into the core of the FEDOT framework.

Here are some parameters that you can specify when initializing a class:
* problem - the name of modelling problem to solve:
        - classification
        - regression
        - ts_forecasting
        - clustering
* seed - value for fixed random seed
* logging_level - level of the output detailing
        - 50 -> critical
        - 40 -> error
        - 30 -> warning
        - 20 -> info
        - 10 -> debug
        - 0 -> nonset
* timeout - time for model design (in minutes)

In [4]:
# new instance to be used as AutoML tool
auto_model = Fedot(problem='classification', seed=42, logging_level=10, timeout=5)

In [None]:
# run of the AutoML-based model generation
pipeline = auto_model.fit(features=train_data_path, target='target')

2023-02-21 21:48:50,193 - CSV data extraction - Used the column as index: "ID".
2023-02-21 21:48:54,224 - AssumptionsHandler - Memory consumption for fitting of the initial pipeline in main session: current 11.5 MiB, max: 39.1 MiB
2023-02-21 21:48:54,228 - ApiComposer - Initial pipeline was fitted in 3.0 sec.
2023-02-21 21:48:54,229 - AssumptionsHandler - Preset was changed to best_quality due to fit time estimation for initial model.
2023-02-21 21:48:54,236 - ApiComposer - AutoML configured. Parameters tuning: True. Time limit: 5 min. Set of candidate models: ['logit', 'resample', 'qda', 'lgbm', 'mlp', 'normalization', 'dt', 'bernb', 'knn', 'scaling', 'fast_ica', 'pca', 'poly_features', 'isolation_forest_class', 'rf', 'logit', 'resample', 'qda', 'lgbm', 'mlp', 'normalization', 'dt', 'bernb', 'knn', 'scaling', 'fast_ica', 'pca', 'poly_features', 'isolation_forest_class', 'rf'].
2023-02-21 21:48:54,242 - ApiComposer - Pipeline composition started.
2023-02-21 21:48:54,253 - DataSourceSpl

Generations:   0%|                                                                          | 1/10000 [00:00<?, ?gen/s]

2023-02-21 21:48:54,259 - MultiprocessingDispatcher - Number of used CPU's: 12
2023-02-21 21:49:04,125 - MultiprocessingDispatcher - Memory consumption for parallel evaluation of population in main session: current 13.7 MiB, max: 39.1 MiB
2023-02-21 21:49:04,127 - EvoGraphOptimizer - Generation num: 1
2023-02-21 21:49:04,129 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.827 ComplexityMetricsEnum.node_num=0.200>']
2023-02-21 21:49:05,386 - MultiprocessingDispatcher - Number of used CPU's: 12
2023-02-21 21:51:00,850 - MultiprocessingDispatcher - Memory consumption for parallel evaluation of population in main session: current 55.9 MiB, max: 57.7 MiB
2023-02-21 21:51:00,853 - EvoGraphOptimizer - Generation num: 2
2023-02-21 21:51:00,854 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.852 ComplexityMetricsEnum.node_num=0.200>']
2023-02-21 21:51:00,

Generations:   0%|                                                                          | 1/10000 [03:07<?, ?gen/s]

2023-02-21 21:52:01,653 - OptimisationTimer - Composition time: 3.123 min
2023-02-21 21:52:01,655 - OptimisationTimer - Algorithm was terminated due to processing time limit
2023-02-21 21:52:01,657 - EvoGraphOptimizer - Generation num: 4
2023-02-21 21:52:01,658 - EvoGraphOptimizer - Best individuals: HallOfFame archive fitness (1): ['<ClassificationMetricsEnum.ROCAUC_penalty=-0.852 ComplexityMetricsEnum.node_num=0.200>']
2023-02-21 21:52:01,659 - EvoGraphOptimizer - no improvements for 2 iterations
2023-02-21 21:52:01,660 - EvoGraphOptimizer - spent time: 3.1 min
2023-02-21 21:52:01,664 - GPComposer - GP composition finished
2023-02-21 21:52:01,666 - DataSourceSplitter - K-folds cross validation is applied.
2023-02-21 21:52:01,669 - ApiComposer - Hyperparameters tuning started with 2 min. timeout
2023-02-21 21:52:01,674 - PipelineTuner - Hyperparameters optimization start: estimation of metric for initial pipeline





2023-02-21 21:52:08,413 - PipelineTuner - Initial pipeline: {'depth': 2, 'length': 2, 'nodes': [lgbm, scaling]}
lgbm - {'num_leaves': 32, 'colsample_bytree': 0.8, 'subsample': 0.8, 'subsample_freq': 10, 'learning_rate': 0.03, 'n_estimators': 100}
scaling - {} 
Initial metric: 0.852
  0%|                                                                           | 0/10 [00:00<?, ?trial/s, best loss=?]2023-02-21 21:52:08,433 - build_posterior_wrapper took 0.007001 seconds
2023-02-21 21:52:08,434 - TPE using 0 trials
 10%|████▊                                           | 1/10 [00:06<01:01,  6.78s/trial, best loss: -0.8520924000000001]2023-02-21 21:52:15,216 - build_posterior_wrapper took 0.007002 seconds
2023-02-21 21:52:15,218 - TPE using 1/1 trials with best loss -0.852092
 20%|█████████▌                                      | 2/10 [00:13<00:53,  6.71s/trial, best loss: -0.8520924000000001]2023-02-21 21:52:21,879 - build_posterior_wrapper took 0.007002 seconds
2023-02-21 21:52:21,881 - T

In [None]:
prediction = auto_model.predict_proba(features=test_data_path)

# Calculate metric
roc_auc_auto = roc_auc_score(test_data.target, prediction)

In [None]:
# comparison with the manual pipeline

print(f'Baseline {roc_auc_baseline:.2f}')
print(f'AutoML solution {roc_auc_auto:.2f}')

Thus, with just a few lines of code, we were able to launch the FEDOT framework and got a better result*.

*Due to the stochastic nature of the algorithm, the metrics for the found solution may differ.

If you want to learn more about FEDOT, you can use [this notebook](2_intro_to_fedot.ipynb).