# What to expect from AutoML software
Automated machine learning (AutoML) takes a higher-level approach to machine learning than most practitioners are used to, so we've gathered a handful of guidelines on what to expect when running AutoML software such as TPOT.

#### AUTOML ALGORITHMS AREN'T INTENDED TO RUN FOR ONLY A FEW MINUTES
Of course, you can run TPOT for only a few minutes and it will find a reasonably good pipeline for your dataset. However, if you don't run TPOT for long enough, it may not find the best possible pipeline for your dataset. It may even not find any suitable pipeline at all, in which case a RuntimeError('A pipeline has not yet been optimized. Please call fit() first.') will be raised. Often it is worthwhile to run multiple instances of TPOT in parallel for a long time (hours to days) to allow TPOT to thoroughly search the pipeline space for your dataset.

#### AUTOML ALGORITHMS CAN TAKE A LONG TIME TO FINISH THEIR SEARCH
AutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline.

As such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search. That's a time-consuming procedure, even for simpler models like decision trees.

Typical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt the run partway through and see the best results so far. TPOT also provides a warm_start parameter that lets you restart a TPOT run from where it left off.

#### AUTOML ALGORITHMS CAN RECOMMEND DIFFERENT SOLUTIONS FOR THE SAME DATASET
If you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn't converge due to lack of time or that multiple pipelines perform more-or-less the same on your dataset.

This is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.

# TPOT with code

We've taken care to design the TPOT interface to be as similar as possible to scikit-learn.

TPOT can be imported just like any regular Python module. To import TPOT, type:

In [1]:
from tpot2 import TPOTClassifier

then create an instance of TPOT as follows:

In [3]:
classification_optimizer = TPOTClassifier()

It's also possible to use TPOT for regression problems with the TPOTRegressor class. Other than the class name, a TPOTRegressor is used the same way as a TPOTClassifier. You can read more about the TPOTClassifier and TPOTRegressor classes in the API documentation.

In [None]:
from tpot2 import TPOTRegressor
regression_optimizer = TPOTRegressor()

Fitting a TPOT model works exactly like any other sklearn estimator. Some example code with custom TPOT parameters might look like:

In [4]:
import sklearn
import sklearn.datasets
import sklearn.metrics

classification_optimizer = TPOTClassifier(search_space="light", max_time_mins=30/60, n_jobs=30, cv=5)

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1, test_size=0.2)

classification_optimizer.fit(X_train, y_train)

auroc_score = sklearn.metrics.roc_auc_score(y_test, classification_optimizer.predict_proba(X_test)[:,1])
print("auroc_score: ", auroc_score)

Generation: : 3it [00:33, 11.04s/it]


auroc_score:  0.9904100529100529


## Scorers, Objective Functions, and multi objective optimization.

There are two ways of passing objectives into TPOT2. 

1. `scorers`: Scorers are functions that have the signature (estimator, X, y). These can be produced with the [sklearn.metrics.make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) function. This function is used to evaluate the test folds during cross validation. These are passed into TPOT2 via the scorers parameter. This can take in the scorer itself or the string corresponding to a scoring function ([as listed here](https://scikit-learn.org/stable/modules/model_evaluation.html)). TPOT2 also supports passing in a list of several scorers for multiobjective optimization. 

2. `other_objective_functions` : Other objective functions in TPOT2 have the signature (estimator) and returns a float or list of floats. These get passed an unfitted estimator (in the case of TPOT2, a `tpot2.GraphPipeline`). 



Each scorer and objective function must be accompanied by a list of weights corresponding to the list of objectives. By default, TPOT2 maximizes objective functions (this can be changed by `bigger_is_better=False`). Positive weights means that TPOT2 will seek to maximize that objective, and negative weights correspond to minimization.

Here is an example of using two scorers

    scorers=['roc_auc_ovr',tpot2.objectives.complexity_scorer],
    scorers_weights=[1,-1],


Here is an example with a scorer and a secondary objective function

    scorers=['roc_auc_ovr'],
    scorers_weights=[1],
    other_objective_functions=[tpot2.objectives.number_of_leaves_objective],
    other_objective_functions_weights=[-1],


TPOT will automatically name the scores based on the function name for the columns in the final results dataframe. If you would like to specify custom function names, you can set the `objective_function_names` to be a list of names (str) for each score. The order of the names are scorers first, and other objective functions second. (e.g. `objective_function_names=['scorer1','scorer2', 'objective1','objective2'])`.

It is possible to have either the scorer or other_objective_function to return multiple values. In that case, just make sure that the `scorer_weights` and `other_objective_function_weights` are the same length as the number of returned scores.


TPOT comes with a few additional built in objective functions you can use. The first table are objectives applied to fitted pipelines, and thus are passee into the `scorers` parameter. The second table are objective functions for the `other_objective_functions` param.

Scorers:
| Function     | Description      |
| :---        |    :----:   |
| tpot2.objectives.complexity_scorer | Estimates the number of learned parameters across all classifiers and regressors in the pipelines. Additionally, currently transformers add 1 point and selectors add 0 points (since they don't affect the complexity of the "final" predictive pipeline.) |

Other Objective Functions.

| Function     | Description      |
| :---        |    :----:   |
| tpot2.objectives.average_path_length | Computes the average shortest path from all nodes to the root/final estimator (only supported for GraphPipeline) |
| tpot2.objectives.number_of_leaves_objective | Calculates the number of leaves (input nodes) in a GraphPipeline |
| tpot2.objectives.number_of_nodes_objective | Calculates the number of nodes in a pipeline (whether it is an scikit-learn Pipeline, GraphPipeline, Feature Union, or the previous nested within each other) |

## Built In Configurations
TPOT can be used to optimize hyperparameters, select models, and optimize pipelines of models including determining the sequence of steps. Tutorial 2 goes into more detail on how to customize search spaces with custom hyperparameter ranges, model types, and possible pipeline configurations. TPOT also comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT. These can be passed in as strings to the `search space` parameter of any of the TPOT estimators.

| String     | Description      |
| :---        |    :----:   |
| linear  | A linear pipeline with the structure of "Selector->(transformers+Passthrough)->(classifiers/regressors+Passthrough)->final classifier/regressor." For both the transformer and inner estimator layers, TPOT may choose one or more transformers/classifiers, or it may choose none. The inner classifier/regressor layer is optional. |
| light | Same search space as linear, but without the inner classifier/regressor layer and with a reduced set of faster running estimators. |
| graph | TPOT will optimize a pipeline in the shape of a directed acyclic graph. The nodes of the graph can include selectors, scalers, transformers, or classifiers/regressors (inner classifiers/regressors can optionally be not included). This will return a custom GraphPipeline rather than an sklearn Pipeline. More details in Tutorial 6. |
| mdr |TPOT will search over a series of feature selectors and Multifactor Dimensionality Reduction models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for genome-wide association studies (GWAS), and is described in detail online here.

Note that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets. |

Note: the `linear` and `graph` configurations by default allow for additional stacked classifiers/regressors within the pipeline in addition to the final classifier/regressor. If you would like to disable this, you can manually get the search space without inner classifier/regressors through the function `tpot2.config.template_search_spaces.get_template_search_spaces` with `inner_predictios=False`. You can pass the resulting search space into the `search space` param.

The specific hyperparameter ranges used by TPOT can be found in files in the tpot2/config folder. The template search spaces listed above are defined in tpot2/config/template_search_spaces.py. Search spaces for individual models can be acquired in the tpot2/config/get_configspace.py file (`tpot2.config.get_search_space`). More details in Tutorial 2.

### Example analysis 

### Best Practices

When running tpot from an .py script, it is important to protect code with `if __name__=="__main__":` . This is because of how TPOT handles parallelization with Python and Dask.

In [None]:
#my_analysis.py

from dask.distributed import Client, LocalCluster
import tpot2
import sklearn
import sklearn.datasets
import numpy as np

if __name__=="__main__":
    scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
    X, y = sklearn.datasets.load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)


    est = tpot2.TPOTClassifier(n_jobs=4, max_time_mins=60, verbose=2)
    est.fit(X_train, y_train)


    print(scorer(est, X_test, y_test))

### Common parameters

        scorers : (list, scorer)
            A scorer or list of scorers to be used in the cross-validation process. 
            see https://scikit-learn.org/stable/modules/model_evaluation.html
        
        scorers_weights : list
            A list of weights to be applied to the scorers during the optimization process.
        
        classification : bool
            If True, the problem is treated as a classification problem. If False, the problem is treated as a regression problem.
            Used to determine the CV strategy.
        
        cv : int, cross-validator
            - (int): Number of folds to use in the cross-validation process. By uses the sklearn.model_selection.KFold cross-validator for regression and StratifiedKFold for classification. In both cases, shuffled is set to True.
            - (sklearn.model_selection.BaseCrossValidator): A cross-validator to use in the cross-validation process.
                - max_depth (int): The maximum depth from any node to the root of the pipelines to be generated.
        
        other_objective_functions : list, default=[tpot2.objectives.estimator_objective_functions.average_path_length_objective]
            A list of other objective functions to apply to the pipeline.
        
        other_objective_functions_weights : list, default=[-1]
            A list of weights to be applied to the other objective functions.
        
        objective_function_names : list, default=None
            A list of names to be applied to the objective functions. If None, will use the names of the objective functions.
        
        bigger_is_better : bool, default=True
            If True, the objective function is maximized. If False, the objective function is minimized. Use negative weights to reverse the direction.
        
        generations : int, default=50
            Number of generations to run
            
        max_time_mins : float, default=float("inf")
            Maximum time to run the optimization. If none or inf, will run until the end of the generations.
        
        max_eval_time_mins : float, default=60*5
            Maximum time to evaluate a single individual. If none or inf, there will be no time limit per evaluation.

        n_jobs : int, default=1
            Number of processes to run in parallel.
        
        memory_limit : str, default="4GB"
            Memory limit for each job. See Dask [LocalCluster documentation](https://distributed.dask.org/en/stable/api.html#distributed.Client) for more information.

            
        verbose : int, default=1 
            How much information to print during the optimization process. Higher values include the information from lower values.
            0. nothing
            1. progress bar
            
            3. best individual
            4. warnings
            >=5. full warnings trace
            6. evaluations progress bar. (Temporary: This used to be 2. Currently, using evaluation progress bar may prevent some instances were we terminate a generation early due to it reaching max_time_mins in the middle of a generation OR a pipeline failed to be terminated normally and we need to manually terminate it.)
        


# More Options

`tpot2.TPOTClassifier` and `tpot2.TPOTRegressor`  have a simplified set of hyperparameters with default values set for classification and regression problems. Currently, both of these use the standard evolutionary algorithm in the `tpot2.TPOTEstimator` class. If you want more control you can look into either the `tpot2.TPOTEstimator` or `tpot2.TPOTEstimatorSteadyState` class.

There are two evolutionary algorithms built into TPOT2, which corresponds to two different estimator classes.

1. The `tpot2.TPOTEstimator` uses a standard evolutionary algorithm that evaluates exactly population_size individuals each generation. This is similar to the algorithm in TPOT1. The next generation does not start until the previous is completely finished evaluating. This leads to underutilized CPU time as the cores are waiting for the last individuals to finish training, but may preserve diversity in the population. 

2. The `tpot2.TPOTEstimatorSteadyState` differs in that it will generate and evaluate the next individual as soon as an individual finishes evaluation. The number of individuals being evaluated is determined by the n_jobs parameter. There is no longer a concept of generations. The population_size parameter now refers to the size of the list of evaluated parents. When an individual is evaluated, the selection method updates the list of parents. This allows more efficient utilization when using multiple cores.


In [None]:
import tpot2
import sklearn
import sklearn.datasets

scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)


est = tpot2.TPOTClassifier(n_jobs=40, max_time_mins=30, verbose=5, generations=1, population_size=5)
est.fit(X_train, y_train)


print(scorer(est, X_test, y_test))

In [None]:
est._evolver_instance.population.evaluated_individuals.iloc[0]['Individual'].export_pipeline()

In [None]:
import tpot2
import sklearn
import sklearn.metrics
import sklearn.datasets

scorer = sklearn.metrics.get_scorer('neg_mean_squared_error')
X, y = sklearn.datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

est = tpot2.tpot_estimator.templates.TPOTRegressor(n_jobs=4, max_time_mins=30, verbose=2, cv=5)
est.fit(X_train, y_train)

print(scorer(est, X_test, y_test))

### tpot2.TPOTEstimatorSteadyState

In [None]:
import tpot2
import sklearn
import sklearn.datasets


graph_search_space = tpot2.search_spaces.pipelines.GraphPipeline(
    root_search_space= tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
    leaf_search_space = tpot2.config.get_search_space("selectors"), 
    inner_search_space = tpot2.config.get_search_space(["transformers"]),
    max_size = 10,
)

est = tpot2.TPOTEstimatorSteadyState( 
                            search_space = graph_search_space,
                            scorers=['roc_auc_ovr'], #scorers can be a list of strings or a list of scorers. These get evaluated during cross validation. 
                            scorers_weights=[1],

                            classification=True,

                            max_eval_time_mins=15,
                            max_time_mins=30,
                            verbose=2)


scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))


In [None]:
fitted_pipeline = est.fitted_pipeline_ # access best pipeline directly
fitted_pipeline.plot()

In [None]:
#view the summary of all evaluated individuals as a pandas dataframe
est.evaluated_individuals

In [None]:
import tpot2
import sklearn
import sklearn.datasets

est = tpot2.TPOTEstimatorSteadyState( 
                            search_space = graph_search_space,
                            scorers=['roc_auc_ovr',tpot2.objectives.complexity_scorer],
                            scorers_weights=[1,-1],

                            classification=True,

                            max_eval_time_mins=15,
                            max_time_mins=30,
                            verbose=2)


scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))


In [None]:
fitted_pipeline = est.fitted_pipeline_ # access best pipeline directly
fitted_pipeline.plot() #plot the best pipeline

view the results of all evaluated individuals as a pandas dataframe

In [None]:
est.evaluated_individuals

view pareto front as a pandas dataframe

In [None]:
est.pareto_front

In [None]:
pareto_front = est.pareto_front

#plot the pareto front of number_of_leaves_objective vs roc_auc_score

import matplotlib.pyplot as plt
plt.scatter(pareto_front['complexity_scorer'], pareto_front['roc_auc_score'])
plt.xlabel('complexity_scorer')
plt.ylabel('roc_auc_score')
plt.show()

### tpot2.TPOTEstimator

In [None]:
import tpot2
import sklearn
import sklearn.datasets

est = tpot2.TPOTEstimator(  
                            search_space = graph_search_space,
                            population_size=30,
                            generations=5,
                            scorers=['roc_auc_ovr'], #scorers can be a list of strings or a list of scorers. These get evaluated during cross validation. 
                            scorers_weights=[1],
                            classification=True,
                            n_jobs=1, 
                            early_stop=5, #how many generations with no improvement to stop after
                            
                            #List of other objective functions. All objective functions take in an untrained GraphPipeline and return a score or a list of scores
                            other_objective_functions= [ ],
                            
                            #List of weights for the other objective functions. Must be the same length as other_objective_functions. By default, bigger is better is set to True. 
                            other_objective_functions_weights=[],
                            verbose=2)

scorer = sklearn.metrics.get_scorer('roc_auc_ovo')
X, y = sklearn.datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))