[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PennNGG/Quantitative-Neurocience/blob/master/Machine%20Learning/Python/Basics%5fof%5fAutoML%5fWith%5fTPOT.ipynb)

# Tutorial: Basics of AutoML With TPOT 


__Content creator:__ Diego G. Dávila (Penn NGG)





---
# Objectives

In this tutorial, you will be introduced to the fundamentals of Automated Machine Learning (AutoML), and learn to apply it to classification and regression problems with TPOT (python package). 


By the end of this tutorial you will be able to:
*   Understand the basics of AutoML
*   Develop an intuition for how AutoML can be applied.
*   Apply AutoML to develop classification and regression pipelines. 




---
#What is Automated Machine Learning?

Machine Learning (ML) methods encompass a large landscape of incredibly powerful tools to explore and understand data. However, the size and intricacy of that landscape often precludes or overcomplicates the application of ML by scientists not specializing in AI and informatics. Even within the field of neuroinformatics, the problems of model and parameter selection, and pipeline construction are topics of intense discussion and are subejct to arbitraty choice. 

AutoML solves these issues by automating the process of applying machine learning to a dataset, from start to finish. This means that the construction of every step of an ML pipeline, from raw data to a model, is fully automated and optimized. 

In a basic sense, AutoML works by trying multiple combinations of pipelines and parameters over runtime, and comparing the performance of each combination. The best performing pipeline across the runtime is then selected and returned to the human user, along with performance metrics. 

Different implementations of AutoML work slightly differently. For this tutorial we will focus on [TPOT](http://epistasislab.github.io/tpot/), which uses genetic programming to construct and identify the optimal machine learning pipeline for classification or regression. 

This high degree of automation is advantageous in that: 
1. It makes advanced machine learning methods available to non-experts or scientists outside the field of ML/AI.
2. Speeds up the time-to-application, and frees up cognitive resources to be focused on scientific questions. 
3. Given sufficient runtime, results in pipelines that typically outperform human-designed ones. 

Some disadvantages of AutoML include:
1. It is a computationally intensive process. Because AutoML tries many combinations of models, processing steps, and parameters, until an optimal solution is found, days to weeks are required for AutoML processes to run (and powerful computational resources are required).  
2. Some implementations of AutoML can be "black boxes". In this introduction we will be using TPOT, which outputs the final pipeline as a python script, allowing for a good amount of introspection.
3. Due to inherent stochasticity, AutoML could produce multiple solutions to the same problem. This is especially the case for especially complex datasets, or if very little runtime is allowed to test solutions. 

---
#Classification Example

TPOT is extremely flexible and easy to implement. So long as we have Input and Output training data, we can quickly set up an AutoML process to generate a pipeline. 

As a first example, we will ask TPOT to develope an optimal solution to a classification problem. In this case, classifying breast cancer. We will make use of the built in breast cancer dataset from scikit-learn to contruct an accurate classification pipeline with AutoML in a few lines of code. 

Note: While we typically would let TPOT run for several days, trying many different generations. For illustrative purposes we will only let it run for 10 minutes here. 

---
# Setup

In [None]:
#@title Import Statements
import os
os.system('pip install tpot')
from tpot import TPOTClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [None]:
#@title Figure settings
import ipywidgets as widgets       # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")

---
# Create Classification Pipeline

In [None]:
# read in the dataset
data = load_breast_cancer(as_frame=False) 

# Split the data up into training and test input/output data. 
# We will use scikit-learn's train_test_split function to use 75% of the data as training, and 25% as test
X_train, X_test, y_train, y_test = train_test_split(data['data'].astype(np.float64),
    data['target'].astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

# Create a classification pipeline optimizer object
# max_time_mins determines how long AutoML tests different pipelines - typically, for an optimal solution, we would let it run for days. 
# However, for illustrative purposes, we will only let it run fot ten minutes.
# Random state is the seed for the first generation. 
optimizer = TPOTClassifier(max_time_mins=10, verbosity=2, random_state=137) 

# do the optimization
optimizer.fit(X_train, y_train)

Now, at this point, we already have an optimized pipeline. We can export it to a python script to evaluate and use with the following line:

Note: Because we are on colab notebook (on Google's server), we cannot see this script. However, were this running on, say, a jupyter notebook, we could dissect the pipeline fully, line by line. It might be a good exercise after going over this tutorial to try this out on your own machine. 

In [None]:
# export optimized pipeline 
optimizer.export('optimizedTPOTClassificationPipeline.py')

Great! In a few short lines we've created a highly accurate classifier. We can do many things with this now. If we wanted, we could list all the variables with their associated importance. Each optimal classifier will be different, so we will not illustrate that in this notebook, but we can extract the classifier with the following lines of code to explore. 

In [None]:
exctracted_best_model = optimizer.fitted_pipeline_.steps[-1][1] # extract the classifier
exctracted_best_model.fit(data['data'].astype(np.float64), data['target'].astype(np.float64)) # fit to the data

---
#Regression Example

Now that we've built a classification pipeline, let's build a regression pipeline in a similar fashion. In this exercise, we will use scikit-learn's diabetes dataset to build a regressor that can serve as a predictive model. Note that this process is nearly identical to setting up the classification pipeline, reflecting TPOT's ease of use. 

---
# Setup

In [None]:
#@title Import Statements
from tpot import TPOTRegressor
from sklearn.datasets import load_diabetes

---
# Create Regression Pipeline

In [None]:
# read in the dataset
data = load_diabetes(as_frame=False) 

# Split the data up into training and test input/output data into 75% training, 25% test. 
X_train, X_test, y_train, y_test = train_test_split(data['data'].astype(np.float64),
    data['target'].astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

# Create a regression pipeline optimizer object
optimizer = TPOTRegressor(max_time_mins=10, verbosity=2, random_state=137) 

# do the optimization
optimizer.fit(X_train, y_train)
print(optimizer.score(X_test, y_test)) # report performance 

Ok good, we've created a workable regression pipeline. We can however see that the performance of the "optimal" pipeline is somewhat terrible. This is because we only let TPOT run for 10 minutes. As stated earlier, one disadvantage of AutoML solutions is that they often require long runtimes to achieve workable solutions. Outside of this notebook, we could let this run for a few days and reach a best solution that has significantly better performance.

Like in the earlier example, we can export our pipeline as follows:

In [None]:
# export optimized pipeline 
optimizer.export('optimizedTPOTRegressionPipeline.py')

---
# Additional Reading

The original TPOT paper can be found [here](http://proceedings.mlr.press/v64/olson_tpot_2016.html).

A second paper evaluating TPOT can also be found [here](https://dl.acm.org/doi/10.1145/2908812.2908918).

Another popular platform for applying AutoML is [auto-sklearn](https://automl.github.io/auto-sklearn/master/index.html#), though as of yet, it can only run on linux systems. 

MATLAB also has [AutoML capabilities](https://www.mathworks.com/discovery/automl.html). [Here](https://www.mathworks.com/videos/automated-machine-learning-automl-with-matlab-1597226741441.html) is a video explaining how a classification pipeline can be constructed. 

---
# Summary

What have we learned?
* What AutoML is

* The basics of AutoML and TPOT

* The pros and cons of AutoML 

* How to apply TPOT to develop an optimized classification pipeline

* How to apply TPOT to develop an optimized regression pipeline