# AutoML

**References:**

https://machinelearningmastery.com/automl-libraries-for-python/

https://towardsdatascience.com/what-is-automl-6ddf27040f27

https://analyticsvidhya.com/blog/2021/05/auto-ml-some-prominent-automl-libraries/

https://towardsdatascience.com/4-python-automl-libraries-every-data-scientist-should-know-680ff5d6ad08

http://epistasislab.github.io/tpot/examples/

https://www.geeksforgeeks.org/tpot-automl/

### Contents:

1. <a href = "#Introduction:">Introduction</a>
2. <a href = "#Why-and-When?">Why-and-When?</a>
3. <a href = "#History-of-Auto-ML">History of Auto-ML</a>
4. <a href = "#Challenges-with-Auto-ML">Challenges with Auto-ML</a>
5. <a href = "#Various-Auto-ML-Libraries">Various Auto-ML Libraries with Respective Codes</a>

### Introduction:

Automated Machine Learning (AutoML) is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on data preparation, model selection, model hyperparameters, and model compression parameters.

AutoML frameworks help the data scientist in:

- Data visualization
- Model intelligibility
- Model deployment

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)


### Why and When?

The decision to bringing the AutoML into the specific industry for the Machine Learning implementation depends on the stakeholders and technical team concerns. Because there are valuable advantages and unavoidable disadvantages as well comes together. On top AutoML allows everyone to use ML features very quicker and faster than actual development there is no question. “Time-Saving, Organized way of execution and Speed up the development activity” are key aspects for the necessity of AutoML and really, we can say YES to AutoML.

Please keep the below points in mind about AutoML

Supports Data Scientists: Usually, Data Scientists
need to involve an end-end life cycle, But the AutoML platform has the
capability to manage the selected ML life cycle stages and it leads us easy
for integration and enhancing productivity, obviously, AutoML is
not a replacement for Data Scientists.

Routine job/task in ML model
life cycle: If you have automated the repetitive tasks in
the ML build ultimately allows Data Scientists to focus more on
problems and issues, bringing best practices, instead of focusing on the
models, and directly you can help cost-saving for your customers.

### History of Auto-ML

Let’s discuss the history of AutoML briefly.

- H2O.ai (H2O.ai)

In 2012 H2O was founded and it offers an Open-source package, from 2017 they provided a commercial AutoML service called Driverless AI. It has been widely adopted in industries, including financial services and retail.

- TPOT

TPOT was developed by the University of Pennsylvania and based on Python packages. It has achieved outstanding performance and accuracy range 97% -98% for the various datasets.

- Google Cloud Auto-ML

Google Cloud AutoML came into the market in 2018, it is a very user-friendly interface and excellent performance.

- Microsoft Azure Auto-ML

Microsoft has released Azure AutoML in 2018, It offers a transparent model selection and process. As usual Microsoft product capabilities, developers can do the demonstrate very easily and fast.

![image-3.png](attachment:image-3.png)

### Challenges with Auto-ML

Even though AutoML supports some extent, the challenges are listed below

![image-4.png](attachment:image-4.png)

**Auto-ML libraries**
Further in detail, will discuss major AutoML libraries in the Python environment.

**Comaprison of Libraries**

![image-5.png](attachment:image-5.png)

### Various Auto-ML Libraries

**auto-sklearn**

auto-sklearn is an automated machine learning toolkit that integrates seamlessly with the standard sklearn interface so many in the community are familiar with. With the use of recent methods like Bayesian Optimization, the library is built to navigate the space of possible models and learns to infer if a specific configuration will work well on a given task.
Created by Matthias Feurer, et al., the library’s technical details are described in a paper, Efficient and Robust Machine Learning. Feurer writes:

""… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters).""

auto-sklearn is perhaps the best library to get started with AutoML. In addition to discovering data preparation and model selections for a dataset, it learns from models that perform well on similar datasets. Top-performing models are aggregated in an ensemble

![image.png](attachment:image.png)

On top of an efficient implementation, auto-sklearn requires minimal user interaction. Install the library with pip install auto-sklearn.
The primary classes that can be used are AutoSklearnClassifier and AutoSklearnRegressor, which operate on classification and regression tasks, respectively. Both have the same user-specified parameters, of which the most important involve time constraints and ensemble sizes.

In [12]:
# Auto-Sklearn only works in Linux!

**TPOT is also an open-source AutoML library**

TPOT is another Python library that automates the modelling pipeline, with a greater emphasis on data preparation as well as modelling algorithms and model hyperparameters. It automates feature selection, preprocessing, and construction through an evolutionary tree-based structure “called the Tree-based Pipeline Optimization Tool (TPOT) that automatically designs and optimizes machine learning pipelines.” (TPOT Paper)

available in Python. The data flow architecture has been clearly explained, where TPOT is focusing
on, It expects cleaned and fine data set for Feature Engineering, Model Selection
and Hyperparameter optimization
process, So here data wrangling and cleansing should be taken care of by the data scientist.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

Programs, or pipelines, are represented as trees. Genetic programs select and evolve certain programs to maximize the end result of each automated machine learning pipeline.

As Pedro Domingos says, “a dumb algorithm with lots of data beats a clever one with limited data.” This is indeed the case: TPOT can generate sophisticated data preprocessing pipelines.

![image-3.png](attachment:image-3.png)

TPOT pipeline optimizers can take a few hours to produce great results, as many AutoML algorithms are (unless the dataset is small).

### Let’s explore TPOT options

- Dividing our training data into training and validation sets.
- TPOT will take care of The model selection and tuning.

In [13]:
# install TPot and other dependencies
!pip install sklearn fsspec xgboost
%pip install -U distributed scikit-learn dask-ml dask-glm
%pip install "tornado>=5" 
%pip install "dask[complete]"
!pip install TPOT

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [14]:
# import required modules
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np

In [15]:
# load boston dataset
X, y = load_boston(return_X_y=True)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

In [16]:
# divide the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .25)

In [17]:
# define TpotRegressor 
reg = TPOTRegressor(verbosity=2, population_size=50, generations=10, random_state=35)

In [18]:
# fit the regressor on training data
reg.fit(X_train, y_train)

                                                                                                                       
Generation 1 - Current best internal CV score: -11.539488893707064
                                                                                                                       
Generation 2 - Current best internal CV score: -11.539488893707064
                                                                                                                       
Generation 3 - Current best internal CV score: -11.539488893707064
                                                                                                                       
Generation 4 - Current best internal CV score: -11.539488893707064
                                                                                                                       
Generation 5 - Current best internal CV score: -11.539488893707064
                                                                 

In [19]:
# print the results on test data
print(reg.score(X_test, y_test))

-8.690240118115778




In [20]:
#save the model in top_boston.py
reg.export('top_boston.py')

In [None]:
#tested:noerror

**H2O’s AutoML** 

It is developed by H2O. This is not like TPOT and it helps to pre-process which includes imputation, encoding, handling missing values, and model selection and hyperparameter tuning. The good thing here is, it provides deployable code for the team to deploy quickly.

**HyperOpt** is for Bayesian optimization purposes, it helps us for optimizing models with hundreds of parameters in the given dataset. As the name implies that this is explicitly used to optimize the pipelines. To simplify the usage of this library has been integrated with sklearn and available in the name of HyperOpt-sklearn.

**Explainability**

H2O AutoML also provides insights into model’s global explainability such as variable importance, partial dependence plot, SHAP values and model correlation with just one line of code.

In addition, it also provides local explainability for individual records. We can input a H2OFrame into the frame argument and indicate which row we would like explained using the row_index argument. In this case we are explaining the results for row 15 of the test frame.

#### End!