# What is TPOT?

TPOT is an extremely useful library for automating the process of selecting the best Machine Learning model and corresponding hyperparameters, saving you time and optimizing your results. Instead of manually testing different models and configurations for each new dataset, TPOT can explore a multitude of Machine Learning pipelines and determine the one most suitable for your specific dataset using genetic programming.

In summary, TPOT simplifies the search for the optimal model and parameters by automating the process, which can significantly speed up the development of Machine Learning models and help you achieve better performance in your data analysis tasks.

# Why use TPOT?

Automatic Machine Learning (AutoML) tools address a simple problem: how to make the creation and training of models less time-consuming?

AutoML, as the name suggests, automates a large part of the model creation process without sacrificing quality, allowing Data Scientists to focus on analysis. Its pipeline consists of several processes that help build a high-performing Machine Learning model (feature engineering, model generation, hyperparameter optimization).

To clarify, in Machine Learning, a pipeline encodes and automates the workflow that transforms and correlates data into a model that can be analyzed. The data loading into the model is entirely automated.

A pipeline can also be used to separate the workflow of a model into different independent and reusable parts, simplifying its creation and avoiding task repetition.

A good pipeline enhances the efficiency and scalability of building and deploying Machine Learning models.

Additionally, TPOT offers great flexibility as it can be adapted for neural network models with PyTorch. TPOT also supports the use of Dask for parallel training, further enhancing its capabilities.

# How does TPOT work?

TPOT, or Tree-based Pipeline Optimization, uses a structure based on binary decision trees to represent a pipeline model. This includes data preparation, algorithm modeling, hyperparameter settings and model selection.

Below is an example of a pipeline showing the elements automated by TPOT:

![IMG](https://datascientest.com/wp-content/uploads/2023/04/image1-1.png)

By combining stochastic search algorithms like genetic programming and a flexible expression tree representation, TPOT automatically designs and optimizes features, Machine Learning models, and hyperparameters. The goal is to maximize the accuracy of supervised classification on your dataset.

It’s essential to note that finding the most optimized pipeline may require letting TPOT work for a certain amount of time. Running TPOT for just a few minutes may not be sufficient to discover the best model for your dataset.

Depending on the size of your dataset, TPOT can take several hours or even days to complete its search. For a comprehensive and effective search, it’s recommended to run multiple instances concurrently for several hours.

Since TPOT’s optimization algorithm is stochastic, meaning it involves partial randomization, it’s possible that two runs may recommend different pipelines for the same dataset.

In such cases, either the pipelines may not match due to limited runtime or they may have very similar performance scores.


# Find the best pipeline with TPOT

Now that we understand what TPOT is and its significance, let’s see how to set it up and use it. As a reminder, TPOT is based on scikit-learn, which makes its code familiar if you have already worked with this library.

1. Start by importing the TPOT modules and any other modules you need to define your model: ‘pip install tpot’.
2.  During the data transformation step, it’s crucial to rename the target variable and give it the name ‘class’.
TPOT only handles data in numeric format, so you will need to apply the necessary transformations to the explanatory variables.
3. After splitting your dataset into a training set and a test set, you can define your TPOT Classifier and its parameters.
4. To apply TPOT to your dataset, simply use the .fit() method.
5. Once the computation is complete, you will see the best pipeline for your dataset in the output. You can then use the .score() method to measure the performance of the model chosen by TPOT.

# Import Needed Libraries

In [1]:
# Tree-based Pipeline Optimization Tool
! pip install tpot



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier

# EDA

In [3]:
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


# preprocessing

### Define Features X and Traget y

In [5]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

### Encode string columns

In [6]:
X = X.apply(LabelEncoder().fit_transform)
X.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,12,1,1,41,147,0,1,98,0,10,2
1,21,0,2,55,40,0,1,82,0,20,1
2,9,1,1,31,141,0,2,25,0,10,2
3,20,0,0,39,72,0,1,34,1,25,1
4,26,1,2,49,53,0,1,48,0,10,2


### Scaling Data

In [7]:
StandardScalerModel = StandardScaler()
X = StandardScalerModel.fit_transform(X)
X

array([[-1.4331398 ,  0.51595242,  0.22903206, ..., -0.8235563 ,
        -0.87246276,  1.05211381],
       [-0.47848359, -1.93816322,  1.27505906, ..., -0.8235563 ,
         0.12037326, -0.59607813],
       [-1.75135854,  0.51595242,  0.22903206, ..., -0.8235563 ,
        -0.87246276,  1.05211381],
       ...,
       [ 0.37009972,  0.51595242, -0.81699495, ...,  1.21424608,
         0.31894046, -0.59607813],
       [ 0.37009972, -1.93816322,  0.22903206, ..., -0.8235563 ,
        -0.87246276, -0.59607813],
       [-1.64528563,  0.51595242,  1.27505906, ..., -0.8235563 ,
        -0.87246276,  1.05211381]])

### Split Data into train and test

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

In [9]:
X_train.shape

(734, 11)

In [10]:
X_test.shape

(184, 11)

# Create Model

[link: https://epistasislab.github.io/tpot/api/](http://https://epistasislab.github.io/tpot/api/)

In [11]:
tpot = TPOTClassifier(generations=10, verbosity=2)
tpot.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/1100 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8800857329233063

Generation 2 - Current best internal CV score: 0.8800857329233063

Generation 3 - Current best internal CV score: 0.8814649147330165

Generation 4 - Current best internal CV score: 0.8828254589507036

Generation 5 - Current best internal CV score: 0.8828254589507036

Generation 6 - Current best internal CV score: 0.885537228590066

Generation 7 - Current best internal CV score: 0.885537228590066

Generation 8 - Current best internal CV score: 0.885537228590066

Generation 9 - Current best internal CV score: 0.8896188612431274

Generation 10 - Current best internal CV score: 0.8896188612431274

Best pipeline: RandomForestClassifier(MinMaxScaler(StandardScaler(input_matrix)), bootstrap=False, criterion=gini, max_features=0.2, min_samples_leaf=3, min_samples_split=11, n_estimators=100)


In [12]:
print(tpot.score(X_train, y_train))

0.9509536784741145


In [13]:
print(tpot.score(X_test, y_test))

0.8532608695652174
