<a href="https://colab.research.google.com/github/Ferrariagustinpablo/Python-Machine-Learning-notebooks/blob/main/TPOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TPOT

https://www.datacamp.com/community/tutorials/tpot-machine-learning-python

TPOT automates the entire Machine Learning pipeline and provides you with the best performing machine learning model.

There are a lot of components you have to consider before solving a machine learning problem some of which includes data preparation, feature selection, feature engineering, model selection and validation, hyperparameter tuning, etc.

The challenge is to find the best performing combination of techniques so that you can minimize the error in your predictions.

This is the main reason that nowadays people are working to develop Auto-ML algorithms and platforms so that anyone, without any machine learning expertise, can build models without spending much time or effort. One such platform is available as a python library: TPOT.

TPOT Makes: Feature Selection, Feature preprocessing, Feature construction, model selection, Parameter Optimization automatically.

Once TPOT is finished searching, it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.


In [None]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/14/5e/cb87b0257033a7a396e533a634079ee151a239d180efe2a8b1d2e3584d23/TPOT-0.11.5-py3-none-any.whl (82kB)
[K     |████                            | 10kB 10.2MB/s eta 0:00:01[K     |████████                        | 20kB 1.7MB/s eta 0:00:01[K     |████████████                    | 30kB 2.2MB/s eta 0:00:01[K     |████████████████                | 40kB 2.4MB/s eta 0:00:01[K     |████████████████████            | 51kB 1.9MB/s eta 0:00:01[K     |████████████████████████        | 61kB 2.2MB/s eta 0:00:01[K     |████████████████████████████    | 71kB 2.3MB/s eta 0:00:01[K     |███████████████████████████████▉| 81kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 2.3MB/s 
Collecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Collecting update-checker>=0.16
  Downloading https:

# Genetic Programming

With the right data, computing power and machine learning model you can discover a solution to any problem, but knowing which model to use can be challenging for you as there are so many of them like Decision Trees, SVM, KNN, etc. That's where genetic programming can be of great use and provide help. Genetic algorithms are inspired by the Darwinian process of Natural Selection, and they are used to generate solutions to optimization and search problems in computer science.

Selection: You have a population of possible solutions to a given problem and a fitness function. At every iteration, you evaluate how to fit each solution with your fitness function.

Crossover: Then you select the fittest ones and perform crossover to create a new population.

Mutation: You take those children and mutate them with some random modification and repeat the process until you get the fittest or best solution.

Well, it turns out that choosing the right machine learning model and all the best hyperparameters for that model is itself an optimization problem for which genetic programming can be used.

# Making the model

In [None]:
import pandas as pd
import numpy as np

telescope_data=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data',header=None)
telescope_data.columns = ['fLength', 'fWidth','fSize','fConc','fConcl','fAsym','fM3Long','fM3Trans','fAlpha','fDist','class']
telescope_data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [None]:
telescope_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19020 entries, 0 to 19019
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   fLength   19020 non-null  float64
 1   fWidth    19020 non-null  float64
 2   fSize     19020 non-null  float64
 3   fConc     19020 non-null  float64
 4   fConcl    19020 non-null  float64
 5   fAsym     19020 non-null  float64
 6   fM3Long   19020 non-null  float64
 7   fM3Trans  19020 non-null  float64
 8   fAlpha    19020 non-null  float64
 9   fDist     19020 non-null  float64
 10  class     19020 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.6+ MB


 The target variable/feature class is however categorical. You can check the counts of each class in the target variable using value_counts() method.

In [None]:
telescope_data['class'].value_counts()

g    12332
h     6688
Name: class, dtype: int64

It's generally a good idea to randomly shuffle the data before starting to avoid any type of ordering in the data. You can rearrange the data in the DataFrame using numpy's random and permutation() function. To reset the index numbers after the shuffle use reset_index() method with drop = True as a parameter.

In [None]:
telescope_shuffle=telescope_data.iloc[np.random.permutation(len(telescope_data))]
tele=telescope_shuffle.reset_index(drop=True)
tele.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,75.4189,21.0294,3.0388,0.299,0.1596,-38.5453,-36.8342,9.1272,3.0122,289.341,g
1,20.7445,17.3249,2.5916,0.4686,0.2446,11.7553,17.0726,-12.0534,25.7775,91.6373,g
2,75.4045,40.178,3.4087,0.1178,0.0599,68.7596,60.9121,12.1566,13.77,230.517,g
3,21.6208,16.5914,2.6698,0.4257,0.262,27.3897,9.4802,10.3581,34.221,230.005,g
4,12.9624,10.9524,2.1538,0.7368,0.4667,-13.8539,6.602,-5.0659,3.5444,168.668,g


replace the g class (signal) with a numerical value 0 and h class (background) with numerical value 1. 

In [None]:
tele['class']=tele['class'].map({'g':0,'h':1})

tele.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,75.4189,21.0294,3.0388,0.299,0.1596,-38.5453,-36.8342,9.1272,3.0122,289.341,0
1,20.7445,17.3249,2.5916,0.4686,0.2446,11.7553,17.0726,-12.0534,25.7775,91.6373,0
2,75.4045,40.178,3.4087,0.1178,0.0599,68.7596,60.9121,12.1566,13.77,230.517,0
3,21.6208,16.5914,2.6698,0.4257,0.262,27.3897,9.4802,10.3581,34.221,230.005,0
4,12.9624,10.9524,2.1538,0.7368,0.4667,-13.8539,6.602,-5.0659,3.5444,168.668,0


Now you store the class labels, which you need to predict, in a separate variable tele_class.

In [None]:
tele_class = tele['class'].values

You should also do missing value treatment before using tpot.

In [None]:
pd.isnull(tele).any()

fLength     False
fWidth      False
fSize       False
fConc       False
fConcl      False
fAsym       False
fM3Long     False
fM3Trans    False
fAlpha      False
fDist       False
class       False
dtype: bool

This dataset doesn't have any missing values. Note in datasets with missing values you can either drop the rows/columns using dropna() method or replace the missing value with some dummy value using fillna() method.

You will now split the DataFrame into a training set and a testing set

In [None]:
from sklearn.model_selection import train_test_split
training_indices, validation_indices = train_test_split(tele.index,
        stratify = tele_class, test_size=0.25)

training_indices.size, validation_indices.size

(14265, 4755)

Note running the code in the below cell will take several hours to finish. With the given TPOT settings (5 generations with 100 population size), TPOT will evaluate 500 pipeline configurations before finishing. To put this number into context, think about a grid search of 500 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 500 model configurations to evaluate with 5-fold cross-validation, which means that roughly 2500 models are fit and evaluated on the training data in one grid search. That's a time-consuming procedure! Later, you will get to know about some more arguments that you can pass to TPOTClassifier to control the execution time for TPOT to finish.

In [None]:
from tpot import TPOTClassifier
# from tpot import TPOTRegressor

tpot = TPOTClassifier(generations=5,verbosity=2)

tpot.fit(tele.drop('class',axis=1).loc[training_indices].values,
         tele.loc[training_indices,'class'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=600.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.8801962846126884
Generation 2 - Current best internal CV score: 0.8801962846126884
Generation 3 - Current best internal CV score: 0.8801962846126884
Generation 4 - Current best internal CV score: 0.880546792849632

Best pipeline: RandomForestClassifier(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=True, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=12, n_estimators=100)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=5,
               log_file=<ipykernel.iostream.OutStream object at 0x7fe125c39b00>,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=100,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

Best pipeline: RandomForestClassifier(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=True, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=12, n_estimators=100)

In [None]:
tpot.score(tele.drop('class',axis=1).loc[validation_indices].values,
           tele.loc[validation_indices, 'class'].values)

In [None]:
tpot.export('tpot_MAGIC_Gamma_Telescope_pipeline.py')

In [None]:
## This is the pipeline that tpot generated: tpot_MAGIC_Gamma_Telescope_pipeline.py
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, RobustScaler

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=42)

# Score on the training set was:0.881597992166
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    RobustScaler(),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=6, max_features=0.6, min_samples_leaf=12, min_samples_split=3, n_estimators=100, subsample=0.55)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

# TPOT can recommend different solutions

TPOT can recommend different solutions for the same dataset
If you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn't converge due to lack of time or that multiple pipelines perform more-or-less the same on your dataset.