<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

---
# AutoML with TPOT
---

<center><img src="https://raw.githubusercontent.com/insaid2018/Term-2/master/images/tpot-logo.jpg"width="300" height="300"  /></center>

---
## Table of Contents
---
1. [Introduction](#section1)<br>
2. [Installing and importing libraries](#section2)
   - 2.1. [Installing TPOT](#section201)
   - 2.2. [Importing Librares](#section202)
3. [TPOT for Classification](#section3)
   - 3.1. [Problem Statement](#section301)
   - 3.2. [Reading the dataset](#section302)
   - 3.3. [Data Pre-Processing](#section303)
   - 3.4. [Defining Model Evaluation](#section304)
4. [TPOT for Regression](#section4)
   - 4.1. [Problem Statement](#section301)
   - 4.2. [Reading the dataset](#section302)
   - 4.3. [Data Pre-Processing](#section303)
   - 4.4. [Defining Model Evaluation](#section304)
5.  [Conclusions](#section801) 

---
### 1. Introduction
---

- **Tree-based Pipeline Optimization Tool**, or TPOT for short, is a **Python library** for **automated machine learning**.

- **TPOT** uses a **tree-based** structure to represent a **model** pipeline for a **predictive** modeling problem

<center><img src="https://raw.githubusercontent.com/insaid2018/Term-2/master/images/Overview-of-the-TPOT-Pipeline-Search.png" width="600" height="300" /></center>



---
### 2. Installing and importing libraries
---

#### 2.1. Installing TPOT

In [None]:
!pip install tpot



#### 2.2. Importing Libraries

In [None]:
import tpot
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier,TPOTRegressor

In [None]:
tpot. __version__

'0.11.6.post2'

---
### 3. TPOT for Classification
---

#### 3.1. Problem Statement

- The **dataset** contains **signals** obtained from a variety of different aspect angles, spanning **90** degrees for the cylinder and **180** degrees for the rock.

- Each pattern is a set of **60** numbers in the range **0.0** to **1.0**.  

- Each number represents the **energy** within a particular **frequency** band, integrated over a certain **period** of time.  

- The **integration** aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the **chirp**.

- The **label** associated with each record contains the letter "**R**" if the object is a **rock** and "**M**" if it is a **mine** (metal cylinder).  

- The **numbers** in the **labels** are in increasing order of **aspect** angle, but they do not encode the **angle** directly.



#### 3.2. Reading the dataset

In [None]:
dataframe = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/sonar.csv')
dataframe.head()

Unnamed: 0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
0,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1.0,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.325,0.32,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.184,0.197,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.053,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
1,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,0.6333,0.706,0.5544,0.532,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.4293,0.3648,0.5331,0.2413,0.507,0.8533,0.6036,0.8514,0.8512,0.5045,0.1862,0.2709,0.4232,0.3043,0.6116,0.6756,0.5375,0.4719,0.4647,0.2587,0.2129,0.2222,0.2111,0.0176,0.1348,0.0744,0.013,0.0106,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
2,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,0.0881,0.1992,0.0184,0.2261,0.1729,0.2131,0.0693,0.2281,0.406,0.3973,0.2741,0.369,0.5556,0.4846,0.314,0.5334,0.5256,0.252,0.209,0.3559,0.626,0.734,0.612,0.3497,0.3953,0.3012,0.5408,0.8814,0.9857,0.9167,0.6121,0.5006,0.321,0.3202,0.4295,0.3654,0.2655,0.1576,0.0681,0.0294,0.0241,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
3,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,0.4152,0.3952,0.4256,0.4135,0.4528,0.5326,0.7306,0.6193,0.2032,0.4636,0.4148,0.4292,0.573,0.5399,0.3161,0.2285,0.6995,1.0,0.7262,0.4724,0.5103,0.5459,0.2881,0.0981,0.1951,0.4181,0.4604,0.3217,0.2828,0.243,0.1979,0.2444,0.1847,0.0841,0.0692,0.0528,0.0357,0.0085,0.023,0.0046,0.0156,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R
4,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105,0.3039,0.2988,0.425,0.6343,0.8198,1.0,0.9988,0.9508,0.9025,0.7234,0.5122,0.2074,0.3985,0.589,0.2872,0.2043,0.5782,0.5389,0.375,0.3411,0.5067,0.558,0.4778,0.3299,0.2198,0.1407,0.2856,0.3807,0.4158,0.4054,0.3296,0.2707,0.265,0.0723,0.1238,0.1192,0.1089,0.0623,0.0494,0.0264,0.0081,0.0104,0.0045,0.0014,0.0038,0.0013,0.0089,0.0057,0.0027,0.0051,0.0062,R


### 3.3. Data Pre-Processing

In [None]:
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')

In [None]:
y = LabelEncoder().fit_transform(y.astype('str'))

### 3.4. Defining Model Evaluation

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
# define search
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_sonar_best_model.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=300.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.8647619047619047

Generation 2 - Current best internal CV score: 0.8661904761904763

Generation 3 - Current best internal CV score: 0.8745238095238095

Generation 4 - Current best internal CV score: 0.8776984126984128

Generation 5 - Current best internal CV score: 0.8776984126984128

Best pipeline: GradientBoostingClassifier(ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.7500000000000001, min_samples_leaf=3, min_samples_split=2, n_estimators=100), learning_rate=1.0, max_depth=9, max_features=0.7000000000000001, min_samples_leaf=1, min_samples_split=13, n_estimators=100, subsample=0.9500000000000001)


---
### 4. TPOT for Regression
---

#### 4.1. Problem Statement 

Auto Insurance in Sweden

In the following **data**
**X** = **Number** of **claims**
**Y** = Total payment for all the claims in **thousands** of **Swedish Kronor**for **geographical** zones in Sweden

#### 4.2. Reading the dataset

In [None]:
dataframe = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-2/master/auto-insurance.csv',header = None)
dataframe.head()

Unnamed: 0,0,1
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4


### 3.3. Data Pre-Processing

In [None]:
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]

(62, 1) (62,)


### 3.4. Defining Model Evaluation

In [None]:
# define model evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
model = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

In [None]:
# perform the search
model.fit(X, y)
# export the best model
model.export('tpot_insurance_best_model.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=300.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: -29.147625969129034

Generation 2 - Current best internal CV score: -29.139635546185936

Generation 3 - Current best internal CV score: -29.139635546185936

Generation 4 - Current best internal CV score: -29.099695854803613

Generation 5 - Current best internal CV score: -29.099695854803613

Best pipeline: RidgeCV(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False))


---
### 5. Conclusions
---

- **TPOT** is an **open**-**source** library for **AutoML** with **scikit**-**learn** data preparation and **machine** **learning** models.

- **TPOT** automatically discovers **top**-**performing** models for **classification** tasks.

- **TPOT** automatically discovers **top**-**performing** models for regression tasks.