# Automated ML Pipeline Generator using TPOT in Python
---
<br/>

### What is TPOT?
- Tree-Based Pipeline Optimization Tool (TPOT)
- a tool that optimizes machine learning pipelines using genetic programming. 
- exploring thousands of possible pipelines to find the best one for your data. 
- Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. 
- TPOT makes use of the Python-based scikit-learn library as its ML menu.
- pipeline is an independently executable workflow of a complete machine learning task.

### To install :
    pip install tpot

### Dependencies :
- scikit learn
- numpy 

___

In [1]:
# import libraries
import numpy as np
import pandas as pd

In [2]:
data_url = "https://raw.githubusercontent.com/20b2122/AutoML-using-TPOT-in-python/main/Housing%20Modified/Housing_Modified.csv"

In [3]:
df = pd.read_csv(data_url)

In [4]:
df.head()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea
0,42000.0,5850,3,1,two,yes,no,yes,no,no,1,no
1,38500.0,4000,2,1,one,yes,no,no,no,no,0,no
2,49500.0,3060,3,1,one,yes,no,no,no,no,0,no
3,60500.0,6650,3,1,two,yes,yes,no,no,no,0,no
4,61000.0,6360,2,1,one,yes,no,no,no,no,0,no


In [5]:
# Checking for missing data
df.isnull().sum()

price       0
lotsize     0
bedrooms    0
bathrms     0
stories     0
driveway    0
recroom     0
fullbase    0
gashw       0
airco       0
garagepl    0
prefarea    0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   price     546 non-null    float64
 1   lotsize   546 non-null    int64  
 2   bedrooms  546 non-null    int64  
 3   bathrms   546 non-null    int64  
 4   stories   546 non-null    object 
 5   driveway  546 non-null    object 
 6   recroom   546 non-null    object 
 7   fullbase  546 non-null    object 
 8   gashw     546 non-null    object 
 9   airco     546 non-null    object 
 10  garagepl  546 non-null    int64  
 11  prefarea  546 non-null    object 
dtypes: float64(1), int64(4), object(7)
memory usage: 51.3+ KB


### convert categorical (stories, driveway, recroom, fullbase, gashw, airco, prefarea) to Numerical

In [7]:
df['stories'] = df['stories'].map({'one': 0, 'two': 1, 'three': 2, 'four': 3})
df['driveway'] = df['driveway'].map({'yes': 1, 'no': 0})
df['recroom'] = df['recroom'].map({'yes': 1, 'no': 0})
df['fullbase'] = df['fullbase'].map({'yes': 1, 'no': 0})
df['gashw'] = df['gashw'].map({'yes': 1, 'no': 0})
df['airco'] = df['airco'].map({'yes': 1, 'no': 0})
df['prefarea'] = df['prefarea'].map({'yes': 1, 'no': 0})

In [8]:
df.head()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea
0,42000.0,5850,3,1,1,1,0,1,0,0,1,0
1,38500.0,4000,2,1,0,1,0,0,0,0,0,0
2,49500.0,3060,3,1,0,1,0,0,0,0,0,0
3,60500.0,6650,3,1,1,1,1,0,0,0,0,0
4,61000.0,6360,2,1,0,1,0,0,0,0,0,0


In [9]:
df.info() # double if all of the categorical has been converted

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   price     546 non-null    float64
 1   lotsize   546 non-null    int64  
 2   bedrooms  546 non-null    int64  
 3   bathrms   546 non-null    int64  
 4   stories   546 non-null    int64  
 5   driveway  546 non-null    int64  
 6   recroom   546 non-null    int64  
 7   fullbase  546 non-null    int64  
 8   gashw     546 non-null    int64  
 9   airco     546 non-null    int64  
 10  garagepl  546 non-null    int64  
 11  prefarea  546 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 51.3 KB


### splitting data into input and output (price)

In [10]:
x = df.iloc[:,1:] # input
y = df.iloc[:,0] # output - price

---

## Individual Algorithm - find the mean of the algorithm 
(Not related to TPOT - just for comparison)

In [11]:
# import Machine Learning libraries

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

### Linear Regression

In [12]:
cv_scores = cross_val_score(LinearRegression(), x, y, cv=10)

In [13]:
print("cross-validation values:\n", cv_scores)
print("\nMean of cross-validation:", np.mean(cv_scores))

cross-validation values:
 [0.21094242 0.54816754 0.5500581  0.34203638 0.20110283 0.51019454
 0.55343713 0.43493564 0.57310562 0.55719404]

Mean of cross-validation: 0.4481174245202405


### Random Forest Regressor

In [14]:
rf_cv_scores = cross_val_score(RandomForestRegressor(), x, y, cv=10)

In [15]:
print("Random forest cross-validation values:\n", rf_cv_scores)
print("\nMean of random forest cross-validation:", np.mean(rf_cv_scores))

Random forest cross-validation values:
 [ 0.12152875  0.47041379  0.42563448  0.27099781 -0.01524396  0.4942324
  0.42313633  0.45089039  0.52439603  0.3980166 ]

Mean of random forest cross-validation: 0.3564002606039119


___

## TPOT

In [16]:
import tpot
import time # to calculate how long it takes for the TPOT to finish execute



### Check the available methods and attributes at TPOT

Using this dataset, we will use TPOTRegressor

In [17]:
dir(tpot)

['TPOTClassifier',
 'TPOTRegressor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'base',
 'builtins',
 'config',
 'decorators',
 'driver',
 'export_utils',
 'gp_deap',
 'gp_types',
 'main',
 'metrics',
 'operator_utils',
 'tpot']

### Split the data in x and y into train and test

In [18]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [19]:
from tpot import TPOTRegressor

# Init
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)

**Generation**: Number of iterations to run the pipeline optimization process.TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. <br/>
<br/>

**Population Size**: Number of individuals to retain in the GP population every generation.<br/>
<br/>
TPOT will evaluate POPULATION_SIZE (50) + GENERATIONS (5) x OFFSPRING_SIZE (50) = 300 pipelines in total.
<br/>
By default, OFFSPRING_SIZE = POPULATION_SIZE<br/>
<br/>

**Verbosity**: How much information TPOT communicates while it is running. 0 = none, 1 = minimal, 2 = high, 3 = all.

In [20]:
start = time.time()

# Fit data
tpot.fit(x_train, y_train)

end = time.time()

  from pandas import MultiIndex, Int64Index


                                                                              
Generation 1 - Current best internal CV score: -256142409.23601055
                                                                              
Generation 2 - Current best internal CV score: -255477109.2869607
                                                                              
Generation 3 - Current best internal CV score: -255477109.2869607
                                                                              
Generation 4 - Current best internal CV score: -255247210.33779487
                                                                              
Generation 5 - Current best internal CV score: -255247210.33779487
                                                                              
Best pipeline: RidgeCV(RobustScaler(RidgeCV(input_matrix)))


**RidgeCV**: Ridge regression with built-in cross-validation.

In [21]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = tpot.predict(x_test)

print('\ntime: ',(end-start))
print('Mean Absolute Error:',mean_absolute_error(y_pred=y_pred, y_true=y_test))
print('Mean Squared Error:',mean_squared_error(y_pred=y_pred, y_true=y_test))
print('Coef of Determination, R2:',r2_score(y_pred=y_pred, y_true=y_test))


time:  219.2390353679657
Mean Absolute Error: 11465.894614513943
Mean Squared Error: 249800964.44662932
Coef of Determination, R2: 0.7066774696876976




**MAE**: measures the average of the residuals in the dataset.

**MSE**: average of the squared difference between the original and predicted values in the data set. Measures the variance of the residuals.

**$R^{2}$**: represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.

In [22]:
# Export the result
tpot.export('TPOTRegressor_ml_pipeline.py')

#### The 'TPOTRegressor_ml_pipeline.py' file will contain the Python code for the optimized pipeline.

In [23]:
import IPython

IPython.display.Code('https://raw.githubusercontent.com/20b2122/AutoML-using-TPOT-in-python/main/Housing%20Modified/TPOTRegressor_ml_pipeline.py')

___

### Prediction using TPOT

- preparing the predicted(y_pred) data to be shown in a table with x_test and y_test data

In [24]:
# to ensure one the tables are joined the data are aligned
sort_x = x_test.sort_index()
new_x = sort_x.reset_index().drop(columns=['index'])

sort_y = y_test.sort_index()
new_y = sort_y.reset_index().drop(columns=['index'])

# converting the array into list then into a table
y_pred = tpot.predict(new_x).tolist() 
predictions = pd.DataFrame({ 'pred_price':y_pred }) 

pd.concat([new_x, new_y, predictions], axis=1).head(10)



Unnamed: 0,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea,price,pred_price
0,5850,3,1,1,1,0,1,0,0,1,0,42000.0,68587.122255
1,3060,3,1,0,1,0,0,0,0,0,0,49500.0,39902.158331
2,3880,3,2,1,1,0,1,0,0,2,0,66000.0,77795.862134
3,5500,3,2,3,1,1,0,0,1,1,0,88500.0,102260.31077
4,7200,3,2,0,1,0,1,0,1,3,0,90000.0,99327.121683
5,3000,2,1,0,0,0,0,0,0,0,0,30500.0,32427.442684
6,3185,2,1,0,1,0,0,0,1,0,0,37900.0,50184.900537
7,3450,1,1,0,1,0,0,0,0,0,0,45000.0,38511.579286
8,3986,2,2,0,0,1,1,0,0,1,0,45000.0,64184.306178
9,4000,3,1,1,1,0,0,0,1,0,0,37900.0,61430.062526


Some of the predicted price are not far off from the actual price. For example, index number 5, the actual price is 30500.0	and the predicted is 31807.49, that is only -1,307.49 of its residual. <br/><br/> But of course there are also samples that are far off from the actual, like index number 9, the residual is -24,400.