## Using TPOT to fit optimized machine learning models in Python

In this tutorial session we will explore how to use the packahe [TPOT](https://epistasislab.github.io/tpot/) to automatically choose the best machine learning algorithm and the optimal parameterization for the algorithm. We will show examples for classification and regression (in the machine learning sense, not the classic stats sense).

Lets dig right into our first example. First, we will import the necessasy packages:

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

Lets look at a first example using a built in dataset called `digits`. You can learn more about it at https://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html.

In [2]:
digits = load_digits()

X_train, X_test, Y_train, Y_test = train_test_split(digits.data,digits.target, train_size = 0.7, test_size = 0.3, random_state = 79)

print(X_train.shape)
print(X_test.shape)

Let's first do a regular ML run, using Random Forests, to see which accuracies we get.

In [31]:
rf_mod = RandomForestClassifier(n_estimators=100, random_state = 79)

rf_mod.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=79, verbose=0,
                       warm_start=False)

In [32]:
rf_pred = rf_mod.predict(X_test)

round(metrics.accuracy_score(Y_test,rf_pred),3)

0.978

Pretty good. What would TPOT give us?

In [5]:
from tpot import TPOTClassifier

In [None]:
tpot_mod = TPOTClassifier(generations = 5, population_size = 100, verbosity = 2, n_jobs = 11, random_state = 79)

tpot_mod.fit(X_train,Y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=220, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.9562909048914667
Generation 2 - Current best internal CV score: 0.9658404886184762
Generation 3 - Current best internal CV score: 0.9666604190673738
Generation 4 - Current best internal CV score: 0.9666604190673738
Generation 5 - Current best internal CV score: 0.9674410934291178
Generation 6 - Current best internal CV score: 0.9769783543070867
Generation 7 - Current best internal CV score: 0.9769783543070867
Generation 8 - Current best internal CV score: 0.9769783543070867


In [11]:
tp_pred = tpot_mod.predict(X_test)

round(metrics.accuracy_score(Y_test,tp_pred),3)

0.985

We get an accuracy of ~0.99 with only ten generations. If we let it run for longer, it could be better.

But what do we do withit then? Well, TPOT is nice enough to output it's "findings" as a ready to run Python script: 

In [None]:
tpot_mod.export('tpot_exported_pipeline.py')

The authors of TPOT suggest that you take this as a suggested pipeline, and do further fine parameter tuning with grid search to eek out the last bits of accuracy. 

How would TPOT do with a more sparse and complex dataset? Let's look at some data from Dhemerson Conciani a masters student working who is developing ML algorithms to classify burn scars in the State of São Paulo, Brazil. We have originally almost one million labelled pixels, but for this exercise we will use a subset of 1000 pixels. 

We read the data using `pandas`, the Python implementation of `data.frames`. 

In [None]:
import pandas as pd
import os

In [None]:
# Set project folder as working folder
os.chdir('/home/thiago/Projects/TPOT_tutorial/')

# Read in csv data using pandas, letting it know there are no variable names
burn = pd.read_csv('./data/burn_scar_mapping/burn_scars_1000.csv')

Once the data is in, it is always a good idea to inspect the data to make sure it looks alright:

In [None]:
burn.head()


In [None]:
burn.tail()

In [None]:
burn.shape

In [None]:
burn.dtypes

We can now start pre-processing the data for ML as usual. Unlike R, Python ML algorithms want headerless numeric arrays for all predictor variables, and a separate numeric array for the dependent variable. It will not recognize "factor" variables like R does. So we need to encode the categorical variables as numeric, including the response variable.

In the code below, we select only the columns we need for the model, and find out who the categorical columns are.

In [None]:
# Drop columns we dont want
vars_df = burn.drop(columns = ['Scene','Date'])

# Make categorical variable boolean mask
categorical_feature_mask = vars_df.dtypes == object # filter categorical columns using mask and turn it into a list
print(categorical_feature_mask)

# Select columns
categorical_cols = vars_df.columns[categorical_feature_mask].tolist()
print(categorical_cols)

We then use the encoding function of `scikit-learn`. You can think of this as a simple model that transforms labels into digits, so the process is similar to actually fitting a model.

In [None]:
from sklearn.preprocessing import LabelEncoder

# instantiate the Encoder, as you would with a Classifier or a Regressor
le = LabelEncoder()

# Empty dataframe to save labels
labels = pd.DataFrame()

# Apply the encoder on categorical feature columns
for col in categorical_cols:
        vars_df[col] = le.fit_transform(vars_df[col])
        cl = list(le.classes_)
        labels = pd.concat([labels, pd.DataFrame({col: cl})],sort = False)

print(labels)                           
vars_df.head(10)

The next step is to separate the X and Y variables as individual arrays. 

In [None]:
X = vars_df.iloc[:,1:].values
Y = vars_df.iloc[:,0].values

print(X[0:3,0:3])
print(Y[0:3])

And then we split the data into training and testing, using `scikit-learn`.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size = 0.7, test_size = 0.3, random_state = 79)

We can now fit our Random Forests and TPOT models to this dataset:

In [None]:
rf_mod.fit(X_train,Y_train)
Y_pred = rf_mod.predict(X_test)

accuracy_score(Y_test,Y_pred)

In [None]:
tpot_mod.fit(X_train,Y_train)

What if we give TPOT larger poprulations and more time (generations)to search?

In [None]:
tpot_mod = TPOTClassifier(generations = 10, population_size = 100, verbosity = 2, n_jobs = 11, random_state = 79)

tpot_mod.fit(X_train,Y_train)

How about a regression problem? Let's bring in another dataset, this time about the size of burned areas instead of just burned land. 

In [None]:
burn2 = pd.read_csv('./data/forest_fires/forestfires.csv')
print(burn2.shape)
print(burn2.head())

In this case, `X` and `Y` are spatial locations, and `data` is the response variable. Let's set it up.

In [None]:
cat_mask = burn2.dtypes==object # filter categorical columns using mask and turn it into a list
cat_cols = burn2.columns[cat_mask].tolist()

labels2 = pd.DataFrame()

for col in cat_cols:
        burn2[col] = le.fit_transform(burn2[col])
        cl = list(le.classes_)
        labels2 = pd.concat([labels2, pd.DataFrame({col: cl})],sort=False)

print(labels2)                           
burn2.head(10)

In [None]:
X_vars = burn2.iloc[:,:-1].values
print(X_vars.shape)
Y_var = burn2.iloc[:,-1].values
print(Y_var.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X_vars,Y_var, train_size=0.7, test_size=0.3, random_state = 79)


Using Random Forests:

In [None]:
from sklearn.ensemble import RandomForestRegressor
import math

rf_reg = RandomForestRegressor(n_estimators=100, random_state = 79)

rf_reg.fit(X_train,Y_train)

from sklearn.metrics import mean_squared_error

Y_pred = rf_reg.predict(X_test)

print(math.sqrt(mean_squared_error(Y_test,Y_pred)))

Using TPOT:

In [None]:
from tpot import TPOTRegressor

In [None]:
tp_reg = TPOTRegressor(generations=5, population_size=20, verbosity=2, random_state = 79)
tp_reg.fit(X_train,Y_train)

Y_pred = tp_reg.predict(X_test)

print(math.sqrt(mean_squared_error(Y_test,Y_pred)))

In [None]:
tp_reg = TPOTRegressor(generations=10, population_size=50, verbosity=2, random_state = 79)
tp_reg.fit(X_train,Y_train)

Y_pred = tp_reg.predict(X_test)

print(math.sqrt(mean_squared_error(Y_test,Y_pred)))