# TPOT usage examples

You can find the talk slides here -> https://slides.com/j-diegohueltesvega/data-science-lazy-people/

In this nb you have some examples and ideas about how to use tpot.
For installing tpot you can follow this guide -> http://rhiever.github.io/tpot/installing/.
I recommend you also to install xgboost which is optional

## Example 1, basic example

In [None]:
from tpot import TPOTClassifier, TPOTRegressor
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

In [None]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.8, test_size=0.2)

In [None]:
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2,n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

In [None]:
tpot.export('tpot_exported_pipeline.py')

## Example 2, titanic predictions with zero data cleaning

This example doesn't have any data cleaning except the labels encoding in order to show how tpot works without any help.
The purpose of the below function is to perform that basic "data cleaning", in the train set and in the test set.

In [None]:
def load_and_labelize_titanic(filename, encoders=None):
    """Read csv and perform basic labeling encoding"""
    
    df = pd.read_csv(filename)
    if not encoders:
        encoders = {'Sex': LabelEncoder(), 
                    'Cabin': LabelEncoder(), 
                    'Embarked': LabelEncoder()}
        for column, encoder in encoders.items():
            encoder.fit(list(df[column].astype(str)) + ['UnknownLabel'])
            df[column] = encoder.transform(df[column].astype(str))
    else:
        for column, encoder in encoders.items():
            df.loc[~df[column].isin(encoder.classes_), column] = 'UnknownLabel'
            df[column] = encoder.transform(df[column].astype(str))
        
    df = df.fillna(-999)
    passenger_ids = df['PassengerId']
    df = df.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
    return df, encoders, passenger_ids


In [None]:
train, encoders, _ = load_and_labelize_titanic('titanic/train.csv')

I recommend you to play with the number of generations and the population size. That will impact in the optimization time

In [None]:
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2,n_jobs=-1, 
                      scoring='accuracy', cv=10)
tpot.fit(train.drop('Survived', axis=1), train['Survived'])

We're going to use the same function with the test set, providing the encoders in order to transform the data in the same way. The function also returns the list of passenger ids to be used with the prediction results.

In [None]:
test, _, passenger_ids = load_and_labelize_titanic('titanic/test.csv', encoders)
results = tpot.predict(test)
results_df = pd.DataFrame({'PassengerId': passenger_ids, 'Survived': results})
results_df.to_csv('titanic/predictions.csv', index=False)

The below cell is part of the exporting from an optimized pipeline. 
You don't need to export in order to predict because you can use the tpot optimizer instance after the fit, but you can also export if you want to persist it. 

In [None]:
from copy import copy
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from xgboost import XGBClassifier

exported_pipeline = make_pipeline(
    make_union(VotingClassifier([("est", RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.8500000000000001, min_samples_leaf=4, min_samples_split=13, n_estimators=100))]), FunctionTransformer(copy)),
    XGBClassifier(learning_rate=0.5, max_depth=6, min_child_weight=20, nthread=1, subsample=0.9000000000000001)
)
exported_pipeline.fit(train.drop('Survived', axis=1), train['Survived'])
results = exported_pipeline.predict(test)

results_df = pd.DataFrame({'PassengerId': passenger_ids, 'Survived': results})
results_df.to_csv('titanic/predictions.csv', index=False)

## Example 3, Titanic predictions without data cleaning and with custom config dict

In that example, I want the optimizer to choose between the RF Classifier or the XGB Classifier, I also fixed these classifier parameters. Doing that, is expected that the optimizer is going to try to mutate more preprocessors and feature selectors.

In [None]:
tpot = TPOTClassifier(generations=20, population_size=100, verbosity=2, n_jobs=-1,
                      scoring='accuracy', cv=4, config_dict={
        'sklearn.ensemble.RandomForestClassifier': {
            'n_estimators': [100],
            'criterion': ["entropy"],
            'max_features': [0.85],
            'min_samples_split': [13],
            'min_samples_leaf': [4],
            'bootstrap': [True]
        },

        'xgboost.XGBClassifier': {
            'n_estimators': [100],
            'max_depth': [6],
            'learning_rate': [0.5],
            'subsample': [0.9],
            'min_child_weight': [20],
            'nthread': [1]
        },

        # Preprocesssors
        'sklearn.preprocessing.Binarizer': {
            'threshold': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.decomposition.FastICA': {
            'tol': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.cluster.FeatureAgglomeration': {
            'linkage': ['ward', 'complete', 'average'],
            'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine', 'precomputed']
        },

        'sklearn.preprocessing.MaxAbsScaler': {
        },

        'sklearn.preprocessing.MinMaxScaler': {
        },

        'sklearn.preprocessing.Normalizer': {
            'norm': ['l1', 'l2', 'max']
        },

        'sklearn.decomposition.PCA': {
            'svd_solver': ['randomized'],
            'iterated_power': range(1, 11)
        },

        'sklearn.preprocessing.PolynomialFeatures': {
            'degree': [2],
            'include_bias': [False],
            'interaction_only': [False]
        },

        'sklearn.kernel_approximation.RBFSampler': {
            'gamma': np.arange(0.0, 1.01, 0.05)
        },

        'sklearn.preprocessing.RobustScaler': {
        },

        'sklearn.preprocessing.StandardScaler': {
        },

        'tpot.built_in_operators.ZeroCount': {
        },

        # Selectors
        'sklearn.feature_selection.SelectFwe': {
            'alpha': np.arange(0, 0.05, 0.001),
            'score_func': {
                'sklearn.feature_selection.f_classif': None
            }  # read from dependencies ! need add an exception in preprocess_args

        },

        'sklearn.feature_selection.SelectKBest': {
            'k': range(1, 100),  # need check range!
            'score_func': {
                'sklearn.feature_selection.f_classif': None
            }
        },

        'sklearn.feature_selection.SelectPercentile': {
            'percentile': range(1, 100),
            'score_func': {
                'sklearn.feature_selection.f_classif': None
            }
        },

        'sklearn.feature_selection.VarianceThreshold': {
            'threshold': np.arange(0.05, 1.01, 0.05)
        },

        'sklearn.feature_selection.RFE': {
            'step': np.arange(0.05, 1.01, 0.05),
            'estimator': {
                'sklearn.ensemble.ExtraTreesClassifier': {
                    'n_estimators': [100],
                    'criterion': ['gini', 'entropy'],
                    'max_features': np.arange(0.05, 1.01, 0.05)
                }
            }
        },

        'sklearn.feature_selection.SelectFromModel': {
            'threshold': np.arange(0, 1.01, 0.05),
            'estimator': {
                'sklearn.ensemble.ExtraTreesClassifier': {
                    'n_estimators': [100],
                    'criterion': ['gini', 'entropy'],
                    'max_features': np.arange(0.05, 1.01, 0.05)
                }
            }
        }

    }
                      )
tpot.fit(train.drop('Survived', axis=1), train['Survived'])


In [None]:
test, _, passenger_ids = load_and_labelize_titanic('titanic/test.csv', encoders)
results = tpot.predict(test)
pd.DataFrame({'PassengerId': passenger_ids, 'Survived': results}).to_csv('titanic/predictions.csv', index=False)

## Example 4, house prices regression

Let's define our custom error function. Is important that the "error" word is in the name function. In this way, TPOT knows that should minimize the value of the function

In [None]:
def rmserror_log(predictions, targets):
    return np.sqrt(((np.log(predictions) - np.log(targets)) ** 2).mean())

As we did in the last example, we're going to do zero data cleaning. We're just labeling the string columns, this time even we do it in a blind way, just iterating the columns.

In [None]:
def load_and_clean_houses(filename, encoders=None):
    df = pd.read_csv(filename)
    if not encoders:
        encoders ={column: LabelEncoder() 
                   for column, column_type in df.dtypes.items() 
                   if str(column_type) == 'object'}
        for column, encoder in encoders.items():
            encoder.fit(list(df[column].astype(str)) + ['UnknownLabel'])
            df[column] = encoder.transform(df[column].astype(str))
    else:
        for column, encoder in encoders.items():
            df.loc[~df[column].isin(encoder.classes_), column] = 'UnknownLabel'
            df[column] = encoder.transform(df[column].astype(str))
    
    df = df.fillna(-999)
    ids = df['Id']
    df = df.drop(['Id'], axis=1)
    return df, encoders, ids

In [None]:
train, encoders, _ = load_and_clean_houses('houses/train.csv')

In [None]:
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, 
                     n_jobs=-1, scoring=rmserror_log)
tpot.fit(train.drop('SalePrice', axis=1), train['SalePrice'])

In [None]:
test, _, ids = load_and_clean_houses('houses/test.csv', encoders)

results = tpot.predict(test)
result_df = pd.DataFrame({'PassengerId': ids, 'Survived': results})
result_df.to_csv('houses/predictions.csv', index=False)

Here we have an example of a exported pipeline

In [None]:
from copy import copy
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from xgboost import XGBRegressor

exported_pipeline = make_pipeline(
    make_union(VotingClassifier([("est", RidgeCV())]), FunctionTransformer(copy)),
    make_union(VotingClassifier([("est", ElasticNetCV(l1_ratio=0.4, tol=0.0001))]), FunctionTransformer(copy)),
    XGBRegressor(max_depth=4, min_child_weight=1, nthread=1, subsample=0.9000000000000001)
)
exported_pipeline.fit(train.drop('SalePrice', axis=1), train['SalePrice'])
results = exported_pipeline.predict(test)

result_df = pd.DataFrame({'PassengerId': ids, 'Survived': results})
result_df.to_csv('houses/predictions.csv', index=False)