# TPOT

#### Author's description:

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

#### Useful links:

[git](https://github.com/EpistasisLab/tpot),
[documentation](http://epistasislab.github.io/tpot/),
[installation](http://epistasislab.github.io/tpot/installing/),
[examples](http://epistasislab.github.io/tpot/examples/)

## Install and import

Note that we use the subprocess function instead of the jupyter **!** method of running bash commands. Domino can run these notebooks as [jobs](https://support.dominodatalab.com/hc/en-us/articles/360023696651-Jobs) (batch or scheduled) which turns your ipython notebook into an executable script file! All you have to do is ensure the code can be executed in a .py file.

In [1]:
import subprocess

completed = subprocess.run(['pip', 'install', 'tpot'], stdout=subprocess.PIPE,)
print(completed.stdout.decode('utf-8'))

Collecting tpot
  Downloading TPOT-0.11.1-py3-none-any.whl (75 kB)
Collecting update-checker>=0.16
  Downloading update_checker-0.16-py2.py3-none-any.whl (7.6 kB)
Collecting deap>=1.2
  Downloading deap-1.3.1-cp36-cp36m-manylinux2010_x86_64.whl (157 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting tqdm>=4.36.1
  Downloading tqdm-4.42.1-py2.py3-none-any.whl (59 kB)
Collecting scikit-learn>=0.22.0
  Downloading scikit_learn-0.22.1-cp36-cp36m-manylinux1_x86_64.whl (7.0 MB)
Collecting joblib>=0.13.2
  Downloading joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py): started
  Building wheel for stopit (setup.py): finished with status 'done'
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11954 sha256=c9aedd18e409d94eceb419f8102e374f34d2b527e7a401138e404c73034ac6e9
  Stored in directory: /home/ubuntu/.cache/pip/wheels/07/2e/ce/e558b7d4f9aafcdc0e5638ef890a3d51

In [2]:
import tpot
from tpot import TPOTClassifier
import sklearn
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [3]:
#tips and code in this notebook were originally written for v 0.10.2
tpot.__version__

'0.11.1'

## Heart Disease

#### load the heart disease dataset

In [4]:
'''
/mnt/data/raw/heart.csv

attribute documentation:
      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing
 '''

#column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

#load data from Domino project directory
hd_data = pd.read_csv("/mnt/data/raw/heart.csv", header=None, names=names)

#some data came in as string
#convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over chosen columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')
    
#drop nulls
hd_data.dropna(inplace=True)
    
#load the X and y set as a numpy array
X_hd = hd_data.drop('target', axis=1).values
y_hd = hd_data['target'].values

#build the train and test sets
X_hd_train, X_hd_test, y_hd_train, y_hd_test = \
    sklearn.model_selection.train_test_split(X_hd, y_hd, random_state=1)

#now do one hot encoding------------------
    
#a function to do one hot encoding for categorical columns
def create_dummies(data, cols, drop1st=True):
    for c in cols:
        dummies_df = pd.get_dummies(data[c], prefix=c, drop_first=drop1st)  
        data=pd.concat([data, dummies_df], axis=1)
        data = data.drop([c], axis=1)
    return data
cat_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data = create_dummies(hd_data, cat_cols)
    
#load the X and y set as a numpy array
X_hd_ohe = hd_data.drop('target', axis=1).values
y_hd_ohe = hd_data['target'].values

#build the train and test sets
X_hd_ohe_train, X_hd_ohe_test, y_hd_ohe_train, y_hd_ohe_test = \
    sklearn.model_selection.train_test_split(X_hd_ohe, y_hd_ohe, random_state=1)

## Run TPOT

#### TPOTClassifier structure

class tpot.TPOTClassifier(generations=100, population_size=100,
                          offspring_size=None, mutation_rate=0.9,
                          crossover_rate=0.1,
                          scoring='accuracy', cv=5,
                          subsample=1.0, n_jobs=1,
                          max_time_mins=None, max_eval_time_mins=5,
                          random_state=None, config_dict=None,
                          template=None,
                          warm_start=False,
                          memory=None,
                          use_dask=False,
                          periodic_checkpoint_folder=None,
                          early_stop=None,
                          verbosity=0,
                          disable_update_check=False)

#### Popular settings

**generations**: int, optional (default=100).
Number of iterations to the run pipeline optimization process. TPOT will evaluate population_size + generations × offspring_size pipelines in total.

**population_size**: int, optional (default=100)
Number of individuals to retain in the genetic programming population every generation. Must be a positive number.

Generally, TPOT will work better when you give it more individuals with which to optimize the pipeline.

**offspring_size**: int, optional (default=None)
Number of offspring to produce in each genetic programming generation. Must be a positive number. By default, the number of offspring is equal to the number of population size.

**scoring**: string or callable, optional (default='accuracy').
Function used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:

'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'

**n_jobs**: integer, optional (default=1).
Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.

**max_time_mins**: integer or None, optional (default=None).

**verbosity**: 0 --> not much, 1 --> a bit, 2 --> medium, 3+ --> all the details

**config_dict**: Python dictionary, string, or None, optional (default=None).
A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.

Possible inputs are:
* Python dictionary, TPOT will use your custom configuration,
* string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
* string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
* string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
* None, TPOT will use the default TPOTClassifier configuration.

http://epistasislab.github.io/tpot/using/#built-in-tpot-configurations

In [5]:
#default config_dict

tpot_hd = TPOTClassifier(generations=5, scoring='accuracy', n_jobs=4, \
                         max_time_mins=1, verbosity=2)
tpot_hd.fit(X_hd_ohe_train, y_hd_ohe_train)
tpot_hd.export('tpot_hd_pipeline.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…


1.04 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.55, min_samples_leaf=2, min_samples_split=12, n_estimators=100)


In [6]:
tpot_hd.score(X_hd_ohe_test, y_hd_ohe_test)

0.7894736842105263

In [7]:
#light config_dict

tpot_hd_light = TPOTClassifier(config_dict='TPOT light', generations=2, \
                         scoring='accuracy', n_jobs=4, max_time_mins=1, \
                         verbosity=2)
tpot_hd_light.fit(X_hd_ohe_train, y_hd_ohe_train)
tpot_hd_light.export('tpot_hd_light_pipeline.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.8506280193236714
Generation 2 - Current best internal CV score: 0.8506280193236714
Generation 3 - Current best internal CV score: 0.8506280193236714
Generation 4 - Current best internal CV score: 0.8506280193236714
Generation 5 - Current best internal CV score: 0.8506280193236714

Best pipeline: DecisionTreeClassifier(LogisticRegression(input_matrix, C=5.0, dual=False, penalty=l2), criterion=gini, max_depth=2, min_samples_leaf=2, min_samples_split=13)


In [8]:
tpot_hd_light.score(X_hd_ohe_test, y_hd_ohe_test)

0.7894736842105263

#### How to specify your parameter space
...but you lose the model space search

In [9]:
params = {'max_depth': np.arange(1,200,1),
          'learning_rate': np.arange(0.0001,0.1,0.0001),
          'n_estimators': np.arange(1,200,1),
          'nthread':[6],
          'gamma':np.arange(0.00001,0.1,0.00001),
          'subsample':np.arange(0.1,2,0.1),
          'reg_lambda': np.arange(0.1,200,1),
          'reg_alpha': np.arange(1,200,1),
          'min_child_weight': np.arange(1,200,1),
          'gamma': np.arange(0.1,2,0.1),
          'colsample_bytree': np.arange(0.1,2,0.1),
          'colsample_bylevel': np.arange(0.1,2,0.1)
         }

This takes a long time to run so commenting out. Just showing how to run it for now.

In [21]:
# tpot_classifier = TPOTClassifier(generations=2, population_size=2, offspring_size=4, n_jobs=4, \
#                                 verbosity=2, \
#                                 config_dict={'xgboost.XGBClassifier': params}, scoring = 'accuracy')
# tpot_classifier.fit(X_hd_ohe_train, y_hd_ohe_train)

In [22]:
# tpot_classifier.export('tpot_xgb.py')

In [23]:
# tpot_classifier.score(X_hd_ohe_test, y_hd_ohe_test)

#### load the breast cancer dataset

In [13]:
#load breast cancer data

from sklearn.datasets import load_breast_cancer

'''
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)
'''

#load from sklearn
X_bc, y_bc = sklearn.datasets.load_breast_cancer(return_X_y=True)

#build the train and test sets
X_bc_train, X_bc_test, y_bc_train, y_bc_test = \
    sklearn.model_selection.train_test_split(X_bc, y_bc, random_state=1)

In [17]:
#light config_dict

tpot_bc_light = TPOTClassifier(config_dict='TPOT light', generations=2, \
                         scoring='accuracy', n_jobs=4, max_time_mins=1, \
                         verbosity=2)
tpot_bc_light.fit(X_bc_train, y_bc_train)
tpot_bc_light.export('tpot_bc_light_pipeline.py')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.9695212038303694
Generation 2 - Current best internal CV score: 0.9695212038303694

Best pipeline: LogisticRegression(MaxAbsScaler(input_matrix), C=20.0, dual=False, penalty=l2)


In [18]:
tpot_bc_light.score(X_bc_test, y_bc_test)

0.965034965034965

In [19]:
hd_acc = tpot_hd_light.score(X_hd_ohe_test, y_hd_ohe_test)
bc_acc = tpot_bc_light.score(X_bc_test, y_bc_test)

import json
with open('../dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))