# Modeling and Evaluation and Final Results

Used Tree-based Pipeline Optimization Tool (TPOT) to find the best model

TPOT automates the entire Machine Learning pipeline and provides a best performing machine learning model.


- How TPOT uses Genetic Programming to select the best machine learning model
- Feature Selection
- Feature preprocessing
- Feature construction
- Model selection
- Hyperparameter Optimization
- The score is the sklearn.model_selection.cross_val_score which does a K-Folds  with  scoring = accuracy


## Step 0: Loading Modules and Dataset

This section is to load and modules and the original dataset from a CSV file into a dataframe

### Modules

In [None]:
# importing  packages
import numpy as np
import pandas as pd
# import matplotlib
# import matplotlib.pyplot as plt
# import seaborn as sns
# import re

# ## importing datetime class
# from datetime import datetime, timedelta

In [None]:
# XG Bost
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# sklearn
# import `logistic regression` model
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split


# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.exceptions import DataConversionWarning
# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier


# Visualize Boosting Trees and Feature Importance
import graphviz


# For support Vector Machines with Scikit-learn Example
# Import scikit-learn dataset library
from sklearn import datasets
#Import svm model
from sklearn import svm
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Bagged Decision Trees for Classification - necessary dependencies
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# For Bagged Example Data clean up
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import MinMaxScaler

# Voting Ensemble 
#from sklearn.linear_model import LogisticRegression
#from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# imports for evaluations
from sklearn.naive_bayes import GaussianNB

# balance the data
from imblearn.over_sampling import SMOTE

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier

# warnings
import warnings

#clf = LogisticRegression(max_iter=2000)



### SMOTE AND VALUATION Functions

In [None]:
## NEW VALUATION CODE
def RunNewEval(df_working,modelname='LogisticRegression',model_param='max_iter=2000'):
    '''
    This Function run a SMOTE on the pandas dataframe passed
    it assumes that the Trarget(y) is the last column of the dataframe
    and all of the columns are numeric
    Prints out the results
    Just to make it easier to runn multiple datasets against each other
    
    Required Imports:
    -	from sklearn.naive_bayes import GaussianNB
    -	from sklearn.model_selection import train_test_split
    - from sklearn import metrics
    INPUTS:
    ----
    - df_working : DataFrame to run against
    OUTPUT:
    ----
    - f1_score - bias, variance
    - ROC_AUC  - bias, variance
    ''' 
    #Create x and Y 
    working_values = df_working.values
    # Slice Out X and Y
    X = working_values[:,:-1]
    #### create a variable `y` which contains the last column in `reg_values`
    y = working_values[:,-1:]
    # Print Before Shape
    print("Shape before SMOTE  X:",X.shape," y:" ,y.shape)
    # resample/balance the data
    sm = SMOTE(random_state = 2021) 
    X_res, y_res = sm.fit_sample(X, y) 
    # Print shape after reshape
    print("Shape after SMOTE  X:",X_res.shape," y:" ,y_res.shape) 
#    print("Working on averaged f1_score from 10-fold CV (default)")
    f1_bias, f1_variance = my_evalNew(X_res, y_res, modelname, 10, 'f1')    
#    print("Working on averaged ROC_AUC from 10-fold CV")
    roc_bias, roc_variance = my_evalNew(X_res, y_res, modelname, 10, 'roc_auc')
    print("Averaged F1 1-bias : " , f1_bias, " Variance : ", f1_variance)
    print("Averaged ROC_AUC 1-bias : " , roc_bias, " Variance : ", roc_variance)    
    return  f1_bias, f1_variance, roc_bias, roc_variance



def my_evalNew(X, y,modelname='LogisticRegression', k=10, scoring = 'f1'):
    #def my_evalNew(X, y, classifer = clf, k=10, scoring = 'f1'):
    '''
    return evaluation results (f1-score or ROC_AUC). 
    Built in k-fold evaluation.
    INPUTS:
    ----
    - X: features; DataFrame or Numpy ndarray;
    - y: target; DataFrame or Numpy ndarray;
    - classifier: any sklearn (or its add-on) based classifier
    - k: number of folds in cross validation
    - scoring: evaluation metric ('f1' default or 'roc_auc')
    OUTPUT:
    ----
    bias/variance score of selected metric.
    '''
    #print("Using model : ", modelname)
    if modelname == 'LogisticRegression':
        print("Using model:", modelname, "with :",scoring)
        clf = LogisticRegression(max_iter=2000)
                  
    elif modelname == 'SVC':
        p_kernel='linear'
        print("Using model:", modelname, "with :",scoring)
        clf = svm.SVC(kernel=p_kernel) # Linear Kernel
        
    elif modelname == 'RandomForest':
        p_max_depth=2
        p_test_size=.3
        p_random_state=2019
        print("Using model:", modelname, "with :",scoring)
        clf = RandomForestClassifier(max_depth=p_max_depth, random_state=p_random_state)
          
    elif modelname == 'AdaBoost':
        seed = 7
        num_trees = 70
        print("Using model:", modelname, "with :",scoring)
        clf = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
          
    elif modelname == 'DecisionTreeClassifier':
        print("Using model:", modelname, "with :",scoring)
        clf = DecisionTreeClassifier()

    elif modelname == 'XGBoost':
        p_objective ='reg:squarederror'
        p_colsample_bytree = 0.3
        p_learning_rate = 0.1
        p_max_depth = 5
        p_alpha = 10
        p_n_estimators = 100
        print("Using model:", modelname, "with :",scoring)
        clf = xgb.XGBRegressor(objective = p_objective
                               , colsample_bytree = p_colsample_bytree
                               , learning_rate = p_learning_rate
                               , max_depth = p_max_depth
                               , alpha = p_alpha
                               , n_estimators = p_n_estimators)

    elif modelname == 'GNaiveBayes':
        print("Using model:", modelname, "with :",scoring)
        clf = GaussianNB()


    scores = []
    for i in range(2):
       #### generate random numbers to shuffle the data for training and test
       np.random.seed(2021)
       random_int = np.random.randint(0,3000)
       #### create cross validation folds
       kfold = model_selection.KFold(n_splits=k, random_state=random_int, shuffle=True)
       #### record the score
       score = model_selection.cross_val_score(clf, X=X, y=y, cv=kfold, scoring=scoring)
       scores.append(score)
    scores = np.array(scores)
    #### we need to calculate the bias (average score) and viariance (std)
    bias, variance = round(scores.mean(),4), round(scores.std(),4)
    return(bias, variance)

def warn(*args, **kwargs):
    pass



def CreateSmoteDF(df_working):
    '''
    This Function runs a SMOTE on the pandas dataframe passed
    it assumes that the Trarget(y) is the last column of the dataframe
    Return a balanced dataframe
    
    Required Imports:
    -	from imblearn.over_sampling import SMOTE
    INPUTS:
    ----
    - df_working : DataFrame to run against
    OUTPUT:
    ----
    -  df_smote : DataFrame that has been balanced
    ''' 
    #Create x and Y 
    working_values = df_working.values
    # Slice Out X and Y
    X = working_values[:,:-1]
    #### create a variable `y` which contains the last column in `reg_values`
    y = working_values[:,-1:]
    df_column_lst = df_working.columns.to_list()
    # Print Before Shape
    print("Shape before SMOTE  X:",X.shape," y:" ,y.shape)
    # resample/balance the data
    sm = SMOTE(random_state = 2021) 
    X_res, y_res = sm.fit_sample(X, y) 
    # Print shape after reshape
    #print("Shape after SMOTE  X:",X_res.shape," y:" ,y_res.shape) 
    # Recreate the Dataframe with the smote set
    df_smote = pd.concat([pd.DataFrame(X_res), pd.DataFrame(y_res)], axis=1)
    # rename the columns
    df_smote.columns = df_column_lst
    print("Shape after SMOTE  : ",df_smote.shape) 

    return df_smote


In [None]:
# importing  packages
import numpy as np
import pandas as pd
# import matplotlib
# import matplotlib.pyplot as plt
# import seaborn as sns
# import re

## Read-in Data

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 
%%bash
ln -s drive/My\ Drive/BUAN\ 6590\ -\ Capstone/ Capstone

ln: failed to create symbolic link 'Capstone/BUAN 6590 - Capstone': File exists


In [None]:
# Load dataset
data_file = '/content/Capstone/DATA/df_merge2_with_Target.csv'
df_data = pd.read_csv(data_file,index_col=0)
df_data.head()

Unnamed: 0,date,duration,user_id,steps,floors,intensity_minutes,active_kilocalories,hr_min,hr_max,hr_res,stress_avg,stress_dur_rest,stress_dur_activity,stress_dur_low,stress_dur_medium,stress_dur_high,total_hours,quality_hours,spo2_minimum,spo2_average,deep_hours,rem_hours,Age,survey_date,ss7dmavg,ss28dmavg,ss28dStdev,7D_StdDevfrom28d,7Dssma0-1STDev_False,7Dssma0-1STDev_True,7Dssma1-2STDev_False,7Dssma1-2STDev_True,7Dssma2-3STDev_False,7Dssma2-3STDev_True,7Dssma3+STDev_False,7Dssma3+STDev_True,Status
0,2020-05-13,86400,0Sq4rLw6hryK3GlUpE6n,7798,0.0,0,175,42.0,131.0,52.0,37.0,22140.0,12240.0,14760.0,16200.0,4440.0,5.25,2.97,88.0,95.38,1.45,1.07,68.0,2020-05-19,37.0,37.0,1e-06,0.0,1,0,1,0,1,0,1,0,2
1,2020-05-14,86400,0Sq4rLw6hryK3GlUpE6n,7787,11.0,0,178,37.0,100.0,54.0,45.0,15420.0,20340.0,12720.0,19920.0,5340.0,6.45,3.53,84.0,92.06,1.45,1.07,68.0,2020-05-19,41.0,41.0,4.0,0.0,1,0,1,0,1,0,1,0,2
2,2020-05-15,86400,0Sq4rLw6hryK3GlUpE6n,6432,8.0,0,134,48.0,104.0,52.0,43.0,19860.0,15660.0,13560.0,17100.0,8880.0,5.93,1.67,87.0,93.78,1.45,1.07,68.0,2020-05-19,41.666667,41.666667,3.399346,0.0,1,0,1,0,1,0,1,0,2
3,2020-05-16,86400,0Sq4rLw6hryK3GlUpE6n,6682,5.0,0,253,49.0,111.0,52.0,53.0,12600.0,15600.0,9240.0,17940.0,14040.0,7.58,3.77,84.0,93.16,1.45,1.07,68.0,2020-05-19,44.5,44.5,5.722762,0.0,1,0,1,0,1,0,1,0,2
4,2020-05-17,86400,0Sq4rLw6hryK3GlUpE6n,5406,8.0,0,175,52.0,108.0,58.0,66.0,4800.0,24060.0,6000.0,11220.0,24060.0,5.6,4.02,83.0,95.79,1.45,1.07,68.0,2020-05-19,48.8,48.8,10.007997,0.0,1,0,1,0,1,0,1,0,2


It's generally a good idea to randomly **shuffle** the data before starting to avoid any type of ordering in the data. You can rearrange the data in the DataFrame using numpy's **random** and **permutation()** function. To reset the index numbers after the shuffle use **reset_index()** method with **drop = True** as a parameter.

In [None]:
df_data_shuffle=df_data.iloc[np.random.permutation(len(df_data))]
df_data2=df_data_shuffle.reset_index(drop=True)
df_data2.head()

Unnamed: 0,date,duration,user_id,steps,floors,intensity_minutes,active_kilocalories,hr_min,hr_max,hr_res,stress_avg,stress_dur_rest,stress_dur_activity,stress_dur_low,stress_dur_medium,stress_dur_high,total_hours,quality_hours,spo2_minimum,spo2_average,deep_hours,rem_hours,Age,survey_date,ss7dmavg,ss28dmavg,ss28dStdev,7D_StdDevfrom28d,7Dssma0-1STDev_False,7Dssma0-1STDev_True,7Dssma1-2STDev_False,7Dssma1-2STDev_True,7Dssma2-3STDev_False,7Dssma2-3STDev_True,7Dssma3+STDev_False,7Dssma3+STDev_True,Status
0,2021-01-03,86400,PT4Wz6SVCxEXSj4O8IfB,3813,6.0,0,97,56.0,101.0,62.0,14.0,59520.0,16560.0,3420.0,2340.0,120.0,10.03,3.47,84.0,92.26,1.05,2.42,25.0,2021-01-05,18.142857,25.0,7.609518,-0.901127,1,0,1,0,1,0,1,0,1
1,2021-02-26,86400,B6XOaByr9nTwVK2vT8YI,7682,20.0,0,535,44.0,128.0,52.0,33.0,27000.0,22740.0,13980.0,10680.0,3420.0,7.63,4.65,83.0,89.81,1.78,2.87,29.0,2021-03-02,35.0,36.071429,6.485856,-0.165195,1,0,1,0,1,0,1,0,1
2,2020-08-30,86400,zl7BTPWIYwuysZi5gVrm,8461,18.0,0,306,53.0,109.0,56.0,39.0,19380.0,21240.0,15780.0,14580.0,4740.0,9.43,2.88,83.0,92.85,1.45,1.07,66.0,2020-09-01,47.714286,46.714286,7.160692,0.139651,0,1,1,0,1,0,1,0,1
3,2020-06-23,86400,6gDGMpyGahFZYdhW8SUB,10265,65.0,0,722,43.0,110.0,46.0,40.0,19860.0,19860.0,11880.0,10800.0,8880.0,7.5,1.98,84.0,92.26,1.45,1.07,56.0,2020-06-23,44.142857,43.464286,7.688034,0.088263,0,1,1,0,1,0,1,0,1
4,2021-03-04,86400,8Xay1Wtk5twkSadg4hqm,6533,29.0,23,342,56.0,116.0,64.0,22.0,46560.0,13740.0,10260.0,7260.0,1080.0,7.85,4.01,89.0,94.28,1.58,2.43,33.0,2021-03-09,25.285714,34.464286,9.271624,-0.989964,1,0,1,0,1,0,1,0,1


In [None]:
data_class = df_data2['Status'].values

### MIssing Value Handling

You should also do missing value treatment before using *tpot*. To check the number of missing values column-wise, you can execute the following:

In [None]:
# Drop unused columns in Actiity
drop_col = ['date','user_id','survey_date']
df_data2 = df_data2.drop(drop_col, axis=1)

In [None]:
df_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37109 entries, 0 to 37108
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   duration              37109 non-null  int64  
 1   steps                 37109 non-null  int64  
 2   floors                37109 non-null  float64
 3   intensity_minutes     37109 non-null  int64  
 4   active_kilocalories   37109 non-null  int64  
 5   hr_min                37109 non-null  float64
 6   hr_max                37109 non-null  float64
 7   hr_res                37109 non-null  float64
 8   stress_avg            37109 non-null  float64
 9   stress_dur_rest       37109 non-null  float64
 10  stress_dur_activity   37109 non-null  float64
 11  stress_dur_low        37109 non-null  float64
 12  stress_dur_medium     37109 non-null  float64
 13  stress_dur_high       37109 non-null  float64
 14  total_hours           37109 non-null  float64
 15  quality_hours      

In [None]:
pd.isna(df_data2).any()

In [None]:
pd.isnull(df_data2).any()

# LOAD IN TPOT

Now it's time to use the **tpot** library to suggest us the best pipeline for this binary classification problem. To do so, you have to import **TPOTClassifier** class from the tpot library. Had this been a regression problem you would have imported **TPOTRegressor** class.

**TPOTClassifier** has a wide variety of parameters, and you can read all about them here. But the most notable ones you must know are:

- generations: Number of iterations to the run pipeline optimization process. The default is `100`.
- population_size: Number of individuals to retain in the genetic programming population every generation. The default is `100`.
- offspring_size: Number of offspring to produce in each genetic programming generation. The default is `100`.
- mutation_rate: Mutation rate for the genetic programming algorithm in the range `[0.0, 1.0]`. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. Default is `0.9`.
- crossover_rate: Crossover rate for the genetic programming algorithm in the range `[0.0, 1.0]`. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.
- scoring: Function used to evaluate the quality of a given pipeline for the classification problem like `accuracy, average_precision, roc_auc, recall`, etc. The default is `accuracy`.
- cv: Cross-validation strategy used when evaluating pipelines. The default is `5`.
- random_state: The seed of the pseudo-random number generator used in TPOT. Use this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.

Also note mutation_rate + crossover_rate cannot exceed **1.0**.

Here you will use tpot with generations = 5 and the rest of the parameters at default values. The parameter verbosity = 2 states how much information TPOT communicates while it's running.

Then you will call the `fit()` method with the training set (without the target column) and the target column as the arguments.

Note running the code in the below cell will take several hours to finish. With the given TPOT settings (5 generations with 100 population size), TPOT will evaluate 500 pipeline configurations before finishing. To put this number into context, think about a grid search of 500 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 500 model configurations to evaluate with 5-fold cross-validation, which means that roughly 2500 models are fit and evaluated on the training data in one grid search. That's a time-consuming procedure! Later, you will get to know about some more arguments that you can pass to TPOTClassifier to control the execution time for TPOT to finish.

__NOTE__: be careful running this step - takes ~2 hours to run!

In [None]:
#### install TPOT
!pip install tpot

## ORIGINAL Dataframe Not SMOTE

You will now split the DataFrame into a training set and a testing set just like you do while doing any type of machine learning modeling. You can do this via sklearn's **cross_validation** **train_test_split**. The parameters are tele.index as indexes of the DataFrame, *train_size = 0.75* to keep 75% of the data in training set, *test_size = 0.25* to keep the rest 25% data in testing set and stratify = tele_class the class label's values in the dataset. Note the validation set is just to give us an idea of the test set error. Here it is kept to be the same as a test set.

In [None]:
from sklearn.model_selection import train_test_split
training_indices, testing_indices = train_test_split(df_data2.index,
                                                        stratify = data_class,
                                                        train_size=0.75, test_size=0.25, random_state = 2019)


You can check the size of the training set and validation set using the size attribute.

In [None]:
training_indices.size, testing_indices.size

In [None]:
# from tpot import TPOTClassifier
# # from tpot import TPOTRegressor # for regression tasks

# tpot = TPOTClassifier(generations=5,verbosity=2, n_jobs=-1)

# tpot.fit(df_data2.drop('Status',axis=1).loc[training_indices].values, # X_train
#          df_data2.loc[training_indices,'Status'].values) # y_train


In [None]:
# tpot.export('/content/Capstone/tpot_exported_pipeline.txt')

Optimization Progress: 43%
258/600 [3:20:28<6:33:05, 68.96s/pipeline]

Generation 1 - Current best internal CV score: 0.7045021413272776

In the above only got to finish once.
The best pipeline is the one that has the CV accuracy score of **70.45%**. 

One of the key difference here is we use both `X_test` and `y_test` in the code below, since the `.score()` method below combines the __prediction__ and __evaluation__ in the same step.

In [None]:
# from tpot import TPOTClassifier
# # from tpot import TPOTRegressor # for regression tasks

# tpot = TPOTClassifier(generations=5,verbosity=2, n_jobs=-1)

In [None]:
# tpot.score(df_data2.drop('Status',axis=1).loc[testing_indices].values, #X_test
#            df_data2.loc[testing_indices, 'Status'].values) # y_test

As can be seen, the test accuracy is **89.16%.**



Isn't that awesome? Without you tweaking a lot of parameters and options to get the best model, TPOT not only gave you the information about the best model but also a working code for it!

As indicated earlier, the last TPOT run took *several hours* to finish. Well, there are certain parameters you can specify to control the execution time of TPOT but with a trade-off. Since you will be limiting the time of TPOT execution, TPOT won't be able to explore all the possible pipelines and hence the best model suggested by TPOT at the end of the constrained time limit may not be the best model possible for that dataset. 

However, if sufficient time is given it will be somewhat closer to the best possible model. Some parameters are:

- **max_time_mins**: how many minutes TPOT has to optimize the pipeline. If not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse.
- **max_eval_time_mins**: how many minutes TPOT has to evaluate a single pipeline. Setting this parameter to higher values will enable TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on assessing time-consuming pipelines. The default is 5.
- **early_stop**: how many generations TPOT checks whether there is no improvement in the optimization process. Ends the optimization process if there is no improvement in the given number of generations.
- **n_jobs**: Number of procedures to use in parallel for evaluating pipelines during the TPOT optimization process. Setting n_jobs=-1 will use as many cores as available on the computer. Beware that using multiple methods on the same machine may cause memory issues for large datasets. The default is 1.
- **subsample**: Fraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. The default is 1.

Just for practice, you will again run TPOT with additional arguments `max_time_mins = 10` and `max_eval_time_mins = 0.4` but this time with reduced `population_size = 15`. I also setup an early stopping rule - if the model performance does not improve in `10` consecutive generations, the training process will stop.

If you do not want to wait hours for your TPOT model, this might be the way to go!

In [None]:
from tpot import TPOTClassifier
tpot = TPOTClassifier(verbosity=2, max_time_mins=180, 
                      max_eval_time_mins=0.4, population_size=15, early_stop=10, n_jobs=-1)
tpot.fit(df_data2.drop('Status',axis=1).loc[training_indices].values, # X_train
         df_data2.loc[training_indices,'Status'].values) # y_train

In [None]:
tpot.score(df_data2.drop('Status',axis=1).loc[testing_indices].values, #X_test
           df_data2.loc[testing_indices, 'Status'].values) # y_test

In [None]:
tpot.export('/content/Capstone/tpot_exported_pipeline_two.txt')

### 0.6832291442121147



Best pipeline: DecisionTreeClassifier(CombineDFs(CombineDFs(input_matrix, input_matrix), input_matrix), criterion=gini, max_depth=7, min_samples_leaf=2, min_samples_split=7)
TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=10, generations=100,
               log_file=None, max_eval_time_mins=0.4, max_time_mins=180,
               memory=None, mutation_rate=0.9, n_jobs=-1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=15,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)


# ***********

As you can notice the best performing classifier within the time frame specified is DecisionTreeClassifier  with `CombineDFs()` and `CombineDFs()` as the pre-processing steps. 

After you trained your best model, you can always export the pipeline as a file and use it without any training (we know training takes a lot of time).

You can export the above trained pipeline as (assume you have a `model` subfolder in your repo):
```python
tpot.export('tpot_exported_pipeline.py')
```

Then in your subsequent analysis, you can import this `.py` file and then use the `tpot.score()` method to evaluate/deploy the model on _new, unseen_ data.

For more examples of using TPOT for machine learning, refer to [these examples](https://epistasislab.github.io/tpot/examples/).

## TEST EXPORT

In [None]:
# from tpot import TPOTClassifier
# tpot = TPOTClassifier(verbosity=2, max_time_mins=180, 
#                       max_eval_time_mins=0.4, population_size=15, early_stop=3, n_jobs=-1)
# tpot.fit(df_data2.drop('Status',axis=1).loc[training_indices].values, # X_train
#          df_data2.loc[training_indices,'Status'].values) # y_train

In [None]:
tpot.export('/content/Capstone/tpot_exported_pipeline_two.txt')

# SMOTE Unbalanced Target and rerun

## RUN WITH SMOTE Dataframe

In [None]:
df_dataSM = CreateSmoteDF(df_data2)

Shape before SMOTE  X: (37109, 33)  y: (37109, 1)


  y = column_or_1d(y, warn=True)


Shape after SMOTE  :  (100176, 34)


In [None]:
data_classSM = df_dataSM['Status'].values

In [None]:
from sklearn.model_selection import train_test_split
training_indicesSM, testing_indicesSM = train_test_split(df_dataSM.index,
                                                        stratify = data_classSM,
                                                        train_size=0.75, test_size=0.25, random_state = 2019)

In [None]:
training_indicesSM.size, testing_indicesSM.size

(75132, 25044)

In [None]:
from tpot import TPOTClassifier
#tpot = TPOTClassifier(verbosity=2, max_time_mins=180, 
#                      max_eval_time_mins=0.4, population_size=15, early_stop=10, n_jobs=-1)
tpot = TPOTClassifier(verbosity=2, max_time_mins=None,                       
                      max_eval_time_mins=0.4, population_size=15,n_jobs=-1)
tpot.fit(df_dataSM.drop('Status',axis=1).loc[training_indicesSM].values, # X_train
         df_dataSM.loc[training_indicesSM,'Status'].values) # y_train

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=1515.0, style=ProgressStyle(d…


Generation 1 - Current best internal CV score: 0.3690970335653245

Generation 2 - Current best internal CV score: 0.3701085966640764

Generation 3 - Current best internal CV score: 0.37639089516416036

Generation 4 - Current best internal CV score: 0.37639089516416036

Generation 5 - Current best internal CV score: 0.38151513884883426

Generation 6 - Current best internal CV score: 0.38151513884883426

Generation 7 - Current best internal CV score: 0.38151513884883426

Generation 8 - Current best internal CV score: 0.38151513884883426

Generation 9 - Current best internal CV score: 0.3888089588171169

Generation 10 - Current best internal CV score: 0.38887550481363903

Generation 11 - Current best internal CV score: 0.38887550481363903

Generation 12 - Current best internal CV score: 0.4223100420219691

Generation 13 - Current best internal CV score: 0.4223100420219691

Generation 14 - Current best internal CV score: 0.4223100420219691

Generation 15 - Current best internal CV score:

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=100,
               log_file=None, max_eval_time_mins=0.4, max_time_mins=None,
               memory=None, mutation_rate=0.9, n_jobs=-1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=15,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [None]:
tpot.score(df_dataSM.drop('Status',axis=1).loc[testing_indicesSM].values, #X_test
           df_dataSM.loc[testing_indicesSM, 'Status'].values) # y_test

0.7466858329340361

The results are better after balancing the data using SMOTE
The best classifier within the time frame specified is KNeighborsClassifier 

> ** 0.7466858329340361**


In [None]:
tpot.export('/content/Capstone/tpot_exported_pipeline_twoSMOTE2_7466_LAST.txt')

In [None]:
# # Write out the merged Dataframe for use in other workbooks 
# outPath = '/content/Capstone/DATA/df_dataSMOTE.csv'
# df_dataSM.to_csv(outPath)