# Summary

All TPOT Automated ML pipelines defeat the base model. Best ROC-AUC (base) was 86.47% on Random Forest. With TPOT Light, ROC-AUC is 89.55%. With Neural Network TPOT, ROC-AUC is 92.84%. Using TPOT Multifactor Dimensionality Reduction (MDR), ROC-AUC is 88.95%. 


* TPOT stands for Tree-based Pipeline Optimization Tool. TPOT uses genetic programming to find the optimal ML pipeline 
* TPOT requires data to be numerical. Since we have preprocessed data already, we will move on to TPOT deployment directly
* TPOT takes 1D arrays, therefore dataframes are raveled
* Best models are be exported as py files


In [1]:
! pip install tpot
! pip install ipywidgets jupyter nbextension enable --py widgetsnbextension
! pip install dask dask-ml

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl.metadata (2.0 kB)
Collecting scikit-learn>=1.4.1 (from tpot)
  Downloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting deap>=1.2 (from tpot)
  Using cached deap-1.4.1-cp311-cp311-macosx_11_0_arm64.whl
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Using cached stopit-1.1.2-py3-none-any.whl
Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: stopit,

In [22]:
from tpot import TPOTClassifier
import pandas as pd 

In [26]:
# import preprocessed data
X_train = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/X_train_ros.csv')
y_train = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/y_train_ros.csv')
X_test = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/X_test_std.csv')
y_test = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/y_test.csv')
X_val = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/X_val_std.csv')
y_val = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/y_val.csv')


In [27]:
# reshape to 1d array

X_train.values.ravel()
y_train.values.ravel()
X_test.values.ravel()
y_test.values.ravel()
X_val.values.ravel()
y_val.values.ravel()

array([0, 1, 1, ..., 0, 0, 0])

## Set MLflow

In [9]:
#!pip install mlflow
import mlflow

# set the experiment id
mlflow.set_experiment(experiment_id="936482171255835555")
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.autolog()


2024/04/22 22:35:43 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [28]:
tpot = TPOTClassifier(generations=20, population_size=20, mutation_rate = 0.05, verbosity=2, scoring = 'roc_auc',
                      cv = 5, n_jobs=-1, max_time_mins = 3, max_eval_time_mins = 3,
                      random_state=42, config_dict='TPOT light')
tpot.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


                                                                           
4.35 minutes have elapsed. TPOT will close down.          
TPOT closed during evaluation in one generation.
                                                          
                                                          
TPOT closed prematurely. Will use the current best pipeline.
                                                          
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=100, p=2, weights=distance)


In [29]:
print(f"ROC-AUC score: {tpot.score(X_test, y_test)}")

  y = column_or_1d(y, warn=True)


ROC-AUC score: 0.8955764469608547


In [30]:
# print best pipeline
print(tpot.fitted_pipeline_)

Pipeline(steps=[('kneighborsclassifier',
                 KNeighborsClassifier(n_neighbors=100, weights='distance'))])


In [39]:
tpot.export('tpot_pipeline.py')

## Neural network classifier using TPOT-NN

In [33]:
from tpot import TPOTClassifier
from sklearn.datasets import make_blobs

In [None]:
# start new MLflow session
mlflow.set_experiment(experiment_id="443836447855555990")
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.autolog()

In [35]:
nn_tpot = TPOTClassifier(config_dict='TPOT NN', 
                     verbosity=2, population_size=5, generations=5, n_jobs = -1,max_time_mins = 2, 
                     scoring = 'roc_auc')

assert not hasattr(nn_tpot, "classes_")
nn_tpot.fit(X_train, y_train)
assert hasattr(nn_tpot, "classes_")

  y = column_or_1d(y, warn=True)


                                                                           
3.22 minutes have elapsed. TPOT will close down.                           
TPOT closed during evaluation in one generation.
                                                                           
                                                                           
TPOT closed prematurely. Will use the current best pipeline.
                                                                           
Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.5, max_depth=10, max_features=0.5, min_samples_leaf=19, min_samples_split=11, n_estimators=100, subsample=0.6000000000000001)


In [36]:
print(f"AUC-ROC score: {nn_tpot.score(X_test, y_test)}")

AUC-ROC score: 0.9284142552431858


  y = column_or_1d(y, warn=True)


In [37]:
# export best nn_tpot
nn_tpot.export('nn_tpot_pipeline.py')

## TPOT MDR

In [38]:
mdr_tpot = TPOTClassifier(config_dict='TPOT MDR', 
                     verbosity=2, population_size=5, generations=5, n_jobs = -1,max_time_mins = 2, 
                     scoring = 'roc_auc')

assert not hasattr(mdr_tpot, "classes_")
mdr_tpot.fit(X_train, y_train)
assert hasattr(mdr_tpot, "classes_")

print(f"AUC-ROC score: {mdr_tpot.score(X_test, y_test)}")

  y = column_or_1d(y, warn=True)


                                                                           
Generation 1 - Current best internal CV score: 0.8884558660947521
                                                                            
Generation 2 - Current best internal CV score: 0.8885106978560247
                                                                            
Generation 3 - Current best internal CV score: 0.8885106978560247
                                                                            
Generation 4 - Current best internal CV score: 0.8885111313558702
                                                                            
Generation 5 - Current best internal CV score: 0.8885111313558702
                                                                            
Best pipeline: LogisticRegression(LogisticRegression(input_matrix, C=0.01, dual=False, penalty=l2), C=25.0, dual=False, penalty=l2)
AUC-ROC score: 0.8895500636220336


  y = column_or_1d(y, warn=True)


In [40]:
# export best mdr_tpot
mdr_tpot.export('mdr_tpot_pipeline.py')