# Summary

Best ROC-AUC (base) was 86.47% on Random Forest. With TPOT, ROC-AUC is 89.53%. With Neural Network TPOT, ROC-AUC is 94.50%.


* TPOT stands for Tree-based Pipeline Optimization Tool. TPOT uses genetic programming to find the optimal ML pipeline 
* TPOT requires data to be numerical. Since we have preprocessed data already, we will move on to TPOT deployment directly
* TPOT takes 1D arrays, therefore dataframes are raveled
* Best models are be exported as py files


In [1]:
! pip install tpot
! pip install ipywidgets jupyter nbextension enable --py widgetsnbextension
! pip install dask dask-ml

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl.metadata (2.0 kB)
Collecting scikit-learn>=1.4.1 (from tpot)
  Downloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting deap>=1.2 (from tpot)
  Using cached deap-1.4.1-cp311-cp311-macosx_11_0_arm64.whl
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Using cached stopit-1.1.2-py3-none-any.whl
Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: stopit,

In [2]:
from tpot import TPOTClassifier
import pandas as pd 

In [3]:
# import preprocessed data
X_train = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/X_train.csv')
y_train = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/y_train.csv')
X_test = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/X_test.csv')
y_test = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/y_test.csv')
X_val = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/X_val.csv')
y_val = pd.read_csv('/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/datasets/y_val.csv')


In [4]:
# reshape to 1d array

X_train.values.ravel()
y_train.values.ravel()
X_test.values.ravel()
y_test.values.ravel()
X_val.values.ravel()
y_val.values.ravel()

array([0, 1, 1, ..., 0, 0, 0])

## Set MLflow

In [9]:
#!pip install mlflow
import mlflow

# set the experiment id
mlflow.set_experiment(experiment_id="936482171255835555")
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.autolog()


2024/04/22 22:35:43 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [10]:
tpot = TPOTClassifier(generations=20, population_size=20, mutation_rate = 0.05, verbosity=2, scoring = 'roc_auc',
                      cv = 5, n_jobs=-1, max_time_mins = 3, max_eval_time_mins = 3,
                      random_state=42, config_dict='TPOT light')
tpot.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
2024/04/22 22:35:45 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ee824c8eeb284a97ae6357a2e45ae457', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 22:35:48 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '01483d17da92438385eb52b77c0f56a2', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 22:35:49 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '5ea2fb6f5f6d480e93245c0f519b43d7', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 22:35:51 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '06069e910a2743aca88519d14f02385f', which will track hyperparameters, performance metric

Optimization Progress:  25%|██▌       | 5/20 [03:22<10:06, 40.42s/pipeline]

2024/04/22 22:39:42 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '2c3cf966c34c43459b1e27f14c08308e', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 22:39:43 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'e6a23405f731442ab6f59f0a57e8696b', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 22:39:46 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'd53e4dcf5e7c46469445206adb121936', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


                                                                           
4.03 minutes have elapsed. TPOT will close down.          
TPOT closed during evaluation in one generation.
                                                          
                                                          
TPOT closed prematurely. Will use the current best pipeline.
                                                          

2024/04/22 22:39:47 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '30b8ee36b7b1419693d8a80d87b15f41', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow




2024/04/22 22:39:50 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'eba66cc8c1964e48988463e03600e107', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow



Best pipeline: DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=7, min_samples_leaf=20, min_samples_split=4)


In [11]:
print(f"ROC-AUC score: {tpot.score(X_test, y_test)}")

  y = column_or_1d(y, warn=True)


ROC-AUC score: 0.8953535723254082


In [13]:
# print best pipeline
print(tpot.fitted_pipeline_)

Pipeline(steps=[('decisiontreeclassifier',
                 DecisionTreeClassifier(max_depth=7, min_samples_leaf=20,
                                        min_samples_split=4,
                                        random_state=42))])


In [12]:
tpot.export('tpot_pipeline.py')

## Neural network classifier using TPOT-NN

In [14]:
from tpot import TPOTClassifier
from sklearn.datasets import make_blobs

In [14]:
# start new MLflow session
mlflow.set_experiment(experiment_id="443836447855555990")
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.autolog()

2024/04/22 23:11:54 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [15]:
nn_tpot = TPOTClassifier(config_dict='TPOT NN', 
                     verbosity=2, population_size=5, generations=5, n_jobs = -1,max_time_mins = 2, 
                     scoring = 'roc_auc')

assert not hasattr(nn_tpot, "classes_")
nn_tpot.fit(X_train, y_train)
assert hasattr(nn_tpot, "classes_")

2024/04/22 23:11:59 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
  y = column_or_1d(y, warn=True)
2024/04/22 23:11:59 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '22539f0c67ce4ad8a1f94cbe32bd28e7', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
cannot import name 'allow_in_graph' from partially initialized module 'torch._dynamo' (most likely due to a circular import) (/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/venv/lib/python3.11/site-packages/torch/_dynamo/__init__.py)
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x28b432480>
Traceback (most recent call last):
  File "/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/venv/lib/pytho

Optimization Progress:  20%|██        | 1/5 [00:43<02:55, 43.83s/pipeline]

2024/04/22 23:13:13 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '2689e039314c4bf7a57b427ea157c7fb', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 23:13:23 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'b77b1b04e2314fe8ae8644bc62c0dff1', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 23:13:34 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '96988c115e494d71980e333c92bea0f3', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x28b432480>
Traceback (most recent call last):
  File "/Users/chiaralu/Desktop/Courses/INSY 695/Group Project/hotel_cancellation_ML2/venv/lib/

                                                                           
Generation 1 - Current best internal CV score: 0.9310294569353934
Optimization Progress: 100%|██████████| 10/10 [01:35<00:00, 14.03s/pipeline]

2024/04/22 23:14:05 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '1f9d8c5a4ea7469cb158aa7c1adb58a3', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 23:14:06 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '568c9167206548e38cb3c78f56037a29', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 23:14:09 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'bcb2b541a40f45c6a83df23e70bce8fd', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/04/22 23:14:11 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '5696b7de0d6040038ae5fe5a99b04907', which will track hyperparameters, performance metrics, model artifacts, and lineage i

                                                                            
2.30 minutes have elapsed. TPOT will close down.                            
TPOT closed during evaluation in one generation.
                                                                            
                                                                            
TPOT closed prematurely. Will use the current best pipeline.
                                                                            

2024/04/22 23:14:16 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '853d8443d4704731b74b64276083ce3b', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow




2024/04/22 23:14:22 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '29b929481f8d44629686128abafaa2fa', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow



Best pipeline: XGBClassifier(DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=4, min_samples_leaf=4, min_samples_split=9), learning_rate=0.5, max_depth=5, min_child_weight=11, n_estimators=100, n_jobs=1, subsample=0.6000000000000001, verbosity=0)


2024/04/22 23:14:37 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ee481113bb554d2bb217b6550651fa2e', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [16]:
print(f"AUC-ROC score: {nn_tpot.score(X_test, y_test)}")

  y = column_or_1d(y, warn=True)


AUC-ROC score: 0.9347729486671937


In [17]:
# export best nn_tpot
nn_tpot.export('nn_tpot_pipeline.py')