# Summary

This notebook runs all aspects of the Regression STREAMLINE which is an automated machine learning analysis pipeline for regression tasks. Of note, two potentially important elements that are not automated by this pipeline include careful data cleaning and feature engineering using problem domain knowledge. Please review the README included in the associated GitHub repository for a detailed overview of how to run this pipeline. For simplicity, this notebook runs Python code outside of what is visible within it.

## Google Collab and Run Enviornment Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Load all require local python files on from Google Drive
from google.colab import files

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ExploratoryAnalysisMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ExploratoryAnalysisJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/DataPreprocessingMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/DataPreprocessingJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/FeatureImportanceMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/FeatureImportanceJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/FeatureSelectionMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/FeatureSelectionJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ModelMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ModelJob.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/l21regjob.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/smogn.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/StatsMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/StatsJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/DataCompareMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/DataCompareJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/PDF_ReportMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/PDF_ReportJob.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/PDF_ReportJob_Reg.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ApplyModelMain.py /content
!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/ApplyModelJob.py /content

!cp /content/drive/MyDrive/STREAMLINE-Regression/streamline/FileCleanup.py /content

In [None]:
#Install remaining required packages not preinstalled in Google Collab
!pip install skrebate==0.7
!pip install xgboost
!pip install lightgbm
!pip install catboost
!pip install gplearn
!pip install scikit-eLCS
!pip install scikit-XCS
!pip install scikit-ExSTraCS
!pip install optuna==2.0.0
!pip install plotly
!pip install kaleido==0.0.3.post1
!pip install fpdf
!pip install group-lasso

## Notebook Housekeeping
Set up notebook cells to display desired results. No need to edit.

In [None]:
import warnings
import sys
import os
import shutil
warnings.filterwarnings('ignore')

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## -----------------------------------------------------------------------------------------------------------------
## (User Specified) Run Parameters of STREAMLINE
These initial notebook cells include all customizable run parameters for STREAMLINE. These settings should only be left unchanged for users wishing to test out the pipeline demo (as is) to learn how it works or to confirm efficacy before running their own data. Run parameters for each phase of the pipeline are included in separate code cells of this section of the notebook.


### Mandatory Run Parameters for Pipeline

In [None]:
demo_run = False #Leave true to run the local demo dataset (without specifying any datapaths), make False to specify a different data folder path below

#Target dataset folder path(must include one or more .txt or .csv datasets)
data_path = "/content/drive/MyDrive/STREAMLINE-Regression/Measurements/Shu_AMIA_MidTemp" # (str) Demontration Data Path Folder

#Output foder path: where to save pipeline outputs (must be updated for a given user)
output_path = '/content/drive/MyDrive/STREAMLINE-Regression/Colab_Output' # (str) Demonstration Ouput Path Folder

#Unique experiment name - folder created for this analysis within output folder path
experiment_name = 'Shu_MidTemp_experiment'  # (str) Demontration Experiment Name

# Data Labels
class_label = 'Ferritin (ng/mL)' # (str) i.e. class outcome column label
instance_label = 'Class' # (str) If data includes instance labels, given respective column name here, otherwise put 'None'

#Option to manually specify feature names to leave out of analysis, or which to treat as categorical (without using built in variable type detector)
ignore_features = [] # list of column names (given as string values) to exclude from the analysis (only insert column names if needed, otherwise leave empty)
categorical_feature_headers = [] # empty list for 'auto-detect' otherwise list feature names (given as string values) to be treated as categorical. Only impacts algorithms that can take variable type into account.

### Run Parameters for Phase 1: Exploratory Analysis

In [None]:
cv_partitions = 5  # (int, > 1) Number of training/testing data partitions to create - and resulting number of models generated using each ML algorithm
partition_method = 'R' # (str, S R or M) for stratified, random, or matched, respectively
match_label = 'None' # (str) Only applies when M selected for partition-method; indicates column label with matched instance ids'

categorical_cutoff = 10 # (int) Number of unique values after which a variable is considered to be quantitative vs categorical
sig_cutoff = 0.05 # (float, 0-1) Significance cutoff used throughout pipeline
export_feature_correlations = 'True' # (str, True or False) Run and export feature correlation analysis (yields correlation heatmap)
export_univariate_plots = 'False' # (str, True or False) Export univariate analysis plots (note: univariate analysis still output by default)
topFeatures = 20 # (int) Number of top features to report in notebook for univariate analysis
random_state = 42 # (int) Sets a specific random seed for reproducible results

### Run Parameters for Phase 2: Data Preprocessing

In [None]:
scale_data = 'True' # (str, True or False) Perform data scaling?
impute_data = 'True' # (str, True or False) Perform missing value data imputation? (required for most ML algorithms if missing data is present)
overwrite_cv = 'True' # (str, True or False) Overwrites earlier cv datasets with new scaled/imputed ones
multi_impute = 'True' # (str, True or False) Applies multivariate imputation to quantitative features, otherwise uses mean imputation

### Run Parameters for Phase 3: Feature Importance Evaluation

In [None]:
do_mutual_info = 'True' # (str, True or False) Do mutual information analysis
do_multisurf = 'True' # (str, True or False) Do multiSURF analysis
use_TURF = 'False' # (str, True or False) Use TURF wrapper around MultiSURF
TURF_pct = 0.5 # (float, 0.01-0.5) Proportion of instances removed in an iteration (also dictates number of iterations)
njobs = -1 # (int) Number of cores dedicated to running algorithm; setting to -1 will use all available cores
instance_subset = 2000 # (int) Sample subset size to use with multiSURF

### Run Parameters for Phase 4: Feature Selection

In [None]:
max_features_to_keep = 2000 # (int) Maximum features to keep.
filter_poor_features = 'False' # (str, True or False) Filter out the worst performing features prior to modeling
top_features = 40 # (int) Number of top features to illustrate in figures
export_scores = 'True' # (str, True or False) Export figure summarizing average feature importance scores over cv partitions

### Run Parameters for Phase 5: Modeling

In [None]:
#ML Model Algorithm Options (individual hyperparameter options can be adjusted below)
do_all = 'False'
# Regression Algorithm
do_linReg = 'False'
do_ENReg = 'False'
do_RFReg = 'False'
do_AdaReg = 'False'
do_GradReg = 'False'
do_SVR = 'True'
do_GL = 'False'

#Group Lasso Parameters - Defaults available
groups_path = '/content/drive/MyDrive/STREAMLINE-Regression/streamline/groups.csv' # (str) Path of the defined groups
# Other Analysis Parameters
training_subsample = 0  # (int) For long running algorithms, option to subsample training set (0 for no subsample) Limit Sample Size Used to train algorithms that do not scale up well in large instance spaces (i.e. XGB,SVM,KN,ANN,and LR to a lesser degree) and depending on 'instances' settings, ExSTraCS, eLCS, and XCS)
use_uniform_FI = 'True' # (str, True or False) Overides use of any available feature importances estimate methods from models, instead using permutation_importance uniformly
primary_metric = 'explained_variance' # (str) Must be an available metric identifier from (https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

# Hyperparameter Sweep Options
n_trials = 50   # (int or None) Number of bayesian hyperparameter optimization trials using optuna
timeout = 900    # (int or None) Seconds until hyperparameter sweep stops running new trials (Note: it may run longer to finish last trial started)
export_hyper_sweep_plots = 'True' # (str, True or False) Export hyper parameter sweep plots from optuna

### Hyperparameter Sweep Options for ML Algorithms
Users can extend or limit the range or options for given ML algorithm hyperparameters to be tested in hyperparameter optimization. These options are hardcoded when running this pipeline from the command line, but they are available here for users to see and modify. We have sought to include a broad range of relevant configurations based on online examples and relevant research publications. Use caution when modifying values below as improper modifications will lead to pipeline errors/failure. Links to available hyperparameter options for each algorithm are included below.

In [None]:
def hyperparameters(random_state,feature_names):
    param_grid = {}

    # Elastic Net Regressor
    # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet
    param_grid_EN = {'alpha':[1e-3,1],'l1_ratio':[0,1],
                     'max_iter': [10,2500],'random_state':[random_state]}

    # Random Forest Regressor
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
    param_grid_RF = {'n_estimators': [10, 1000],'max_depth': [1, 30],'min_samples_split': [2, 50],
                     'min_samples_leaf': [1, 50],'max_features': [None, 'auto', 'log2'],
                     'bootstrap': [True],'oob_score': [False, True],'random_state':[random_state]}

    # AdaBoost Regressor
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
    param_grid_AdaB = {'n_estimators': [10, 1000], 'learning_rate': [.0001, 0.3], 'loss': ['linear', 'square', 'exponential']}

    # GradientBoosting Regressor
    # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
    param_grid_GradB = {'learning_rate': [.0001, 0.3],'n_estimators': [10, 1000],
                     'min_samples_leaf': [1, 50],'min_samples_split': [2, 50], 'max_depth': [1, 30],
                     'random_state':[random_state]}

    # Epsilon-Support Vector Regression
    # https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
    param_grid_SVR = {'kernel': ['linear', 'poly', 'rbf'],'C': [0.1, 1000],'gamma': ['scale'],'degree': [1, 6]}

    # Group Lasso Regressor
    # https://group-lasso.readthedocs.io/en/latest/api_reference.html#
    param_grid_GL = {'group_reg':[1e-3,1],#'l1_reg':[0,1],
                     'n_iter':[10,2500],
                     'scale_reg': ['group_size', 'none', 'inverse_group_size'],
                     #'subsampling_scheme': [0.1,0.9],
                     #'frobenius_lipschitz': [True],
                     'random_state':[random_state]}

    #Leave code below as is...
    param_grid['Linear Regression'] = {}
    param_grid['Elastic Net'] = param_grid_EN
    param_grid['Group Lasso'] = param_grid_GL
    param_grid['RF Regressor'] = param_grid_RF
    param_grid['AdaBoost'] = param_grid_AdaB
    param_grid['GradBoost'] = param_grid_GradB
    param_grid['SVR'] = param_grid_SVR
    return param_grid

### Run Parameters for Phase 6:  Statistics Summary and Figure Generation

In [None]:
plot_FI_box = 'True' # (str, True or False) Plot box plot summaries comparing algorithms for each metric
plot_metric_boxplots = 'True' # (str, True or False) Plot feature importance boxplots for each algorithm
metric_weight = 'explained_variance' # (str, balanced_accuracy or roc_auc) ML model metric used as weight in composite FI plots (only supports balanced_accuracy or roc_auc as options) Recommend setting the same as primary_metric if possible.
top_model_features = 40  # (int) Number of top features in model to illustrate in figures

### Run Parameters for Phase 10:  Apply Models to Replication Dataset
An optional phase to apply all trained models from previous phases to a separate 'replication' dataset which will be used to evaluate models across all algorithms and CV splits. In this demo, we didn't have a separate replication dataset to use for the UCI HCC dataset evaluated. Thus here we use a copy of the original HCC dataset as a 'pretend' replication dataset to demonstrate functionality. The replication data folder can include 1 or more datasets that can be evaluated as separate replication data. The user also needs to

In [None]:
applyToReplication = False # (Boolean, True or False) Leave false unless you have a replication dataset handy to further evaluate/compare all models in uniform manner
rep_data_path = "/content/drive/MyDrive/STREAMLINE-main/DemoRepData" # (txt) Name of folder with replication Dataset(s)
dataset_for_rep = "/content/drive/MyDrive/STREAMLINE-main/DemoRepData/hcc-data_example_rep.csv" # (txt) Path and name of dataset used to generate the models we want to apply (not the replication dataset)

### Run Parameters for Phase 11:  File Cleanup
An optional phase to delete all unnecessary/temporary files generated by the pipeline.

In [None]:
del_time = 'True'  # (str, True or False) Delete individual run-time files (but save summary)
del_oldCV = 'True' # (str, True or False) Delete any of the older versions of CV training and testing datasets not overwritten (preserves final training and testing datasets)

## -----------------------------------------------------------------------------------------------------------------
## Phase 1: Exploratory Analysis

### Identify Working Directory

In [None]:
wd_path = os.getcwd() #Working directory path automatically detected
wd_path = wd_path.replace('\\','/')
sys.path.insert(1, wd_path+'/streamline')

### Import Python Packages

In [None]:
import glob
import time
import csv
import pandas as pd
import numpy as np
import random
import pickle
import ExploratoryAnalysisMain
import ExploratoryAnalysisJob

### Demo Setup
Bypasses whatever user may have entered into 'data_path' variable to ensure proper loading of local 'demo' dataset.

In [None]:
if demo_run:
    data_path = wd_path+'/drive/MyDrive/STREAMLINE-main/DemoData'
print("Data Folder Path: "+data_path)
jupyterRun = 'True' #Leave True or pipeline will not display text or figures

Data Folder Path: /content/drive/MyDrive/STREAMLINE-Regression/Measurements/Shu_AMIA_MidTemp


### Run Exploratory Analysis

In [None]:
ExploratoryAnalysisMain.makeDirTree(data_path,output_path,experiment_name,jupyterRun)

Exception: ignored

In [None]:
#Determine file extension of datasets in target folder:
file_count = 0
unique_datanames = []
for dataset_path in glob.glob(data_path+'/*'):
    dataset_path = str(dataset_path).replace('\\','/')
    print('---------------------------------------------------------------------------------')
    print(dataset_path)
    file_extension = dataset_path.split('/')[-1].split('.')[-1]
    data_name = dataset_path.split('/')[-1].split('.')[0] #Save unique dataset names so that analysis is run only once if there is both a .txt and .csv version of dataset with same name.
    if file_extension == 'txt' or file_extension == 'csv':
        if data_name not in unique_datanames:
            unique_datanames.append(data_name)
            ExploratoryAnalysisJob.runExplore(dataset_path,output_path+'/'+experiment_name,cv_partitions,partition_method,categorical_cutoff,export_feature_correlations,export_univariate_plots,class_label,instance_label,match_label,random_state,ignore_features,categorical_feature_headers,sig_cutoff,jupyterRun)
            file_count += 1

if file_count == 0: #Check that there was at least 1 dataset
    raise Exception("There must be at least one .txt or .csv dataset in data_path directory")

#Create metadata dictionary object to keep track of pipeline run paramaters throughout phases
metadata = {}
metadata['Data Path'] = data_path
metadata['Output Path'] = output_path
metadata['Experiment Name'] = experiment_name
metadata['Class Label'] = class_label
metadata['Instance Label'] = instance_label
metadata['Ignored Features'] = ignore_features
metadata['Specified Categorical Features'] = categorical_feature_headers
metadata['CV Partitions'] = cv_partitions
metadata['Partition Method'] = partition_method
metadata['Match Label'] = match_label
metadata['Categorical Cutoff'] = categorical_cutoff
metadata['Statistical Significance Cutoff'] = sig_cutoff
metadata['Export Feature Correlations'] = export_feature_correlations
metadata['Export Univariate Plots'] = export_univariate_plots
metadata['Random Seed'] = random_state
metadata['Run From Jupyter Notebook'] = jupyterRun
#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

Output hidden; open in https://colab.research.google.com to view.

## -----------------------------------------------------------------------------------------------------------------
## Phase 2: Data Preprocessing

### Import Additional Python Packages

In [None]:
import DataPreprocessingJob

### Run Data Preprocessing

In [None]:
dataset_paths = os.listdir(output_path+"/"+experiment_name)
dataset_paths.remove('metadata.pickle')
for dataset_directory_path in dataset_paths:
    full_path = output_path+"/"+experiment_name+"/"+dataset_directory_path
    print(dataset_directory_path)
    for cv_train_path in glob.glob(full_path+"/CVDatasets/*Train.csv"):
        cv_train_path = str(cv_train_path).replace('\\','/')
        cv_test_path = cv_train_path.replace("Train.csv","Test.csv")
        DataPreprocessingJob.job(cv_train_path,cv_test_path,output_path+'/'+experiment_name,scale_data,impute_data,overwrite_cv,categorical_cutoff,class_label,instance_label,random_state,multi_impute,jupyterRun)


#Unpickle metadata from previous phase
file = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()

#Update metadata
metadata['Use Data Scaling'] = scale_data
metadata['Use Data Imputation'] = impute_data
metadata['Use Multivariate Imputation'] = multi_impute
#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

jobsCompleted
QTPAD_data_MidTemp
Preparing Train and Test for: QTPAD_data_MidTemp_CV_0
Imputing Missing Values...
Notice: No missing values found. Imputation skipped.
Scaling Data Values...
Saving Processed Train and Test Data...
QTPAD_data_MidTemp phase 2 complete
Preparing Train and Test for: QTPAD_data_MidTemp_CV_1
Imputing Missing Values...
Notice: No missing values found. Imputation skipped.
Scaling Data Values...
Saving Processed Train and Test Data...
QTPAD_data_MidTemp phase 2 complete
Preparing Train and Test for: QTPAD_data_MidTemp_CV_2
Imputing Missing Values...
Notice: No missing values found. Imputation skipped.
Scaling Data Values...
Saving Processed Train and Test Data...
QTPAD_data_MidTemp phase 2 complete
Preparing Train and Test for: QTPAD_data_MidTemp_CV_3
Imputing Missing Values...
Notice: No missing values found. Imputation skipped.
Scaling Data Values...
Saving Processed Train and Test Data...
QTPAD_data_MidTemp phase 2 complete
Preparing Train and Test for: QTPAD

## -----------------------------------------------------------------------------------------------------------------
## Phase 3: Feature Importance Evaluation

### Import Additional Python Packages

In [None]:
import FeatureImportanceJob

### Run Feature Importance Evaluation

In [None]:
dataset_paths = os.listdir(output_path+"/"+experiment_name)
removeList = removeList = ['metadata.pickle','metadata.csv','algInfo.pickle','jobsCompleted','logs','jobs','DatasetComparisons','UsefulNotebooks',experiment_name+'_ML_Pipeline_Report.pdf']
for text in removeList:
    if text in dataset_paths:
        dataset_paths.remove(text)

for dataset_directory_path in dataset_paths:
    full_path = output_path+"/"+experiment_name+"/"+dataset_directory_path
    experiment_path = output_path+'/'+experiment_name

    if eval(do_mutual_info) or eval(do_multisurf):
        if not os.path.exists(full_path+"/feature_selection"):
            os.mkdir(full_path+"/feature_selection")

    if eval(do_mutual_info):
        if not os.path.exists(full_path+"/feature_selection/mutualinformation"):
            os.mkdir(full_path+"/feature_selection/mutualinformation")
        for cv_train_path in glob.glob(full_path+"/CVDatasets/*_CV_*Train.csv"):
            cv_train_path = str(cv_train_path).replace('\\','/')
            FeatureImportanceJob.job(cv_train_path,experiment_path,random_state,class_label,instance_label,instance_subset,'mi',njobs,use_TURF,TURF_pct,jupyterRun)

    if eval(do_multisurf):
        if not os.path.exists(full_path+"/feature_selection/multisurf"):
            os.mkdir(full_path+"/feature_selection/multisurf")
        for cv_train_path in glob.glob(full_path+"/CVDatasets/*_CV_*Train.csv"):
            cv_train_path = str(cv_train_path).replace('\\','/')
            FeatureImportanceJob.job(cv_train_path,experiment_path,random_state,class_label,instance_label,instance_subset,'ms',njobs,use_TURF,TURF_pct,jupyterRun)

#Unpickle metadata from previous phase
file = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()

#Update metadata
metadata['Use Mutual Information'] = do_mutual_info
metadata['Use MultiSURF'] = do_multisurf
metadata['Use TURF'] = use_TURF
metadata['TURF Cutoff'] = TURF_pct
metadata['MultiSURF Instance Subset'] = instance_subset
#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

## -----------------------------------------------------------------------------------------------------------------
## Phase 4: Feature Selection

### Import Additional Python Packages

In [None]:
import FeatureSelectionJob

### Run Feature Selection

In [None]:
dataset_paths = os.listdir(output_path + "/" + experiment_name)
removeList = removeList = ['metadata.pickle','metadata.csv','algInfo.pickle','jobsCompleted','logs','jobs','DatasetComparisons','UsefulNotebooks',experiment_name+'_ML_Pipeline_Report.pdf']
for text in removeList:
    if text in dataset_paths:
        dataset_paths.remove(text)

for dataset_directory_path in dataset_paths:
    full_path = output_path + "/" + experiment_name + "/" + dataset_directory_path
    FeatureSelectionJob.job(full_path,do_mutual_info,do_multisurf,max_features_to_keep,filter_poor_features,top_features,export_scores,class_label,instance_label,cv_partitions,overwrite_cv,jupyterRun)

#Unpickle metadata from previous phase
file = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()

#Update metadata
metadata['Max Features to Keep'] = max_features_to_keep
metadata['Filter Poor Features'] = filter_poor_features
metadata['Top Features to Display'] = top_features
metadata['Export Feature Importance Plot'] = export_scores
metadata['Overwrite CV Datasets'] = overwrite_cv
#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

## -----------------------------------------------------------------------------------------------------------------
## Phase 5: ML Modeling

### Phase 5 Import Additional Python Packages

In [None]:
import ModelJob

In [None]:
#Create ML modeling algorithm information dictionary, given as ['algorithm used (set to true initially by default)','algorithm abreviation', 'color used for algorithm on figures']
### Note that other named colors used by matplotlib can be found here: https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png
### Make sure new ML algorithm abbreviations and color designations are unique
algInfo = {}

algInfo['Linear Regression'] = [True,'Linear Regression','red']
algInfo['Elastic Net'] = [True, 'Elastic Net', 'steelblue']
algInfo['Group Lasso'] = [True, 'Group Lasso', 'orange']
algInfo['RF Regressor'] = [True, 'RF Regressor', 'navy']
algInfo['AdaBoost'] = [True, 'AdaBoost', 'teal']
algInfo['GradBoost'] = [True, 'GradBoost', 'olive']
algInfo['SVR'] = [True, 'SVR', 'rosybrown']
### Add new algorithms here...


#Set up ML algorithm True/False use
if not eval(do_all): #If do all algorithms is false
    for key in algInfo:
        algInfo[key][0] = False #Set algorithm use to False

#Set algorithm use truth for each algorithm specified by user (i.e. if user specified True/False for a specific algorithm)
if not do_linReg == 'None':
    algInfo['Linear Regression'][0] = eval(do_linReg)
if not do_ENReg == 'None':
    algInfo['Elastic Net'][0] = eval(do_ENReg)
if not do_GL == 'None':
    algInfo['Group Lasso'][0] = eval(do_GL)
if not do_RFReg == 'None':
    algInfo['RF Regressor'][0] = eval(do_RFReg)
if not do_AdaReg == 'None':
    algInfo['AdaBoost'][0] = eval(do_AdaReg)
if not do_GradReg == 'None':
    algInfo['GradBoost'][0] = eval(do_GradReg)
if not do_SVR == 'None':
    algInfo['SVR'][0] = eval(do_SVR)
### Add new algorithms here...


#Pickle the algorithm information dictionary for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"algInfo.pickle", 'wb')
pickle.dump(algInfo,pickle_out)
pickle_out.close()

#Make list of algorithms to be run (full names)
algorithms = []
for key in algInfo:
    if algInfo[key][0]: #Algorithm is true
        algorithms.append(key)

### Run ML Modeling

In [None]:
dataset_paths = os.listdir(output_path + "/" + experiment_name)
removeList = removeList = ['metadata.pickle','metadata.csv','algInfo.pickle','jobsCompleted','logs','jobs','DatasetComparisons','UsefulNotebooks',experiment_name+'_ML_Pipeline_Report.pdf']
for text in removeList:
    if text in dataset_paths:
        dataset_paths.remove(text)
print(dataset_paths)
for dataset_directory_path in dataset_paths:
  full_path = output_path + "/" + experiment_name + "/" + dataset_directory_path
  if not os.path.exists(full_path+'/models'):
      os.mkdir(full_path+'/models')
  if not os.path.exists(full_path+'/model_evaluation'):
      os.mkdir(full_path+'/model_evaluation')
  if not os.path.exists(full_path+'/models/pickledModels'):
      os.mkdir(full_path+'/models/pickledModels')
  for cvCount in range(cv_partitions):
      train_file_path = full_path+'/CVDatasets/'+dataset_directory_path+"_CV_"+str(cvCount)+"_Train.csv"
      test_file_path = full_path + '/CVDatasets/' + dataset_directory_path + "_CV_" + str(cvCount) + "_Test.csv"
      for algorithm in algorithms:
           print(algorithm)
           algAbrev = algInfo[algorithm][1]
           #Get header names for current CV dataset for use later in GP tree visulaization
           data_name = full_path.split('/')[-1]
           feature_names = pd.read_csv(full_path+'/CVDatasets/'+data_name+'_CV_'+str(cvCount)+'_Test.csv').columns.values.tolist()
           if instance_label != 'None':
              feature_names.remove(instance_label)
           feature_names.remove(class_label)
           #Get hyperparameter grid
           param_grid = hyperparameters(random_state,feature_names)[algorithm]
           ModelJob.runModel(algorithm,train_file_path,test_file_path,full_path,n_trials,timeout,export_hyper_sweep_plots,instance_label,class_label,random_state,cvCount,filter_poor_features,training_subsample,use_uniform_FI,primary_metric,param_grid,groups_path,algAbrev)

#Unpickle metadata from previous phase
file = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()

#Update metadata
### Add new algorithms here...
metadata['Linear Regression'] = str(algInfo['Linear Regression'][0])
metadata['Elastic Net'] = str(algInfo['Elastic Net'][0])
metadata['Group Lasso'] = str(algInfo['Group Lasso'][0])
metadata['RF Regressor'] = str(algInfo['RF Regressor'][0])
metadata['AdaBoost'] = str(algInfo['AdaBoost'][0])
metadata['GradBoost'] = str(algInfo['GradBoost'][0])
metadata['SVR'] = str(algInfo['SVR'][0])

metadata['Primary Metric'] = primary_metric
metadata['Uniform Feature Importance Estimation (Models)'] = use_uniform_FI
metadata['Hyperparameter Sweep Number of Trials'] = n_trials
metadata['Hyperparameter Timeout'] = timeout
metadata['Export Hyperparameter Sweep Plots'] = export_hyper_sweep_plots

#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

[32m[I 2023-06-27 18:24:15,746][0m Trial 0 finished with value: 0.3408853677072674 and parameters: {'alpha': 0.013292918943162165, 'l1_ratio': 0.9507143064099162, 'max_iter': 1140, 'random_state': 42}. Best is trial 0 with value: 0.3408853677072674.[0m


['winequality-red']
Linear Regression
weights: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
0.48709141436263526 0.359203043829853 0.6080024113580738
winequality-red [CV_0] (Linear Regression) training complete. ------------------------------------
Elastic Net


[32m[I 2023-06-27 18:24:15,773][0m Trial 1 finished with value: 0.2664960239638324 and parameters: {'alpha': 0.21830968390524597, 'l1_ratio': 0.596850157946487, 'max_iter': 2179, 'random_state': 42}. Best is trial 0 with value: 0.3408853677072674.[0m
[32m[I 2023-06-27 18:24:15,794][0m Trial 2 finished with value: 0.333494841735038 and parameters: {'alpha': 0.0029375384576328283, 'l1_ratio': 0.05808361216819946, 'max_iter': 2145, 'random_state': 42}. Best is trial 0 with value: 0.3408853677072674.[0m
[32m[I 2023-06-27 18:24:15,814][0m Trial 3 finished with value: 0.3359850967876031 and parameters: {'alpha': 0.010025956902289565, 'l1_ratio': 0.14286681792194078, 'max_iter': 140, 'random_state': 42}. Best is trial 0 with value: 0.3408853677072674.[0m
[32m[I 2023-06-27 18:24:15,835][0m Trial 4 finished with value: 0.3344035158887722 and parameters: {'alpha': 0.00115279871282324, 'l1_ratio': 0.9699098521619943, 'max_iter': 1525, 'random_state': 42}. Best is trial 0 with value: 0.

Best trial:
  Value:  0.341169660237735
  Params: 
    alpha: 0.027982323345358308
    l1_ratio: 0.38171491149858633
    max_iter: 139
    random_state: 42
ElasticNet(alpha=0.027982323345358308, l1_ratio=0.38171491149858633,
           max_iter=139, random_state=42)
weights: {'alpha': 0.027982323345358308, 'copy_X': True, 'fit_intercept': True, 'l1_ratio': 0.38171491149858633, 'max_iter': 139, 'positive': False, 'precompute': False, 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
0.4966536698460862 0.3466995318328333 0.6009512972356371
winequality-red [CV_0] (Elastic Net) training complete. ------------------------------------
RF Regressor


[32m[I 2023-06-27 18:24:21,160][0m Trial 0 finished with value: 0.37766011834006447 and parameters: {'n_estimators': 112, 'max_depth': 20, 'min_samples_split': 30, 'min_samples_leaf': 15, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.37766011834006447.[0m
[32m[I 2023-06-27 18:24:25,714][0m Trial 1 finished with value: 0.3707956952526275 and parameters: {'n_estimators': 710, 'max_depth': 21, 'min_samples_split': 40, 'min_samples_leaf': 19, 'max_features': 'log2', 'bootstrap': True, 'oob_score': False, 'random_state': 42}. Best is trial 0 with value: 0.37766011834006447.[0m
[32m[I 2023-06-27 18:24:30,673][0m Trial 2 finished with value: 0.33469671953934316 and parameters: {'n_estimators': 468, 'max_depth': 24, 'min_samples_split': 37, 'min_samples_leaf': 40, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.37766011834006447.[0m
[32m[I 2023-06-27 18:

Best trial:
  Value:  0.4295781951610133
  Params: 
    n_estimators: 250
    max_depth: 15
    min_samples_split: 9
    min_samples_leaf: 1
    max_features: None
    bootstrap: True
    oob_score: True
    random_state: 42
RandomForestRegressor(max_depth=15, max_features=None, min_samples_split=9,
                      n_estimators=250, oob_score=True, random_state=42)
weights: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 15, 'max_features': None, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 9, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 250, 'n_jobs': None, 'oob_score': True, 'random_state': 42, 'verbose': 0, 'warm_start': False}
0.36932041970733004 0.5145685388837051 0.72337659878839
winequality-red [CV_0] (RF Regressor) training complete. ------------------------------------
AdaBoost


[32m[I 2023-06-27 18:31:01,680][0m Trial 0 finished with value: 0.3584523159968957 and parameters: {'n_estimators': 112, 'learning_rate': 0.23898324175938382, 'loss': 'exponential'}. Best is trial 0 with value: 0.3584523159968957.[0m
[32m[I 2023-06-27 18:31:02,899][0m Trial 1 finished with value: 0.3557679332628523 and parameters: {'n_estimators': 116, 'learning_rate': 0.2339293309818035, 'loss': 'linear'}. Best is trial 0 with value: 0.3584523159968957.[0m
[32m[I 2023-06-27 18:31:10,670][0m Trial 2 finished with value: 0.34473293662731347 and parameters: {'n_estimators': 624, 'learning_rate': 0.13380524258079196, 'loss': 'exponential'}. Best is trial 0 with value: 0.3584523159968957.[0m
[32m[I 2023-06-27 18:31:13,527][0m Trial 3 finished with value: 0.35279298238729545 and parameters: {'n_estimators': 340, 'learning_rate': 0.13782874270056356, 'loss': 'linear'}. Best is trial 0 with value: 0.3584523159968957.[0m
[32m[I 2023-06-27 18:31:14,555][0m Trial 4 finished with va

Best trial:
  Value:  0.3666814203773476
  Params: 
    n_estimators: 74
    learning_rate: 0.279564522107791
    loss: exponential
AdaBoostRegressor(learning_rate=0.279564522107791, loss='exponential',
                  n_estimators=74)
weights: {'base_estimator': 'deprecated', 'estimator': None, 'learning_rate': 0.279564522107791, 'loss': 'exponential', 'n_estimators': 74, 'random_state': None}
0.459783155902376 0.3951650359198048 0.6355648039368456
winequality-red [CV_0] (AdaBoost) training complete. ------------------------------------
GradBoost


[32m[I 2023-06-27 18:33:19,997][0m Trial 0 finished with value: 0.38990572180621097 and parameters: {'learning_rate': 0.0020059560245279666, 'n_estimators': 870, 'min_samples_leaf': 15, 'min_samples_split': 44, 'max_depth': 8, 'random_state': 42}. Best is trial 0 with value: 0.38990572180621097.[0m
[32m[I 2023-06-27 18:33:29,858][0m Trial 1 finished with value: 0.39752327499874385 and parameters: {'learning_rate': 0.012067245262919609, 'n_estimators': 624, 'min_samples_leaf': 19, 'min_samples_split': 24, 'max_depth': 11, 'random_state': 42}. Best is trial 1 with value: 0.39752327499874385.[0m
[32m[I 2023-06-27 18:33:33,506][0m Trial 2 finished with value: 0.3789968329971675 and parameters: {'learning_rate': 0.003952429057290443, 'n_estimators': 382, 'min_samples_leaf': 36, 'min_samples_split': 41, 'max_depth': 24, 'random_state': 42}. Best is trial 1 with value: 0.39752327499874385.[0m
[32m[I 2023-06-27 18:33:41,753][0m Trial 3 finished with value: 0.37654998111068627 and pa

Best trial:
  Value:  0.41198223989317256
  Params: 
    learning_rate: 0.006678902831141084
    n_estimators: 520
    min_samples_leaf: 42
    min_samples_split: 29
    max_depth: 28
    random_state: 42
GradientBoostingRegressor(learning_rate=0.006678902831141084, max_depth=28,
                          min_samples_leaf=42, min_samples_split=29,
                          n_estimators=520, random_state=42)
weights: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.006678902831141084, 'loss': 'squared_error', 'max_depth': 28, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 42, 'min_samples_split': 29, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 520, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
0.39299180077839313 0.4829668139853692 0.7057233994106
winequality-red [CV_0] (GradBoost) training com

[32m[I 2023-06-27 18:38:10,726][0m Trial 0 finished with value: -0.28176270903769773 and parameters: {'kernel': 'rbf', 'C': 153.52246941973468, 'gamma': 'scale', 'degree': 3}. Best is trial 0 with value: -0.28176270903769773.[0m
[32m[I 2023-06-27 18:38:14,228][0m Trial 1 finished with value: 0.3280009184449913 and parameters: {'kernel': 'linear', 'C': 24.400607090817502, 'gamma': 'scale', 'degree': 2}. Best is trial 1 with value: 0.3280009184449913.[0m
[32m[I 2023-06-27 18:38:14,400][0m Trial 2 finished with value: 0.3691064870052901 and parameters: {'kernel': 'rbf', 'C': 0.25113061677390003, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.3691064870052901.[0m
[32m[I 2023-06-27 18:38:18,074][0m Trial 3 finished with value: 0.32808638661302236 and parameters: {'kernel': 'linear', 'C': 25.37815508265663, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.3691064870052901.[0m
[32m[I 2023-06-27 18:38:18,266][0m Trial 4 finished with value: -1.18528416

Best trial:
  Value:  0.3746496403322445
  Params: 
    kernel: rbf
    C: 0.4836887033238201
    gamma: scale
    degree: 3
SVR(C=0.4836887033238201)
weights: {'C': 0.4836887033238201, 'cache_size': 200, 'coef0': 0.0, 'degree': 3, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
0.4500687035563562 0.4089689331879969 0.6418677345185555
winequality-red [CV_0] (SVR) training complete. ------------------------------------
Linear Regression
weights: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
0.453545990824046 0.3156930997068125 0.5661696668334568
winequality-red [CV_1] (Linear Regression) training complete. ------------------------------------


[32m[I 2023-06-27 18:39:07,085][0m Trial 0 finished with value: 0.35350210358425604 and parameters: {'alpha': 0.013292918943162165, 'l1_ratio': 0.9507143064099162, 'max_iter': 1140, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,112][0m Trial 1 finished with value: 0.27415924230480343 and parameters: {'alpha': 0.21830968390524597, 'l1_ratio': 0.596850157946487, 'max_iter': 2179, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,136][0m Trial 2 finished with value: 0.35043638427863516 and parameters: {'alpha': 0.0029375384576328283, 'l1_ratio': 0.05808361216819946, 'max_iter': 2145, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,153][0m Trial 3 finished with value: 0.35190783515903146 and parameters: {'alpha': 0.010025956902289565, 'l1_ratio': 0.14286681792194078, 'max_iter': 140, 'random_state': 42}. Best is trial 0 with 

Elastic Net


[32m[I 2023-06-27 18:39:07,253][0m Trial 7 finished with value: 0.35009271841355455 and parameters: {'alpha': 0.0010500232504231353, 'l1_ratio': 0.023062425041415757, 'max_iter': 484, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,272][0m Trial 8 finished with value: 0.35240247959504667 and parameters: {'alpha': 0.06847920095574778, 'l1_ratio': 0.13949386065204183, 'max_iter': 985, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,291][0m Trial 9 finished with value: 0.3508972561831376 and parameters: {'alpha': 0.004992453416923981, 'l1_ratio': 0.0906064345328208, 'max_iter': 572, 'random_state': 42}. Best is trial 0 with value: 0.35350210358425604.[0m
[32m[I 2023-06-27 18:39:07,312][0m Trial 10 finished with value: 0.3502643525789228 and parameters: {'alpha': 0.023630001889867556, 'l1_ratio': 0.9328619999452225, 'max_iter': 1559, 'random_state': 42}. Best is trial 0 with v

Best trial:
  Value:  0.3538906276517492
  Params: 
    alpha: 0.011219585994476672
    l1_ratio: 0.8275331166536568
    max_iter: 2261
    random_state: 42
ElasticNet(alpha=0.011219585994476672, l1_ratio=0.8275331166536568,
           max_iter=2261, random_state=42)
weights: {'alpha': 0.011219585994476672, 'copy_X': True, 'fit_intercept': True, 'l1_ratio': 0.8275331166536568, 'max_iter': 2261, 'positive': False, 'precompute': False, 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
0.4545808427216832 0.3140853590698178 0.5623745716083944
winequality-red [CV_1] (Elastic Net) training complete. ------------------------------------
RF Regressor


[32m[I 2023-06-27 18:39:11,480][0m Trial 0 finished with value: 0.39104868819896854 and parameters: {'n_estimators': 112, 'max_depth': 20, 'min_samples_split': 30, 'min_samples_leaf': 15, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.39104868819896854.[0m
[32m[I 2023-06-27 18:39:16,541][0m Trial 1 finished with value: 0.3851890164367586 and parameters: {'n_estimators': 710, 'max_depth': 21, 'min_samples_split': 40, 'min_samples_leaf': 19, 'max_features': 'log2', 'bootstrap': True, 'oob_score': False, 'random_state': 42}. Best is trial 0 with value: 0.39104868819896854.[0m
[32m[I 2023-06-27 18:39:21,210][0m Trial 2 finished with value: 0.3446603914991715 and parameters: {'n_estimators': 468, 'max_depth': 24, 'min_samples_split': 37, 'min_samples_leaf': 40, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.39104868819896854.[0m
[32m[I 2023-06-27 18:3

Best trial:
  Value:  0.43801563631526985
  Params: 
    n_estimators: 302
    max_depth: 23
    min_samples_split: 20
    min_samples_leaf: 1
    max_features: auto
    bootstrap: True
    oob_score: True
    random_state: 42
RandomForestRegressor(max_depth=23, max_features='auto', min_samples_split=20,
                      n_estimators=302, oob_score=True, random_state=42)
weights: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 23, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 20, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 302, 'n_jobs': None, 'oob_score': True, 'random_state': 42, 'verbose': 0, 'warm_start': False}
0.3751585922280705 0.43532540357749283 0.6599055058712905
winequality-red [CV_1] (RF Regressor) training complete. ------------------------------------
AdaBoost


[32m[I 2023-06-27 18:46:42,042][0m Trial 0 finished with value: 0.37582410587881415 and parameters: {'n_estimators': 112, 'learning_rate': 0.23898324175938382, 'loss': 'exponential'}. Best is trial 0 with value: 0.37582410587881415.[0m
[32m[I 2023-06-27 18:46:43,660][0m Trial 1 finished with value: 0.3755385139345873 and parameters: {'n_estimators': 116, 'learning_rate': 0.2339293309818035, 'loss': 'linear'}. Best is trial 0 with value: 0.37582410587881415.[0m
[32m[I 2023-06-27 18:46:51,636][0m Trial 2 finished with value: 0.37449940603407367 and parameters: {'n_estimators': 624, 'learning_rate': 0.13380524258079196, 'loss': 'exponential'}. Best is trial 0 with value: 0.37582410587881415.[0m
[32m[I 2023-06-27 18:46:53,851][0m Trial 3 finished with value: 0.3771694762853493 and parameters: {'n_estimators': 340, 'learning_rate': 0.13782874270056356, 'loss': 'linear'}. Best is trial 3 with value: 0.3771694762853493.[0m
[32m[I 2023-06-27 18:46:55,470][0m Trial 4 finished with

Best trial:
  Value:  0.3825082191936294
  Params: 
    n_estimators: 605
    learning_rate: 0.11299992853681023
    loss: linear
AdaBoostRegressor(learning_rate=0.11299992853681023, n_estimators=605)
weights: {'base_estimator': 'deprecated', 'estimator': None, 'learning_rate': 0.11299992853681023, 'loss': 'linear', 'n_estimators': 605, 'random_state': None}
0.43507653569123184 0.34586601597615185 0.5881878465511485
winequality-red [CV_1] (AdaBoost) training complete. ------------------------------------
GradBoost


[32m[I 2023-06-27 18:50:38,312][0m Trial 0 finished with value: 0.4147561758574554 and parameters: {'learning_rate': 0.0020059560245279666, 'n_estimators': 870, 'min_samples_leaf': 15, 'min_samples_split': 44, 'max_depth': 8, 'random_state': 42}. Best is trial 0 with value: 0.4147561758574554.[0m
[32m[I 2023-06-27 18:50:49,402][0m Trial 1 finished with value: 0.42559385073433065 and parameters: {'learning_rate': 0.012067245262919609, 'n_estimators': 624, 'min_samples_leaf': 19, 'min_samples_split': 24, 'max_depth': 11, 'random_state': 42}. Best is trial 1 with value: 0.42559385073433065.[0m
[32m[I 2023-06-27 18:50:54,577][0m Trial 2 finished with value: 0.3860822609789156 and parameters: {'learning_rate': 0.003952429057290443, 'n_estimators': 382, 'min_samples_leaf': 36, 'min_samples_split': 41, 'max_depth': 24, 'random_state': 42}. Best is trial 1 with value: 0.42559385073433065.[0m
[32m[I 2023-06-27 18:51:01,390][0m Trial 3 finished with value: 0.4247339478046656 and param

Best trial:
  Value:  0.4432761868226563
  Params: 
    learning_rate: 0.00493463855735765
    n_estimators: 937
    min_samples_leaf: 16
    min_samples_split: 8
    max_depth: 22
    random_state: 42
GradientBoostingRegressor(learning_rate=0.00493463855735765, max_depth=22,
                          min_samples_leaf=16, min_samples_split=8,
                          n_estimators=937, random_state=42)
weights: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.00493463855735765, 'loss': 'squared_error', 'max_depth': 22, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 16, 'min_samples_split': 8, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 937, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
0.3832337087888556 0.42246634470769395 0.6554493246996906
winequality-red [CV_1] (GradBoost) training complet

[32m[I 2023-06-27 19:01:23,144][0m Trial 0 finished with value: -0.1724248275757645 and parameters: {'kernel': 'rbf', 'C': 153.52246941973468, 'gamma': 'scale', 'degree': 3}. Best is trial 0 with value: -0.1724248275757645.[0m
[32m[I 2023-06-27 19:01:26,752][0m Trial 1 finished with value: 0.3443606353221332 and parameters: {'kernel': 'linear', 'C': 24.400607090817502, 'gamma': 'scale', 'degree': 2}. Best is trial 1 with value: 0.3443606353221332.[0m
[32m[I 2023-06-27 19:01:26,924][0m Trial 2 finished with value: 0.36838575723092215 and parameters: {'kernel': 'rbf', 'C': 0.25113061677390003, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.36838575723092215.[0m
[32m[I 2023-06-27 19:01:30,417][0m Trial 3 finished with value: 0.3443402897356786 and parameters: {'kernel': 'linear', 'C': 25.37815508265663, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.36838575723092215.[0m
[32m[I 2023-06-27 19:01:30,623][0m Trial 4 finished with value: -4.50509337

Best trial:
  Value:  0.373498938580563
  Params: 
    kernel: rbf
    C: 0.4781918801478707
    gamma: scale
    degree: 3
SVR(C=0.4781918801478707)
weights: {'C': 0.4781918801478707, 'cache_size': 200, 'coef0': 0.0, 'degree': 3, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
0.3989189411589041 0.39981459860202906 0.6335066113379686


[32m[I 2023-06-27 19:02:19,327][0m Trial 0 finished with value: 0.3479131086345644 and parameters: {'alpha': 0.013292918943162165, 'l1_ratio': 0.9507143064099162, 'max_iter': 1140, 'random_state': 42}. Best is trial 0 with value: 0.3479131086345644.[0m
[32m[I 2023-06-27 19:02:19,350][0m Trial 1 finished with value: 0.26963371827550214 and parameters: {'alpha': 0.21830968390524597, 'l1_ratio': 0.596850157946487, 'max_iter': 2179, 'random_state': 42}. Best is trial 0 with value: 0.3479131086345644.[0m


winequality-red [CV_1] (SVR) training complete. ------------------------------------
Linear Regression
weights: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
0.42388026954677843 0.33421508795737376 0.5781323332047346
winequality-red [CV_2] (Linear Regression) training complete. ------------------------------------
Elastic Net


[32m[I 2023-06-27 19:02:19,386][0m Trial 2 finished with value: 0.3391896202136411 and parameters: {'alpha': 0.0029375384576328283, 'l1_ratio': 0.05808361216819946, 'max_iter': 2145, 'random_state': 42}. Best is trial 0 with value: 0.3479131086345644.[0m
[32m[I 2023-06-27 19:02:19,410][0m Trial 3 finished with value: 0.3418565417738561 and parameters: {'alpha': 0.010025956902289565, 'l1_ratio': 0.14286681792194078, 'max_iter': 140, 'random_state': 42}. Best is trial 0 with value: 0.3479131086345644.[0m
[32m[I 2023-06-27 19:02:19,431][0m Trial 4 finished with value: 0.3405177979612175 and parameters: {'alpha': 0.00115279871282324, 'l1_ratio': 0.9699098521619943, 'max_iter': 1525, 'random_state': 42}. Best is trial 0 with value: 0.3479131086345644.[0m
[32m[I 2023-06-27 19:02:19,452][0m Trial 5 finished with value: 0.31792657391198453 and parameters: {'alpha': 0.6541210527692729, 'l1_ratio': 0.0007787658410143283, 'max_iter': 965, 'random_state': 42}. Best is trial 0 with value

Best trial:
  Value:  0.3480011864186074
  Params: 
    alpha: 0.014231841829795082
    l1_ratio: 0.95364850341293
    max_iter: 1943
    random_state: 42
ElasticNet(alpha=0.014231841829795082, l1_ratio=0.95364850341293, max_iter=1943,
           random_state=42)
weights: {'alpha': 0.014231841829795082, 'copy_X': True, 'fit_intercept': True, 'l1_ratio': 0.95364850341293, 'max_iter': 1943, 'positive': False, 'precompute': False, 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
0.42303280445348496 0.3355456166152344 0.5802387911649789
winequality-red [CV_2] (Elastic Net) training complete. ------------------------------------
RF Regressor


[32m[I 2023-06-27 19:02:23,713][0m Trial 0 finished with value: 0.37947486804188113 and parameters: {'n_estimators': 112, 'max_depth': 20, 'min_samples_split': 30, 'min_samples_leaf': 15, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.37947486804188113.[0m
[32m[I 2023-06-27 19:02:27,666][0m Trial 1 finished with value: 0.37135777399126146 and parameters: {'n_estimators': 710, 'max_depth': 21, 'min_samples_split': 40, 'min_samples_leaf': 19, 'max_features': 'log2', 'bootstrap': True, 'oob_score': False, 'random_state': 42}. Best is trial 0 with value: 0.37947486804188113.[0m
[32m[I 2023-06-27 19:02:31,784][0m Trial 2 finished with value: 0.33488939989786976 and parameters: {'n_estimators': 468, 'max_depth': 24, 'min_samples_split': 37, 'min_samples_leaf': 40, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.37947486804188113.[0m
[32m[I 2023-06-27 19

Best trial:
  Value:  0.4304908468964103
  Params: 
    n_estimators: 302
    max_depth: 26
    min_samples_split: 9
    min_samples_leaf: 2
    max_features: auto
    bootstrap: True
    oob_score: True
    random_state: 42
RandomForestRegressor(max_depth=26, max_features='auto', min_samples_leaf=2,
                      min_samples_split=9, n_estimators=302, oob_score=True,
                      random_state=42)
weights: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 26, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 9, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 302, 'n_jobs': None, 'oob_score': True, 'random_state': 42, 'verbose': 0, 'warm_start': False}
0.3297185571504368 0.4821371381022488 0.6996025614890676
winequality-red [CV_2] (RF Regressor) training complete. ------------------------------------
AdaBoost


[32m[I 2023-06-27 19:09:21,777][0m Trial 0 finished with value: 0.35513909507666774 and parameters: {'n_estimators': 112, 'learning_rate': 0.23898324175938382, 'loss': 'exponential'}. Best is trial 0 with value: 0.35513909507666774.[0m
[32m[I 2023-06-27 19:09:23,316][0m Trial 1 finished with value: 0.36168277345168215 and parameters: {'n_estimators': 116, 'learning_rate': 0.2339293309818035, 'loss': 'linear'}. Best is trial 1 with value: 0.36168277345168215.[0m
[32m[I 2023-06-27 19:09:31,367][0m Trial 2 finished with value: 0.35298885299431576 and parameters: {'n_estimators': 624, 'learning_rate': 0.13380524258079196, 'loss': 'exponential'}. Best is trial 1 with value: 0.36168277345168215.[0m
[32m[I 2023-06-27 19:09:34,693][0m Trial 3 finished with value: 0.3620957385010235 and parameters: {'n_estimators': 340, 'learning_rate': 0.13782874270056356, 'loss': 'linear'}. Best is trial 3 with value: 0.3620957385010235.[0m
[32m[I 2023-06-27 19:09:36,219][0m Trial 4 finished wit

Best trial:
  Value:  0.3620957385010235
  Params: 
    n_estimators: 340
    learning_rate: 0.13782874270056356
    loss: linear
AdaBoostRegressor(learning_rate=0.13782874270056356, n_estimators=340)
weights: {'base_estimator': 'deprecated', 'estimator': None, 'learning_rate': 0.13782874270056356, 'loss': 'linear', 'n_estimators': 340, 'random_state': None}
0.375414858113975 0.4103148010528841 0.6424416419490335
winequality-red [CV_2] (AdaBoost) training complete. ------------------------------------
GradBoost


[32m[I 2023-06-27 19:12:22,727][0m Trial 0 finished with value: 0.3910076985584965 and parameters: {'learning_rate': 0.0020059560245279666, 'n_estimators': 870, 'min_samples_leaf': 15, 'min_samples_split': 44, 'max_depth': 8, 'random_state': 42}. Best is trial 0 with value: 0.3910076985584965.[0m
[32m[I 2023-06-27 19:12:31,419][0m Trial 1 finished with value: 0.4079478380796882 and parameters: {'learning_rate': 0.012067245262919609, 'n_estimators': 624, 'min_samples_leaf': 19, 'min_samples_split': 24, 'max_depth': 11, 'random_state': 42}. Best is trial 1 with value: 0.4079478380796882.[0m
[32m[I 2023-06-27 19:12:36,209][0m Trial 2 finished with value: 0.3648266851592111 and parameters: {'learning_rate': 0.003952429057290443, 'n_estimators': 382, 'min_samples_leaf': 36, 'min_samples_split': 41, 'max_depth': 24, 'random_state': 42}. Best is trial 1 with value: 0.4079478380796882.[0m
[32m[I 2023-06-27 19:12:44,280][0m Trial 3 finished with value: 0.40998779768791244 and paramet

Best trial:
  Value:  0.41845621118100995
  Params: 
    learning_rate: 0.01914208808549394
    n_estimators: 215
    min_samples_leaf: 18
    min_samples_split: 25
    max_depth: 19
    random_state: 42
GradientBoostingRegressor(learning_rate=0.01914208808549394, max_depth=19,
                          min_samples_leaf=18, min_samples_split=25,
                          n_estimators=215, random_state=42)
weights: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.01914208808549394, 'loss': 'squared_error', 'max_depth': 19, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 18, 'min_samples_split': 25, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 215, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
0.33150363130740523 0.4793227099537858 0.6927865881007205
winequality-red [CV_2] (GradBoost) training com

[32m[I 2023-06-27 19:18:01,027][0m Trial 0 finished with value: -0.22965940867406495 and parameters: {'kernel': 'rbf', 'C': 153.52246941973468, 'gamma': 'scale', 'degree': 3}. Best is trial 0 with value: -0.22965940867406495.[0m
[32m[I 2023-06-27 19:18:07,129][0m Trial 1 finished with value: 0.3380552419492198 and parameters: {'kernel': 'linear', 'C': 24.400607090817502, 'gamma': 'scale', 'degree': 2}. Best is trial 1 with value: 0.3380552419492198.[0m
[32m[I 2023-06-27 19:18:07,476][0m Trial 2 finished with value: 0.36159064748149966 and parameters: {'kernel': 'rbf', 'C': 0.25113061677390003, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.36159064748149966.[0m
[32m[I 2023-06-27 19:18:13,601][0m Trial 3 finished with value: 0.3380228557027869 and parameters: {'kernel': 'linear', 'C': 25.37815508265663, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.36159064748149966.[0m
[32m[I 2023-06-27 19:18:13,942][0m Trial 4 finished with value: -1.965550

Best trial:
  Value:  0.371061880746763
  Params: 
    kernel: rbf
    C: 0.812382965505656
    gamma: scale
    degree: 4
SVR(C=0.812382965505656, degree=4)
weights: {'C': 0.812382965505656, 'cache_size': 200, 'coef0': 0.0, 'degree': 4, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
0.3763019802501491 0.4091501886862521 0.6396500398463871
winequality-red [CV_2] (SVR) training complete. ------------------------------------
Linear Regression
weights: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
0.40502666072948595 0.37973355620462235 0.6170522150624297


[32m[I 2023-06-27 19:18:59,471][0m Trial 0 finished with value: 0.33437796412050935 and parameters: {'alpha': 0.013292918943162165, 'l1_ratio': 0.9507143064099162, 'max_iter': 1140, 'random_state': 42}. Best is trial 0 with value: 0.33437796412050935.[0m
[32m[I 2023-06-27 19:18:59,513][0m Trial 1 finished with value: 0.2577908795728487 and parameters: {'alpha': 0.21830968390524597, 'l1_ratio': 0.596850157946487, 'max_iter': 2179, 'random_state': 42}. Best is trial 0 with value: 0.33437796412050935.[0m
[32m[I 2023-06-27 19:18:59,556][0m Trial 2 finished with value: 0.3339046484958419 and parameters: {'alpha': 0.0029375384576328283, 'l1_ratio': 0.05808361216819946, 'max_iter': 2145, 'random_state': 42}. Best is trial 0 with value: 0.33437796412050935.[0m


winequality-red [CV_3] (Linear Regression) training complete. ------------------------------------
Elastic Net


[32m[I 2023-06-27 19:18:59,589][0m Trial 3 finished with value: 0.33477794506750075 and parameters: {'alpha': 0.010025956902289565, 'l1_ratio': 0.14286681792194078, 'max_iter': 140, 'random_state': 42}. Best is trial 3 with value: 0.33477794506750075.[0m
[32m[I 2023-06-27 19:18:59,622][0m Trial 4 finished with value: 0.3339533988444807 and parameters: {'alpha': 0.00115279871282324, 'l1_ratio': 0.9699098521619943, 'max_iter': 1525, 'random_state': 42}. Best is trial 3 with value: 0.33477794506750075.[0m
[32m[I 2023-06-27 19:18:59,651][0m Trial 5 finished with value: 0.3078921056795115 and parameters: {'alpha': 0.6541210527692729, 'l1_ratio': 0.0007787658410143283, 'max_iter': 965, 'random_state': 42}. Best is trial 3 with value: 0.33477794506750075.[0m
[32m[I 2023-06-27 19:18:59,684][0m Trial 6 finished with value: 0.33420000555927243 and parameters: {'alpha': 0.0035498788321965025, 'l1_ratio': 0.3042422429595377, 'max_iter': 31, 'random_state': 42}. Best is trial 3 with valu

Best trial:
  Value:  0.33605478708980785
  Params: 
    alpha: 0.049086260186289095
    l1_ratio: 0.12936276427868063
    max_iter: 151
    random_state: 42
ElasticNet(alpha=0.049086260186289095, l1_ratio=0.12936276427868063,
           max_iter=151, random_state=42)
weights: {'alpha': 0.049086260186289095, 'copy_X': True, 'fit_intercept': True, 'l1_ratio': 0.12936276427868063, 'max_iter': 151, 'positive': False, 'precompute': False, 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
0.4014099005764374 0.3848346569740886 0.6204212295893555
winequality-red [CV_3] (Elastic Net) training complete. ------------------------------------
RF Regressor


[32m[I 2023-06-27 19:19:07,074][0m Trial 0 finished with value: 0.3840756522376849 and parameters: {'n_estimators': 112, 'max_depth': 20, 'min_samples_split': 30, 'min_samples_leaf': 15, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.3840756522376849.[0m
[32m[I 2023-06-27 19:19:14,392][0m Trial 1 finished with value: 0.3751273034771255 and parameters: {'n_estimators': 710, 'max_depth': 21, 'min_samples_split': 40, 'min_samples_leaf': 19, 'max_features': 'log2', 'bootstrap': True, 'oob_score': False, 'random_state': 42}. Best is trial 0 with value: 0.3840756522376849.[0m
[32m[I 2023-06-27 19:19:18,955][0m Trial 2 finished with value: 0.33468378955722927 and parameters: {'n_estimators': 468, 'max_depth': 24, 'min_samples_split': 37, 'min_samples_leaf': 40, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.3840756522376849.[0m
[32m[I 2023-06-27 19:19:2

Best trial:
  Value:  0.44888006868686076
  Params: 
    n_estimators: 477
    max_depth: 27
    min_samples_split: 4
    min_samples_leaf: 1
    max_features: auto
    bootstrap: True
    oob_score: False
    random_state: 42
RandomForestRegressor(max_depth=27, max_features='auto', min_samples_split=4,
                      n_estimators=477, random_state=42)
weights: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 27, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 4, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 477, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}
0.34838906091933514 0.46593658387285886 0.6826083274898408
winequality-red [CV_3] (RF Regressor) training complete. ------------------------------------
AdaBoost


[32m[I 2023-06-27 19:25:55,348][0m Trial 0 finished with value: 0.35721635586243927 and parameters: {'n_estimators': 112, 'learning_rate': 0.23898324175938382, 'loss': 'exponential'}. Best is trial 0 with value: 0.35721635586243927.[0m
[32m[I 2023-06-27 19:25:56,593][0m Trial 1 finished with value: 0.3593973899877259 and parameters: {'n_estimators': 116, 'learning_rate': 0.2339293309818035, 'loss': 'linear'}. Best is trial 1 with value: 0.3593973899877259.[0m
[32m[I 2023-06-27 19:26:02,159][0m Trial 2 finished with value: 0.35066781308247313 and parameters: {'n_estimators': 624, 'learning_rate': 0.13380524258079196, 'loss': 'exponential'}. Best is trial 1 with value: 0.3593973899877259.[0m
[32m[I 2023-06-27 19:26:04,356][0m Trial 3 finished with value: 0.361896060058203 and parameters: {'n_estimators': 340, 'learning_rate': 0.13782874270056356, 'loss': 'linear'}. Best is trial 3 with value: 0.361896060058203.[0m
[32m[I 2023-06-27 19:26:05,384][0m Trial 4 finished with val

Best trial:
  Value:  0.3685518620815495
  Params: 
    n_estimators: 925
    learning_rate: 0.06480354716878484
    loss: linear
AdaBoostRegressor(learning_rate=0.06480354716878484, n_estimators=925)
weights: {'base_estimator': 'deprecated', 'estimator': None, 'learning_rate': 0.06480354716878484, 'loss': 'linear', 'n_estimators': 925, 'random_state': None}
0.39522346371146916 0.39342703242894417 0.6310049896124998
winequality-red [CV_3] (AdaBoost) training complete. ------------------------------------
GradBoost


[32m[I 2023-06-27 19:30:12,054][0m Trial 0 finished with value: 0.40472801827210686 and parameters: {'learning_rate': 0.0020059560245279666, 'n_estimators': 870, 'min_samples_leaf': 15, 'min_samples_split': 44, 'max_depth': 8, 'random_state': 42}. Best is trial 0 with value: 0.40472801827210686.[0m
[32m[I 2023-06-27 19:30:24,894][0m Trial 1 finished with value: 0.4201824472020291 and parameters: {'learning_rate': 0.012067245262919609, 'n_estimators': 624, 'min_samples_leaf': 19, 'min_samples_split': 24, 'max_depth': 11, 'random_state': 42}. Best is trial 1 with value: 0.4201824472020291.[0m
[32m[I 2023-06-27 19:30:30,188][0m Trial 2 finished with value: 0.3776342172005385 and parameters: {'learning_rate': 0.003952429057290443, 'n_estimators': 382, 'min_samples_leaf': 36, 'min_samples_split': 41, 'max_depth': 24, 'random_state': 42}. Best is trial 1 with value: 0.4201824472020291.[0m
[32m[I 2023-06-27 19:30:38,524][0m Trial 3 finished with value: 0.40734074112162916 and param

Best trial:
  Value:  0.4363802093077984
  Params: 
    learning_rate: 0.006841390207432107
    n_estimators: 779
    min_samples_leaf: 18
    min_samples_split: 12
    max_depth: 19
    random_state: 42
GradientBoostingRegressor(learning_rate=0.006841390207432107, max_depth=19,
                          min_samples_leaf=18, min_samples_split=12,
                          n_estimators=779, random_state=42)
weights: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.006841390207432107, 'loss': 'squared_error', 'max_depth': 19, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 18, 'min_samples_split': 12, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 779, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
0.38197623475285913 0.4164115461633672 0.6507859196204753
winequality-red [CV_3] (GradBoost) training c

[32m[I 2023-06-27 19:39:36,070][0m Trial 0 finished with value: -0.1089982462764107 and parameters: {'kernel': 'rbf', 'C': 153.52246941973468, 'gamma': 'scale', 'degree': 3}. Best is trial 0 with value: -0.1089982462764107.[0m
[32m[I 2023-06-27 19:39:42,823][0m Trial 1 finished with value: 0.3297449697799797 and parameters: {'kernel': 'linear', 'C': 24.400607090817502, 'gamma': 'scale', 'degree': 2}. Best is trial 1 with value: 0.3297449697799797.[0m
[32m[I 2023-06-27 19:39:43,147][0m Trial 2 finished with value: 0.37085391420016744 and parameters: {'kernel': 'rbf', 'C': 0.25113061677390003, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.37085391420016744.[0m
[32m[I 2023-06-27 19:39:48,724][0m Trial 3 finished with value: 0.3297575892395533 and parameters: {'kernel': 'linear', 'C': 25.37815508265663, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.37085391420016744.[0m
[32m[I 2023-06-27 19:39:48,937][0m Trial 4 finished with value: -1.47931186

Best trial:
  Value:  0.38352167991242564
  Params: 
    kernel: rbf
    C: 0.8462698128989947
    gamma: scale
    degree: 4
SVR(C=0.8462698128989947, degree=4)
weights: {'C': 0.8462698128989947, 'cache_size': 200, 'coef0': 0.0, 'degree': 4, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
0.3902568946262355 0.4068287859748866 0.6391496557277925
winequality-red [CV_3] (SVR) training complete. ------------------------------------
Linear Regression
weights: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
0.3650141912606247 0.3318703726859735 0.5802718256325237


[32m[I 2023-06-27 19:40:36,409][0m Trial 0 finished with value: 0.35336058029154077 and parameters: {'alpha': 0.013292918943162165, 'l1_ratio': 0.9507143064099162, 'max_iter': 1140, 'random_state': 42}. Best is trial 0 with value: 0.35336058029154077.[0m
[32m[I 2023-06-27 19:40:36,441][0m Trial 1 finished with value: 0.2702987597191424 and parameters: {'alpha': 0.21830968390524597, 'l1_ratio': 0.596850157946487, 'max_iter': 2179, 'random_state': 42}. Best is trial 0 with value: 0.35336058029154077.[0m
[32m[I 2023-06-27 19:40:36,507][0m Trial 2 finished with value: 0.3534726426236235 and parameters: {'alpha': 0.0029375384576328283, 'l1_ratio': 0.05808361216819946, 'max_iter': 2145, 'random_state': 42}. Best is trial 2 with value: 0.3534726426236235.[0m


winequality-red [CV_4] (Linear Regression) training complete. ------------------------------------
Elastic Net


[32m[I 2023-06-27 19:40:36,545][0m Trial 3 finished with value: 0.3542495783011157 and parameters: {'alpha': 0.010025956902289565, 'l1_ratio': 0.14286681792194078, 'max_iter': 140, 'random_state': 42}. Best is trial 3 with value: 0.3542495783011157.[0m
[32m[I 2023-06-27 19:40:36,584][0m Trial 4 finished with value: 0.35394724482787404 and parameters: {'alpha': 0.00115279871282324, 'l1_ratio': 0.9699098521619943, 'max_iter': 1525, 'random_state': 42}. Best is trial 3 with value: 0.3542495783011157.[0m
[32m[I 2023-06-27 19:40:36,616][0m Trial 5 finished with value: 0.3224781754575671 and parameters: {'alpha': 0.6541210527692729, 'l1_ratio': 0.0007787658410143283, 'max_iter': 965, 'random_state': 42}. Best is trial 3 with value: 0.3542495783011157.[0m
[32m[I 2023-06-27 19:40:36,647][0m Trial 6 finished with value: 0.3539753437215564 and parameters: {'alpha': 0.0035498788321965025, 'l1_ratio': 0.3042422429595377, 'max_iter': 31, 'random_state': 42}. Best is trial 3 with value: 0

Best trial:
  Value:  0.3547016277645205
  Params: 
    alpha: 0.006819661908946111
    l1_ratio: 0.5514888626953147
    max_iter: 2343
    random_state: 42
ElasticNet(alpha=0.006819661908946111, l1_ratio=0.5514888626953147,
           max_iter=2343, random_state=42)
weights: {'alpha': 0.006819661908946111, 'copy_X': True, 'fit_intercept': True, 'l1_ratio': 0.5514888626953147, 'max_iter': 2343, 'positive': False, 'precompute': False, 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
0.3626105944304696 0.33629077592444245 0.5824288940952096
winequality-red [CV_4] (Elastic Net) training complete. ------------------------------------
RF Regressor


[32m[I 2023-06-27 19:40:43,130][0m Trial 0 finished with value: 0.3916346002070937 and parameters: {'n_estimators': 112, 'max_depth': 20, 'min_samples_split': 30, 'min_samples_leaf': 15, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.3916346002070937.[0m
[32m[I 2023-06-27 19:40:47,066][0m Trial 1 finished with value: 0.3815677654958267 and parameters: {'n_estimators': 710, 'max_depth': 21, 'min_samples_split': 40, 'min_samples_leaf': 19, 'max_features': 'log2', 'bootstrap': True, 'oob_score': False, 'random_state': 42}. Best is trial 0 with value: 0.3916346002070937.[0m
[32m[I 2023-06-27 19:40:49,898][0m Trial 2 finished with value: 0.3452450521522805 and parameters: {'n_estimators': 468, 'max_depth': 24, 'min_samples_split': 37, 'min_samples_leaf': 40, 'max_features': 'log2', 'bootstrap': True, 'oob_score': True, 'random_state': 42}. Best is trial 0 with value: 0.3916346002070937.[0m
[32m[I 2023-06-27 19:40:53

Best trial:
  Value:  0.4191039615962849
  Params: 
    n_estimators: 179
    max_depth: 30
    min_samples_split: 17
    min_samples_leaf: 1
    max_features: auto
    bootstrap: True
    oob_score: True
    random_state: 42
RandomForestRegressor(max_depth=30, max_features='auto', min_samples_split=17,
                      n_estimators=179, oob_score=True, random_state=42)
weights: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 30, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 17, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 179, 'n_jobs': None, 'oob_score': True, 'random_state': 42, 'verbose': 0, 'warm_start': False}
0.2998721988172851 0.4522626245374678 0.6727363849265758
winequality-red [CV_4] (RF Regressor) training complete. ------------------------------------
AdaBoost


[32m[I 2023-06-27 19:47:14,574][0m Trial 0 finished with value: 0.3638436113734607 and parameters: {'n_estimators': 112, 'learning_rate': 0.23898324175938382, 'loss': 'exponential'}. Best is trial 0 with value: 0.3638436113734607.[0m
[32m[I 2023-06-27 19:47:15,514][0m Trial 1 finished with value: 0.35792041294675236 and parameters: {'n_estimators': 116, 'learning_rate': 0.2339293309818035, 'loss': 'linear'}. Best is trial 0 with value: 0.3638436113734607.[0m
[32m[I 2023-06-27 19:47:20,676][0m Trial 2 finished with value: 0.3516593790311748 and parameters: {'n_estimators': 624, 'learning_rate': 0.13380524258079196, 'loss': 'exponential'}. Best is trial 0 with value: 0.3638436113734607.[0m
[32m[I 2023-06-27 19:47:23,154][0m Trial 3 finished with value: 0.3622811786514661 and parameters: {'n_estimators': 340, 'learning_rate': 0.13782874270056356, 'loss': 'linear'}. Best is trial 0 with value: 0.3638436113734607.[0m
[32m[I 2023-06-27 19:47:24,461][0m Trial 4 finished with val

Best trial:
  Value:  0.3677700269136026
  Params: 
    n_estimators: 399
    learning_rate: 0.2068161278015292
    loss: linear
AdaBoostRegressor(learning_rate=0.2068161278015292, n_estimators=399)
weights: {'base_estimator': 'deprecated', 'estimator': None, 'learning_rate': 0.2068161278015292, 'loss': 'linear', 'n_estimators': 399, 'random_state': None}
0.34731550287071267 0.3789621956643824 0.621184698442605
winequality-red [CV_4] (AdaBoost) training complete. ------------------------------------
GradBoost


[32m[I 2023-06-27 19:50:37,527][0m Trial 0 finished with value: 0.39163492256292565 and parameters: {'learning_rate': 0.0020059560245279666, 'n_estimators': 870, 'min_samples_leaf': 15, 'min_samples_split': 44, 'max_depth': 8, 'random_state': 42}. Best is trial 0 with value: 0.39163492256292565.[0m
[32m[I 2023-06-27 19:50:49,987][0m Trial 1 finished with value: 0.41186814082649814 and parameters: {'learning_rate': 0.012067245262919609, 'n_estimators': 624, 'min_samples_leaf': 19, 'min_samples_split': 24, 'max_depth': 11, 'random_state': 42}. Best is trial 1 with value: 0.41186814082649814.[0m
[32m[I 2023-06-27 19:50:54,868][0m Trial 2 finished with value: 0.3804177216267515 and parameters: {'learning_rate': 0.003952429057290443, 'n_estimators': 382, 'min_samples_leaf': 36, 'min_samples_split': 41, 'max_depth': 24, 'random_state': 42}. Best is trial 1 with value: 0.41186814082649814.[0m
[32m[I 2023-06-27 19:51:00,539][0m Trial 3 finished with value: 0.3862318626609618 and par

Best trial:
  Value:  0.42949577890859264
  Params: 
    learning_rate: 0.009141999052246986
    n_estimators: 433
    min_samples_leaf: 28
    min_samples_split: 11
    max_depth: 14
    random_state: 42
GradientBoostingRegressor(learning_rate=0.009141999052246986, max_depth=14,
                          min_samples_leaf=28, min_samples_split=11,
                          n_estimators=433, random_state=42)
weights: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.009141999052246986, 'loss': 'squared_error', 'max_depth': 14, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 28, 'min_samples_split': 11, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 433, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
0.3275073295467715 0.40314213644484254 0.6402686178904011
winequality-red [CV_4] (GradBoost) training 

[32m[I 2023-06-27 19:56:13,460][0m Trial 0 finished with value: -0.13609709462626127 and parameters: {'kernel': 'rbf', 'C': 153.52246941973468, 'gamma': 'scale', 'degree': 3}. Best is trial 0 with value: -0.13609709462626127.[0m
[32m[I 2023-06-27 19:56:16,749][0m Trial 1 finished with value: 0.34326426924297887 and parameters: {'kernel': 'linear', 'C': 24.400607090817502, 'gamma': 'scale', 'degree': 2}. Best is trial 1 with value: 0.34326426924297887.[0m
[32m[I 2023-06-27 19:56:16,928][0m Trial 2 finished with value: 0.3705553665807717 and parameters: {'kernel': 'rbf', 'C': 0.25113061677390003, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.3705553665807717.[0m
[32m[I 2023-06-27 19:56:20,636][0m Trial 3 finished with value: 0.3431366025012454 and parameters: {'kernel': 'linear', 'C': 25.37815508265663, 'gamma': 'scale', 'degree': 3}. Best is trial 2 with value: 0.3705553665807717.[0m
[32m[I 2023-06-27 19:56:20,829][0m Trial 4 finished with value: -0.3252493

Best trial:
  Value:  0.3866637510108916
  Params: 
    kernel: rbf
    C: 0.7558652231750806
    gamma: scale
    degree: 5
SVR(C=0.7558652231750806, degree=5)
weights: {'C': 0.7558652231750806, 'cache_size': 200, 'coef0': 0.0, 'degree': 5, 'epsilon': 0.1, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'shrinking': True, 'tol': 0.001, 'verbose': False}
0.3451315030757597 0.3676131421860038 0.6161341546632361
winequality-red [CV_4] (SVR) training complete. ------------------------------------


## -----------------------------------------------------------------------------------------------------------------
## Phase 6: Statistics (Stats Summaries, Figures, Statistical Comparisons)

### Import Additional Python Packages

In [None]:
import StatsJob

### Run Statistics Summary and Figure Generation

In [None]:
#Unpickle metadata from previous phase
file = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()
metadata['Export Metric Boxplots'] = plot_metric_boxplots
metadata['Export Feature Importance Boxplots'] = plot_FI_box
metadata['Metric Weighting Composite FI Plots'] = metric_weight
metadata['Top Model Features To Display'] = top_model_features
#Pickle the metadata for future use
pickle_out = open(output_path+'/'+experiment_name+'/'+"metadata.pickle", 'wb')
pickle.dump(metadata,pickle_out)
pickle_out.close()

#Now that primary pipeline phases are complete generate a human readable version of metadata
df = pd.DataFrame.from_dict(metadata, orient ='index')
df.to_csv(output_path+'/'+experiment_name+'/'+'metadata.csv',index=True)

# Iterate through datasets
dataset_paths = os.listdir(output_path + "/" + experiment_name)
removeList = removeList = ['metadata.pickle','metadata.csv','algInfo.pickle','jobsCompleted','logs','jobs','DatasetComparisons','UsefulNotebooks',experiment_name+'_ML_Pipeline_Report.pdf']
for text in removeList:
    if text in dataset_paths:
        dataset_paths.remove(text)
for dataset_directory_path in dataset_paths:
    full_path = output_path + "/" + experiment_name + "/" + dataset_directory_path
    StatsJob.job(full_path,plot_FI_box,class_label,instance_label,cv_partitions,scale_data,plot_metric_boxplots,primary_metric,top_model_features,sig_cutoff,metric_weight,jupyterRun)


## -----------------------------------------------------------------------------------------------------------------
## Phase 7: Dataset Comparison (Optional: Use only if > 1 dataset was analyzed)

### Import Additional Python Packages

In [None]:
import DataCompareJob

### Run Dataset Comparison

In [None]:
if len(dataset_paths) > 1:
    DataCompareJob.job(output_path+'/'+experiment_name,sig_cutoff,jupyterRun)

## -----------------------------------------------------------------------------------------------------------------
## Phase 8: PDF Training Report Generator (Optional)

In [None]:
import PDF_ReportJob_Reg

In [None]:
experiment_path = output_path+'/'+experiment_name
PDF_ReportJob_Reg.job(experiment_path,'True','None','None')

2023-06-27 17:42:48.676167
Starting Report
['Data Path:', '/content/drive/MyDrive/STREAMLINE-Regression/DemoData_2', '\n', 'Output Path:', '/content/drive/MyDrive/STREAMLINE-Regression/Colab_Output', '\n', 'Experiment Name:', 'Demo', '\n', 'Class Label:', 'Ferritin (ng/mL)', '\n', 'Instance Label:', 'InstanceID', '\n', 'Ignored Features:', '[]', '\n', 'Specified Categorical Features:', '[]', '\n', 'CV Partitions:', '5', '\n', 'Partition Method:', 'R', '\n', 'Match Label:', 'None', '\n', 'Categorical Cutoff:', '10', '\n', 'Statistical Significance Cutoff:', '0.05', '\n', 'Export Feature Correlations:', 'True', '\n', 'Export Univariate Plots:', 'False', '\n', 'Random Seed:', '42', '\n', 'Run From Jupyter Notebook:', 'True', '\n', 'Use Data Scaling:', 'True', '\n', 'Use Data Imputation:', 'True', '\n', 'Use Multivariate Imputation:', 'True', '\n', 'Use Mutual Information:', 'True', '\n', 'Use MultiSURF:', 'True', '\n', 'Use TURF:', 'False', '\n', 'TURF Cutoff:', '0.5', '\n', 'MultiSURF In

## -----------------------------------------------------------------------------------------------------------------
## Phase 9: Apply Models to Replication Data (Optional)

### Import Additional Python Packages

In [None]:
import ApplyModelJob

### Specify Run Parameters

In [None]:
if demo_run:
    rep_data_path = wd_path+'/drive/MyDrive/STREAMLINE-main/DemoRepData'
print("Replication Data Folder Path: "+rep_data_path)
print("Dataset Path: "+dataset_for_rep)

Replication Data Folder Path: /content/drive/MyDrive/STREAMLINE-main/DemoRepData
Dataset Path: /content/drive/MyDrive/STREAMLINE-main/DemoRepData/hcc-data_example_rep.csv


### Run Application of Models to Replication Data

In [None]:
if applyToReplication:
    data_name = dataset_for_rep.split('/')[-1].split('.')[0] #Save unique dataset names so that analysis is run only once if there is both a .txt and .csv version of dataset with same name.
    full_path = output_path + "/" + experiment_name + "/" + data_name #location of folder containing models respective training dataset
    full_path
    # full_path_2 = output_path + "/" + experiment_name + "/" + data_name
    if not os.path.exists(full_path):
        os.mkdir(full_path)
    if not os.path.exists(full_path+"/applymodel"):
        os.mkdir(full_path+"/applymodel")

    #Determine file extension of datasets in target folder:
    file_count = 0
    unique_datanames = []
    for datasetFilename in glob.glob(rep_data_path+'/*'):
        datasetFilename = str(datasetFilename).replace('\\','/')

        file_extension = datasetFilename.split('/')[-1].split('.')[-1]
        apply_name = datasetFilename.split('/')[-1].split('.')[0] #Save unique dataset names so that analysis is run only once if there is both a .txt and .csv version of dataset with same name.
        if not os.path.exists(full_path+"/applymodel/"+apply_name):
            os.mkdir(full_path+"/applymodel/"+apply_name)

        if file_extension == 'txt' or file_extension == 'csv':
            if apply_name not in unique_datanames:
                unique_datanames.append(apply_name)
                ApplyModelJob.job(datasetFilename,full_path,class_label,instance_label,categorical_cutoff,sig_cutoff,cv_partitions,scale_data,impute_data,primary_metric,dataset_for_rep,match_label,plot_ROC,plot_PRC,plot_metric_boxplots,export_feature_correlations,jupyterRun,multi_impute)
                file_count += 1

    if file_count == 0: #Check that there was at least 1 dataset
        raise Exception("There must be at least one .txt or .csv dataset in rep_data_path directory")

## -----------------------------------------------------------------------------------------------------------------
## Phase 10: PDF Apply Report Generator (Optional)

In [None]:
import PDF_ReportJob

In [None]:
if applyToReplication:
    experiment_path = output_path+'/'+experiment_name
    PDF_ReportJob.job(experiment_path,'False',rep_data_path,dataset_for_rep)

## -----------------------------------------------------------------------------------------------------------------
## Phase 11: File Cleanup (Optional)

In [None]:
# Get dataset paths for all completed dataset analyses in experiment folder
datasets = os.listdir(experiment_path)
experiment_name = experiment_path.split('/')[-1] #Name of experiment folder
removeList = removeList = ['metadata.pickle','metadata.csv','algInfo.pickle','jobsCompleted','logs','jobs','DatasetComparisons','UsefulNotebooks',experiment_name+'_ML_Pipeline_Report.pdf']
for text in removeList:
    if text in datasets:
        datasets.remove(text)

#Delete jobscompleted folder/files
try:
    shutil.rmtree(experiment_path+'/'+'jobsCompleted')
except:
    pass

#Delete target files within each dataset subfolder
for dataset in datasets:
    #Delete individual runtime files (save runtime summary generated in phase 6)
    if eval(del_time):
        try:
            shutil.rmtree(experiment_path+'/'+dataset+'/'+'runtime')
            print("Individual Runtime Files Deleted")
        except:
            pass
    #Delete temporary feature importance pickle files (only needed for phase 4 and then saved as summary files in phase 6)
    try:
        shutil.rmtree(experiment_path+'/'+dataset+'/feature_selection/mutualinformation/pickledForPhase4')
        print("Mutual Information Pickle Files Deleted")
    except:
        pass
    try:
        shutil.rmtree(experiment_path+'/'+dataset+'/feature_selection/multisurf/pickledForPhase4')
        print("MultiSURF Pickle Files Deleted")
    except:
        pass
    #Delete older training and testing CV datasets (does not delete any final versions used for training). Older cv datasets might have been kept to see what they look like prior to preprocessing and feature selection.
    if eval(del_oldCV):
        #Delete CV files generated after preprocessing but before feature selection
        files = glob.glob(experiment_path+'/'+dataset+'/CVDatasets/*CVOnly*')
        for f in files:
            try:
                os.remove(f)
                print("Deleted Intermediary CV-Only Dataset Files")
            except:
                pass
        #Delete CV files generated after CV partitioning but before preprocessing
        files = glob.glob(experiment_path+'/'+dataset+'/CVDatasets/*CVPre*')
        for f in files:
            try:
                os.remove(f)
                print("Deleted Intermediary CV-Pre Dataset Files")
            except:
                pass