# Stacked Classifier - AutoGluon

This notebook will perfrom AutoML with Bayesian inference to determine the best possible estimator for classification problem.
If you want to check out the Optuna framework, check the other repository on (https://github.com/Benetti-Hub/MultiphasePipeline). AutoGluon has the advantage of achieving almost the same results, without the need of understanding the Python code written by a MSc student (Uses the same models described in the paper).

In [1]:
#For development
#Reload the library when a change is detected in one of the imported libraries
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularDataset, TabularPredictor

from src import feature_engineering as fe

In [3]:
df = pd.read_csv('data/Train_bronze.csv') #Train Dataset
kept_columns = ['Ang', 'FrL', 'FrG', 'X_LM_2', 'Eo', 'Flow_label'] #Kept Columns (from SFFS)
df = fe.bronze_to_gold(df)[kept_columns]
train_data = TabularDataset(df)
train_data.head()

Unnamed: 0,Ang,FrL,FrG,X_LM_2,Eo,Flow_label
0,70.0,1.977684,0.167038,90.94347,87.456055,2
1,15.0,3.375242,0.027241,759.85875,87.456055,2
2,-1.0,3.2333,0.001111,17270.115534,87.456055,1
3,0.0,1.237211,0.121114,571.995542,174.589416,2
4,0.0,0.394137,1.052936,0.41992,1513.924625,0


## AutoGluon TabularPredictor

We can now initialize the AutoML process. The framework below is basic, but the various hyperparameters can be set by following (https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html#prediction-options-inference).

In [4]:
label = 'Flow_label'
save_path = './models' # specifies folder to store trained models
metric = 'balanced_accuracy' #With AutoGluon we can't apply SMOTE, this is the next best thing

#To really maximize the performance, set the various hyperparams according to your needs (You can specify the folds, the stacking depth etc)
#You can also set the search space for the hyperparameters for the various model
predictor = TabularPredictor(label=label, eval_metric=metric,
                             path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "./models/"
AutoGluon Version:  0.2.0
Train Data Rows:    6486
Train Data Columns: 5
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	6 unique label values:  [2, 1, 0, 3, 4, 5]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 6
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    7878.33 MB
	Train Data (Original)  Memory Usage: 0.26 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		

[1000]	train_set's multi_logloss: 0.0636229	train_set's balanced_accuracy: 0.990598	valid_set's multi_logloss: 0.162845	valid_set's balanced_accuracy: 0.925036


	0.941	 = Validation balanced_accuracy score
	14.19s	 = Training runtime
	0.63s	 = Validation runtime
Fitting model: LightGBM ...
	0.9561	 = Validation balanced_accuracy score
	3.5s	 = Training runtime
	0.03s	 = Validation runtime
Fitting model: RandomForestGini ...
	0.9061	 = Validation balanced_accuracy score
	1.08s	 = Training runtime
	0.15s	 = Validation runtime
Fitting model: RandomForestEntr ...
	0.927	 = Validation balanced_accuracy score
	1.08s	 = Training runtime
	0.15s	 = Validation runtime
Fitting model: CatBoost ...
	0.9249	 = Validation balanced_accuracy score
	14.38s	 = Training runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	0.9249	 = Validation balanced_accuracy score
	1.09s	 = Training runtime
	0.15s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	0.9216	 = Validation balanced_accuracy score
	1.08s	 = Training runtime
	0.15s	 = Validation runtime
Fitting model: XGBoost ...




	0.9385	 = Validation balanced_accuracy score
	4.72s	 = Training runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetMXNet ...
	0.8606	 = Validation balanced_accuracy score
	24.71s	 = Training runtime
	0.15s	 = Validation runtime
Fitting model: LightGBMLarge ...
	0.9488	 = Validation balanced_accuracy score
	9.23s	 = Training runtime
	0.11s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	0.965	 = Validation balanced_accuracy score
	0.98s	 = Training runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 94.33s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("./models/")


In [5]:
predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.964967,1.31726,77.054556,0.000935,0.975441,2,True,14
1,LightGBM,0.95609,0.02692,3.49675,0.02692,3.49675,1,True,5
2,LightGBMLarge,0.948826,0.110563,9.229619,0.110563,9.229619,1,True,13
3,LightGBMXT,0.940989,0.625116,14.187225,0.625116,14.187225,1,True,4
4,XGBoost,0.938457,0.012982,4.716712,0.012982,4.716712,1,True,11
5,RandomForestEntr,0.927024,0.147607,1.084121,0.147607,1.084121,1,True,7
6,ExtraTreesGini,0.92493,0.148549,1.086065,0.148549,1.086065,1,True,9
7,CatBoost,0.92488,0.006851,14.376304,0.006851,14.376304,1,True,8
8,ExtraTreesEntr,0.921554,0.147995,1.077949,0.147995,1.077949,1,True,10
9,RandomForestGini,0.906142,0.151147,1.081809,0.151147,1.081809,1,True,6


In [6]:
predictor.feature_importance(train_data)

Computing feature importance via permutation shuffling for 5 features using 1000 rows with 3 shuffle sets...
	36.86s	= Expected runtime (12.29s per shuffle set)
	19.62s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Ang,0.487692,0.015925,0.000178,3,0.578944,0.396441
FrL,0.426909,0.024874,0.000565,3,0.569441,0.284378
FrG,0.380786,0.013506,0.00021,3,0.458179,0.303392
X_LM_2,0.276541,0.01623,0.000573,3,0.369542,0.183541
Eo,0.168208,0.015649,0.001436,3,0.257879,0.078537


## Testing

We can test the performance of our estimators using the test set. The provided test set is from the same database, if you want to infer the capabilities of the model using the different set of studies, check for the data in "data/secret". Just remember that only 4 classes are present using the new data, and the outputs of the model have to be modified accordingly (SS+SW=Stratified & DB+B=Bubbly).

In [9]:
df_test = pd.read_csv('data/Test_bronze.csv') #Train Dataset
df_test = fe.bronze_to_gold(df_test)[kept_columns]
test_data = TabularDataset(df_test)
test_data.head()

Unnamed: 0,Ang,FrL,FrG,X_LM_2,Eo,Flow_label
0,90.0,0.519709,1.863741,0.132197,16100.480648,0
1,-5.0,0.413051,0.031999,1861.390527,174.589416,4
2,-30.0,0.007566,2.67343,0.001522,87.456055,0
3,90.0,2.281396,0.106891,330.908469,12516.891615,5
4,-5.0,0.416307,0.242226,57.493613,174.589416,4


In [13]:
predictor.evaluate_predictions(test_data[label], predictor.predict(test_data))

Evaluation: balanced_accuracy on test data: 0.9394913827575521
Evaluations on test data:
{
    "balanced_accuracy": 0.9394913827575521,
    "accuracy": 0.9531442663378545,
    "mcc": 0.9299095172733073
}


{'balanced_accuracy': 0.9394913827575521,
 'accuracy': 0.9531442663378545,
 'mcc': 0.9299095172733073}

In [16]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.964967       1.317260  77.054556                0.000935           0.975441            2       True         14
1              LightGBM   0.956090       0.026920   3.496750                0.026920           3.496750            1       True          5
2         LightGBMLarge   0.948826       0.110563   9.229619                0.110563           9.229619            1       True         13
3            LightGBMXT   0.940989       0.625116  14.187225                0.625116          14.187225            1       True          4
4               XGBoost   0.938457       0.012982   4.716712                0.012982           4.716712            1       True         11
5      RandomForestEntr   0.927024       0.147607   1.084121                0.147607           1.084121 

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetMXNet': 'TabularNeuralNetModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.5892385242385242,
  'KNeighborsDist': 0.6546697494837029,
  'NeuralNetFastAI': 0.7720105221500569,
  'LightGBMXT': 0.940988809407414,
  'LightGBM': 0.9560900036248873,
  'RandomForestGini': 0.906142441444767,
  'RandomForestEntr': 0.9270238943262199,
  'CatBoost': 0.9248801758569201,
  'ExtraTreesGini': 0.9249299399066842,
  'ExtraTreesEntr': 0.9215544714149365,
  'XGBoost': 0.9384574208760256,
  'NeuralNetMXNet': 0.8605617178640435,
  'LightGBMLarge': 0.94882

In [17]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM,0.943754,0.95609,0.069927,0.02692,3.49675,0.069927,0.02692,3.49675,1,True,5
1,WeightedEnsemble_L2,0.939491,0.964967,2.411047,1.31726,77.054556,0.011594,0.000935,0.975441,2,True,14
2,XGBoost,0.938581,0.938457,0.328878,0.012982,4.716712,0.328878,0.012982,4.716712,1,True,11
3,LightGBMLarge,0.936598,0.948826,0.325949,0.110563,9.229619,0.325949,0.110563,9.229619,1,True,13
4,LightGBMXT,0.928099,0.940989,0.92628,0.625116,14.187225,0.92628,0.625116,14.187225,1,True,4
5,ExtraTreesGini,0.926861,0.92493,0.219497,0.148549,1.086065,0.219497,0.148549,1.086065,1,True,9
6,ExtraTreesEntr,0.920034,0.921554,0.252344,0.147995,1.077949,0.252344,0.147995,1.077949,1,True,10
7,RandomForestEntr,0.917629,0.927024,0.198205,0.147607,1.084121,0.198205,0.147607,1.084121,1,True,7
8,RandomForestGini,0.899841,0.906142,0.192747,0.151147,1.081809,0.192747,0.151147,1.081809,1,True,6
9,CatBoost,0.895867,0.92488,0.014314,0.006851,14.376304,0.014314,0.006851,14.376304,1,True,8
