# Stacked Classifier - Optuna

This notebook will perfrom AutoML with Bayesian inference to determine the best possible estimator for classification problem.
The best estimaotrs are also aggregated using a meta learner in order to generate a stacked classifier. Given the time complexity
of the problem at hand, the stacked classifier class can perfrom prarallel trials on multiple machines:
$\textit{This requires to setup a proper connection to a database, see the function "database_location" of the StackedClassifier class}$

The StackedClassifier class is very flexibile, and can accomodate any type of "in-fold" operation, such as the normalization of the input features, the oversampling of the minority classes and so on. To check the parameters that are explored by Optuna trial, check "./script/model_optuna.py".

Addiontally, this script is also used to perfrom some form of model evaluation, by plotting the results against unseen experimental conditions

In [1]:
#For development
#Reload the library when a change is detected in one of the imported libraries
%load_ext autoreload 
%autoreload 2 

In [2]:
from script import stacking
from script import utils
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

2021-09-02 08:48:43.972340: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-02 08:48:43.972377: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
df = pd.read_csv('Data/Train_bronze.csv') #Train Dataset
df = utils.bronze_to_silver(df)
df.head()

Unnamed: 0,P,T,DenL,DenG,VisL,VisG,ST,ID,Roughness,Ang,Vsl,Vsg,Flow_label
0,100.0,25.0,879.8,1.3,0.483,1.8e-05,0.03,0.0508,0.0,0.0,0.01094,8.03354,3.0
1,100.0,25.0,1000.0,1.18,0.001,1.5e-05,0.07,0.051,0.0,-1.0,1.8,0.025,1.0
2,100.0,25.0,1000.0,1.18,0.001,1.5e-05,0.07,0.025,0.0,-90.0,0.986322,0.021483,2.0
3,100.0,25.0,1000.0,1.18,0.001,1.5e-05,0.07,0.051,0.0,0.0,0.1,16.0,0.0
4,100.0,25.0,1000.0,1.18,0.001,1.5e-05,0.07,0.025,0.0,90.0,1.57362,0.967343,2.0


In [4]:
kept_columns = ['Ang', 'FrL', 'FrG', 'X_LM_2', 'Eo', 'Flow_label'] #Kept Columns (problem specific)
algos = ["LightGBM"] #Algos to explore

## Create a Stacked Classifier object

An object of the StackingClassifier class is initiatied, it contains info about the models to explore, the column to keep, and the in-fold operations to perform during the optimization of the models

In [5]:
sc = stacking.StackedClassifier(base_algos=algos, balance_method=SMOTE(),
                                         database_loc='local', cv=5)

In [6]:
sc.make_clean('results/info/') #REMOVE THE FOLDER SPEFICIED 
sc.make_clean('results/db/') #REMOVE THE FOLDER SPECIFIED

### Optimize the Base Models

We can now optimize the models using the train dataframe. Feed a dataframe (where the last column is the target) and a number of trials to explore for each algo. Go run a marathon, when you are back if you are lucky the AutoML process will be compleated. You can interrupt this process at any time, no information will be lost. If you already have a trained base, just set the second argument to -1

In [9]:
sc.optimize_base(df, 20, kept=kept_columns)

[32m[I 2021-09-02 08:58:33,037][0m Using an existing study with name 'LightGBM optimization' instead of creating a new one.[0m
[32m[I 2021-09-02 08:58:34,863][0m Trial 40 pruned. [0m
[32m[I 2021-09-02 08:58:38,759][0m Trial 41 pruned. [0m
[32m[I 2021-09-02 08:58:41,199][0m Trial 42 pruned. [0m
[32m[I 2021-09-02 08:58:44,005][0m Trial 43 pruned. [0m
[32m[I 2021-09-02 08:58:46,511][0m Trial 44 pruned. [0m
[32m[I 2021-09-02 08:58:47,896][0m Trial 45 pruned. [0m
[32m[I 2021-09-02 08:58:50,873][0m Trial 46 pruned. [0m
[32m[I 2021-09-02 08:58:52,250][0m Trial 47 pruned. [0m
[32m[I 2021-09-02 08:58:53,453][0m Trial 48 pruned. [0m
[32m[I 2021-09-02 08:58:55,348][0m Trial 49 pruned. [0m
[32m[I 2021-09-02 08:58:56,565][0m Trial 50 pruned. [0m
[32m[I 2021-09-02 08:58:59,070][0m Trial 51 pruned. [0m
[32m[I 2021-09-02 08:59:01,490][0m Trial 52 pruned. [0m
[32m[I 2021-09-02 08:59:03,462][0m Trial 53 pruned. [0m
[32m[I 2021-09-02 08:59:05,142][0m Trial 5

### Optimize the Meta-Learner

Optimize the meta learner using the exact same approach as before.

In [10]:
sc.train_meta('LightGBM', n_trials=5)

[32m[I 2021-09-02 08:59:50,068][0m A new study created in memory with name: LightGBM meta optimization[0m
[32m[I 2021-09-02 08:59:50,516][0m Trial 0 finished with value: 0.9457292103503047 and parameters: {'num_leaves': 8, 'learning_rate': 0.02, 'n_estimators': 66, 'subsample': 1.0, 'subsample_freq': 1, 'reg_alpha': 0.0500060150217337, 'reg_lambda': 0.10452603478326139, 'colsample_bytree': 0.4, 'min_child_samples': 166}. Best is trial 0 with value: 0.9457292103503047.[0m
[32m[I 2021-09-02 08:59:51,730][0m Trial 1 finished with value: 0.9457292103503047 and parameters: {'num_leaves': 394, 'learning_rate': 0.017, 'n_estimators': 156, 'subsample': 0.5, 'subsample_freq': 100, 'reg_alpha': 0.000587754234691692, 'reg_lambda': 0.0001288393909671829, 'colsample_bytree': 0.8, 'min_child_samples': 153}. Best is trial 0 with value: 0.9457292103503047.[0m
[32m[I 2021-09-02 08:59:53,234][0m Trial 2 finished with value: 0.9457292103503047 and parameters: {'num_leaves': 391, 'learning_rate

### Train and Save the Base Models

We are now ready to train and save the best models found using all the training data

In [11]:
df_train = pd.read_csv('Data/Dataset.csv')
df_train = utils.bronze_to_gold(df_train, balance_method=SMOTE(), kept_columns=kept_columns)
X_train, y_train = df_train.iloc[:,:-1].values, df_train.iloc[:,[-1]].values.ravel()

In [12]:
first_it = True
if first_it:
    #Train
    sc.train_base(X_train, y_train, remove_old=True)
else:
    #Load
    sc.load_base()

In [13]:
def predict_df(df_bronze, n_class=6):
    
    '''
    Simple function to check the performance of the single
    estimators on the test data
    '''
    
    df_gold = utils.bronze_to_gold(df_bronze, kept_columns=kept_columns)
    
    X_test = df_gold.iloc[:,:-1].values    
    y_test = df_gold.iloc[:,[-1]].values.ravel()
    
    for algo, model in sc.get_base().items():
        print(algo)
        y_pred = model.predict(X_test)
        y_pred = y_pred if y_pred.ndim==1 else y_pred.argmax(1)
        
        if n_class==4:
            y_pred = np.where(y_pred==5, 1, y_pred)
            y_pred = np.where(y_pred==4, 3, y_pred)
        
        utils.print_performance(y_test, y_pred)
    
    print('StackedClassifier')
    y_pred = sc.predict(X_test)
    if n_class==4:
        y_pred = np.where(y_pred==5, 1, y_pred)
        y_pred = np.where(y_pred==4, 3, y_pred)
    
    utils.print_performance(y_test, y_pred)
    

## Model Evaluation

We can now use the single estimators to check the performance on the test set. This is just a final check, and if everything went smoothly, the best single estimator will be the best performing one.

## Extra Evaluation

The test set was taken with respect to experimental coditions equal to the ones of the test set. It is clear that this does not really refect a real case scenario, where the experimental conditions may differ from the ones of the training dataset. To this end, the dataset from the study of Mexico numba one is used to test the pipeline with respect to previously unseen experimental conditions. It should be underlined that this dataset only contains 4 types of regimes, so the output of the pipeline have to be modified accordingly (by aggregating dispersed and stratified regimes)

In [14]:
df_test_secret = pd.read_csv('Data/Secret/Test_secret.csv')
df_secret_ID = df_test_secret.loc[(df_test_secret["ID"]<0.05) & (df_test_secret["ID"]>0.01) & (df_test_secret['P']<300)]
predict_df(df_test_secret, n_class=4)

LightGBM
Mean Accuracy:  0.7426687883609912
Mean F1 score:  0.710848370944936


Single class Accuracy:  [0.59647189 0.62676056 0.85403151 0.68798956]
Single class F1 score:  [0.68308081 0.62620932 0.79132675 0.7427766 ]

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.60      0.68       907
           1       0.63      0.63      0.63       568
           2       0.74      0.85      0.79      2158
           3       0.81      0.69      0.74       766

    accuracy                           0.74      4399
   macro avg       0.74      0.69      0.71      4399
weighted avg       0.75      0.74      0.74      4399


Confusion Matrix:
 [[ 541   39  282   45]
 [  15  356  194    3]
 [  70  167 1843   78]
 [  51    7  181  527]]
StackedClassifier
Mean Accuracy:  0.7426687883609912
Mean F1 score:  0.710848370944936


Single class Accuracy:  [0.59647189 0.62676056 0.85403151 0.68798956]
Single class F1 score:  [0.68308081 0.62620932 

## Results Visualization

It is now time to present the results in a respectable manner

In [14]:
import plotly.express as px
import plotly.graph_objects as go

In [15]:
algos = ["LightGBM", "RandomForest", "MLP"]

In [16]:
def bar_plot_studies(df_test, algos=algos, n_class=6):
    
    '''
    Another simple function to show the accuracy on the single studies
    '''
    
    author_list = list(df_test['Author'].value_counts().index)
    model_info = pd.DataFrame(index=author_list, columns=algos)

    for author in author_list:

        di = {}

        df_author = df_test.loc[(df_test['Author']==author)]
        df_author = utils.bronze_to_gold(df_author, kept_columns=kept_columns)
        X_test = df_author.iloc[:,:-1].values
        y_test = df_author.iloc[:,[-1]].values.ravel()

        for algo, model in sc.get_base().items():
            y_pred = model.predict(X_test)
            y_pred = y_pred if y_pred.ndim==1 else y_pred.argmax(1)
            if n_class==4:
                y_pred = np.where(y_pred==5, 1, y_pred)
                y_pred = np.where(y_pred==4, 3, y_pred)
            
            di[algo] = accuracy_score(y_test, y_pred)
        
        y_pred = sc.predict(X_test)
        if n_class==4:
            y_pred = np.where(y_pred==5, 1, y_pred)
            y_pred = np.where(y_pred==4, 3, y_pred)
        
        di["StackedEnsamble"] = accuracy_score(y_test, y_pred)
        model_info.loc[author] = di 

    model_info = model_info.sort_index()
    fig = go.Figure(data=[go.Bar(name=algo, y=model_info[algo], x=model_info.index) for algo in algos])
    fig.update_yaxes(title="Accuracy")
    fig.update_xaxes(title="Independent Study", tickangle=45)
    fig.update_layout(barmode='group', title="Prediction Accuracy on different studies")
    
    fig.write_image(f"Plots/Others/prediction_accuracy_{len(author_list)}.png", scale=2)
    
    fig.show()
    
    return model_info

In [None]:
mi = bar_plot_studies(df_test_secret, n_class=4)

In [None]:
averages = pd.DataFrame({"mean" : mi.mean(), "std" : mi.std()})

fig = go.Figure(data=[go.Bar(name="Averages", 
                             y=averages["mean"], 
                             x=averages.index,
                             error_y=dict(type='data', array=averages['std']))])

fig.update_layout(barmode='group')
fig.show()

In [None]:
infos = pd.read_parquet('results/info/logs.parquet')
algos = ["LightGBM", "RandomForest", "MLP", "TabNet", "XGBoost"]
infos = infos.loc[infos["Algo"].isin(algos)]

bar = infos.loc[infos.groupby(['Algo'])['Accuracy'].idxmax()].sort_values(by='Accuracy')
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Accuracy',
    x=bar['Algo'], y=round(bar['Accuracy'], 3),
    error_y=dict(type='data', array=3*round(bar['Accuracy_std'], 3)),
    text=round(bar['Accuracy'], 3),
    textposition=['none', 'none', 'none', 'none', 'none', 'none']
))
fig.add_trace(go.Bar(
    name='F1_score',
    x=bar['Algo'], y=round(bar['F1_score'], 3),
    error_y=dict(type='data', array=3*round(bar['F1_score_std'], 3)),
    text=round(bar['F1_score'], 3),
    textposition=['none', 'none', 'none', 'none', 'none']
))

fig.update_layout(
    width=800,
    height=450,
    title="Cross Validation Results",
    xaxis_title="Machine learning model",
    yaxis_title="Metric value",
    legend_title="Metric"
)

fig.write_image(f"Plots/Others/CrossValidationMetrics.png", scale=2)
fig.show()


In [None]:
model = utils.load_model('./results/Models', model_type="LightGBM")

In [None]:
df_ = pd.read_csv('Data/Secret/Test_secret.csv')
df_ = df_secret_ID
df_pred = utils.bronze_to_gold(df_, kept_columns=kept_columns)
y_pred = model.predict(df_pred.iloc[:,:-1].values)

df_ = utils.bronze_to_gold(df_)
di = {0: 'A', 1: 'DB', 2: 'I', 3: 'SW', 4: 'SS', 5:'B'}
for q in ['FrL', 'FrG', 'Eo']:
    df_[f'log({q})'] = np.log10(df_[f'{q}'])

df_["Predicted"] = y_pred
df_["Correct"] = (df_['Flow_label']==df_['Predicted'])

In [None]:
dimensions = list([ 
                   dict(label='log(FrL)', values=np.log10(df_['FrL'])),
                   dict(label='log(FrG)', values=np.log10(df_['FrG'])),
                   dict(label='log(Eo)', values=np.log10(df_['Eo'])),
                   dict(label='Ang', values=df_['Ang']),
                   dict(range=[0,df_['Flow_label'].max()],
                       tickvals = list(di.keys()), ticktext =list(di.values()),
                       label='Flow Regime', values=df_['Flow_label']),
                    dict(range=[0,df_['Flow_label'].max()],
                       tickvals = list(di.keys()), ticktext =list(di.values()),
                       label='Predicted', values=df_['Predicted']),
                   dict(label="Correct", values=df_['Correct'].astype('int'))
                  ])

fig = go.Figure(data=go.Parcoords(line = dict(color = df_['Flow_label'], 
                                colorscale = 'RdBu'), dimensions=dimensions))
fig.show()

In [None]:
#df_ = df_.join(pd.read_csv('Data/Secret/Test_secret.csv'), how="left", lsuffix="", rsuffix="Right")
df_ = df_.join(df_secret_ID, how="left", lsuffix="", rsuffix="Right")
dimensions = list([ 
                   dict(label='log(FrL)', values=np.log10(df_['FrL'])),
                   dict(label='log(FrG)', values=np.log10(df_['FrG'])),
                   dict(label='ID', values=df_['ID']),
                   dict(label='Ang', values=df_['Ang']),
                   dict(range=[0,df_['Flow_label'].max()],
                        tickvals = list(di.keys()), ticktext =list(di.values()),
                       label='Flow Regime', values=df_['Flow_label']),
                   dict(label="Correct", values=df_['Correct'].astype('int'))
                  ])

fig = go.Figure(data=go.Parcoords(line = dict(color = df_['Correct'], 
                                colorscale = [(0.00, "red"),   (0.33, "red"),
                                                     (0.33, "blue"), (0.66, "blue"),
                                                     (0.66, "green"),  (1.00, "green")]), dimensions=dimensions))
fig.show()