# Feature Selection Methods

This notebook is dedicated to performing feature selection using the best model found during autoML. The following approaches are considered:
   - VarianceThreshold: will use the variance to determine if a variable is meaningfull
   - SelectKBest: will use ANOVA to keep only the most relevant features
   - SequentialForwardSelection: will greedily build an estimator from an empty subset of features
    
The first two methods are filtering methods, meaning that they do not consider hidden interaction between variables. They are fast, but they do not guarantee of generating the best possible model. Sequential Forward Selection on the other hand is a wrapper method, that will consider interaction between the features, but it is very slow when the number of features is significant.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from script import utils

# Setting up

Load the gold dataframe (the one read to be ingested by the machine learning algorithms, where missing values are removed).

In [None]:
df = pd.read_csv('Data/Train_gold.csv')
df.head()

In [None]:
target = 'Flow_label'
features = df.columns.drop(target)

Split features from labels

In [None]:
X, y = df[features], df[target].values.ravel()

# Variance Threshold

This method will filter out all the features with less variance than the one imposed by the user. Since the features we are dealing with have different scales, this filtering method will only be used to filter out constant features from the dataset. (I mean, this step is useless given the imposed threshold, it will just remove contant values)

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
constant_filter = VarianceThreshold(threshold=0)
_ = constant_filter.fit(X)
discarded_columns = X.columns[np.invert(constant_filter.get_support())]
discarded_columns

In [None]:
X = X[X.columns[constant_filter.get_support()]]

# Univariate Feature Selection
The next statistical test is actually to select the best K perfomring features. This is done using the ANOVA test for the variables in the dataframe. Note that using this type of filter does not guarantee generating a list of the best meaningfull variables, since the interactions are not taken into account. Nonetheless is a good test to perform. For the time being we select 80% of the features, of course this is a choice dicated by a random number I had in mind when creating this notebook.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, mutual_info_classif

In [None]:
feat_to_keep = round(0.8*len(X.columns))
univariate_model = SelectKBest(score_func=f_classif, k=feat_to_keep)

In [None]:
univariate_model.fit(X, y)
print("Features Discarded: \n", X.columns[np.invert(univariate_model.get_support())])
print("Features Kept: \n", X.columns[univariate_model.get_support()])

## Plot the Feature Importance
Once we have determined the best performing features, the next step is to actually plot their importance

In [None]:
model = f_classif(X, y)
imp_f_classif = pd.DataFrame({'Features': X.columns, 'F_score': model[0]}).sort_values(by='F_score', ascending=False)
fig, axes = plt.subplots(figsize=(35,10))  
axes.set_title("ANOVA F-statistics",fontsize=30)
plt.bar(range(imp_f_classif.shape[0]), imp_f_classif.F_score, align="center")
plt.xticks(range(imp_f_classif.shape[0]), imp_f_classif['Features'], rotation='vertical', fontsize=30)
plt.yticks(fontsize=30)
plt.xlim([-1, imp_f_classif.shape[0]])
plt.ylabel('F(λ)', fontsize=30)
plt.xlabel('Features', fontsize=30)
plt.savefig(f'Plots/FeatureSelection/ANOVA.png', dpi=fig.dpi, bbox_inches='tight')
plt.show()

It is clear that the most performing feature is the liquid holdup, followed by information about the phase velocities

## Step Forward Feature Selection
Here we perfrom the step forward feature selection. This technique will use a greedy strategy to build the best performing model. Note that every feature kept will never be discarded in future steps unless the floating method is set to true.

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

In [None]:
import lightgbm as lgbm

In [None]:
estimator = lgbm.LGBMClassifier()

In [None]:
sfs=SFS(estimator, k_features=8, forward=True, floating=True, scoring='accuracy', verbose=2, cv=5, n_jobs=-1)

In [None]:
sfs.fit(X, y)

## Visualization

This section is dedicated to the visualization of the results egnerated by the SFS. Since we are interested in other metrics than the one used for SFS, we actually have to retrain the models.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score
from imblearn.over_sampling import SMOTE
import plotly.graph_objects as go

In [None]:
info_di = sfs.get_metric_dict()
l = [list(info_di[x]['feature_names']) for x in info_di.keys()]

In [None]:
def generate_info(l, cv=5, balance_classes=True):
    '''
    Simple function to contruct a dataframe containing
    all the necessary info about the metrics: 
    
    Input:
        l: list of features
        cv: number of cross validation folds
        balance classes: weather or not to perform SMOTE
    Output:
        info_df: a dictionary containing metrics info
    
    '''
    info_df = {}
    
    cv_info = np.zeros((cv, 2))
    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42) #SKF
    
    y_sfs = df[target].values.ravel()
    for i, features in enumerate(l, start=1):        
        X_sfs = X[features].values
        
        for j, (train_idx, valid_idx) in enumerate(skf.split(X_sfs, y_sfs)):
                X_train, y_train = X_sfs[train_idx], y_sfs[train_idx]
                X_valid, y_valid = X_sfs[valid_idx], y_sfs[valid_idx]
                
                if balance_classes:
                    X_train, y_train = SMOTE().fit_resample(X_train, y_train)
                
                model = lgbm.LGBMClassifier()
                model.fit(X_train, y_train)
                
                y_pred = model.predict(X_valid)
                
                cv_info[j, 0] = accuracy_score(y_valid, y_pred)
                cv_info[j, 1] = f1_score(y_valid, y_pred, average='macro')
                
        info_df[i] = {
            'feature_names'  : features,            
            'mean_acc' : np.mean(cv_info[:, 0]),
            'std_acc'  : np.std(cv_info[:, 0]),
            'mean_f1'  : np.mean(cv_info[:, 1]),
            'std_f1'   : np.std(cv_info[:, 1]),
        }         
                        
    return info_df


In [None]:
info_di = generate_info(l, balance_classes=False)

In [None]:
scores = pd.DataFrame.from_dict(info_di).T
scores

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=scores.index, y=scores['mean_acc'],
    error_y=dict(type='data', array=3*scores['std_acc']),
    name='Accuracy'
))

fig.add_trace(go.Scatter(
    x=scores.index, y=scores['mean_f1'],
    error_y=dict(type='data', array=3*scores['std_f1']),
    name='F1 score'
))

fig.update_layout(
    width=800,
    height=600,
    title="Sequential Floating Forward Selection Results",
    xaxis_title="Features used",
    yaxis_title="Metric value",
    legend_title="Metric",
    #paper_bgcolor='rgb(239,239,239)',
    #plot_bgcolor='rgb(255,255,255)'
)

fig.add_annotation(
            x=6,  # arrows' head
            y=scores.iloc[5]['mean_acc'],  # arrows' head
            ax=6.3,  # arrows' tail
            ay=scores.iloc[5]['mean_acc']-0.1,  # arrows' tail
            xref='x',
            yref='y',
            axref='x',
            ayref='y',
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowwidth=1,
            font=dict(
                size=20,
            ),
            text=r"$\theta, Re_{L}, Fr_{G}, Fr_{L}, X_{LM}, Eo$",
            #bordercolor="#ff7f0e",
            #borderwidth=2,
            #borderpad=4,
            #bgcolor="#ffffff",
)

#fig.update_yaxes(showgrid=True,  gridcolor="grey", linecolor='black', mirror=True)
#fig.update_xaxes(showgrid=True,  gridcolor="grey", linecolor='black', mirror=True)


#fig.write_image(f"Plots/FeatureSelection/SFFS.png", scale=2)
fig.show()