In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go


from sklearn.tree import DecisionTreeClassiﬁer
from sklearn.ensemble import BaggingClassiﬁer

# Feature Importance

This Jupyter Notebook contains a summary of Feature Importance methods with example codes from Machine Learning for Asset Managers by Marcos Lopez de Prado

## p-Values

Caveats of p-Values:
1. Rely on the assumptions
    - correct model speciﬁcation
    - mutually uncorrelated regressors
    - white noise residuals
2. For highly multicollinear (mutually correlated) explanatory variables, p-values cannot be robustly estimated
    - traditional regression methods cannot discriminate among redundant
explanatory variables
3. Evaluate a probability that is not entirely relevant
    - Given a null hypothesis H0 and an estimated coefﬁcient β, the p-value estimates the probability of obtaining a result equal or more extreme than β, subject to H0 being true
    - However, researchers are often more interested in a different probability, namely, the probability of H0 being true, subject to having observed β.
    - This probability can be computed using Bayes theorem, alas at the expense of making additional assumptions (Bayesian priors)
4. Assesses significance in-sample
    - The entire sample is used to solve two tasks: estimating the coefﬁcients and determining their signiﬁcance
    - Running multiple in-sample tests on the same data set is likely to produce a false discovery, a practice known as p-hacking

In [None]:
from sklearn.datasets import make_classiﬁcation

def getTestData(n_features=100 ,n_informative=25, n_redundant=25, n_samples=10000, random_state=0, sigmaStd=.0):
    
    # generate a random dataset for a classiﬁcation problem
    np.random.seed(random_state)
    X,y = make_classiﬁcation(n_samples=n_samples, n_features=n_features-n_redundant, n_informative=n_informative, n_redundant=0, shufﬂe=False, random_state=random_state)
    
    # name the columns
    cols = ['I' + str(i) for i in range(n_informative)] # I = Informative
    cols += ['N' + str(i) for i in range(n_features - n_informative - n_redundant)] # N = Noise
    
    # make dataframe
    X,y = pd.DataFrame(X, columns=cols),pd.Series(y)
    
    # choose random informative features to make redundant
    i = np.random.choice(range(n_informative), size=n_redundant)

    # make inverted dict to print
    r_to_i_map = {f"R{k}":f"I{v}" for k,v in enumerate(i)}  # Ri : Ii
    i_to_r_map = {}
    for k,v in r_to_i_map.items():
        i_to_r_map[v] = i_to_r_map.get(v, []) + [k] # Ii : Ri
    print(f"Informative Features used to generate Redundant Features: ")
    for k,v in i_to_r_map.items():
        print(f"{k} : {v}")

    # make redundant features
    for k,j in enumerate(i):
        X['R' + str(k)] = X['I' + str(j)] + np.random.normal(size = X.shape[0]) * sigmaStd # R = Redundant

    return X,y

In [None]:
import statsmodels.discrete.discrete_model as sm

# fit logit model on generated test data and obtain p-values
X,y = getTestData(40,5,30,10000,sigmaStd=.1)
ols = sm.Logit(y,X).ﬁt(disp=0)
pvalues = ols.pvalues.sort_values(ascending=False)

# plot the p-values associated with the coefficients
fig = go.Figure()
fig.add_trace(go.Bar(x=pvalues, y=pvalues.index, orientation='h'))
fig.update_layout(
    title="p-values of the coefficients",
    xaxis_title="p-values",
    yaxis_title="Features",
    height=800,
    width=800, 
)
fig.add_vline(x=0.05, line_width=2, line_dash="dash", line_color="black")
fig.show()


Results:
- Only four out of the thirty-ﬁve nonnoise features are deemed statistically signiﬁcant: I1, R29, R27, I3
- Noise features are ranked as relatively important (with positions 9, 11, 14, 18, and 26)
- Fourteen of the features ranked as least important are not noise

$\rightarrow$ p-values misrepresent the ground truth

## Mean-Decrease Impurity

- Sample of size $N$, $F$ features $\{X_f\}_{f=1,...,F}$ and one label per observation.
- Tree-Based algo splits at each node $t$ its labels into two samples: for given $X_f$ labels in node $t$ associated with a $X_f$ below a threshhold $\tau$ are placed in left sample and rest in right sample

The information gain that results from a split is measured in terms of the resulting reduction in impurity. Here we use Entropy as a measure of impurity but other measures are also possible (e.g. Gini impurity).

<u>Impurity Measure: Entropy</u></br>

$i[t] = - \sum_{j=1}^{J} p(j|t) \log_{2} p(j|t)$

- $p(j|t)$ is the proportion of observations of class $j$ at node $t$
- $J$ is the number of classes


<u>Information Gain</u></br>

$\Delta g[t,f] = i[t] - \frac{N_{t}^{(0)}}{N_t} i[t^{(0)}] - \frac{N_{t}^{(1)}}{N_t} i[t^{(1)}]$

Where
- $i[t]$ is the impurity of labels at node t before split
- $i[t^{(0)}]$ is the impurity of labels in the left sample
- $i[t^{(1)}]$ is the impurity of labels in the right sample

At each node t, the classiﬁcation algorithm evaluates $\Delta g[t,f]$ for various features in $\{X_f\}_{f=1,...,F}$

Mathematics Sklearn: https://scikit-learn.org/stable/modules/tree.html#tree-mathematical-formulation

<u>Mean Decrease Impurity (MDI)</u></br>

The importance of a feature can be computed as the weighted information gain ($\Delta g[t,f]$) across all nodes where that feature was selected.

MDI was introduced by Breimann (2001) Random Forests: https://link.springer.com/article/10.1023/A:1010933404324

For algorithms that combine ensembles of trees, like random forests, we can further estimate the mean and variance of MDI values for each feature across all trees. These mean and variance estimates, along with the central limit theorem, are useful in testing the significance of a feature against a user-defined null hypothesis.

<u>Bootstrap Aggregation (Bagging)</u></br>

1. Draw $B$ bootstrap samples from the training data (random sampling with replacement)
2. Train a classiﬁcation tree $T_b$ on each bootstrap sample $b=1,...,B$
3. The ensemble forecast is the simple average of the individual forecasts from the $B$ models. In the case of categorical variables, the probability that an observation belongs to a class is given by the proportion of estimators that classify that observation as a member of that class (majority voting).

In [None]:
def featImpMDI(ﬁt,featNames):
    # feat importance based on IS mean impurity reduction

    # get importances for each tree
    df0 = {i: tree.feature_importances_ for i,tree in enumerate(ﬁt.estimators_)}

    # convert to DF
    df0 = pd.DataFrame.from_dict(df0, orient='index')
    df0.columns = featNames
    df0 = df0.replace(0, np.nan) # because max_features=1
    
    # compute mean and std of importances
    imp = pd.concat({'mean':df0.mean(), 'std':df0.std()*df0.shape[0]**-.5}, axis=1) # CLT
    imp /= imp['mean'].sum()

    return imp

In [None]:
X,y = getTestData(40, 5, 30, 10000, sigmaStd=.1)

clf = DecisionTreeClassiﬁer(
    criterion='entropy',
    max_features=1,
    class_weight='balanced',
    min_weight_fraction_leaf=0
    )

clf = BaggingClassiﬁer(
    estimator=clf,
    n_estimators=1000,
    max_features=1.,
    max_samples=1.,
    oob_score=False
    )

ﬁt = clf.ﬁt(X,y)
imp = featImpMDI(ﬁt, featNames=X.columns)

In [None]:
# plot results
imp_sorted = imp.copy()
imp_sorted.sort_values(by='mean', ascending=True, inplace=True)

fig = go.Figure()
fig.add_trace(go.Bar(x=imp_sorted['mean'],
                     y=imp_sorted.index,
                     error_x=dict(
                         type='data',
                         symmetric=True,
                         array=imp_sorted['std']
                         ),
                     orientation='h'))
fig.update_layout(
    title="MDI Results",
    xaxis_title="MDI mean with standard deviation",
    yaxis_title="Features",
    height=800,
    width=800, 
)
fig.show()


Results:
- All nonnoisy features (I, R) are ranked higher than noisy features.
- Still, a small number of nonnoisy features appear to be much more important than their peer

Out of the four caveats of p-values, the MDI method deals with three:
1. MDI’s computational nature circumvents the need for strong distributional assumptions that could be false - we are not imposing a particular tree structure or algebraic specification, or relying on stochastic or distributional characteristics of residuals.
2. Whereas betas are estimated on a single sample, ensemble MDIs are derived from a bootstrap of trees. Accordingly, the variance of MDI estimates can be reduced by increasing the number of trees in ensemble methods in general, or in a random forest in particular. This reduces the probability of false positives caused by overfitting. Also, unlike p-values, MDI’s estimation does not require the inversion of a possibly ill-conditioned matrix.
3. The goal of the tree-based classifiers is not to estimate the coefficients of a given algebraic equation, thus estimating the probability of a particular null hypothesis is irrelevant. In other words, MDI corrects for caveat 3 by finding the important features in general, irrespective of any particular parametric specification.

However, MDI does not deal with the fourth caveat:

4. The procedure itself does not involve cross-validation. Therefore, the one caveat of p-values that MDI does not fully solve is that MDI is also computed in-sample. To confront this final caveat, we need to introduce the concept of mean-decrease accuracy.


## Mean-Decrease Accuracy / Permutation Feature Importance

1. Fit a model and compute cross-validated performance
2. Compute cross-validated performance for shuffled observations associated with one of the features -> this gives modified coss validated performance per feature
3. Derive MDA associated with a particular feature by comparing cross-validated performance before and after shuffling. For important features there should be a significant decay in performance after shuffling.

<u>Important:</u>

When features are not independent, MDA may underestimate the importance of interrelated features. At the extreme, given two highly important but identical features, MDA may conclude that both features are relatively unimportant, because the effect of shuffling one may be partially compensated by not shuffling the other.

Despite it's name accuracy may not be a good choice to evaluate the cross-validated performance in case of Finance because accuracy scores a classifier in terms of its proportion of correct predictions, but in Finance we are more interested in the magnitude of the prediction error. A classifier may achieve high accuracy even though it made good predictions with low confidence and bad predictions with high confidence.

<u> Negative average likelihood </u>

Good alternative to accuracy is log-loss (cross-entropy loss). Log-loss scores a classifier in terms of average log-likelihood of the true labels but are not easy to interpret and compare so better use negative average likelihood (NegAL).

$$NegAL = - N^{-1} \sum_{n=0}^{N-1} \sum_{k=0}^{K-1} y_{n,k} p_{n,k}$$

Where
- $p_{n,k}$ is the probability associated with prediction n of label k
- $y_{n,k}$ is the indicator function $y_{n,k} \in \{0,1\}$ where $y_{n,k}=1$ when observation n was assigned label k and $y_{n,k}=0$ otherwise

<u> Probability-weighted accuracy </u>

Another alternative to accuracy is probability-weighted accuracy (PWA). PWA is the average probability associated with the true labels.

$$PWA = \frac{\sum_{n=0}^{N-1} y_n (p_n - K^{-1})}{\sum_{n=0}^{N-1}(p_n - K^{-1})}$$

Where
- $p_n = max_k\{p_{n,k}\}$
- $y_n$ is the indicator function $y_n \in \{0,1\}$ where $y_n=1$ when the prediction was correct and $y_n=0$ otherwise

PWA punishes bad predictions made with high confidence more severely than accuracy, but less severely than log-loss.

In [None]:
# y1 = pd.Series

# prob = [[0.826 0.174]
#  [0.814 0.186]
#  [0.813 0.187]
#  ...
#  [0.971 0.029]
#  [0.892 0.108]
#  [0.931 0.069]]

# classes = ['0', '1']

# -log_loss = float

In [None]:
# example code for pwa

y_true = pd.Series([0, 1, 0, 0, 1, 0, 1, 1, 1, 0])
y_prob = pd.Series([0.1, 0.9, 0.2, 0.3, 0.8, 0.1, 0.9, 0.9, 0.9, 0.1])
y_pred = pd.Series([0, 1, 0, 1, 1, 0, 1, 1, 1, 0])
labels = ['0', '1']

y_indicator = y_true.eq(y_pred).astype(int)
y_indicator

In [None]:
def pwa(y_true, y_pred, y_prob, labels):
    
    y_indicator = y_true.eq(y_pred).astype(int)
    
    # ... to be completed ...

    pass

In [None]:
from sklearn.metrics import log_loss
from sklearn.model_selection._split import KFold

def featImpMDA(clf,X,y,n_splits=10):
    # feat importance based on OOS score reduction

    cvGen = KFold(n_splits=n_splits)
    scr0, scr1 = pd.Series(dtype=float), pd.DataFrame(columns=X.columns)

    for i,(train,test) in enumerate(cvGen.split(X=X)):
        X0, y0 = X.iloc[train,:], y.iloc[train] # get train set
        X1, y1 = X.iloc[test,:], y.iloc[test] # get test set
        fit = clf.fit(X=X0, y=y0) # the fit occurs here
        prob = fit.predict_proba(X1) # prediction before shuffling
        pred = fit.predict(X1) # prediction before shuffling

        # compute logloss before shuffling
        scr0.loc[i] = -log_loss(y1, prob, labels=clf.classes_)
        
        for j in X.columns:
            X1_ = X1.copy(deep=True)
            np.random.shuffle(X1_[j].values) # shuffle one column
            prob = fit.predict_proba(X1_) # prediction after shuffling

            # compute logloss after shuffling
            scr1.loc[i,j] = -log_loss(y1, prob, labels=clf.classes_)
    
    # compute importance for each feature
    imp = (-1 * scr1).add(scr0, axis=0)
    imp = imp / (-1 * scr1) # normalize
    imp = pd.concat({'mean' : imp.mean(), 'std' : imp.std() * imp.shape[0] ** -.5}, axis=1) # CLT
    return imp

In [None]:
X,y = getTestData(40,5,30,10000,sigmaStd=.1)

clf = DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    class_weight = 'balanced',
    min_weight_fraction_leaf=0
    )

clf = BaggingClassifier(
    estimator=clf,
    n_estimators=1000,
    max_features = 1.,
    max_samples=1.,
    oob_score=False
    )

imp = featImpMDA(clf, X, y, 10)

In [None]:
# plot results
imp_sorted = imp.copy()
imp_sorted.sort_values(by='mean', ascending=True, inplace=True)

fig = go.Figure()
fig.add_trace(go.Bar(x=imp_sorted['mean'],
                     y=imp_sorted.index,
                     error_x=dict(
                         type='data',
                         symmetric=True,
                         array=imp_sorted['std']
                         ),
                     orientation='h'))
fig.update_layout(
    title="MDA Results",
    xaxis_title="MDA mean with standard deviation",
    yaxis_title="Features",
    height=800,
    width=800, 
)
fig.show()


## Clustered Feature Importance (CFI)

CFI is a method to deal with substitution effects.

Substitution effects arise when two features share predictive information and thus can bias the results from feature importance methods. In the case of MDI, the importance of two identical features will be halved, as they are randomly chosen with equal probability. In the case of MDA, two identical features may be considered relatively unimportant, even if they are critical, because the effect of shuffling one may be compensated by the other.

CFI involves two steps:
1. Finding the number and constituents of the clusters of features
2. Applying the feature importance analysis on groups of similar features rather than on individual features

<u> Step 1: Features Clustering</u>
- Project the observed features into a metric space, resulting in a matrix $\{X_f\}_{f=1,...,F}$
- Use correlation based approach or information theoretic distance metrics to cluster the features
- Information theoretic metrics have advantage that they recognizing redundant features that are the result of nonlinear combinations of informative features
- Apply ONC algorithm (optimal number of clusters)


Some silhouette scores may be low due one feature being a combination of multiple features across clusters. This is a problem, because ONC cannot assign one feature to multiple clusters. In this case, the following transformation may help reduce the multicollinearity of the system.


Replace features included in that cluster with residual features outside of cluster k. 
- $D_k$ subset of index features $D={1,...,F}$ included in k
- For given $X_{n,i} = \alpha_i + \sum_{j \in \{\cup _{l<k} D_l \}} \beta_{i,j} X_{n,j} + \epsilon_{n, i}$ where n is the index of observations per feature
- If degrees of freedom in the above regression is too low, one option is to use as regressors linear combinations of the features within each cluster
- One of the properties of OLS residuals is that they are orthogonal to the regressors 

<u>Clustering</u>

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples

def clusterKMeansBase(corr0,maxNumClusters=10,n_init=10):
    x,silh=((1-corr0.fillna(0))/2.)**.5,pd.Series(dtype=np.float64)# observations matrix
    
    for init in range(n_init):
        
        for i in range(2,maxNumClusters+1):
            kmeans_=KMeans(n_clusters=i,n_init=1)
            kmeans_=kmeans_.fit(x)
            silh_=silhouette_samples(x,kmeans_.labels_)
            stat=(silh_.mean()/silh_.std(),silh.mean()/silh.std())
            
            if np.isnan(stat[1]) or stat[0]>stat[1]:
                silh,kmeans=silh_,kmeans_
    
    newIdx=np.argsort(kmeans.labels_)
    corr1=corr0.iloc[newIdx] # reorder rows
    corr1=corr1.iloc[:,newIdx] # reorder columns

    clstrs={i:corr0.columns[np.where(kmeans.labels_==i)[0]].tolist() for i in np.unique(kmeans.labels_)} # cluster members
    silh=pd.Series(silh,index=x.index)
    
    return corr1,clstrs,silh

In [None]:
from sklearn.metrics import silhouette_samples

def makeNewOutputs(corr0,clstrs,clstrs2):
    clstrsNew={}

    for i in clstrs.keys():
        clstrsNew[len(clstrsNew.keys())]=list(clstrs[i])

    for i in clstrs2.keys():
        clstrsNew[len(clstrsNew.keys())]=list(clstrs2[i])

    newIdx = [j for i in clstrsNew for j in clstrsNew[i]]
    corrNew = corr0.loc[newIdx,newIdx]
    x = ((1 - corr0.fillna(0)) / 2.) ** .5
    kmeans_labels = np.zeros(len(x.columns))

    for i in clstrsNew.keys():
        idxs=[x.index.get_loc(k) for k in clstrsNew[i]]
        kmeans_labels[idxs]=i

    silhNew=pd.Series(silhouette_samples(x,kmeans_labels),index=x.index)

    return corrNew,clstrsNew,silhNew

In [None]:
def clusterKMeansTop(corr0,maxNumClusters=None,n_init=10):

    if maxNumClusters==None:maxNumClusters=corr0.shape[1]-1
    
    corr1, clstrs, silh = clusterKMeansBase(corr0, maxNumClusters=min(maxNumClusters, corr0.shape[1]-1), n_init=n_init)

    clusterTstats = {i:np.mean(silh[clstrs[i]]) / np.std(silh[clstrs[i]]) for i in clstrs.keys()}

    tStatMean = sum(clusterTstats.values())/len(clusterTstats)
    
    redoClusters = [i for i in clusterTstats.keys() if clusterTstats[i] < tStatMean]
    
    if len(redoClusters)<=1:
        return corr1,clstrs,silh
    else:
        keysRedo = [j for i in redoClusters for j in clstrs[i]]
        corrTmp = corr0.loc[keysRedo,keysRedo]
        tStatMean = np.mean([clusterTstats[i] for i in redoClusters])
        corr2, clstrs2, silh2 = clusterKMeansTop(corrTmp, maxNumClusters=min(maxNumClusters, corrTmp.shape[1]-1), n_init=n_init)
    
        # Make new outputs, if necessary
        corrNew,clstrsNew,silhNew = makeNewOutputs(corr0, {i:clstrs[i] for i in clstrs.keys() if i not in redoClusters}, clstrs2)
        newTstatMean = np.mean([np.mean(silhNew[clstrsNew[i]]) / np.std(silhNew[clstrsNew[i]]) for i in clstrsNew.keys()])

        if newTstatMean <= tStatMean:
            return corr1, clstrs, silh
        else:
            return corrNew,clstrsNew,silhNew

<u>Clustered MDI</u>


We compute the clustered MDI as the sum of the MDI values of the features that constitute that cluster. If there is one feature per cluster, then MDI and clustered MDI are the same. In the case of an ensemble of trees, there is one clustered MDI for each tree, which allows us to compute the mean clustered MDI, and standard deviation around the mean clustered MDI, similarly to how we did for the feature MDI.

In [None]:
def groupMeanStd(df0,clstrs):
    out = pd.DataFrame(columns=['mean','std'])
    for i, j in clstrs.items():
        df1 = df0[j].sum(axis=1)  # sum of each MDI in the cluster
        out.loc['C_'+str(i), 'mean'] = df1.mean()  # mean 
        out.loc['C_'+str(i), 'std'] = df1.std() * df1.shape[0] ** -.5  # std * sqrt(n)
    return out

In [None]:
def featImpMDI_Clustered(fit, featNames, clstrs):
   
    # get importances of each tree
    df0 = {i:tree.feature_importances_ for i,tree in enumerate(fit.estimators_)}

    # convert to dataframe
    df0 = pd.DataFrame.from_dict(df0,orient='index')
    df0.columns = featNames
    df0 = df0.replace(0,np.nan) # because max_features=1

    # get mean and std
    imp = groupMeanStd(df0,clstrs)
    imp /= imp['mean'].sum()

    return imp

<u>Clustered MDA</u>

When computing clustered MDA, instead of shuffling one feature at a time, we shuffle all of the features that constitute a given cluster. If there is one cluster per feature, then MDA and clustered MDA are the same.

In [None]:
def featImpMDA_Clustered(clf,X,y,clstrs,n_splits=10):
    from sklearn.metrics import log_loss
    from sklearn.model_selection._split import KFold
    
    cvGen = KFold(n_splits=n_splits)
    scr0, scr1 = pd.Series(dtype=np.float64), pd.DataFrame(columns=clstrs.keys())  # make empty scrorer
    
    for i,(train,test) in enumerate(cvGen.split(X=X)):

        # train and test by cv folds
        X0, y0 = X.iloc[train,:], y.iloc[train]
        X1, y1 = X.iloc[test,:], y.iloc[test]

        # fit classifier and compute score
        fit = clf.fit(X=X0,y=y0)
        prob = fit.predict_proba(X1)
        scr0.loc[i] = -log_loss(y1, prob, labels=clf.classes_)

        for j in scr1.columns:
            X1_ = X1.copy(deep=True)
            
            # shuffle cluster
            for k in clstrs[j]:
                np.random.shuffle(X1_[k].values) 

            # fit and compute score after 1 cluster shuffled
            prob = fit.predict_proba(X1_)
            scr1.loc[i,j] = -log_loss(y1, prob, labels=clf.classes_)

    # compute importances as difference between scores
    imp = (-1 * scr1).add(scr0, axis=0)
    imp = imp / (-1*scr1)

    # mean and std
    imp = pd.concat({'mean' : imp.mean(),'std' : imp.std() * imp.shape[0] ** -.5}, axis=1)
    imp.index = ['C_'+str(i) for i in imp.index]
    
    return imp

<u>Features Clustering</u>

In a nonexperimental setting, the researcher should denoise and detone the correlation matrix before clustering, as explained in Section 2. We do not do so in this experiment as a matter of testing the robustness of the method (results are expected to be better on a denoised and detoned correlation matrix).

In [None]:
X,y=getTestData(40,5,30,10000,sigmaStd=.1)
# corr0,clstrs,silh = clusterKMeansBase(X.corr(),maxNumClusters=10,n_init=10)
corr0,clstrs,silh = clusterKMeansTop(X.corr(),maxNumClusters=10,n_init=10)

# sns.heatmap(corr0,cmap='viridis')

# plot heatmap with plotly go figure
fig = go.Figure(
    data=go.Heatmap(
        z=corr0,
        x=corr0.index,
        y=corr0.columns,
        colorscale='Viridis')
        )
fig.update_layout(
    title='Correlation Matrix',
    xaxis_nticks=36,
    width=800,
    height=800,)
fig.show()

<u>Results from Clustering</u>

ONC correctly recognizes that there are six relevant clusters (one cluster for each informative feature, plus one cluster of noise features), and it assigns the redundant features to the cluster that contains the informative feature from which the redundant features were derived. Given the low correlation across clusters, there is no need to replace the features with their residuals.

<u>Apply Clustered MDI</u>

In [None]:
clf=DecisionTreeClassifier(criterion='entropy',max_features=1, class_weight='balanced', min_weight_fraction_leaf=0)
clf=BaggingClassifier(estimator=clf,n_estimators=1000,max_features=1.,max_samples=1.,oob_score=False)
fit=clf.fit(X,y)
imp=featImpMDI_Clustered(fit,X.columns,clstrs)

In [None]:
# plot results
imp_sorted = imp.copy()
imp_sorted.sort_values(by='mean', ascending=True, inplace=True)

fig = go.Figure()
fig.add_trace(go.Bar(x=imp_sorted['mean'],
                     y=imp_sorted.index,
                     error_x=dict(
                         type='data',
                         symmetric=True,
                         array=imp_sorted['std']
                         ),
                     orientation='h'))
fig.update_layout(
    title="Clustered MDI",
    xaxis_title="MDI mean with standard deviation",
    yaxis_title="Features",
    height=400,
    width=400, 
)
fig.show()

<u>Results from Clustered MDI</u>

- Noise Features are all in cluster C5 and are the least important
- C5 haa at least half importance than the others: clustered MDI works better than non-clustered MDI
- Shows different importances for non-noise features

<u>Apply Clustered MDA</u>

In [None]:
clf=DecisionTreeClassifier(criterion='entropy',max_features=1,class_weight='balanced',min_weight_fraction_leaf=0)
clf=BaggingClassifier(estimator=clf,n_estimators=1000,max_features=1.,max_samples=1.,oob_score=False)
imp=featImpMDA_Clustered(clf,X,y,clstrs,10)

In [None]:
# plot results
imp_sorted = imp.copy()
imp_sorted.sort_values(by='mean', ascending=True, inplace=True)

fig = go.Figure()
fig.add_trace(go.Bar(x=imp_sorted['mean'],
                     y=imp_sorted.index,
                     error_x=dict(
                         type='data',
                         symmetric=True,
                         array=imp_sorted['std']
                         ),
                     orientation='h'))
fig.update_layout(
    title="Clustered MDI",
    xaxis_title="MDI mean with standard deviation",
    yaxis_title="Features",
    height=400,
    width=400, 
)
fig.show()

<u>Results from Clustered MDA</u>

- C5 has essentially zero importance -> can be discarded as irrelevant
- all other clusters have very similar importance -> contrast to non-clustered MDA

## Classifiers (from AIFML)

How to tackle overfitting:

1. Set a parameter max_features to a lower value, as a way of forcing discrepancy between trees.
2. Early stopping: Set the regularization parameter *min_weight_fraction_leaf* to a sufficiently large value (e.g., 5%) such that out-of-bag accuracy converges to out-of-sample (k-fold) accuracy.
3. Use *BaggingClassifier* on *DecisionTreeClassifier* where *max_samples* is set to the average uniqueness (avgU) between samples.
4. Use *BaggingClassifier* on *RandomForestClassifier* where *max_samples* is set to the average uniqueness (avgU) between samples.
5. Modify the RF class to replace standard bootstrapping with sequential bootstrapping

There are three ways of setting up an RF:

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
avgU = 0.5 # see sample weights notebook for the functions

# 1
clf0 = RandomForestClassifier(n_estimators=1000, class_weight='balanced_subsample', criterion='entropy')

# 2
clf1 = DecisionTreeClassifier(criterion='entropy', max_features='auto', class_weight='balanced')
clf1 = BaggingClassifier(estimator=clf1, n_estimators=1000, max_samples=avgU)

# 3
clf2 = RandomForestClassifier(n_estimators=1, criterion='entropy', bootstrap=False, class_weight='balanced_subsample')
clf2 = BaggingClassifier(estimator=clf2, n_estimators=1000, max_samples=avgU, max_features=1.)

Author suggested to fit RF on PCA of the features
