# Permutation Importance... 
#### (and grudgenly the builtin feature importence for sklearn RandomForestClassifier)
Ref:<br> 
<a href="https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance">Permutation importance</a><br>
<a href="https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html">Feature importances with a forest of trees¶</a><br>
<a href="https://github.com/rasbt/stat451-machine-learning-fs21/blob/main/13-feature-selection/05_permutation-importance.ipynb">Permutation Importance by Sebastian Raschka</a><br>


This is part of 'Explainable AI', which answers the question 'Why did it do that?'

There are a lot lot of algorithms that help in this area, two are:<br>
   **Feature importance** calculated per feature (column).  Is based on mean decrease in impurity. - applies to random forest and available via feature_importances_ attribute.  Computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.  Favors  high cardinality features over low.<br>
   **Permutation importance** calculated per feature (column).  Is based on how much a models score decreases when a column is randomly scrambled. <mark>Applies to any model</mark>.  Computed per column as the difference between baseline accuracy and scrambled column accuracy.  The features with the biggest drop in accuracy are the most important to the model.
   
   <mark>Of the 2, favor permutation importance since you can use it with any model and does not suffer from the high cardinality problem of Feature Importence
   

In [None]:
#the following gives access to utils folder
#where utils package stores shared code
import os
import sys
PROJECT_ROOT = os.path.abspath(os.path.join(
                  os.getcwd(),
                  os.pardir)
)

#only add it once
if (PROJECT_ROOT not in sys.path):
    sys.path.append(PROJECT_ROOT)
    
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

# Load Data
Lets work with the wine dataset

In [None]:
from sklearn.datasets import load_wine
data = load_wine()

df = pd.DataFrame(data= np.c_[data['data'], data['target']],
                     columns= data['feature_names'] + ['target'])
df.head()

In [None]:
df.target.unique()

## Any correlations?  If so dump em
Correlations interfere with calculating importance.  For instance, if you have 2 highly correlated columns; A and B.  If you scramble A then B still has As information.  This will result in a lower importance value for both features, where they might actually be important.

In [None]:
import utils as ut

In [None]:
ut.get_correlated_columns(df)
# df=ut.drop_correlated_columns(df)

## Get train/test split

In [None]:
y=df['target']
X=df.drop(columns=['target'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Train classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

feature_names = X.columns
forest = RandomForestClassifier(n_estimators=100,random_state=0, oob_score=True)
forest.fit(X_train, y_train)

## How well did it do?

In [None]:
#lets see how well it does

#manually
print('Training accuracy:', np.mean(forest.predict(X_train) == y_train)*100)
print('Test accuracy:', np.mean(forest.predict(X_test) == y_test)*100)

#by testing on test data
print('Model score: ', forest.score(X_test, y_test))

#or by OOB data
print(f'OOB error={forest.oob_score_}')

# <mark> Permutation Importance - this is what you want to use
The computation for full permutation importance is more costly. Features are shuffled n times then the results are averaged to estimate the importance of it. Please see 
    <a href="https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance">Permutation feature importance</a> for more details.

In [None]:
from sklearn.inspection import permutation_importance

def plotem(forest_importances):
    #plot em
    forest_importances = forest_importances.sort_values(ascending=False)

    fig, ax = plt.subplots(figsize=(8,5))
    forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
    ax.set_title("Random permutation importance")
    ax.set_ylabel("Mean accuracy decrease")
    fig.tight_layout()
    plt.show()

In [None]:
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2, scoring='accuracy')
forest_importances = pd.Series(result.importances_mean, index=feature_names)
plotem(forest_importances)

In [None]:
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2, scoring='accuracy')
forest_importances = pd.Series(result.importances_mean, index=feature_names)
plotem(forest_importances)

In [None]:
# result = permutation_importance(
#     forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2, scoring='accuracy')
# forest_importances = pd.Series(result.importances_mean, index=feature_names)
# plotem(forest_importances)

# Feature Importence (but prefer Permutation importence)
lets see what the random forest thinks is important, it calculates them based on a mean decrease in impurity, but favors high cardinality columns 


In [None]:
%%time
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)

In [None]:
#plot em
forest_importances = pd.Series(importances, index=feature_names)
forest_importances = forest_importances.sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10,5))
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

# <mark>Based on Permutation and Feature importance color intensity is the most important features that the model uses to determine the target

# What happens with corelations?

Add a column thats correlated to color intensity and see how the results change by rerunning above cells without dropping the correlated columns

In [None]:
df.color_intensity.describe()

In [None]:
df['ci1']=df.color_intensity.map(lambda x: x+ np.random.rand()*.1)
df.head()

In [None]:
#verify random noise
# df['color_intensity']-df['ci1']