#**Random Forest Feature Selection**

In this step we are going to outline how the random forest machine learning approach can help in determining the importance of a feature.
Random forests are one the most popular machine learning algorithms. They are so successful because they provide in general a good predictive performance, low overfitting, and easy interpretability. This interpretability is given by the fact that it is straightforward to derive the importance of each variable on the tree decision. In other words, it is easy to compute how much each variable is contributing to the decision. In scikit learn random forest library the relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features, [scikt learn](https://scikit-learn.org/stable/modules/ensemble.html#forest).

Random forests are also incredibly robust and generally are very easy to implement. In every project I give, students always use Random forests for predictions. Lately, I have also found them using them for dimensionality reduction. However, there is an issue with them and Sebastian Raschka explains it really nicely below:


"The random forest technique comes with an important gotcha that is worth mentioning. For instance, if two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don't need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importances." - Python Machine Learning by Sebastian Raschka.

This "gotcha" is really important to understand and in a way negates the use of random forests for feature importance and thus the use of them for feature reduction and engineering.

Have a look at the code below and play about with the standardisation methods from scikit learn. The description of how the Ranodmforest Classifier works can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier).


Aa  test example we are going to use the Scikit learns wine dataset. We can import this directly. In this example the $y$ variable is the wine class of which there are 3. The $X$ variables are characteristics that are normally used to characterize wine.

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
import numpy as np
data = load_wine()

X=data.data
print(len(X[1,:]))
y=data.target
df_X=pd.DataFrame(X,columns=data.feature_names)
print(df_X.corr())

import seaborn as sns
%matplotlib inline


# calculate the correlation matrix
corr = df_X.corr()

# plot the heatmap
sns.heatmap(corr,
        xticklabels=corr.columns,
        yticklabels=corr.columns)



For demonstration purposes we split the dataset into training and test. However, if you are doing this on  a real dataset it maybe worth while using a bootstrapping approach.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train[1,:])

The next 2 pieces of code are not really necessary but demonstrate how random forests don't need normalised or standarised data.

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train = mms.fit_transform(X_train)
X_test = mms.transform(X_test)
print(X_train[1,:])

In [None]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)
print(X_train[1,:])

The next code fragement imports random forest classifer from scikit learn. The n_estimators refers to the number of random trees we are going to examine and the n_jobs=-1 specifies that all available processors should be used. I have also added in a variable here called no_f. This variable allows us to set the number of features in our training set.

In [5]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
no_f=13
feat_labels = data.feature_names[0:no_f]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(X_train[:,0:no_f], y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train[:,0:no_f].shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f],importances[indices[f]]))



Now we are going to plot the attribute importance.

In [None]:
import matplotlib.pyplot as plt

plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]), importances[indices], color='green', align='center')
plt.xticks(range(X_train.shape[1]),feat_labels, rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

Adjust the variable that go into the $X_train$ dataset. See if the variable importance is affected by the correlation between the features. Put your comments on the comments board.