- Let say we have a trained model and we want to find what were the more important features. How can we do that?
- In order to talk about feature importance when dealing with linear models **we must normalize the data before trainign**. Otherwise any conclusion drawn from the features weights will be meaningless.
---

# 1. Permutation importance. 

- It's found after the model has been fit.
- Suppose that a feature disappeared. In reality we cannot delete a feature, since the model is already trained on the whole feature space, and the model can't be tested with the missing feature.
- But we can replace that feature we some noise. But that noise has to come from the same distribution as the original dropped feature. This can be done with the permuation importance method.

- Permutation importance says:
    - Okey, we have feature $i$.
    - Let's shuffle it at random.
    
![alt text](https://i.ibb.co/SwvgDZt/Screen-Shot-2020-11-01-at-19-40-40.png)
![alt text](https://i.ibb.co/HHxctdT/Screen-Shot-2020-11-01-at-19-40-45.png)

---

# 2. Partial Dependence Plot.

- It's found after the model has been fit.
- Feature importance shows **_what_ <font color=red>variables</font>** most affect predictions, partial dependence plots show **_how_ <font color=blue>a feature</font>** affects predictions
- These plots are usually constructed by taking just one observations and seeing how the prediction changes by varying one of the features.
- See example [here](https://www.kaggle.com/dansbecker/partial-plots).

![alt text](https://i.ibb.co/p4BtMdC/Screen-Shot-2020-11-01-at-19-56-16.png)

----

# 3. Tree-Based.

- Дерево при построении учитывает значимость признаков: на каждом шаге жаждным образом выбирает тот признак, который позволит наилучшим с точки зрения выбранного критерия информативности образом поделит выборку на две подвыборки. Поэтому происходит сплит.
- Как можно этим воспользоваться при построении Feature Importance с помошью деревьев?
    - Уже дерево имеет знание, какой признак важный и какой нет.
    - На каждом шаге мы знаем, какой признак дерево использовало для разбиения.
    - После чего мы можем применить один из этим методов:
        - Gain.
        - Frequency (Split Count).
        - Cover (weighted Split Count).

---

## 3.1. Gain.

- This is the criterion the tree performed the split with.
- $IG(i) = E(i) - E_{\text{weighted}}(\text{split from }i)$, where in out context
    - $IG(i)$ - information gain at node $i$.
    - $E(i)$ - entropy at node $i$.
    - $E_{\text{weighted}}(\text{split from }i)$ - weighted entropy of the left and right children of node $i$.
    - For more info and example see [here](https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8) and [here](https://victorzhou.com/blog/information-gain/).
    
    
- **i.e.:** Information gain is a measure of how much we reduce the entropy of the node by making the given split. The higher the information gain, the more entropy was removed $\implies$ the more important that feature is.


- **Note:** In `sklearn` feature importance is calculated by computing the normalized total gini-reduction.
    - $\text{Gini-Reduction}(i) = \frac{N_i}{N_{\text{total}}}\big[\text{Gini}(i) - \text{Gini}_{\text{weighted}}(\text{split from }i)\big]$, where
        - $N_i$ - number of samples at node $i$.
        - $N_{\text{total}}$ - total number of samples in the data set.
        - $\text{Gini}(i)$ - Gini impurity at node $i$.
        - $\text{Gini}_{\text{weighted}}(\text{split from }i)$ - weighted Gini impurity of the left and right children of node $i$.
        - For more info and example see [here](https://stackoverflow.com/questions/49170296/scikit-learn-feature-importance-calculation-in-decision-trees) and [here](https://stackoverflow.com/a/15821880/6819878).



- <font color=blue>To determine the importance of feature $i$ by **information gain**, we sum all the information gain values of the nodes where the feature $i$ was used for splitting.</font>

---

## 3.2. Cover.

- It's the relative number of observations related to this feature. For example, if you have $100$ observations, $4$ features and $3$ trees, and suppose $\text{feature_1}$ is used to decide the **leaf node** for $10$, $5$, and $2$ observations in $\text{tree_1}$, $\text{tree_2}$ and $\text{tree_3}$ respectively; then the metric will count cover for this feature as $10+5+2 = 17$ observations. This will be calculated for all the $4$ features and the cover will be $17$ expressed as a percentage for all features' cover metrics.
- This is like frequency but considering also the number of observations that were used at splitting. A feature spliting a set of $50$ observations into two susbet of $47$ and $3$ observation will be more important that a feature splitting a set of $4$ observations into subset of $3$ and $1$.

Read more about cover [here](https://datascience.stackexchange.com/questions/12318/how-to-interpret-the-output-of-xgboost-importance). It says that cover is not only calculated for the leaf nodes.

## 3.3. Frequency.

- Here feature importance is directly proportional to the frequency by which that feature was use for splitting.
- It's the percentage representing the relative number of times a particular feature occurs in the trees of the model. In the above example, if $\text{feature_1}$ occurred in $2$ splits, 1 split and $3$ splits in each of $\text{tree_1}, \text{tree_2}$ and $\text{tree_3}$, then the weight for $\text{feature_1}$ will be $2+1+3 = 6$.
- The frequency for feature1 is calculated as its percentage weight over weights of all features.

----


# 4. SHAP Values.

Well explained here:

- [One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values](https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-all-shapley-values-f3e04534983d).
- [Interpreting complex models with SHAP values](https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83).