# <a id='toc1_'></a>[Explainable Machine Learning](#toc0_)

**Skills**
1. Implement local explainable techniques like LIME, SHAP, and ICE plots using Python.
2. Implement global explainable techniques such as [Partial Dependence Plots](https://scikit-learn.org/stable/modules/partial_dependence.html) (PDP) and [Accumulated Local Effects](https://en.wikipedia.org/wiki/Accumulated_local_effects) (ALE) plots in Python.
3. Apply example-based explanation techniques to explain machine learning models using Python.
4. Visualize and explain neural network models using SOTA techniques in Python.
5. Critically evaluate interpretable attention and [saliency](https://en.wikipedia.org/wiki/Saliency_map) methods for transformer model explanations.
6. Explore emerging approaches to explainability for large language models (LLMs) and generative computer vision models.

---

**Table of contents**<a id='toc0_'></a>    
- [Explainable Machine Learning](#toc1_)    
- [Module 1️⃣](#toc2_)    
  - [Local Explanations](#toc2_1_)    
    - [LIME, 2016](#toc2_1_1_)    
      - [Process](#toc2_1_1_1_)    
      - [Limitations](#toc2_1_1_2_)    
      - [Pros & Cons](#toc2_1_1_3_)    
    - [Anchors, 2018](#toc2_1_2_)    
      - [Process](#toc2_1_2_1_)    
      - [Pros & Cons](#toc2_1_2_2_)    
    - [Shapley Values, 1952](#toc2_1_3_)    
      - [How to Interpret](#toc2_1_3_1_)    
      - [Formula](#toc2_1_3_2_)    
      - [Pros & Cons](#toc2_1_3_3_)    
    - [SHAP, 2017](#toc2_1_4_)    
      - [Types](#toc2_1_4_1_)    
        - [Kernel SHAP](#toc2_1_4_1_1_)    
      - [Process](#toc2_1_4_2_)    
        - [Tree SHAP](#toc2_1_4_2_1_)    
        - [Deep SHAP](#toc2_1_4_2_2_)    
      - [Pros & Cons](#toc2_1_4_3_)    
    - [ICE, 2014](#toc2_1_5_)    
      - [Process](#toc2_1_5_1_)    
      - [Types](#toc2_1_5_2_)    
        - [c-ICE](#toc2_1_5_2_1_)    
        - [d-ICE](#toc2_1_5_2_2_)    
      - [Pros & Cons](#toc2_1_5_3_)    
  - [Global Explanations](#toc2_2_)    
    - [Functional Decomposition](#toc2_2_1_)    
      - [Functional ANOVA](#toc2_2_1_1_)    
    - [Feature Interaction](#toc2_2_2_)    
      - [H-statistic, 2008](#toc2_2_2_1_)    
        - [Interpretation](#toc2_2_2_1_1_)    
      - [Pros & Cons](#toc2_2_2_2_)    
    - [Permutation Feature Importance, Model Reliance, 2018](#toc2_2_3_)    
      - [Process](#toc2_2_3_1_)    
      - [Pros & Cons](#toc2_2_3_2_)    
    - [PDP, 2001](#toc2_2_4_)    
      - [Process](#toc2_2_4_1_)    
      - [Formula](#toc2_2_4_2_)    
      - [Pros & Cons](#toc2_2_4_3_)    
    - [ALE Plots, 2020](#toc2_2_5_)    
      - [Process](#toc2_2_5_1_)    
      - [Pros & Cons](#toc2_2_5_2_)    
  - [Example-Based Explanations](#toc2_3_)    
    - [Types](#toc2_3_1_)    
      - [Prototype-based](#toc2_3_1_1_)    
          - [MMD-Critic, 2020](#toc2_3_1_1_1_1_)    
        - [Pros & Cons](#toc2_3_1_1_2_)    
      - [Counterfactuals](#toc2_3_1_2_)    
        - [Process](#toc2_3_1_2_1_)    
        - [Pros & Cons](#toc2_3_1_2_2_)    
      - [Influential Instances](#toc2_3_1_3_)    
        - [Pros & Cons](#toc2_3_1_3_1_)    
- [Module 2️⃣](#toc3_)    
  - [Visualizing NN Predictions](#toc3_1_)    
    - [Feature Visualization](#toc3_1_1_)    
      - [Pros & Cons](#toc3_1_1_1_)    
    - [Feature Attribution](#toc3_1_2_)    
      - [Vanilla Gradient](#toc3_1_2_1_)    
        - [Process](#toc3_1_2_1_1_)    
      - [Grad-CAM](#toc3_1_2_2_)    
        - [Process](#toc3_1_2_2_1_)    
      - [Pros & Cons](#toc3_1_2_3_)    
  - [Explaining NNs](#toc3_2_)    
  - [Explainable Attention](#toc3_3_)    
- [Module 3️⃣](#toc4_)    
- [My Questions](#toc5_)    
- [Resources](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Module 1️⃣](#toc0_)

Focuses on model-agnostic explanation techniques. 
- means that the techniques are able to be used across different types of models.

## <a id='toc2_1_'></a>[Local Explanations](#toc0_)
> Focus on explaining individual predictions.

- They approximate the original model's behavior locally around the instance of interest using an interpretable model like linear regression. 

### <a id='toc2_1_1_'></a>[LIME, 2016](#toc0_)
= local interpretable model agnostic explanations

> use interpretable models to explain individual predictions of a black box ML model

<img src="imgs/lime_image.png" alt="Sources of Bias" width="600" height="200">

- Can be used for many data sources: tabular data, text, images
- Can show the features that most contribute to our prediction

```
    What LIME posits is that while the complex decision function of the original model cannot be approximated with a linear model, we can use a linear model to approximate the decision function at this local area of interest. We can do this by sampling instances around our area of interest and using those to train a new interpretable model like a linear regression model. 
```

#### <a id='toc2_1_1_1_'></a>[Process](#toc0_)
1. **Select instance** of interest (the prediction I want to explain)
2. **Perturb your dataset** and get black box predictions for perturbed samples
3. **Generate a new dataset** consisting of perturbed sampled (variations of your data) and the corresponding predictions
4. **Train an interpretable model**, weighted by the proximity of sampled instances to the instance of interest
5. **Interpret the local model** to explain predictions

Note that LIME is a local explanability process and will not generalize to other instances than the one you selected at **(1)**. 

$$
    explanation(x) = argmin_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)
$$
- $\mathcal{L}$ is a loss which measures how close the explanation is to the prediction of the original model, e.g. MSE for regression. 
- $\pi$ is a proximity, which is a measure of how large of the neighborhood around instance $x$ is that we consider for the explanation.
- $G$ is the family of all possible explanations or all possible linear regression models.
- $\Omega$ is the model complexity. We try to keep this low. Means we prefer fewer features.

Note that lime can only minimize the loss $\mathcal{L}$. The user of LIME is responsible for determining the complexity by selecting the maximum number of features that the linear regression model may use.

#### <a id='toc2_1_1_2_'></a>[Limitations](#toc0_)

How to find best kernel width for exponential smoothing kernel? 
- No robust way to find
- Problem gets worse for higher dimensional feature spaces

Note that 
- A *small kernel* width means that an instance must be very close to influence our local model. 
- Whereas a *larger kernel* width means that instances that are farther away also influence the model.

#### <a id='toc2_1_1_3_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Flexible - if you replace your ML model, you can still use LIME | Limitations in how to define a neighborhood and optimize it | 
| Intuitive | Instability of explanations - not always consistent across settings/experiments | 
| Works for tabular data, text and images | Can be used to hide biases and can be easily fooled | 

### <a id='toc2_1_2_'></a>[Anchors, 2018](#toc0_)
> Explain individual predictions by finding a decision rule (`IF-THEN` rule) that sufficiently "anchors" the prediction.

$$
    E_{Dx(z|A)}[ 1_{f(x)=f(z)} ] \geq \tau A(x) = 1
$$
- $A$ is the anchor 
- $x$ is the instance of interest
- $\tau$ is the precision threshold

> $\textcolor{#ba4e00}{\text{Coverage}}$ is an anchor's probability of applying to it's neighbors.

#### <a id='toc2_1_2_1_'></a>[Process](#toc0_)

<img src="imgs/anchors_process.png" alt="Anchors Process" width="600" height="200">

1. **Candidate generation**
   - In the first round, it creates one candidate per feature of the instance being explained, $x$, where each candidate fixes the respective value of possible perturbations. In subsequent rounds, the best candidates from the previous round are extended by adding one feature predicate that is not yet included in them.
2. **Identify the best candidate**
   - Compares candidate rules to determine which one explains the instance $x$, the best. 
   - To do this, it creates perturbations that match the currently observed rule and evaluates them by calling the model. However, these model calls need to be minimized to limit computational overhead. At the core of this component is a pure exploration, multi-armed bandit algorithm.
   - In this setting, each candidate rule is considered an arm that can be pulled. Each time an arm is pulled, respective neighbors are evaluated, and we obtain more information about the candidate rules payoff, precision in the case of anchors. The precision indicates how well the rule describes the instance being explained.
3. **Candidate precision validation**
   - takes more samples if there is no statistical confidence, yet that the candidate exceeds the specific threshold $\tau$. 
4. **Modified beam search**
   - assembles all of the above components into a beam search, which is a graph search algorithm and a variant of the breadth-first algorithm. It carries over the B best candidates from each round to the next round, where `B` is the beam width. These `B` best rules are then used to create new rules. At every round $i$, it generates candidates with exactly $i$ predicates and selects the `B` best among them. 
   - By setting a high value for `B`, the algorithm is more likely to avoid local optima, but this requires a high number of model calls and thereby increases the computational load. 

> $\textcolor{#ba4e00}{\text{Multi-Armed Bandits}}$ are used to efficiently explore and exploit different strategies called arms in analogy to slot machines using sequential selection.

❗ One call out when using anchors is that you need to be really careful if you have unbalanced data because the perturbation space reflects the training data.

- How to solve? 
  - mitigation methods
    - defining a custom perturbation space through sampling differently, 
    - using a subset of training data or 
    - modifying the multi-arm bandit parameters, confidence, and error which could result more samples being drawn, ultimately leading to the minority being sampled more often.

#### <a id='toc2_1_2_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons | 
|------|------|
| Rules are very easy to understand | Requires careful parameter tuning | 
| Works when predictions are locally nonlinear or complex | Pertubation function needs to be designed for each use case | 
| Model agnostic | The concept of coverage is still undefined in some domains and difficult to make comparisons | 
| Highly efficient | | 

### <a id='toc2_1_3_'></a>[Shapley Values, 1952](#toc0_)

> $\textcolor{#ba4e00}{\text{Shapley Values}}$ is a method from coalitional game theory (where game players cooperate in a coalition) that tell us how to fairly distribut "payout" to players depending/based on their contribution to the total payout.

#### <a id='toc2_1_3_1_'></a>[How to Interpret](#toc0_)

> Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value.

#### <a id='toc2_1_3_2_'></a>[Formula](#toc0_)

$$
   \underbrace{\phi_j(val)}_{\begin{subarray}{l}\text{Shapley value}\\\text{for feature} \; j\end{subarray}} = \sum_{S \subseteq \{ i, \dots, p \}  j} \underbrace{\frac{|S|!( p - |S|S - 1 )!}{p!}}_{ \begin{subarray}{l}\text{weighting based on total}\\\text{number of features} \; p \; \text{and} \\ \text{number of features in subset}\end{subarray}}  ( \underbrace{val( S \cup \{ j \} )}_{\begin{subarray}{l}\text{Subset prediction with}\\ \text{feature} \; j \; \text{included}\end{subarray}} - \overbrace{val(S)}^{\begin{subarray}{l}\text{Subset prediction}\\ \text{without feature} \\ j \; \text{included}\end{subarray}} )
$$
- $S$: Subset of the features used in the model

In reality we can't just remove a feature,
- so instead we just replace the feature with a random value from our training set.

<img src="imgs/shap_random_replacement.png" alt="SHAP Random Replacement" width="600" height="100">

#### <a id='toc2_1_3_3_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| The difference between the prediction and the average predictiong is **fairly distributed** among the feature values of the instance | **Computation time is very high** (total number of subsets of a set if $s^n$) |
| Enables constrastive explanations (you can compare to the average prediction, a subset, or even to a single data point) | **Because of this, it requires approximation to be practical** |
| Based on theory rather than assumptions (i.e. LIME assumes local linear behavior) | Can be easily misinterpreted |
| | Always uses all the features | 
| | You need access to the data | 

### <a id='toc2_1_4_'></a>[SHAP, 2017](#toc0_)
> $\textcolor{#ba4e00}{\textbf{SH}\text{apley} \textbf{A}\text{dditive ex}\textbf{P}\text{lanations}}$ proposes to approximate Shaplet Values.

#### <a id='toc2_1_4_1_'></a>[Types](#toc0_)
##### <a id='toc2_1_4_1_1_'></a>[Kernel SHAP](#toc0_)
> $\textcolor{#ba4e00}{\text{Kernel SHAP}}$: Samples feature subsets and fits a linear regression model, where the variables are whether a feature is present or absent and the output value is the prediction. Coefficients of the fittes linear regression model are the approximations of Shaplet Values.
#### <a id='toc2_1_4_2_'></a>[Process](#toc0_)
1. Sample coalitions $z_k' \in \{0, 1\}^M, \; k \in \{1, \dots, K\}$.
2. Get prediction for each $z_k'$ by first converting $z_k'$ to the original feature space and then applying model $\hat{f}: \hat{f}(h_x(z_k'))$.
3. Compute the weight for each $z_k'$ with the SHAP kernel. 
4. Fit weighted linear model.
5. Return Shapley values $\phi k$, the coefficients from the linear model.

##### <a id='toc2_1_4_2_1_'></a>[Tree SHAP](#toc0_)
> $\textcolor{#ba4e00}{\text{Tree SHAP}}$

##### <a id='toc2_1_4_2_2_'></a>[Deep SHAP](#toc0_)
> $\textcolor{#ba4e00}{\text{Deep SHAP}}$

#### <a id='toc2_1_4_3_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| The difference between the prediction and the average prediction is **fairly distributed** among the feature values of the instance | Kernel SHAP can still be slow | 
| Enables constrastive explanations (you can compare to the average prediction, a subset, or even to a single data point) | You need access to the data |
| Faster computation | It is possible to create intentionally misleading interpretations with SHAP and it can be used to hide biases | 

### <a id='toc2_1_5_'></a>[ICE, 2014](#toc0_)
> $\textcolor{#ba4e00}{\text{Individual Conditional Expectation}}$ plots one line per instance (single data point/observation) that displays how the instace's prediction changes when a feature changes.

Intuition is "what would happen if we changed one feature of this instance while holding all other features constant".

- **ICE (Individual Conditional Expectation)**: Shows one line per instance, highlighting variability between instances.
- **PDP (Partial Dependence Plot)**: Aggregates these lines (typically by averaging them) to show an overall trend.

#### <a id='toc2_1_5_1_'></a>[Process](#toc0_)

1. Select an instance and a feature of interest. 
2. Keep all other features constant.
3. Create variants of the instance by replacing the feature's values.
4. Make predictions with the black box model for newly created instances.

#### <a id='toc2_1_5_2_'></a>[Types](#toc0_)
##### <a id='toc2_1_5_2_1_'></a>[c-ICE](#toc0_)
> $\textcolor{#ba4e00}{\text{Centered ICE Plots}}$ center the curves at a certain point in the feature and display only the difference in the prediction to this point. 


##### <a id='toc2_1_5_2_2_'></a>[d-ICE](#toc0_)
> $\textcolor{#ba4e00}{\text{Derivative ICE Plots}}$ to know whether changes occur and in which directions they occur.

- Plots individual derivates of the prediction function with respect to a feature.

#### <a id='toc2_1_5_3_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Intuitive, easy to understand | Can only meaningfully display one feature | 
| Can uncover heterogeneous relationships | Plot can become overcrowded easily |
| | Difficult to see average |
| | If the feature of interest is correlated with the other features, then some points in the lines might be invalid data points - according to the joint feature distribution. |

## <a id='toc2_2_'></a>[Global Explanations](#toc0_)

> aim to provide an understanding of the overall behavior and decision making process of a machine learning model, rather than focusing on individual predictions or local regions of the input space.

### <a id='toc2_2_1_'></a>[Functional Decomposition](#toc0_)
> Divides complex models into simpler, constituent parts. Each part or function can be analyzed separately, making it easier to understand the overall behavior of the model.

Each ML model can be broke down into 3 components: 
1. **Main Effects**: how each feature affects the prediction, independent of the values of the other feature.
2. **Interaction Effect**: joint effect of the features.
3. **Intercept**: what the prediction is when all other feature effects are set to zero.

#### <a id='toc2_2_1_1_'></a>[Functional ANOVA](#toc0_)
- **Additive Decomposition**
  - decompose the prediction function f(x) of a model into a sum of terms, where each term represents the effect of one or more input features on the prediction.
- **Individual Contributions**
  - breaking down the prediction function into these additive components, this allows for an analysis of how much each feature and their interactions contributes to the model's output.
- **Interaction Effects**
  - quantify interaction effects between the features.

| Pros | Cons |
|------|------|
| Theoretical justification for decomposing high-dimensional and complex ML models into individual effects and interactions | In practice difficult because of the computational requirements | 
| | More appropriate for tabular data vs image or text data |

### <a id='toc2_2_2_'></a>[Feature Interaction](#toc0_)
To estimate the interaction strength, we can measure how much of the variation of the prediction depends on the interaction of the features, and we can do this with something called the [H statistic](https://christophm.github.io/interpretable-ml-book/interaction.html#theory-friedmans-h-statistic).

#### <a id='toc2_2_2_1_'></a>[H-statistic, 2008](#toc0_)
> $\textcolor{#ba4e00}{\text{H-statistic}}$ for feature interaction. Includes the two-way H-statistic (interactions between 2 variables, $j$ and $k$) and the total H-statistic ($j$ vs. all).

It iterates over all data points and at each data point the partial dependence (PD) is evaluated which in turn is done with all $n$ data points.

H-statistic between feature $j$ and $k$: 

$$
   H_{jk}^2 = \frac{\sum_{i=1}^{n}\Big[ PD_{jk}(x_j^{(i)}, x_k^{(i)}) - PD_{j}(x_j^{(i)}) - PD_{k}(x_k^{(i)}) \Big]^2}{\sum_{i=1}^{n}PD_{jk}^2(x_j^{(i)}, x_k^{(i)})}
$$

H-statistic between feature $j$ and any other feature: 

$$
   H_{j}^2 = \frac{\sum_{i=1}^{n}\Big[ \hat{f}(x^{(i)}) - PD_{j}(x_j^{(i)}) - PD_{-j}(x_{-j}^{(i)}) \Big]^2}{\sum_{i=1}^{n}\hat{f}^2(x^{(i)})}
$$

##### <a id='toc2_2_2_1_1_'></a>[Interpretation](#toc0_)

- $\text{H-statistic} = 0 \rightarrow$ no interaction.
- $\text{Total H-statistic} = 1 \rightarrow$ all of the variance of the $PD_{jk}$ of $\hat{f}$ is explained by the sum of the PD functions.
- $\text{2-way H-statistic} = 1 \rightarrow$ each single PD function is constant and the effect on the prediction only comes through the interaction.

#### <a id='toc2_2_2_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Meaningful interpretation | Computationally expensive |
| Dimensionless, which makes it comparable across both features and models | Results can be unstable (there is inherent variance in the statistic itself) | 
| | The H-statistic can be greater than $1$ making it hard to interpret | 
| | Doesn't work with highly correlated features |

### <a id='toc2_2_3_'></a>[Permutation Feature Importance, Model Reliance, 2018](#toc0_)
> The importance of a feature can be measured by calculating how much model's prediction error increases after permuting the feature.

#### <a id='toc2_2_3_1_'></a>[Process](#toc0_)
1. Estimate the original model error $e_{orig} = \mathcal{L}( Y, \hat{f}(X) )$
2. For each feature $j \in \{ 1, \dots, p\}$:
   - Generate feature matrix $X_{perm}$ by permuting feature $j$ in the data $X$
   - Estimate error $e_{perm} = \mathcal{L}( Y, \hat{f}(X_{perm}) )$ based on the predictions of the permuted data
   - Calculate permutation feature importance either as
     - quotient $FI_j = \frac{e_{perm}}{e_{orig}}$ or 
     - difference $FI_j = e_{perm} - e_{orig}$ 
3. Sort features by descending $FI$

#### <a id='toc2_2_3_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Easy to interpret | Because it depends on shuffling the feature, which adds randomness to the measurement, when repeated, the results might vary |
| Provides global insight into model behavior | If features are correlated, it can be biased by unrealistic data instances (just like PDP) | 
| Does not require retraining the model | | 

> Model specific feature importance examples include the 
> 
> - Gini importance in random forests and 
> - Standardized regression coefficients in regression models.

An alternative to permutation based feature importance is variance based feature importance measures like functional ANOVA.

### <a id='toc2_2_4_'></a>[PDP, 2001](#toc0_)
> $\textcolor{#ba4e00}{\text{Partial Dependence Plots }}$ shows the marginal effect one or two features have on the predicted outcome of a model.

Is the average of the lines of an ICE plot. 

#### <a id='toc2_2_4_1_'></a>[Process](#toc0_)
1. Select feature of interest
2. For every instance in training dataset:
   - Keep all other features the same, create variants of the instance by replacing the feature's value with values from the grid
   - Make predictions with the black box model for newly created instances
3. Average across all instances and plot

#### <a id='toc2_2_4_2_'></a>[Formula](#toc0_)

We take an approximation of partial depedency (PD) for computational efficiency:
$$
   pd_s(x_s) = \frac{1}{n}\sum_{i=1}^{n}f(x_s, x_c^{(i)})
$$

- $x_c^{(i)}$: Fixed values for features (excluding $x_s$) of observation $i$
- We vary the value of $x_s$

#### <a id='toc2_2_4_3_'></a>[Pros & Cons](#toc0_)
| Pros | Cons |
|------|------|
| Interpretation is very clear (in uncorrelated case) | Assumes features are independent | 
| Easy to implement | Maximum number of features per plot is 2 | 
| | Only show average marginal effects |

### <a id='toc2_2_5_'></a>[ALE Plots, 2020](#toc0_)
> $\textcolor{#ba4e00}{\text{Accumulated Local Effects Plots}}$ were introduced to allow for visualization of correlated features.
>
> shows the average effect of a single feature on a model's predictions, while keeping all other features constant.
>
> **Key Idea**: PDP smooths out individual differences (shown in ICE plots) and gives you the overall trend of how the feature affects predictions.
>
> **In Short**: PDP = "average prediction across all data points as the feature changes."

- Include local effects and accumulation
- focus on local changes in the prediction when a feature value changes
  - unlike PDPs which average out the effects over the entire feature space
- Instead of plotting the local effects directly, ALE plots accumulate these effects across the range of the feature.

> helps in understanding the global trend of how the feature influences predictions

#### <a id='toc2_2_5_1_'></a>[Process](#toc0_)
1. **Bin the Feature**
2. **Compute Local Effects**
   - For each bin, calculate the local effect of the feature on the prediction. 
   - This involves calculating the change in prediction when moving from the lower to the upper edge of the bin, averaging this change over all instances that fall into that bin
3. **Accumulate Effects**
   - Starting from the first bin, accumulate the local effects across all bins. 
   - Sum up the average effects sequentially, to show how the feature influences the prediction as the value changes and centering.
4. **Centering**
   - To make the plot more interpretable, center the accumulated effects around zero. 
   - Subtract the mean of accumulated effects which forces the interpretation to focus on deviations from the average prediction. 

#### <a id='toc2_2_5_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Unbiased (work even when features are correlated) | ALE and linear coefficients don't match when features are highly correlated | 
| Allows clear interpretation | Implementation more complex and less intuitive than PDPs | 
| Faster to compute than PDPs | |

## <a id='toc2_3_'></a>[Example-Based Explanations](#toc0_)
> a way of interpreting and understanding the decisions made by machine learning models. Particularly in complex tasks like CV and NLP.
>
> aim to provide human understandable justifications for a model's predictions by presenting relevant examples or counterfactual scenarios.

Can be particularly useful in high stakes decision making scenarios. Such as medical diagnoses, loan approvals, or criminal risk assessment, where it is crucial to understand the reasoning behind the model's predictions.

These explanations can 
- help build trust in the model's decisions, 
- identify potential biases or flaws, and 
- facilitate human AI collaboration.

Cxample based explanations may not always fully capture the complex decision making process of deep neural networks. 
- Should be used in conjunction with other interpretation techniques to gain a more comprehensive understanding of the model's behavior.

### <a id='toc2_3_1_'></a>[Types](#toc0_)
#### <a id='toc2_3_1_1_'></a>[Prototype-based](#toc0_)
> Aims to explain the predictions of an ML model by identifying representative examples or "prototypes" from the data.

**Main thesis** is that we represent the model's knowledge in terms of prototypical instances or patterns.

1. For a given prediction, the model selects the most similar training examples or prototypes. 
   - These prototypes are typically chosen based on their closeness in feature space to the input instance.
2. How it does this is the model assesses the similarity between the input instance and various prototypes. 
   - The similarity is typically measured using distance measures, like Euclidean distance in feature space. 
3. The system then presents these prototypes to the user, showing how the input instance resembles these examples. 
4. Users can inspect the prototypes to see common features on patterns, making it easier to understand why the model made a certain prediction. 

In addition to prototypes, you can use *contrastive examples* or *criticism*.
- helps users understand what differentiates the predicted class from other classes.

###### <a id='toc2_3_1_1_1_1_'></a>[MMD-Critic, 2020](#toc0_)
> Combines prototypes and criticisms in a single framework. Selects prototypes that minimize the discrepancy between the two distributions.

**Process**
1. Select `#` of prototypes and criticisms you want
2. Find prototypes with greedy search
3. Find criticisms with greedy search

GOAL: minimize the squared maximum mean discrepancy (MMD)

- **MMD** = measure on how different two distributions are.

##### <a id='toc2_3_1_1_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Prototypes themselves are concrete instances that capture the essence of each class or decision boundary | Doesn't scale well with high dimensional data | 
| Aligns with human intuition | Primarily provide local explanations | 
| Model agnostic | Easy to overfit or underfit | 
| | Challenging to select representative prototypes | 

#### <a id='toc2_3_1_2_'></a>[Counterfactuals](#toc0_)
> Describes the *Smallest change to the feature values that changes the prediction* to a predefined output: 
> 
> "If $X$ had not occured, $Y$ would not have occured"

##### <a id='toc2_3_1_2_1_'></a>[Process](#toc0_)
1. Define a loss function 
2. Find the counterfactual explanation that minimizes this loss using an optimization algorithm

##### <a id='toc2_3_1_2_2_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Clear interpretation | Rashomon Effect: for each instance, you will usually find multiple counterfactual explanations. How to report them? | 
| Does not require access to the data or model | | 
| Relatively easy to implement | | 

#### <a id='toc2_3_1_3_'></a>[Influential Instances](#toc0_)
> A training instance is considered *influential* if it's deletion from the training data considerably changes the parameters or predictions of the model.

By identifying influential training instances, we can 
- debug models and better explain their behavior and predictions.
- answer questions about global model behavior 
- answer questions about individual predictions
Can help answer questions like: 
1. Which were the most influential instances for the model parameters or the predictions overall? 
2. Which were the most influential instances for a particual prediction? 

Two major types: 
1. Deletion Diagnostics
  > Measure the effect of deleting an instance on model parameters (DFBETA) or model predictions (Cook's distance)
  
  - $\textcolor{#ba4e00}{\text{DFBETA}}$: you remove an instance, recalculate the weight vector, and compare it to the weight vector obtained with all instances.
  - $\textcolor{#ba4e00}{\text{Cook's distance}}$: you remove an instance, recalculate the model predictions, and compare them to the predictions obtained with all instances
   - *Example*: You recalculate the $\mu$ for each person in the dataset, omitting one of the student GPAs each time, and measure how much the mean estimate changes.
2. Influence Functions, 2017
  > Approximates how much the model changes when you upweight the loss of a training instance by an infinitesimally small step $\epsilon$, which results in new model parameters.
   - *Example*: You upweight one of the person's GPAs by a very small weight.

The above is for the example of: 
- How much can your average GPA be influenced by a single person? 

**Whats the difference between influential differences and outliers?**
- influential instances look at relationships between the instance and their corresponding model parameters to predictions
- outliers look at the relationship between an instance and the other instances or data.

##### <a id='toc2_3_1_3_1_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Great for debugging | Computationally expensive | 
| Deletion diagnostics are model agnostic | Influence functions are only approximate | 
| Can be used to compare different ML models | |

# <a id='toc3_'></a>[Module 2️⃣](#toc0_)

## <a id='toc3_1_'></a>[Visualizing NN Predictions](#toc0_)

### <a id='toc3_1_1_'></a>[Feature Visualization](#toc0_)
> Process of making learned features in a NN explicit.
>
> Answers the question: *what does this neuron, channel or layer see?*

With feature visualization we are trying to maximize the activation of a neuron $h$. 

$$
    img^{*} = argmax_{img} \underbrace{h_{n,x,y,z}}_{\text{activation of a neuron}} (\overbrace{img}^{\text{input}})
$$
- $x,y=$ spatial position of neuron
- $n=$ layer
- $z=$ channel index

For the mean activation of an entire channel $z$ in layer $n$: 
$$
    img^{*} = argmax_{img} \sum_{x,y}h_{n,x,y,z} (img)
$$

Note that `minimize = maximize(-)`. 

#### <a id='toc3_1_1_1_'></a>[Pros & Cons](#toc0_)
| Pros | Cons |
|------|------|
| Unique insight into how NNs work | Many feature visualization images contain some features with no human interpretation | 
| Communicate in a non-technical way how NNs work | For large NNs, difficult to visualize complete network | 
| | Is not a complete picture of interactions | 

### <a id='toc3_1_2_'></a>[Feature Attribution](#toc0_)
> Indicate how much each feature in your model contributes to a predicition for an instance.

**Gradient-based Methods**
- Vanilla Gradient (Saliency Maps)
- DeconvNet
- Grad-CAM
- Guided Grad-CAM
- SmoothGrad

#### <a id='toc3_1_2_1_'></a>[Vanilla Gradient](#toc0_)
> $\textcolor{#ba4e00}{\textbf{Vanilla Gradient (Saliency Maps)}}$ provides pixel-level importance.
>
> Calculate the gradient of the loss fucntion for the class of interest wrt the input pixels.

##### <a id='toc3_1_2_1_1_'></a>[Process](#toc0_)
1. Forward pass image of interest
2. Compute the gradient of the class score of interest wrt the input pixels. 
   - set all other classes to zero.
3. Visualize the gradients
   - show absolute values or highlight negative/positive contributions.

#### <a id='toc3_1_2_2_'></a>[Grad-CAM](#toc0_)
> $\textcolor{#ba4e00}{\textbf{Grad-CAM}}$ provides region-level importance.
>
> = gradient weighted class activation map
>
> Analyzes which regions are activated in the feature maps of the last convolutional layers for a certain classification.

**Primary goal** of GradCAM is to understand at which parts of an image a convolutional layer looks for a certain classification.
- Does this by analyzing which regions are activated in the feature maps of the last convolutional layers.

##### <a id='toc3_1_2_2_1_'></a>[Process](#toc0_)
1. Forward prop the input image through the CNN.
2. obtain the raw score for the class of interest. 
   - This is the activation of the neuron before the softmax layer.
   - Set all other class activations to zero.
3. Then back prop the gradient of the class of interest to the last convolutional layer before the fully connected layers. 
4. Weight each feature map or pixel by the gradient for the class, and 
5. Calculate an average of the feature maps, weighted per pixel by the gradient then apply lute to the average feature map.
6. To visualize scale values to the interval between 0 and 1, upscale the image and overlay it over the original image.

#### <a id='toc3_1_2_3_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| Visuals provide easily understandable explanations | Difficult to evaluate explanation (how do we know it is correct?) |
| Faster computation thatn methods like LIME or SHAP | Results can be unreliable | 
| Many methods to choose from | | 

## <a id='toc3_2_'></a>[Explaining NNs](#toc0_)

### Network Dissection, 2017
> Links human concepts with individual NN units.

- **Key hypothesis**: do convolutional neural networks learn disentangled features?
- $\textcolor{#ba4e00}{\textbf{Disentangled features}}$ just means that can individual network units detect specific real world concepts.

#### Implementation
1. Get images with human labeled visual concepts. 
   - These could be pixel-wise labeled images with concepts of different abstraction levels. 
2. Measure CNN channel activations for images.
3. Get the alignment of activations and labeled concepts.

#### Pros & Cons 
| Pros | Cons |
|------|------|
| Expands upon insights from feature visualization | You need datasets that are labeled on the pixel level with the concepts (this takes a lot of effort to collect!) | 
| Communicate in a non-technical way how NNs work | Many units respond to the same concept and some to no concept at all | 
| Links units to concepts | Only aligns human concepts with positive activations (not with negative activations of channels) | 
| Detect concepts beyond the classes in the classification task | | 

### Concept Activation Vectors, 2018
> A numerical representation of a concept in the activation space of a NN layer.

$\textcolor{#ba4e00}{\textbf{TCAV}}$: For any given concept, TCAV measures the extent of that concept's influence on the model's prediction for a certain class.

#### Pros & Cons 
| Pros | Cons |
|------|------|
| Customizable via concept dataset curation | Performs poorly on shallow NNs (concepts in deeper layers are more separable) | 
| Provides global explanations | Requires additional annotations to the dataset (costly) | 
| | Mostly used in images only |

## <a id='toc3_3_'></a>[Explainable Attention](#toc0_)

### A Review of Attention
#### Review Embeddings
- We want to represent a word as a fixed length vector
  - word $\rightarrow [1,4,5,62,2,33]$

> $\textcolor{#ba4e00}{\textbf{Embeddings}}$ are a method of converting textual information into vectors of real numbers, capturing semnatic and syntactic aspects of the data.
>
> - Acts as a compact representation of the original data, capturing its essential aspects. 

#### Self Attention

<img src="imgs/attention_1.png" alt="Attention-1" width="600" height="200">

> The goal of self attention is to improve the original embeddings (vector embeddings $v_1, v_2, v_3, v_4$) with context. 
- We would ideally like our output to be new representations that are better than the original representations

To get the new better representations we get scores $s_{ij}$ by multiplying each vector with each other.
- We then normalize all scores $s_{ij}$, this yields weights $w_{ij}$. We do this s.t. all weights $w_{ij}$ sum to $1$.
  - weights $w_{ij}=\text{normalized scores}$

Then reweight all the vectors. 
<img src="imgs/attention_2.png" alt="Attention-2" width="600" height="200">

We do this for each word in our sequence. 
<img src="imgs/attention_3.png" alt="Attention-3" width="600" height="200">

<img src="imgs/attention_4.png" alt="Attention-4" width="600" height="200">

Until now, no weights are being trained. We need to **introduce trainable parameters**:

<img src="imgs/attention_key_query_values.png" alt="Attention-Key-Query-Values" width="600" height="200">

Now we have Key, Query and Value matrices. These are our trainable weights. The matrices have dimension $k \times k$.

<img src="imgs/attention_key_query_value_matrix.png" alt="Attention-Key-Query-Value-Matrix" width="600" height="200">

<img src="imgs/self_attention_block.png" alt="Self-Attention-Block" width="600" height="200">

> **TLDR**: $\textcolor{#ba4e00}{\textbf{Self Attention}}$ is the process of **adding more context**.

But do we have enough attention? 

> $\textcolor{#ba4e00}{\textbf{Multi-Head Attention}}$: we parallelize attention mechanisms by having multiple heads $h$.

<img src="imgs/multi_head_attention.png" alt="Multi-Head-Attention" width="600" height="200">

#### Self-Attention vs Cross Attention
- Self Attention operates within a single sequence. 
- $\textcolor{#ba4e00}{\textbf{Cross Attention}}$ is used between two different sequences. 
- How Cross Attention works: 
  - For each element in one sequence (query sequence), cross-attention computes attention scores based on its relationship with every element in the other sequence (key-value-sequence).
  - This mechanism enables the model to selectively focus on relevant parts of the other sequence when generating an output. 
- Cross-attention is critical for tasks that involve understanding how elements from different sources related to one another.

### Visualizing Attention

**BertViz**, 2019
- Visualizing attention weights illuminates one type of architecture within the model but **does not necessarily provide a direct explanation for predictions**
- 

### Saliency Methods, 2020, as Alternatives 
> Input $\textcolor{#ba4e00}{\textbf{Saliency Methods}}$ reveal why one particular model prediction was made in terms of how relevant each input word was to that prediction.
> 
> - to understand how the input text influences output predictions more directly.

$\textcolor{#ba4e00}{\textbf{Integrated Gradients}}=$ The path integral of the gradients along the straightline path from the baseline $x'$ to the input $x$.
- We consider the straight line path from the baseline $x'$ to the input $x$ and compute the gradients at all possible points along the path. 
- Integrated gradients are obtained by cumulating these gradients.



# <a id='toc4_'></a>[Module 3️⃣](#toc0_)

# <a id='toc5_'></a>[My Questions](#toc0_)

# <a id='toc6_'></a>[Resources](#toc0_)

- [CNN Visualization](https://adamharley.com/nn_vis/cnn/3d.html)
- [Google Feature Visualization Interactive Paper](https://distill.pub/2017/feature-visualization/)