# <a id='toc1_'></a>[Interpretable Machine Learning](#toc0_)

---

**Table of contents**<a id='toc0_'></a>    
- [Interpretable Machine Learning](#toc1_)    
- [Module 1️⃣](#toc2_)    
  - [Introduction to Interpretable ML](#toc2_1_)    
  - [Regression Models](#toc2_2_)    
    - [Linear Regression](#toc2_2_1_)    
      - [Pros & Cons](#toc2_2_1_1_)    
      - [Assumptions](#toc2_2_1_2_)    
    - [Logistic Regression](#toc2_2_2_)    
      - [Assumptions](#toc2_2_2_1_)    
      - [Logistic Function](#toc2_2_2_2_)    
      - [Logit Function](#toc2_2_2_3_)    
      - [Log Odds](#toc2_2_2_4_)    
      - [Pros & Cons](#toc2_2_2_5_)    
  - [Generalized (Linear) Model](#toc2_3_)    
    - [Generalized Additive Models](#toc2_3_1_)    
- [Module 2️⃣](#toc3_)    
  - [Decision Trees](#toc3_1_)    
    - [CART](#toc3_1_1_)    
      - [Implementation](#toc3_1_1_1_)    
        - [Variance](#toc3_1_1_1_1_)    
        - [Gini Index](#toc3_1_1_1_2_)    
      - [Interpreting CART](#toc3_1_1_2_)    
    - [Sparse Decision Trees](#toc3_1_2_)    
  - [Decision Rules and RuleFit](#toc3_2_)    
  - [Neural Network Interpretability](#toc3_3_)    
- [Resources](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Module 1️⃣](#toc0_)

**Learning Objectives**
- Describe interpretable machine learning and differentiate between interpretability and explainability.
- Explain and implement regression models in Python.
- Demonstrate knowledge of generalized models in Python.

## <a id='toc2_1_'></a>[Introduction to Interpretable ML](#toc0_)
**Interpretability**
> An interpretable model provides both visibility into its mechanisms and insiht into how it arrives at its predictions. Provides insights into what features are important, how they are related, or what rules/patterns are learned 
>
> *Examples:* Inherently interpretable models - Decision Trees, Monotonic NNs

## <a id='toc2_2_'></a>[Regression Models](#toc0_)
### <a id='toc2_2_1_'></a>[Linear Regression](#toc0_)
> The goal of Lin Reg is to create a linear model that minimizes the sum of squared residuals. 

$$
    SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$

**Ordinary Least Squares**
> goal is to find the line or hyperplane in higher dimensions that best fits the observable data points by minimizing the sum of squared residuals.

- Relies on several assumptions: 
  1. relationship between the predictors and outcomes is linear
  2. Observations are independent of one another.
  3. Residuals have constant variance across all levels of predictors = homoscedasticity
  4. Residuals are normally distributed

**Regression**
> A methodology used for modeling and analysis of numerical data.
>
> - relationships between 2+ variables are evaluated

<img src="imgs/regression_formula.png" alt="Sources of Bias" width="400">

$$
    y = \beta_0 + \beta_1 X_1 + \dots + \beta_j X_j + \epsilon
$$

**How to interpret the coefficients $\beta$?**
- $\pm$: 
  - indicates whether the associated feature has a positive or negative relationship with the target variable.
- magnitude: 
  - represents the strength of that relationship. 
  - Larger coefficients indicate a stronger influence of that feature on the target variable.

**Feature importance of features in Lin Reg:**
> = absolute value of the features t-statistic
>
> t-statistic = estimated weight scaled with its standard error.

$$
  t_{\hat{\beta_j}} = \frac{\hat{\beta_j}}{SE(\hat{\beta_j})}
$$

- **Effect Plot**
  - calculate the effects, which is the weight per feature times the feature value of an instance

#### <a id='toc2_2_1_1_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Can only represent linear relationships |
| Lots of documnetation, used widely across domains | Usually not as accurate because the real world is complex and nonlinear |
| Based on solid statistical theory | The interpretation of a weight is dependent on other features |

#### <a id='toc2_2_1_2_'></a>[Assumptions](#toc0_)
- Linearity (lin relationship between $X$ and $y$)
- Independence (observations are independent to one another)
- Homoscedasticity (variance of the residual errors is constant across all values of the independent variables)
- Normality (residual errors follow $\mathcal{N}$)
- No multicollinearity (independent variables should not be highly correlated with each other)
- No autocorrelation (the residual errors are not correlated with each other)
- No endogeneity (independent variables are not correlated with the error term)

### <a id='toc2_2_2_'></a>[Logistic Regression](#toc0_)
> wraps lin reg eq in a logistic fct. 
> 
> - squeezes outputs on lin reg to $[0, 1]$.
> - = lin model for the log odds 

- **log-odds**
  - **odds** = probability or likelihood of a particular outcome
    - e.g. $\mathbb{P}$ of binary class $1$

$$
  ln( \underbrace{\frac{\mathbb{P}(y=1)}{1 - \mathbb{P}(y=1)})}_{\text{Odds}} = \underbrace{log(\frac{\mathbb{P}(y=1)}{\mathbb{P}(y=0)})}_{LogOdds} = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p 
  \\
  \frac{\mathbb{P}(y=1)}{1 - \mathbb{P}(y=1)}) = Odds = exp(\beta_0 + \beta_1x_1 + \dots + \beta_p x_p)
$$

#### <a id='toc2_2_2_1_'></a>[Assumptions](#toc0_)
- Linearity 
- No multicollinearity 
- Independence of observations 
- No influential outliers 
- Absence of perfect separation
- Large sample size

#### <a id='toc2_2_2_2_'></a>[Logistic Function](#toc0_)
> used to model the probability of a binary outcome in logistic regression. 
> 
> transforms the linear combination of the input features into a probability value 0-1. 

$$
  \sigma(z) = \frac{1}{1 + e^{-z}}
$$

#### <a id='toc2_2_2_3_'></a>[Logit Function](#toc0_)
> inverse of the logistic function. 
> 
> transforms the probability of the binary outcome back into the log odds, a linear scale

$$
logit(p) = log(\frac{p}{1-p})
$$

#### <a id='toc2_2_2_4_'></a>[Log Odds](#toc0_)
> logarithm of the odds of the probability of an event occurring.
> 
> The odds themselves are the ratio of the probability of the event occurring to the probability of the event not occurring. 
>
> - Odds > 1 = positive
> - Odds < 1 = negative

$$
  LogOdds = log(\frac{p}{1-p})
$$

#### <a id='toc2_2_2_5_'></a>[Pros & Cons](#toc0_)

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Can only represent linear relationships |
| Lots of documnetation, used widely across domains | Usually not as accurate because the real world is complex and nonlinear |
| Based on solid statistical theory | Interpretation more difficult than lin reg because the interpreation of the weights is multiplicative and not additive | 
| Can give you probabilities in addition to classification | If there is a feature that would perfectly separate the two classes, the weight for that feature would not converge and the model wouldn't be able to be trained. Because the optimal weight would be infinite. | 



## <a id='toc2_3_'></a>[Generalized (Linear) Model](#toc0_)

> Idea: Keep the weighted sum of the features, but allow non-Gaussian outcome distributions and connect the expected mean of this distribution and the weighted sum through a possibly nonlinear function.

$$
  \overbrace{g}^{\text{link function}} ( \underbrace{\mathbb{E}_Y(y|x)}_{\text{probability distribution from the exponential famility that defines} \; E_Y} ) = x^T\beta
$$

- If target outcome does not follow a Gaussian distribution
- Logistic regression
  - is a GLM that assumes the Bernoullu distribution and uses the logit function as its link function

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Most modifications of the linear model make the model less interpretable | 
| Lots of documentation, used widely across domains | Any link function complicates interpretation|
| Based on solid statistical theory | | 
| Allows modeling of non-Gaussian outcomes | |

### <a id='toc2_3_1_'></a>[Generalized Additive Models](#toc0_)
>  if the relationship between our features and outcome is not linear

*If the relationship between our features and outcome is not linear?* we could 
- transform feature
- categorize feature
- use GAMs

$$
  f(\mathbb{E}_Y(y|x)) = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_p(x_p)
$$

- We can learn the functions $f$ by using *Spline Functions*
> **Splines** are constructed from simpler basis functions, and used to approximate more complex functions

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Most modifications of the linear model make the model less interpretable | 
| Lots of documentation, used widely across domains | | 
| Based on solid statistical theory | | 
| Allows nonlinear relationships to be modeled | |

# <a id='toc3_'></a>[Module 2️⃣](#toc0_)

**Learning Objectives**
- Explain and implement decision trees in Python.
- Demonstrate knowledge of decision rules in Python.
- Define and explain neural network interpretable model approaches, including prototype-based networks, monotonic networks, and Kolmogorov-Arnold networks.

## <a id='toc3_1_'></a>[Decision Trees](#toc0_)
> Tree-based models split the data multiple times based on certain cutoff values in the features.

### <a id='toc3_1_1_'></a>[CART](#toc0_)
> = Classification & Regression Trees

$$
    \hat{y} = \hat{f}(x) = \sum_{m=1}^{M} c_m I\{x \in R_m\}
$$

- $\hat{y}=$ predicted value for a given instance $x$
- $x=$ instance
- $M=$ total number of leaf nodes (or subsets) in the model
- $R_m=$ subset (leaf node), each instance $x$ falls into one subset
- $c_m=$ constant value associated with the $m$-th leaf node $R_m$
- $I_{\{ x \in R_m \}}=$ is the function that returns $1$ if $x$ is in the subset $R_m$ and $0$ otherwise

#### <a id='toc3_1_1_1_'></a>[Implementation](#toc0_)
The implementation of the CART algorithm starts with the 
- splitting of the data into different subsets based on feature values. 
- CART determines the cutoff point for a feature that ... of the target variable.
  - ... minimizes the variance for regression, or 
  - ... Gini index for classification 
- This process of splitting is done recursively. 
- At each node, the algorithm selects the best feature and the best value of that feature to split the data, 
  - based on some criterion like the 
    - Gini impurity for classification or the 
    - MSE for regression. 

##### <a id='toc3_1_1_1_1_'></a>[Variance](#toc0_)
> measures how spread out the target values are around the mean in a node. 

##### <a id='toc3_1_1_1_2_'></a>[Gini Index](#toc0_)
> measures the impurity or class distribution in a node

#### <a id='toc3_1_1_2_'></a>[Interpreting CART](#toc0_)
> If your instance lies in the subset (leaf node), the predicted outcome is the mean value of $y$ of the instances in that node

- If we want to examine feature importance, we look at all the splits for which the feature was used, measure how much it has reduced the variance or Gini index compared to the parent node.
- The sum of all importances is scaled to $100$, so each importance can be interpreted as share of the overall model importance.

| Pros | Cons |
|------|------|
| Great for capturing interactions between features in the data | Trees don't deal well with linear relationships (they create step functions, which are inefficient) |
| A natural visualization | Unstable (Small changes in the training dataset can create a completely different tree. Why? Because each split depends on the parent split.) | 
| You can use features without any transformations | As trees get larger, they become harder to interpret | 

### <a id='toc3_1_2_'></a>[Sparse Decision Trees](#toc0_)

**GOSDT** = generalized and scalable optimal sparse decision trees
- Solves the problem of Decision Trees: 
  - DTs are created from the top down and pruned. This means that if you make a mistake at the top, that will propagate through your whole model.
- *Idea*: 
  - Minimize an objective that is a function of tree and data over all possible trees you can construct on this data. 
  - We want to minimize both the 
    - misclassification error and the 
    - number of leaves to encourage sparsity and thus interpretability.
- Can be summarized into *Branch and Bound* (= idea is reducing the search space of all DTs)
  - Analytical bounds help in efficiently pruning the search space
    1. **Hierarchical Objective Lower Bound** 
       - If $R_{bestSoFar} < b(tree_{fixed})$ then this tree and its children are all suboptimal 
    2. **Hierarchical Objective Lower Bound with Lookahead**
       - If $R_{bestSoFar} < b(tree_{fixed}) + C$ (our tree with at least one child) then this tree & chuildren are suboptimal
    3. **Leaf Bound**
       - Max # leaves of any optimal child tree $< \# leaves(tree) + (R_{bestSoFar} - b(tree))/C$
    4. **Incremental Progress Bound(s)**
       - Each split must provide a reduction in loss of at least $C$ 
- Scalable, due to its bit vector representation, which represents each sub-problem by its contents as bit vectors

## <a id='toc3_2_'></a>[Decision Rules and RuleFit](#toc0_)

> = simple IF-THEN statements consisting of a condition and a prediction learned through an algorithm.

One important concept when implementing decision rules is the trade off between coverage and support and accuracy in confidence.
- **Coverage/Support**
  - Percentage of instances to which the condition of a rule applies.
- **Accuracy**
  - Measure of how accurate the rule is in predicting the correct class for the instances to which the condition of the rule applies. 

Algorithms for learning rules: 
- **OneR**
  - From all the features, OneR selected the one that carries the most information about the outcome of interest and creates decision rules from this feature.
- **Sequential Covering**
  - A general procedure that iteratively learns rules and removes the data points that are covered by the new rule.

| Pros | Cons | 
|------|------|
| Very easy to interpret | Mostly used only for classification |
| More compact than decision trees | Features need to be categorical | 
| Prediction is fast | Some algorithms are prone to overfitting | 
| Usually automatically generate sparse models | Don't deal well with linear relationships (they create step functions, which are inefficient) | 

- **RuleFit**
  - Combines decision trees with linear models by generating rules from an ensemble of decision trees and then fitting a sparse linear model on those rules
  - Learns sparse linear model with the original features AND also a number of new features that are decision rules.
  - [Steps](https://www.coursera.org/learn/interpretable-machine-learning/lecture/53KLa/rulefit)
    1. Generate Rules
    2. Create Sparse Linear Model

| Pros | Cons | 
|------|------|
| Easy to interpret | Interpretability worsens with increasing number of features in the model. |
| Adds feature interactions to linear models. | Interpretation is tricky for overlapping rules. | 
| Works for both classification and regression |  | 

## <a id='toc3_3_'></a>[Neural Network Interpretability](#toc0_)

<img src="imgs/types_of_interpretable_nns.png" alt="Sources of Bias" width="400">

 Improved interpretability in neural networks comes in the form of 
 - **shallow neural networks**,
   - simpler neural network architectures with fewer layers and nodes that can be more interpretable than deeper and more complex models
   - The relationships between inputs and outputs are more readily apparent
 - **sparse neural networks**, and 
   - models that have many network connections pruned or set to zero, and they can be more interpretable as the remaining connections represent the most important features
 - **modular neural networks**.
   - models composed of specialized subcomponents, for example, object detection or classification or reasoning. These can be more interpretable than monolithic architectures.

But there are also inherently interpretable models:
- **disentangled neural networks**, 
  - models that learn representations where each neuron or feature map corresponds to a specific interpretable concept, like edges, textures, or object parts
  - this can provide visibility into the model's internal reasonin
- **prototype based neural networks**, and 
  - models that learn prototypical examples of each class and use similarity to these prototypes as the basis for predictions, 
  - this can be more interpretable than complex decision boundaries
- **monotonic neural networks**
  - models that constrain the neural network to produce outputs that are monotonically increasing or decreasing with respect to the input features
  - this can make the model's behavior more intuitive and predictable

### PotoPNet
> = Prototype-based Neural Networks

- Prototype-based explanations aim to explain the predictions of a black box model by identifying representative examples or prototypes from the data. 
- The main thesis is that we represent the model's knowledge in terms of prototypical instances or patterns.


### MonoNet
> = Monotonic Neural Networks

# <a id='toc4_'></a>[Resources](#toc0_)

- [Interpretable Machine Learning: Fundamental
Principles and 10 Grand Challenges](https://arxiv.org/pdf/2103.11251)