# Interpretable Machine Learning 

---

# Module 1 
## Introduction to Interpretable ML
**Interpretability**
> An interpretable model provides both visibility into its mechanisms and insiht into how it arrives at its predictions. Provides insights into what features are important, how they are related, or what rules/patterns are learned 
>
> *Examples:* Inherently interpretable models - Decision Trees, Monotonic NNs

## Regression Models 
### Linear Regression
> The goal of Lin Reg is to create a linear model that minimizes the sum of squared residuals. 

$$
    SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$

**Ordinary Least Squares**
> goal is to find the line or hyperplane in higher dimensions that best fits the observable data points by minimizing the sum of squared residuals.

- Relies on several assumptions: 
  1. relationship between the predictors and outcomes is linear
  2. Observations are independent of one another.
  3. Residuals have constant variance across all levels of predictors = homoscedasticity
  4. Residuals are normally distributed

**Regression**
> A methodology used for modeling and analysis of numerical data.
>
> - relationships between 2+ variables are evaluated

<img src="imgs/regression_formula.png" alt="Sources of Bias" width="400">

$$
    y = \beta_0 + \beta_1 X_1 + \dots + \beta_j X_j + \epsilon
$$

**How to interpret the coefficients $\beta$?**
- $\pm$: 
  - indicates whether the associated feature has a positive or negative relationship with the target variable.
- magnitude: 
  - represents the strength of that relationship. 
  - Larger coefficients indicate a stronger influence of that feature on the target variable.

**Feature importance of features in Lin Reg:**
> = absolute value of the features t-statistic
>
> t-statistic = estimated weight scaled with its standard error.

$$
  t_{\hat{\beta_j}} = \frac{\hat{\beta_j}}{SE(\hat{\beta_j})}
$$

- **Effect Plot**
  - calculate the effects, which is the weight per feature times the feature value of an instance

#### Pros & Cons

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Can only represent linear relationships |
| Lots of documnetation, used widely across domains | Usually not as accurate because the real world is complex and nonlinear |
| Based on solid statistical theory | The interpretation of a weight is dependent on other features |

#### Assumptions
- Linearity (lin relationship between $X$ and $y$)
- Independence (observations are independent to one another)
- Homoscedasticity (variance of the residual errors is constant across all values of the independent variables)
- Normality (residual errors follow $\mathcal{N}$)
- No multicollinearity (independent variables should not be highly correlated with each other)
- No autocorrelation (the residual errors are not correlated with each other)
- No endogeneity (independent variables are not correlated with the error term)

### Logistic Regression
> wraps lin reg eq in a logistic fct. 
> 
> - squeezes outputs on lin reg to $[0, 1]$.
> - = lin model for the log odds 

- **log-odds**
  - **odds** = probability or likelihood of a particular outcome
    - e.g. $\mathbb{P}$ of binary class $1$

$$
  ln( \underbrace{\frac{\mathbb{P}(y=1)}{1 - \mathbb{P}(y=1)})}_{\text{Odds}} = \underbrace{log(\frac{\mathbb{P}(y=1)}{\mathbb{P}(y=0)})}_{LogOdds} = \beta_0 + \beta_1x_1 + \dots + \beta_p x_p 
  \\
  \frac{\mathbb{P}(y=1)}{1 - \mathbb{P}(y=1)}) = Odds = exp(\beta_0 + \beta_1x_1 + \dots + \beta_p x_p)
$$

#### Assumptions
- Linearity 
- No multicollinearity 
- Independence of observations 
- No influential outliers 
- Absence of perfect separation
- Large sample size

#### Logistic Function 
> used to model the probability of a binary outcome in logistic regression. 
> 
> transforms the linear combination of the input features into a probability value 0-1. 

$$
  \sigma(z) = \frac{1}{1 + e^{-z}}
$$

#### Logit Function 
> inverse of the logistic function. 
> 
> transforms the probability of the binary outcome back into the log odds, a linear scale

$$
logit(p) = log(\frac{p}{1-p})
$$

#### Log Odds
> logarithm of the odds of the probability of an event occurring.
> 
> The odds themselves are the ratio of the probability of the event occurring to the probability of the event not occurring. 
>
> - Odds > 1 = positive
> - Odds < 1 = negative

$$
  LogOdds = log(\frac{p}{1-p})
$$

#### Pros & Cons 

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Can only represent linear relationships |
| Lots of documnetation, used widely across domains | Usually not as accurate because the real world is complex and nonlinear |
| Based on solid statistical theory | Interpretation more difficult than lin reg because the interpreation of the weights is multiplicative and not additive | 
| Can give you probabilities in addition to classification | If there is a feature that would perfectly separate the two classes, the weight for that feature would not converge and the model wouldn't be able to be trained. Because the optimal weight would be infinite. | 



## Generalized (Linear) Model

> Idea: Keep the weighted sum of the features, but allow non-Gaussian outcome distributions and connect the expected mean of this distribution and the weighted sum through a possibly nonlinear function.

$$
  \overbrace{g}^{\text{link function}} ( \underbrace{\mathbb{E}_Y(y|x)}_{\text{probability distribution from the exponential famility that defines} \; E_Y} ) = x^T\beta
$$

- If target outcome does not follow a Gaussian distribution
- Logistic regression
  - is a GLM that assumes the Bernoullu distribution and uses the logit function as its link function

| Pros | Cons |
|------|------|
| How predictions are produced is transparent | Most modifications of the linear model make the model less interpretable | 
| Lots of documentation, used widely across domains | Any link function complicates interpretation|
| Based on solid statistical theory | | 
| Allows modeling of non-Gaussian outcomes | |

# Resources 

- [Interpretable Machine Learning: Fundamental
Principles and 10 Grand Challenges](https://arxiv.org/pdf/2103.11251)