**Table of contents**<a id='toc0_'></a>    
- [Module 1](#toc1_)    
- [Module 2](#toc2_)    
  - [Challenges & Trade-Offs in Developing AI Systems](#toc2_1_)    
  - [Overview of XAI Techniques and Approaches](#toc2_2_)    
    - [Explanation Techniques](#toc2_2_1_)    
      - [Local Explanations](#toc2_2_1_1_)    
        - [LIME - Local interpretable model agnostic explanations](#toc2_2_1_1_1_)    
        - [Anchors](#toc2_2_1_1_2_)    
        - [SHAP - SHapley Additive exPlanations](#toc2_2_1_1_3_)    
        - [ICE - Individual Conditional Expectation](#toc2_2_1_1_4_)    
      - [Global Explanations](#toc2_2_1_2_)    
        - [Functional decomposition](#toc2_2_1_2_1_)    
        - [Feature Interaction](#toc2_2_1_2_2_)    
      - [Example-Based Explanations](#toc2_2_1_3_)    
        - [Prototype-based Explanations](#toc2_2_1_3_1_)    
        - [Counterfactual Explanations](#toc2_2_1_3_2_)    
    - [Deep Learning Network Explanations](#toc2_2_2_)    
      - [Feature Visualization](#toc2_2_2_1_)    
      - [Feature Attribution](#toc2_2_2_2_)    
      - [Network Dissection](#toc2_2_2_3_)    
      - [Concept Activation Vectors](#toc2_2_2_4_)    
  - [XAI in GenAI](#toc2_3_)    
  - [Resources](#toc2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Module 1](#toc0_)
- **Transparency**: Provides documentation for the system's decisions and actions (details about model architecture, training data, optimizations, etc.) to ensure compliance and accountability.
  - *Examples*: Model Cards, Datasheets for Datasets, Fairness Indicators, Algorithmic Impact Assessments
- **Interpretability**: An interpretable model provides both visibility into its mechanisms and insight into how it arrives at its predictions. Provides insights into what features are important, how they are related, or what rules/patterns are learned.
  - *Examples*: Inherently Interpretable Model - Decision Trees, Monotonic NNs
- **Explainability**: Aims to make any AI system, including opaque DL models, more explainable. Involves developing techniques to explain the outputs/decisions of black-box AI models (usually) after they are trained.
  - *Examples*: Post-hoc Explanations - SHAP, Saliency Maps, Concept Activation Vectors

**Why is explainability important for AI systems?**
1. Trust & accountability 
2. Identifying biases
3. Human-AI collaboration
4. Scientific advancement 

**Algorithmic Bias**
- **Training data bias**
  - *Example*: If the data used to train an AI model is biased or lacks representation from certain groups, the model may learn and perpetuate those biases. If a facial recognition system is trained primarily on images of people from one particular ethnicity, it may perform poorly on recognizing faces from other ethnic groups. The way data is collected, sampled, or labeled can introduce biases.
- **Sample bias**
  - The way data is collected, sampled, or labeled can introduce biases
  - *Example*: If a dataset for resume screening is predominantly composed of resumes from a particular geographic region or industry, the resulting model may be biased against candidates from other regions or industries.
- **Poxy variable bias**
  - Sometimes algorithms may rely on proxy variables that are correlated with protected characteristics like race, gender, or age, leading to indirect discrimination against those groups.
- **Bias arising from human-AI interaction**:   
  - Such as biased language or behavior exhibited by users, which can influence the systems outputs over time.
- **Bias resulting from algorithms themselves (i.e. architecture, optimization criteria)**

> XAI can detect these biases and discrimination in AI systems.

<img src="imgs/sources_of_bias.png" alt="Sources of Bias" width="400">


# <a id='toc2_'></a>[Module 2](#toc0_)

## <a id='toc2_1_'></a>[Challenges & Trade-Offs in Developing AI Systems](#toc0_)

> High stakes decision-making, use inherently interpretable models over trying to explain black-box models using XAI for 

- The word *“explanation”* in XAI refers to an understanding of how a model works, as opposed to an explanation of how the world works

## <a id='toc2_2_'></a>[Overview of XAI Techniques and Approaches](#toc0_)

- **Interpretable Machine Learning**: defined as developing machine learning models that are inherently understandable and self-explanatory.
- **Occam's Razor**: If you have two competing ideas to explain the same phenomenon, you should prefer the simpler one.
- Linear models are highly interpretable as their predictions are based on a weighted sum of input features.
  - The coefficients directly indicate the importance and directionality of each feature.
- Enforcing sparsity, for example, via lasso or elastic net regularization, can enhance interpretability by driving coefficients of irrelevant features to zero, effectively performing feature selection.
- Generalized models try to solve some of the problems with linear regression. For example, generalized additive models combine linear models with nonlinear feature transformations, while maintaining an additive and interpretable structure, showing how each feature shape impacts predictions.
- Regression and generalized models give us interpretable coefficients. 
- The sign positive or negative of a coefficient indicates whether the associated feature has a positive or negative relationship with the variable. 
- The magnitude of a coefficient represents the strength of that relationship. 
- Larger coefficients indicate a stronger influence of that feature on the target variable.
- Decision trees like CART and GOSDT are also intrinsically interpretable models. Their hierarchical tree structure of if-then-else decision rules provides a natural way to trace how predictions are made for different instances based on their feature values. Rule-based models encode knowledge in the form of human-readable rules.
- Tree-based models split the data multiple times based on certain cutoff values in the features. Different subsets of the data set are created through this splitting with each instance belonging to one subset. The terminal, also known as leaf nodes, are the final subsets. To predict the outcome in each leaf node, the average outcome of the training data in this node is used. 

There are also interpretable Neural Networks:
1. **Disentagled Neural Networks**
> models that learn representations where each neuron or feature map corresponds to a specific interpretable concept. For example, edges, textures, or object parts. This can provide visibility into the model's internal reasoning.
2. **Prototype-based Networks** (*PropoPNet*)
> models that learn prototypical examples of each class and use similarity to these prototypes as the basis for predictions, which can be more interpretable than complex decision boundaries
   - integrate prototypes or examples into the neural network
1. **Monotonic Neural Networks** (*MonoNet*)
> models that constrain the neural network to produce outputs that are monotonically increasing or decreasing with respect to the input features. This can make the model's behavior more intuitive and predictable.
   - ensure model predictions vary in a consistent direction as feature change, aligning with human intuitions.
1. **Representation Networks** (*Kolmogorov Arnold Network*)
> have no linear weights at all. Every weight parameter is replaced by a univariate function parameterized as a spline. KANs can be intuitively visualized and can easily interact with human users.
   - introduced alternatives to weights using spline representations. 

<img src="imgs/shallow_sparse_modular_nns.png" alt="Types of NNs 1" width="400">

<img src="imgs/disentangled_prototypical_monotonic_nns.png" alt="Types of NNs 2" width="400">

**Mechanistic Interpretability (MI):**
> process of reverse engineering neural networks from learned weights down to human interpretable algorithms.

Based on three speculative claims about neural networks.
1. **Features** are the fundamental unit of neural networks. They correspond to directions or a linear combination of neurons in a layer.
2. **Circuits**: Features are connected by weights forming circuits. A circuit is a computational subgraph of a network. A circuit consists of a set of features and the weighted edges that go between them.
3. **Universality**: poses that analogous features and circuits form across models and tasks.

**Superposition**
> According to the superposition hypothesis, neural networks, as we observe them, are simulations of larger networks, where each neuron is a disentangled feature.   

### <a id='toc2_2_1_'></a>[Explanation Techniques](#toc0_)

There are explainable machine-learning approaches for nearly all models and domains. 
- **Local explanation methods** like LIME and SHAP focus on explaining individual predictions. 
   - They approximate the original model's behavior locally around the instance of interest using an interpretable model like linear regression. 
- In contrast to local explanations, **global explanations** aim to capture overall model behavior. 
  - Methods like functional decomposition, feature interaction, and partial dependence plots are considered global explanations. 
- **Example based explanations** operate by providing representative examples to support a model's prediction. 
  - Key methods include finding the most influential training instances that significantly sway a particular prediction if removed.
- **NN Explanations**: Neural networks have their own challenges regarding explainability and thus approaches tailored for specific neural networks. 

#### <a id='toc2_2_1_1_'></a>[Local Explanations](#toc0_)
##### <a id='toc2_2_1_1_1_'></a>[LIME - Local interpretable model agnostic explanations](#toc0_)

- use interpretable models to explain individual predictions of a black box machine learning model.
- Using LIME for images is really interesting. Image variations are created by segmenting the image into superpixels or interconnected pixels with similar colors, and turning them on or off by replacing each pixel with a user-defined color like gray. The user is also able to specify a probability for turning off a superpixel at each permutation.

##### <a id='toc2_2_1_1_2_'></a>[Anchors](#toc0_)

> ot of similarities to the lime approach. Again, we are explaining individual predictions, but this time, instead of using a linear model to approximate the local decision boundary as with LIME, we are now finding a decision rule that sufficiently anchors the prediction.

##### <a id='toc2_2_1_1_3_'></a>[SHAP - SHapley Additive exPlanations](#toc0_)

> The SHAP method proposes to approximate Shapley values instead of outright calculating them

##### <a id='toc2_2_1_1_4_'></a>[ICE - Individual Conditional Expectation](#toc0_)

> plot one line per instance that displays how the instances prediction changes when a particular feature changes. By visualizing all of our instances, we can improve local explainability, while also gaining better global understanding.

#### <a id='toc2_2_1_2_'></a>[Global Explanations](#toc0_)

> aim to capture overall model behavior. 

- include functional decomposition, feature interaction, permutation feature importance, and visualizations, including partial dependence plots and accumulated local effects or ALE plots. 

##### <a id='toc2_2_1_2_1_'></a>[Functional decomposition](#toc0_)

<img src="imgs/feature_decomposition.png" alt="Types of NNs 2" width="400">


> divides complex models into simpler constituent parts. Each part or function can then be analyzed separately, making it easier to understand the overall behavior of the model. It breaks a model into the main effects, how each feature affects the prediction, independent of the values in the other feature. Interaction effect, the joint effect of the features. The intercept, what the prediction is when all feature effects are set to zero. 

##### <a id='toc2_2_1_2_2_'></a>[Feature Interaction](#toc0_)

<img src="imgs/feature_interaction.png" alt="Types of NNs 2" width="400">

> When we consider decomposition, we first decompose a model prediction into a constant term, a term for each feature, and a term for the interaction between factors.

- We use the H-statistic for feature interaction. A partial dependence plot, PDP or PD, shows the marginal effect one or two features have on the predicted outcome of a model. 
- If this looks familiar, it should. It is the average of the lines of an ICE plot. Just like with ICE plots, the PDP can show the relationship between a feature and the target. PDPs do not have clear interpretations when features are correlated. In real life, features are usually correlated, or relationships between features are not well understood. 
- This is why accumulated local effects, or ALE, plots were introduced in 2020.
  - ALE plots include local effects and accumulation. ALE plots focus on local changes in the prediction when a feature value changes, unlike PDPs, which average out the effects over the entire feature space.
  - Instead of plotting the local effects directly, ALE plots accumulate these effects across the range of a feature. This helps in understanding the global trend of how the feature influences predictions.

**PDP Plots**
> - average out the effects over the entire feature space.

**ALE Plots**
> - focus on local changes in the prediction when a feature value changes
> 
> - to understand the global trend of how the feature influences predictions

#### <a id='toc2_2_1_3_'></a>[Example-Based Explanations](#toc0_)
##### <a id='toc2_2_1_3_1_'></a>[Prototype-based Explanations](#toc0_)

<img src="imgs/prototype_based_explanations.png" alt="Types of NNs 2" width="400">

> - aims to explain the predictions of a black box model by identifying representative examples or prototypes from the data.
>
> - The main thesis is that we represent the model's knowledge in terms of prototypical instances or patterns.

##### <a id='toc2_2_1_3_2_'></a>[Counterfactual Explanations](#toc0_)

<img src="imgs/counterfactual_explanations.png" alt="Types of NNs 2" width="400">

> describe a causal situation. 
> 
> - If x had not occurred, y would not have occurred. 
> 
> - We can simulate counterfactuals for predictions of black box machine learning models by changing the feature values of an instance before making the predictions and analyzing how the prediction changes. 
> 
> - A counterfactual explanation of a prediction describes the smallest change to the feature values that changes the prediction to a predefined output.

### <a id='toc2_2_2_'></a>[Deep Learning Network Explanations](#toc0_)

#### <a id='toc2_2_2_1_'></a>[Feature Visualization](#toc0_)
> What is happening inside a NN? 

- is the process of making learned features in a neural network explicit.
- answers the question, what does this neuron channel or layer see? 
- With Feature Visualization, we are looking to maximize the activation of a neuron.

#### <a id='toc2_2_2_2_'></a>[Feature Attribution](#toc0_)
> indicates how much each feature in your model contributes to a prediction for an instance.

<img src="imgs/feature_attribution.png" alt="Types of NNs 2" width="400">

#### <a id='toc2_2_2_3_'></a>[Network Dissection](#toc0_)
> links human concepts with individual neural network units.

<img src="imgs/network_dissection.png" alt="Types of NNs 2" width="400">

- implementation of network dissection is fairly straightforward. 
- First, get images with human labeled visual concepts. These should be pixelwise labeled images with concepts of different abstraction levels. 
- Second, measure CNN channel activations for images, and 
- lastly, get the alignment of activations and labeled concepts. 

This method can be used to probe any convolutional layer.

#### <a id='toc2_2_2_4_'></a>[Concept Activation Vectors](#toc0_)
> are a numerical representation of a concept in the activation space of a neural network layer.

<img src="imgs/concept_activation_vectors.png" alt="Types of NNs 2" width="400">


- For any given concept, TCAV measures the extent of that concept's influence on the model's prediction for a certain class.

## <a id='toc2_3_'></a>[XAI in GenAI](#toc0_)

**Fine-tuning**
> A large base model (typically >1B parameters) trained on a corpus of unlabeled data is fine-tuned on a smaller dataset with labels or through RLHF.

<img src="imgs/fine_tuning.png" alt="" width="400">

<img src="imgs/local_vs_global_explanations.png" alt="" width="400">

<img src="imgs/local_explanations_figure.png" alt="" width="400">

### Feature Attribution
#### Pertubation-based
> where you perturb input examples by removing, masking or altering input features. This can be embedding vectors, hidden units, words, or tokens, and then you evaluate model output changes.
#### Gradient-based 
> where you determine the importance of each input feature by analyzing the partial derivatives of the output with respect to each input dimension. The magnitude of the derivatives reflects the sensitivity of the output to changes in the input.
>
> - Integrated gradients are the primary approach to gradient-based explanations in LLMs
#### SHAP
#### Decomposition-based
> aim to break down the relevance score into linear contributions from the input. 
> 
> - An example of this is layer wise relevance propagation or LRP.

#### Example-based Explanations
##### Counterfactual explanations
> reveal what would have happened based on certain observed input changes.
##### Influential instance
> characterize the influence of individual training samples by measuring how much they affect the loss on test points.
##### Adversarial example
> = Neural models are highly vulnerable to carefully crafted small modifications in the input data that can drastically alter the model's predictions, despite being nearly imperceptible to humans.
>
> - expose areas where models fail and are used during training to improve model robustness and accuracy.
##### Natural language explanations
> Explain a model's decision-making on an input sequence with generated text. 
> 
> - The approach is to train a language model using both original textual data and human-annotated explanations.

#### Global Explanations
##### Probing-based explanation
>  we can look at either classifier-based probing or parameter-free probing.
>
1. Freeze LLM parameters 
2. Generate representations from the LLM
3. Train a shallow classifier on those representations to predict linguistic properties

##### Neuron activation explanation
> examines individual neurons or dimensions rather than the whole vector space. 
> 
It involves two steps,  
1. identifying important neurons. 
2. learn relations between linguistic properties and individual ranked neurons in supervised tasks.
3. Verify via ablation experiments.
4. Generate natural language explanations.
5. Test how well they allow the model to simulate the real neurons activation behavior on new test examples.

##### Concept-based explanation
> allow us to map the inputs to a set of concepts and measure important scores of each predefined concept to model predictions.

### Prompting 
#### In-context learning
> When a model is shown tasks demonstrations as part of the prompt.
#### Chain-of-thought (CoT)
> Prompting the model to describe its reaosning and go step-by-step through a problem.

### Embeddings 
- Embeddings = method of converting textual information into vectors of real numbers, capturing semantic and syntactic aspects of the data. 
- Embeddings are mapped into a multidimensional space that we call embedding or latent space
- They capture semantic relationships, making it possible for words with similar meanings to have similar representations.

**Key Features**
- Dimensionality reduction
- Captures semantics
- Encodes meaning based on word usage, context and distance measures

### RAG

<img src="imgs/rag_pipeline.png" alt="" width="400">

RAG Pipeline:

1. you have a vector database that is created by embedding your unstructured data.
2. you have a user query that is also embedded using the same embedding model as the one used to create your vector database.
3. You use a similarity algorithm to find the closest matches between items in your vector database and your user query.
4. The closest matches are then incorporated into your prompt and sent to the LLM.
5. The LLM generates a response to the user's query, and this response is sent back to the user.


## Notes

**Cosine Similarity is Scale Invariant**
- it can measure the similarity of vectors regardless of the magnitude

**How can one visualize the latent space?**
- **PCA** = Principal Component Analysis
  - to simplify and find global linear relationships and patterns in the data
- **t-SNE** = t-distributed Stochastic Neighbor Embedding
  - involves constructing a lower dimensional representation where similar data points are placed closer together.
  - use t-SNE to emphasize visualization, reveal local patterns and clusters.
- **UMAP** = Uniform manifold approximation and projection
  - uses manifold learning and non-linear dimensionality reduction technique to understand the underlying structure or shape of the data.
  - focuses on campturing non-linear relationships in the data
  - You should use UMAP to preserve local structure and handle complex non-linear relationships.
- **PaCMAP**

## <a id='toc2_4_'></a>[Resources](#toc0_)

- [Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead](https://arxiv.org/pdf/1811.10154)