# Responsible AI: XAI GenAI project

## 0. Background



Based on the previous lessons on explainability, post-hoc methods are used to explain the model, such as saliency map, SmoothGrad, LRP, LIME, and SHAP. Take LRP (Layer Wise Relevance Propagation) as an example; it highlights the most relevant pixels to obtain a prediction of the class "cat" by backpropagating the relevance. (image source: [Montavon et. al (2016)](https://giorgiomorales.github.io/Layer-wise-Relevance-Propagation-in-Pytorch/))

<!-- %%[markdown] -->
![LRP example](images/catLRP.jpg)

Another example is about text sentiment classification, here we show a case of visualizing the importance of words given the prediction of 'positive':

![text example](images/textGradL2.png)

where the words highlight with darker colours indicate to be more critical in predicting the sentence to be 'positive' in sentiment.
More examples could be found [here](http://34.160.227.66/?models=sst2-tiny&dataset=sst_dev&hidden_modules=Explanations_Attention&layout=default).

Both cases above require the class or the prediction of the model. But:

***How do you explain a model that does not predict but generates?***

In this project, we will work on explaining the generative model based on the dependency between words. We will first look at a simple example, and using Point-wise Mutual Information (PMI) to compute the saliency map of the sentence. After that we will contruct the expereiment step by step, followed by exercises and questions.


## 1. A simple example to start with
Given a sample sentence: 
> *Tokyo is the capital city of Japan.* 

We are going to explain this sentence by finding the dependency using a saliency map between words.
The dependency of two words in the sentence could be measured by [Point-wise mutual information (PMI)](https://en.wikipedia.org/wiki/Pointwise_mutual_information): 


Mask two words out, e.g. 
> \[MASK-1\] is the captial city of \[MASK-2\].


Ask the generative model to fill in the sentence 10 times, and we have:

| MASK-1      | MASK-2 |
| ----------- | ----------- |
|    tokyo   |     japan   |
|  paris  |     france    |
|  london  |     england    |
|  paris  |     france    |
|  beijing |  china |
|    tokyo   |     japan   |
|  paris  |     france    |
|  paris  |     france    |
|  london  |     england    |
|  beijing |  china |

PMI is calculated by: 

$PMI(x,y)=log_2⁡ \frac{p(\{x,y\}| s-\{x,y\})}{P(\{x\}|s-\{x,y\})P(\{y\}|s-\{x,y\})}$

where $x$, $y$ represents the words that we masked out, $s$ represents the setence, and $s-\{x,y\}$ represents the sentences tokens after removing the words $x$ and $y$.

In this example we have $PMI(Tokyo, capital) = log_2 \frac{0.2}{0.2 * 0.2} = 2.32$

Select an interesting word in the sentences; we can now compute the PMI between all other words and the chosen word using the generative model:
(Here, we use a longer sentence and run 20 responses per word.)
![](images/resPMI.png)


## 2. Preparation
### 2.1 Conda enviroment

```
conda env create -f environment.yml
conda activate xai_llm
```


### 2.2 Download the offline LLM

We use the offline LLM model from hugging face. It's approximately 5 GB.
Download it using the comman below, and save it under `./models/`.
```
huggingface-cli download TheBloke/openchat-3.5-0106-GGUF openchat-3.5-0106.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
# credit to https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF
```

## 3. Mask the sentence and get the responses from LLM
### 3.1 Get the input sentence

**Remember to change the anchor word index when changing the input sentence.**

In [9]:
import os
import numpy as np
# change to your working directory
os.chdir(R'C:\Users\anned\OneDrive - Danmarks Tekniske Universitet\Uni\Responible AI\Project\Project_3\XAI_LLM')

In [10]:
def get_input():
    # ideally this reads inputs from a file, now it just takes an input
    return input("Enter a sentence: ")
    
anchor_word_idx = 3 # the index of the interested word
prompts_per_word = 20 # number of generated responses  

#sentence = get_input()
# I chose to ignore the input function and just set the sentence manually because why wouldn't you?
# Test with this shorter sentence to make code run faster. I later load in the full responses for the full sentence, but this takes a long time to run.
sentence = "The quick brown fox"# jumps over the lazy dog"
print("Sentence: ", sentence)

Sentence:  The quick brown fox


### 3.2 Load the model

In [11]:
from models.ChatModel import ChatModel
model_name = "openchat"
model = ChatModel(model_name)
print(f"Model: {model_name}")

Model: openchat


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 
Model metadata: {'general.name': 'openchat_openchat-3.5-0106', 'general.architecture': 'llama', 'llama.context_length': '8192', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '32000', 'general.file_type': '15', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '10000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': "{{ bos_token }}{% for message in message

### 3.3 Run the prompts and get all the responses


In [12]:
from tools.command_generator import generate_prompts, prefix_prompt
from tools.evaluate_response import get_replacements
from tqdm import tqdm
# takes 23 minutes to run with the sentence I wrote. Make a shorter sentence to run faster.
# out comment this when the code has run once and you saved the responses.
def run_prompts(model, sentence, anchor_idx, prompts_per_word=20):
    prompts = generate_prompts(sentence, anchor_idx)
    all_replacements = []
    for prompt in prompts:
        replacements = []
        for _ in tqdm(
            range(prompts_per_word),
            desc=f"Input: {prompt}",
        ):
            response = model.get_response(
                prefix_prompt(prompt),
            ).strip()
            if response:
                replacement = get_replacements(prompt, response)
                if replacement:
                    replacements.append(replacement)
        if len(replacements) > 0:
            all_replacements.append(replacements)
    return all_replacements

all_responses = run_prompts(model, sentence, anchor_word_idx, prompts_per_word)


Input: [MASK] quick brown [MASK]:   0%|          | 0/20 [00:00<?, ?it/s]

Input: [MASK] quick brown [MASK]:   5%|▌         | 1/20 [00:12<04:04, 12.87s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  10%|█         | 2/20 [00:19<02:42,  9.04s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  15%|█▌        | 3/20 [00:21<01:44,  6.15s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  20%|██        | 4/20 [00:28<01:38,  6.18s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  25%|██▌       | 5/20 [00:32<01:20,  5.37s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  30%|███       | 6/20 [00:35<01:05,  4.68s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  35%|███▌      | 7/20 [00:42<01:11,  5.52s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  40%|████      | 8/20 [00:51<01:19,  6.60s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  45%|████▌     | 9/20 [01:02<01:28,  8.01s/it]

 Response is not valid. ['[mask]', 'quick', 'brown', '[mask]'] ['the', 'fox', 'jumps', 'quickly', 'over', 'the', 'lazy', 'dog']


Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  50%|█████     | 10/20 [01:08<01:12,  7.28s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  55%|█████▌    | 11/20 [01:11<00:55,  6.16s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  60%|██████    | 12/20 [01:19<00:53,  6.67s/it]

 Response is not valid. ['[mask]', 'quick', 'brown', '[mask]'] ['the', 'fox', 'is', 'quite', 'fast']


Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  65%|██████▌   | 13/20 [01:33<01:01,  8.75s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  70%|███████   | 14/20 [01:38<00:46,  7.68s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  75%|███████▌  | 15/20 [01:48<00:41,  8.37s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  80%|████████  | 16/20 [01:57<00:33,  8.43s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  85%|████████▌ | 17/20 [02:02<00:22,  7.50s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  90%|█████████ | 18/20 [02:10<00:15,  7.79s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]:  95%|█████████▌| 19/20 [02:15<00:06,  6.71s/it]Llama.generate: prefix-match hit
Input: [MASK] quick brown [MASK]: 100%|██████████| 20/20 [02:20<00:00,  7.03s/it]
Input: The [MASK] brown [MASK]:   0%|          | 0/20 [00:00<?, ?it/s]Llama.generate: p

In [13]:
# out comment this when the code has run once and you saved the responses.

print(all_responses)
# save the responses to a file
with open("responses.txt", "w") as f:
    f.write(str(all_responses))

[[['the', 'fox jumps over the lazy dog'], ['the fox jumped over the', 'fence'], ['fox', 'fox'], ['the', 'fox jumps over the lazy dog'], ['the', 'dog jumps'], ['the', 'fox'], ['the', 'fox jumps over the lazy dog'], ['the', 'fox jumps'], ['', ''], ['the fox is', 'brown'], ['the', 'dog'], ['', ''], ['the fox was [mask]', '[mask] jumps over the lazy dog'], ['the', 'fox jumps'], ['the', 'fox jumps'], ['the', 'fox jumps'], ['the', 'fox jumps'], ['the fox was very', 'the brown dog followed close behind'], ['the', 'dog jumped'], ['the', 'dog']], [['dog', 'fur'], ['beautiful', 'deer'], ['chocolate', 'dog'], ['cat', 'coat'], ['deer', 'coat'], ['deer', 'coat'], ['beautiful', 'cat'], ['dog', 'coat'], ['dog', 'coat'], ['dog', 'fur'], ['cat', 'tabby'], ['elephant', 'tusks'], ['dog', 'fur'], ['dog', 'furred'], ['beautiful', 'dog'], ['cow', 'meadow'], ['friendly', 'dog'], ['bear', 'coat'], ['beautiful', 'deer'], ['bear', 'coat']], [['fox', 'runs'], ['fox', 'runs'], ['fox', 'jumps'], ['fox', 'runs'], [

In [14]:
# just overwring the response and sentence if you chose to change it. You can change it to your sentence and configurations.

# load the responses from a file
with open("responses_Fox_and_Dog.txt", "r") as f:
    all_responses = eval(f.read())

sentence = "The quick brown fox jumps over the lazy dog"
anchor_word_idx = 3 # the index of the interested word
prompts_per_word = 20 # number of generated responses  

print(all_responses)

[[['the fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['', 'fox'], ['fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['fox', 'fox'], ['', ''], ['the', 'fox'], ['the', 'fox'], ['the swift', 'fox'], ['', 'fox'], ['the', 'fox'], ['the fox', 'fox'], ['', 'fox'], ['the', 'fox'], ['the', 'fox']], [['fox', 'rabbit'], ['friendly', 'cat'], ['clever', 'rabbit'], ['gray', 'cat'], ['quick', 'fox'], ['', ''], ['quick', 'fox'], ['', ''], ['black', 'cat'], ['quick', 'fox'], ['quick', 'fox'], ['fox', 'fox'], ['yellow', 'cat'], ['yellow', 'cat'], ['little', 'cat'], ['yellow', 'fox'], ['fox', 'vixen'], ['old', 'cat'], ['tiny', 'frog'], ['feline', 'cat']], [['', ''], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['fox', 'cleverly'], ['brown', 'fox'], ['fox', '[mask]'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['fox', '[mask]'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['

In [15]:
print(all_responses)
print(len(all_responses))
print(all_responses[0])
print(len(all_responses[0]))
print(all_responses[0][0])
print(len(all_responses[0][0]))


print(all_responses[7])
print(len(all_responses[7]))
print(all_responses[7][0])
print(len(all_responses[7][0]))



print(all_responses[6])
print(len(all_responses[6]))
print(all_responses[6][0])
print(len(all_responses[6][0]))

# out of range because the sentence has 9 words, but this is exclusive the original masked word word
"""
print(all_responses)
print(len(all_responses))
print(all_responses[8])
print(len(all_responses[8]))
print(all_responses[8][0])
print(len(all_responses[8][0]))
""" 

[[['the fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['', 'fox'], ['fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['the', 'fox'], ['fox', 'fox'], ['fox', 'fox'], ['', ''], ['the', 'fox'], ['the', 'fox'], ['the swift', 'fox'], ['', 'fox'], ['the', 'fox'], ['the fox', 'fox'], ['', 'fox'], ['the', 'fox'], ['the', 'fox']], [['fox', 'rabbit'], ['friendly', 'cat'], ['clever', 'rabbit'], ['gray', 'cat'], ['quick', 'fox'], ['', ''], ['quick', 'fox'], ['', ''], ['black', 'cat'], ['quick', 'fox'], ['quick', 'fox'], ['fox', 'fox'], ['yellow', 'cat'], ['yellow', 'cat'], ['little', 'cat'], ['yellow', 'fox'], ['fox', 'vixen'], ['old', 'cat'], ['tiny', 'frog'], ['feline', 'cat']], [['', ''], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['fox', 'cleverly'], ['brown', 'fox'], ['fox', '[mask]'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['fox', '[mask]'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['brown', 'fox'], ['

'\nprint(all_responses)\nprint(len(all_responses))\nprint(all_responses[8])\nprint(len(all_responses[8]))\nprint(all_responses[8][0])\nprint(len(all_responses[8][0]))\n'

> We are looking at the sentence "The quick brown fox jumps over the lazy dog".
> We start by masking the word 'fox'. 
> The first index in the all_responses corresponds to the word index in the sentence (exclusive the masked word ('fox)). Thus index 2 corresponds to the word 'brown' and the index 4 corresponds to the word 'jumps'.
> We then mask the word with the index i and generate 20 responses for the two masked words in the sentence thus the second index in all responses corresponds each of these 20 generated responses.
>
> The responses are given in the order of the words in the sentence. So even though we mask out the word x='fox' in the sentence, the response for this word changes when we move from the left side of the sentence to the right side.
>For index 0 the output is given as [y,x] = ['the', 'fox'] 
>For index 7 the output is given as [x,y] = ['fox', 'dog']

### 3.4 EXERCISE: compute the PMI for each word

$PMI(x,y)=log_2⁡ \frac{p(\{x,y\}| s-\{x,y\})}{P(\{x\}|s-\{x,y\})P(\{y\}|s-\{x,y\})}$

* Compute the $P(x)$, $P(y)$ and $P(x,y)$ first and print it out.
* Compute the PMI for each word.
* Visualize the result by coloring. Tips: you might need to normalize the result first. 


In [17]:
sentence_split = sentence.split()
true_x = sentence_split[anchor_word_idx]
print(sentence_split)
print(true_x)
sentence_split_masked_x = sentence_split.copy()
sentence_split_masked_x.pop(anchor_word_idx)
print(sentence_split_masked_x)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
fox
['The', 'quick', 'brown', 'jumps', 'over', 'the', 'lazy', 'dog']


In [18]:
# Your code here. You are more than welcome to build .py file and run with that.

P_x = {}
P_y = {}
P_x_y = {}
PMI = {}
epsilon = 1e-10


for i in range(len(all_responses)):
    true_y = sentence_split_masked_x[i]
    P_x[i] = 0
    P_y[i] = 0
    P_x_y[i] = 0
    for j in range(len(all_responses[i])):
        if i > anchor_word_idx:
            if all_responses[i][j][1] == true_x:
                P_x[i] += 1
            if all_responses[i][j][0] == true_y:
                P_y[i] += 1
            if all_responses[i][j][1] == true_x and all_responses[i][j][0] == true_y:
                P_x_y[i] += 1
        else:
            if all_responses[i][j][0] == true_x:
                P_x[i] += 1
            if all_responses[i][j][1] == true_y:
                P_y[i] += 1
            if all_responses[i][j][0] == true_x and all_responses[i][j][1] == true_y:
                P_x_y[i] += 1
            
    P_x[i] = P_x[i]/len(all_responses[i])
    P_y[i] = P_y[i]/len(all_responses[i])
    P_x_y[i] = P_x_y[i]/len(all_responses[i])

    if P_x[i] == 0:
        P_x[i] = epsilon
    if P_y[i] == 0:
        P_y[i] = epsilon
    if P_x_y[i] == 0:
        P_x_y[i] = epsilon


    PMI[i] = np.log2(P_x_y[i]/(P_x[i]*P_y[i]))

print(P_x)
print(P_y)
print(P_x_y)
print(PMI)


{0: 0.25, 1: 0.15, 2: 0.15, 3: 1.0, 4: 1e-10, 5: 1e-10, 6: 1e-10, 7: 0.1}
{0: 1e-10, 1: 1e-10, 2: 1e-10, 3: 0.85, 4: 1e-10, 5: 1e-10, 6: 1e-10, 7: 0.2}
{0: 1e-10, 1: 1e-10, 2: 1e-10, 3: 0.85, 4: 1e-10, 5: 1e-10, 6: 1e-10, 7: 0.1}
{0: 2.0, 1: 2.736965594166206, 2: 2.736965594166206, 3: 0.0, 4: 33.219280948873624, 5: 33.219280948873624, 6: 33.219280948873624, 7: 2.321928094887362}



## 4. EXERCISE: Try more examples; maybe come up with your own. Report the results.

* Try to come up with more examples and, change the anchor word/number of responses, and observe the results. What does the explanation mean? Do you think it's a nice explanation? Why and why not? 
* What's the limitation of the current method? When does the method fail to explain? 

## 5. Bonus Exercises
### 5.1 Language pre-processing. 
In this exercise, we only lower the letters and split sentences into words; there's much more to do to pre-process the language. For example, contractions (*I'll*, *She's*, *world's*), suffix and prefix, compound words (*hard-working*). It's called word tokenization in NLP, and there are some Python packages that can do such work for us, e.g. [*TextBlob*](https://textblob.readthedocs.io/en/dev/). 


### 5.2 Better word matching
In the above example of
> Tokyo is the capital of Japan and a popular metropolis in the world.

, GenAI never gives the specific word 'metropolis' when masking it out; instead, sometimes it provides words like 'city', which is not the same word but has a similar meaning. Instead of measuring the exact matching of certain words (i.e. 0 or 1), we can also measure the similarity of two words, e.g. the cosine similarity in word embedding, which ranges from 0 to 1. 