# Applying integrated gradients to Language Models

<a href="https://colab.research.google.com/drive/1rTVXeecVJ4gLxkeAzMmrlj9ZG31QrXXq" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

Language models are an excellent example of how we can use integrated gradient techniques to explain a neural network's behavioral processes. In short, integrated gradients are an interpretability method proposed in "[Axiomatic Attribution for Deep Networks](https://arxiv.org/abs/1703.01365)". This methodology uses the calculated gradient of an ML model to determine what influence the individual parts of an input (like tokens in a sentence or pixels in an image) have on the model's output.

![image](https://raw.githubusercontent.com/garygsw/smooth-taylor/master/method-comparison.png)

[Source](https://github.com/garygsw/smooth-taylor).

Integrated gradients are closely related to saliency maps, something we covered in [this](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/CV/CNN_attribution_maps.ipynb) (Grad-Cam) and [this](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/NLP/lime_for_NLP.ipynb) (LIME) notebook. However, we will use the `alibi` library to explore this method in this tutorial.

> **[Alibi](https://docs.seldon.io/projects/alibi) is an open-source Python library aimed at machine learning model inspection and interpretation.**

As the object of our analysis, we will be using one of the models we trained on [this notebook](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/NLP/model_maker.ipynb) (a Bidirectional LSTM), which can be found on the Hub.

In [1]:
# First, we need to clone the repository of our model, which contains both the model and its tokenizer
!git lfs install
!git clone https://huggingface.co/AiresPucrs/BiLSTM-sentiment-classifier

Git LFS initialized.
Cloning into 'BiLSTM-sentiment-classifier'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 14 (delta 3), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (14/14), 4.55 KiB | 932.00 KiB/s, done.
Filtering content: 100% (2/2), 25.87 MiB | 17.48 MiB/s, done.


Let us quickly load and test our model to see if everything works as it should.

In [4]:
import json
import tensorflow as tf

model_path = '/content/BiLSTM-sentiment-classifier/BiLSTM-sentiment-classifier.h5'
tokenizer_path = '/content/BiLSTM-sentiment-classifier/tokenizer-BiLSTM-sentiment-classifier.json'

# Load the model
model = tf.keras.models.load_model(model_path)

# Load the tokenizer
with open(tokenizer_path) as fp:
    data = json.load(fp)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)
    word_index = tokenizer.word_index
    fp.close()

# Define samples to test the model
strings = [
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
]

# Get some redictions
preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(strings),
        maxlen=250,
        truncating='post'
    ), verbose=0)

for i, string in enumerate(strings):
    print(f'Review: "{string}"\n(Negative 😔 {round((preds[i][0]) * 100)}% | Positive 😊 \
      {round(preds[i][1] * 100)}%)\n')

Review: "this explanation is really bad"
(Negative 😔 95% | Positive 😊       5%)

Review: "i did not like this tutorial 2/10"
(Negative 😔 88% | Positive 😊       12%)

Review: "this tutorial is garbage i wont my money back"
(Negative 😔 89% | Positive 😊       11%)

Review: "is nice to see philosophers doing machine learning"
(Negative 😔 4% | Positive 😊       96%)

Review: "this is a great and wonderful example of nlp"
(Negative 😔 0% | Positive 😊       100%)

Review: "this tutorial is great one of the best tutorials ever made"
(Negative 😔 0% | Positive 😊       100%)



In language models, like text classification models, integrated gradients define an attribution value for each token in the input sentence. The attributions are calculated considering the integral of the model gradients with respect to the word embedding layer along a straight path from a baseline instance x' to the input instance x.

Thus, we can say the attribution given to an input is equal to the difference between the model output at the instance x and the model output at the baseline x':

$$A(x, x') = F(x) -  F(x')$$

Now, to utilize the [`IntegratedGradients`](https://docs.seldon.io/projects/alibi/en/latest/api/alibi.explainers.html#alibi.explainers.IntegratedGradients) class from `alibi`, we need to set some arguments first:

-   `model`: a tensorflow model.
-   `layer`: a layer or a function having as parameter the model and returning a layer with respect to which the gradients are calculated. In the case of our language model, this is the `Embedding` layer.
-   `target_fn`: A scalar function applied to the model's predictions (like ` np.argmax(predictions, axis=1)`).
-   `method`: Method for the integral approximation (`riemann_left`,  `riemann_right`,  `riemann_middle`,  `riemann_trapezoid`,  `gausslegendre`).
-   `n_steps`: Number of steps in the path integral approximation from the baseline to the input instance.  
-   `internal_batch_size`: Batch size for the internal batching.


In [7]:
!pip install alibi -q

import alibi

n_steps = 100 #128
internal_batch_size = 500 #250

ig  = alibi.explainers.IntegratedGradients(
    model,
    target_fn=None,
    layer=lambda model: model.layers[1], # Embedding layer
    n_steps=100, #128
    method="gausslegendre",
    internal_batch_size = 500 #250
  )


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m522.1/522.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h

The integrated gradient attributions are calculated concerning the embedding layer for the number of samples we defined in our x_test_sample list. This could also be a partition of your testing set. With these samples, we use our model to generate a prediction array. Meanwhile, `ig.explain` (the alibi explainer object) requires a list of elements (predicted_classes) of the model's output to compute the gradients. We can achieve this by argmaxing the prediction array or passing the `preds.argmax(axis=1)` function as the `target` parameter. Since our model has a `softmax` output, we can set the `target` parameter as `preds.argmax(axis=1)`.

> **Note: If you are using a model with a `sigmoid` output, some basic list comprehension (`[0 if preds[i][0] < 0.5 else 1 for i in range(len(preds))]`) can give you a list of predicted classes for your samples.**

Here, we use the default baseline (`None`), which equates to a sequence of zeros (this corresponds to a sequence of padding characters, a.k.a. no input).

In [9]:
# Sample text
x_test_sample = [
    'One of the weakest entries in the J-horror remake sweepstakes, One Missed Call is undone by bland performances and shopworn shocks.'
]

# Prediction
preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(x_test_sample),
        maxlen=250,
        truncating='post'
    ), verbose=0
)

# Target fuction that gives you the class
target_function =  preds.argmax(axis=1)

explanation = ig.explain(tf.keras.preprocessing.sequence.pad_sequences(
                            tokenizer.texts_to_sequences(x_test_sample),
                            maxlen=250,
                            truncating='post'),
                         baselines=None,
                         target=target_function,
                         attribute_to_layer_inputs=False)

From this explanation object, we can recover our attributions. To retrieve them, we call the `explanation.attributions`, which, at index 0 (since we only passed a single sample), retrieves us an array of shape (1, 250, 128), i.e., one sample (batch dimension), of length size 250 (context length), and dimensionality 128 (the embedding dimension). Summing all values across the embedding dimension (axis 2), we get the attributions for our 250 tokens.

In [16]:
attrs = explanation.attributions[0]
attrs = attrs.sum(axis=2)
print('Attributions shape:', attrs.shape)

Attributions shape: (1, 250)


Now, we get rid of the padding tokens and take only the attributions that relate to the words in our input.

In [58]:
# Get how many tokens our text sample has
number_of_tokens = len(x_test_sample[0].split())
# Get all the attributions that are not froma pad token
atributions = attrs[0][- number_of_tokens:]
# Get a list of corresponding words for every token
words = x_test_sample[0].split()

len(words), len(atributions)

(21, 21)

To create a visually intuitive way of interpreting this model's output, we will assign a color to each attribution score, taking the `max` and `min` values to set a range of predefined colors.

In [59]:
import matplotlib

# Get the max and min atributions to estipulate a range
minima = min(atributions)
maxima = max(atributions)

# Normalize and map this range to a color scheme (we chose `YlOrRd`)
norm = matplotlib.colors.Normalize(vmin=minima, vmax=maxima, clip=True)
mapper = matplotlib.cm.ScalarMappable(norm=norm, cmap=matplotlib.cm.YlOrRd)

# Create a list of HEX colors
colors = [matplotlib.colors.to_hex(mapper.to_rgba(v)) for v in atributions]

colors

['#fff4b0',
 '#fff5b5',
 '#fff4b0',
 '#fff4b0',
 '#fff5b3',
 '#fff5b5',
 '#ffefa4',
 '#fff3ae',
 '#fee38b',
 '#fff0a8',
 '#fff1ab',
 '#ffeda1',
 '#fee085',
 '#fff4b0',
 '#fff8bb',
 '#fff2ac',
 '#800026',
 '#ffffcc',
 '#fff7b7',
 '#fee288',
 '#ffea99']

Now, with attributions and colors, we can create a function that maps the generated colors to each token in the text sample, yielding a colorful HTML representation of the attributions given to each word. Since we use the "YlOrRd" color scale, words with **high positive attribution** are colored in shades of **red**. Words with **middling attributions** are colored **orange**, while **low attributions** receive a **pale yellow**.

In [62]:
from IPython.display import HTML
import plotly.graph_objects as go

text_with_attributions = ' '.join([f'''<span style="color:{colors[i]}"><b>{words[i]}</b></span>''' for i in range(len(words))])

display(HTML(f'Sample: {x_test_sample[0]}\n'))

for i in range(len(preds)):
    display(HTML(f'Prediction: (Negative 😔 {round((preds[i][0]) * 100)}% | Positive 😊 {round(preds[i][1] * 100)}%)\n'))
    display(HTML(f'Attributions: {text_with_attributions}'))

# Graph the attributions
fig = go.Figure(
    go.Bar(
        x=atributions,
        y=words,
        orientation='h',
        marker_color=colors
    )
)

fig.update_xaxes(
    ticksuffix = "",
    griddash='dash'
)

fig.update_layout(
    template='plotly_dark',
    title_text=f'Atributions and Words',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)

fig.show()

Above, you can see the tokens that had a greater influence on the prediction of our sentiment classifier. In this example, the word **"bland"** (token **2159**) was the main culprit for this negative classification.  🎭

----

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).