<a href="https://colab.research.google.com/github/OldFatGuyFrom1962/GitHub-for-Poets-test/blob/main/31_NLP_Handson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Hands-on

This notebook is available at https://bit.ly/aiq-nlp .

## Start here

To get things started, execute this notebook. The steps required are:
- If you'd like to save changes you make, in the top menu, select "File" --> "Save copy in ..." (select option that applies to you).
- In the top menu, select "Runtime" --> "Change runtime type" --> "Hardware accelerator" = "GPU".
- Button on top right, press "Connect".
- In menu top menu, "Runtime" --> "Run all".

The execution of the notebook will take a few minutes the first time it is run as several datasets, some software, and a model need to be downloaded.

This notebook demonstrates course concepts in the context of natural language processing (NLP). The following sections include:

## Setup

Installing, downloading, and importing required packages and data.

## Sentiment Model

Downloads and sets up a sentiment model from Huggingface.
  
  - Data pre-processing: how to process data for use in the model.
  - Running the model: demonstration of how to run the model.
  - Embedding Space: visualization of the model's token embeddings.
  - Performance: measuring accuracy of the model on a sample rotten tomatoes dataset.
  - Trulens: Model Wrapper: how to wrap the model for use with Trulens.

## Attributions

Basic demonstration of attributions using Trulens.
  
  - Baselines: demonstration of adjusting the baseline used for Trulens attributions.

## Fairness

Finding inputs which demonstrate model unfairness.

  - Gender in embedding space: visualization of the embedding space's gender direction.

## Drift

Demonstration of different options for measuring drift.

  - Model score drift: drift in model score, same as in tabular data.
  - Token distribution drift: comparing distributions of tokens across datasets.
  - Embedding distribution drift: comparing distributions in the embedding space.
  - Gender distribution drift: comparing distributions in embedding space using an extracted gender dimension.

# Interactive widgets

There are widgets throughout this notebook where you can adjust inputs or inspect details to/on various demonstrations. These are colored in a thick <span style="border: 5px solid teal;">teal</span> border.

# Starting up

Once you evaluate the entire notebook, you can skip any section starting with `Skip`, and focus on sections starting with `Playground`.

# Homework

Adapt the appropriate [Trulens](https://trulens.org) quickstart notebook to your favorite model / dataset. The available quickstarts are:

* [vision with pytorch](https://colab.research.google.com/drive/1n77IGrPDO2XpeIVo_LQW0gY78enV-tY9?usp=sharing)
* [vision with tensorflow](https://colab.research.google.com/drive/1f-ETsdlppODJGQCdMXG-jmGmfyWyW2VD?usp=sharing)
* [NLP with pytorch](https://colab.research.google.com/drive/18GcjsYMkRbxPDDS3J6BEbKnb7AY-1-Wa?usp=sharing)
* [NLP with tensorflow](https://colab.research.google.com/drive/1K09IvN7cMTkzsnb-uAeA0YQNfDU7Ibhs?usp=sharing)

## Optional Adventure

Replace elements of this notebook with your favorite ... (in order of increasing difficulty):

- Sentiment classification dataset. Code portions that will have to change are marked with `DATA`.
- Non-sentiment classification dataset. Marked with `DATA`.
    - more useful if model updated too
- Huggingface classification model. Marked with `HUGS`.
- Huggingface non-classification model. Marked with `CLASS`.
    - more useful if you have appropriate data too

While Trulens supports Tensorflow, most of this notebook is tailored to pytorch so we do not recommend trying to use it with tensorflow.

In [None]:
#@title Skip: Setup

%load_ext autoreload
%autoreload 2

import sys

# Install requirements.
while True:
  try:
    import trulens
    import lzma
    import pickle
    import datasets
    import domonic
    import gdown
    from openTSNE import TSNE
    import torch
    import transformers
  except Exception:
    ! {sys.executable} -m pip install git+https://github.com/truera/trulens.git@piotrm/aiq-nlp
    ! {sys.executable} -m pip install transformers datasets openTSNE domonic==0.9.8 gdown torch
  else:
    break

import base64
import functools
import multiprocessing as mp
import os
from pathlib import Path
import re
from typing import Callable, Dict, List, Tuple, Sequence

from datasets import load_dataset
from IPython.display import clear_output
from IPython.display import display
from ipywidgets import interact
from ipywidgets import interactive
from ipywidgets import widgets
import numpy as np
import numpy.typing as npt
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import torch
from tqdm.auto import tqdm
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import transformers as hugs

from trulens.nn.attribution import Cut
from trulens.nn.attribution import IntegratedGradients
from trulens.nn.attribution import OutputCut
from trulens.nn.models import get_model_wrapper
from trulens.nn.quantities import ClassQoI
from trulens.utils.nlp import token_baseline
from trulens.utils.typing import ModelInputs

# need to use old ipywidgets:
# ! {sys.executable} -m pip install ipywidgets==7.7.1
#try:
#  from google.colab import output
#  output.enable_custom_widget_manager()

# Figure = go.FigureWidget # use this if running in vscode
Figure = go.Figure # use this if running in google colab or jupyter

# Download some pre-computed data.
if not Path("tsne_embedding.lzma").exists():
  gdown.download(
    "https://drive.google.com/file/d/1ZA8jyv026Q7T1RCJFtxxfCUl1JHXNFVP/view?usp=sharing",
    fuzzy=True, resume=True
)

try:
  # DATA: Preload datasets. More about them later.
  rotten_train = load_dataset("rotten_tomatoes", split="train")
  rotten_test = load_dataset("rotten_tomatoes", split="test")
  rotten_texts = list(rotten_train['text']) + list(rotten_test['text'])
  
  imdb_train = load_dataset("imdb", "plain_text", split="train")
  imdb_test = load_dataset("imdb", "plain_text", split="test")
  imdb_texts = list(imdb_train['text']) + list(imdb_test['text'])

except Exception as e:
  print("WARNING: could not load huggingface datasets, will use blank replacements")
  print(str(e))

  rotten_texts = ["fake rotten tomatoes sentence; person woman man camera TV"] * 10
  rotten_train = dict(text=rotten_texts, label=[0] * 10)
  rotten_test = dict(text=rotten_texts, label=[0] * 10)

  imdb_texts = ["fake imdb sentence; person woman man camera TV"] * 10
  imdb_train = dict(text=imdb_texts, label=[0] * 10)
  imdb_test = dict(text=imdb_texts, label=[0] * 10)

try:
  # DATA: extra dataset for playing around
  tweets_file = Path("training.1600000.processed.noemoticon.csv")
  if not tweets_file.exists():
      ! wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
      ! unzip trainingandtestdata.zip

  tweets = pd.read_csv(tweets_file, encoding='ISO-8859-1', header=None, names=["polarity", "id", "timestamp", "query", "user", "text"])
  tweet_texts = list(tweets['text'])

except Exception as e:
  print("WARNING: could not load tweet dataset, will use blank replacements")
  print(str(e))

  tweet_texts = ["fake tweet sentence; person woman man camera TV"] * 10

# Sentiment Classification Model

[Huggingface](https://huggingface.co/models) offers a variety of pre-trained NLP models to explore. We exemplify in this notebook a [transformer-based twitter sentiment classification model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

In the below cell, we point out, with `HUGS`, elements that you would need to update to replace the given model with another hugging face model.

In [None]:
#@title Huggingface NLP model setup


# Wrap all of the components needed to run a model.
class Model:
    # device = torch.device("cpu", 0)
    # Can also use cuda if available:
    device = torch.device("cuda", 0)

    # HUGS: model name, see https://huggingface.co/models for others
    # https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
    MODEL = f"distilbert-base-uncased-finetuned-sst-2-english"

    model: hugs.PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
        MODEL
    ).to(device)

    tokenizer: hugs.PreTrainedTokenizer = AutoTokenizer.from_pretrained(MODEL)

    # HUGS: the embeddings vectors, one for each token
    embeddings: npt.NDArray[np.float32] = \
        model.distilbert.embeddings.word_embeddings.weight.detach().cpu().numpy()

    # HUGS: name of the layer that produces token embeddings. The trulens
    # wrapping cell later in this notebook can be helpful in figuring out this
    # parameter.
    embeddings_layer: str = 'distilbert_embeddings_word_embeddings'

    # number of dimensions in token embedding
    embedding_size: int = embeddings.shape[1]

    # HUGS: maximum number of tokens to send to model
    max_length: int = 128

    # HUGS: Maximum number of instances we can evaluate the model on at once. This is
    # necessary when using a GPU with a limited amount of memory.
    rebatch_size: int = 16

    id_of_token: Dict[str, int] = tokenizer.get_vocab()
    token_of_id: Dict[int, str] = {v: k for k, v in id_of_token.items()}

    # number of tokens in vocabulary
    vocab_size: int = len(id_of_token)

    def _vocab(token_of_id, vocab_size):
        # Python list comprehension scoping workaround
        return np.array([token_of_id[i] for i in range(vocab_size)])

    # tokens in order
    vocab: npt.NDArray[str] = _vocab(token_of_id, vocab_size)

    # CLASS
    labels = ['negative', 'positive']

    # CLASS
    NEGATIVE: int = labels.index('negative')
    POSITIVE: int = labels.index('positive')

    def tokenize(texts: List[str]) -> Dict[str, torch.Tensor]:
        """
        Tokenize a list of `texts` into a form appropriate for `TwitterSentiment.model` .
        """
        return Model.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=Model.max_length,
            return_tensors='pt'
        ).to(Model.device)

    # CLASS
    def evaluate_to_logits(
        texts: List[str], batch_size=rebatch_size
    ) -> torch.Tensor:
        """
        Evaluate a collection of `texts` into their logits scores.
        """

        logits = []

        inputs = Model.tokenize(texts)

        for idx in tqdm(range(0, len(texts), batch_size),
                        desc="evaluating model"):

            batch_logits = Model.model(
                input_ids=inputs['input_ids'][idx:idx + batch_size],
                attention_mask=inputs['attention_mask'][idx:idx + batch_size]
            ).logits

            logits.append(batch_logits.detach())

        return torch.concat(logits).detach().cpu()

    # CLASS
    def evaluate_to_probits(
        texts: List[str], batch_size=rebatch_size
    ) -> torch.Tensor:
        """
        Evaluate a collection of `texts` into their probits scores.
        """

        logits = Model.evaluate_to_logits(
            texts, batch_size=batch_size
        )
        return torch.nn.functional.softmax(logits, dim=1).detach().cpu()

    # HUGS
    def token_str(token_id: int) -> str:
        """
        Given a `token_id`, produce a string of how it should be drawn.
        """
        tok = Model.tokenizer.decode(token_id)
        if tok.startswith("##"):
            # token starts with "##" to denote a word postfix
            return tok[2:]
        else:
            # if not a postfix, add space better indicate a complete word
            # separation
            return " " + tok

## Data pre-processing

This section demonstrates the initial steps of an NLP model evaluation, the tokenization and conversion to embeddings.

In [None]:
#@title Skip: Details

# Utilities we will use for interactive parts of this notebook. Please ignore.
aiq_layout = dict(
    border="10px solid teal", padding="10px", width="100%", margin="0px"
)

# Interaction utilities.
textbox = (
    lambda t=
    "I'm a sentence. The last part of this sentence is not a real wordle.", c=
    True, d="input": widgets.Text(
        value=t,
        continuous_update=c,
        layout=aiq_layout,
        description=d + (" (enter to update)" if not c else ""),
        style={'description_width': 'initial'}
    )
)

In [None]:
#@title Playground: Input processing
#@markdown
#@markdown Enter a piece of text in the widget below to see the tokenized version in the cell output. There you can see: 
#@markdown - "INPUT TEXT" -- the piece of text you entered.
#@markdown - "MODEL INPUTS" -- the data structure provided to the model that represents various aspects of the input text.
#@markdown - "TOKENS" -- string representations of the tokens making up the input text.
#@markdown - "EMBEDDINGS" -- the embedding representation of the tokens.

@interact(text=textbox())
def show_parse(text: str):

    print("INPUT TEXT\n", text, "\n")

    # Input sentences need to be tokenized first.
    inputs = Model.tokenize([text])

    # The tokenizer gives us vocabulary indexes for each input token (in this case,
    # words and some word parts like the "'m" part of "I'm" are tokens).

    print("MODEL INPUTS\n", inputs, "\n")

    # HUGS: Decode helps inspecting the tokenization produced:
    tokens = Model.tokenizer.batch_decode(torch.flatten(inputs['input_ids']))

    # Normally decode would give us a single string for each sentence but we would
    # not be able to see some of the non-word tokens there. Flattening first gives
    # us a string for each input_id.

    print("TOKENS\n", tokens, "\n")

    # HUGS: Each token is represented by a dense vector in the model.
    toks = inputs['input_ids'].detach().cpu().numpy()
    embs = np.array([Model.embeddings[token_id] for token_id in toks])[0]

    print("EMBEDDINGS\n", embs, "\n")


In [None]:
#@title Playground: Model evaluation
#@markdown Enter a piece of text to evaluate the sentiment model on. In the cell output you should see:
#@markdown - The classification label (negative or positive).
#@markdown - The classification scores (two real numbers, one for each class).
#@markdown - The input text evaluated.

model_results = []


@interact(text=textbox())
def show_output(text):
    global model_results
    results = model_results

    # Get the model appropriate inputs from a single text instance:
    inputs = Model.tokenize([text])

    # Run the model on it:
    outputs = Model.model(**inputs)

    # CLASS: From logits we can extract the most likely class for each sentence and its
    # readable label.
    predictions = [Model.labels[i] for i in outputs.logits.argmax(axis=1)]

    # CLASS
    results.insert(
        0, (predictions[0], outputs.logits.detach().cpu().numpy()[0], text)
    )
    results = results[0:10]

    for result in results:
        print(*result)

## Embedding Space

This section visualizes the model's embedding space. It is based on TSNE dimensionality reduction that reduces 768 dimensional embedding vectors into just 2 dimensions. Ideally tokens that are nearby in the original space should show up nearby in the visualization but this naturally not exact. 

In [None]:
#@title Skip: Details

# AIQ: The following computation takes too long on colab. The results should
# have been downloaded for you earlier in this notebook.

# HUGS: If using a different model, you might have to recompute this or just skip this step.
tsne_filename = Path("tsne_embedding.lzma")
if tsne_filename.exists():
    print("loading dimensionality reduction")
    with lzma.open(tsne_filename, mode='rb') as fh:
        tsne_embedding = pickle.load(fh)

else:
    print(
        "computing, if you are running this in colab, be prepared to wait a long time"
    )
    man = TSNE(
        n_jobs=mp.cpu_count(),
        verbose=True,
        n_iter=10000,
        learning_rate=200,
        negative_gradient_method='bh',
        metric="cosine"
    )
    tsne_embedding = man.fit(Model.embeddings)

    print("saving")
    with lzma.open(tsne_filename, mode='wb') as fh:
        pickle.dump(obj=tsne_embedding, file=fh)

In [None]:
#@title Playground: Embedding space
#@markdown In this cell you should see a visualization of token embeddings. As they are high-dimensional vectors, they have been reduced to 2 dimensional points via a dimensionality reduction technique. While a lot of information about a token is lost in this visualization, some higher level patterns can be seen:
#@markdown - Long strings of points tend to represent tokens that have some notion of order. For example the tokens representing years 1831 through 2018 seen in the lower right portion of the figure and tokens "1st" through "19th" right above the year sequence.
#@markdown - Pairs or small groups of related or semantically equivalent tokens make up the bulk of the visualization. For example "presently" next to "nowadays".
#@markdown - Large clumps of special tokens or rarer tokens from non-English languages like the large clump in the upper left that contains many Japanese or Chinese characters.

# AIQ: This is computationally intensive picture.

plotly_layout = dict(paper_bgcolor="teal", margin=dict(l=10, r=10, t=10, b=10))

fig = Figure(layout=dict(width=800, height=800, **plotly_layout))
fig.update_xaxes(
    showticklabels=False
)
fig.update_yaxes(
    showticklabels=False
)
fig.add_scatter(
    x=tsne_embedding[:, 0],
    y=tsne_embedding[:, 1],
    text=Model.vocab[:],
    mode='markers',
    marker_size=2
)

display(fig)

## Performance

We load a [rotten tomatoes movie review sentiment dataset](https://huggingface.co/datasets/rotten_tomatoes) as the first source of data. Later in the drift section we will add a different sentiment dataset.

In [None]:
#@title Rotten tomatoes dataset

# DATA: https://huggingface.co/datasets/rotten_tomatoes
rotten_train

In [None]:
#@title Accuracy measurement and accuracy on rotten tomatoes

def accuracy(X: npt.NDArray[float], Y_true: npt.NDArray[int]) -> float:
    """
    Determine model accuracy on the given dataset `X` with ground truth labels
    `Y_true`. If this is running slowly, you might be running without GPU. 
    """
    
    # CLASS
    Y_probits = Model.evaluate_to_probits(X).detach().cpu().numpy()

    # CLASS
    Y_pred = np.argmax(Y_probits, axis=1)
    correct = Y_pred == Ytrue

    return correct.mean()

# DATA
for dataset_name, X, Ytrue in [
    ("rotten train", rotten_train['text'], rotten_train['label']),
    ("rotten test", rotten_test['text'], rotten_test['label'])]:
    print(dataset_name, f"accuracy = {accuracy(X, Ytrue) * 100:0.2f} %")


## Trulens: Model Wrapper

As in the prior notebooks, we need to wrap the pytorch model with the appropriate Trulens functionality.

In [None]:
#@title Trulens wrapping

# HUGS: Output might be useful for figuring out embedding layer for different models.
Model.wrapper = get_model_wrapper(Model.model, device=Model.device)

#Model.wrapper.print_layer_names()

# Explanations

In [None]:
#@title Trulens Integrated Gradients and Visualization setup

# HUGS: Set up the attribution method, here it will be "Integrated Gradients"
# CLASS: Will have to change qoi to potentially LambdaQoI.
common_attributor_arguments = dict(
    model=Model.wrapper,
    resolution=128,
    rebatch_size=32,
    doi_cut=Cut(Model.embeddings_layer),
    qoi=ClassQoI(Model.POSITIVE),
    qoi_cut=OutputCut(accessor=lambda o: o['logits'])
)
# HUGS
infl = IntegratedGradients(
    **common_attributor_arguments
)

from trulens.visualizations import NLP

# HUGS: Set up visualization utilities.
V = NLP(
    wrapper=Model.wrapper,
    labels=Model.labels,
    decode=Model.token_str,
    tokenize=lambda sentences: ModelInputs(kwargs=Model.tokenize(sentences,)).map(lambda t: t.to(Model.device)),
    # huggingface models can take as input the keyword args as per produced by
    # their tokenizers.
    input_accessor=lambda x: x.kwargs['input_ids'],
    # for huggingface models, input/token ids are under input_ids key in the
    # input dictionary
    output_accessor=lambda x: x['logits'],
    # and logits under 'logits' key in the output dictionary
    hidden_tokens=set([Model.tokenizer.pad_token_id])
    # do not display these tokens
)

In [None]:
#@title Playground: Attribution
#@markdown Enter a piece of text in the box below and press enter to display the explanation of what about the input was most important on the output. The output rows include:
#@markdown - The "quantity of interest" which indicates which apsect of model output is being explained, in this case it is the positive class score. This is indicated by both "ClassQoI_1" where 1 indicates the index of the positive score, and the teal square around the name of that classs, "positive".
#@markdown - The classification outcome (white rectangle around the class name). The relative scores of each class is indicated by the green bars above each class label.
#@markdown - Influnece of each token (indicated by positive/green or negative/red contributions) to the quantity of interest.

results = []

@interact(text=textbox(c=False))
def show_attribution(text):
    global results

    # Token attribution visualization takes in a list of sentences and the
    # attribution method to compute the attributions.
    token_attribution = V.tokens([text], infl) 

    results.insert(0, token_attribution)
    results = results[:10]

    for result in results:
        display(result)

## Baselines

We see in the above results that special tokens such as the sentence end **&lt;/s&gt;** contributes are found to contribute a lot to the model outputs. While this may be useful in some contexts, we are more interested in the contributions of the actual words in these sentences. To focus on the words more, we need to adjust the **baseline** used in the integrated gradients computation. By default in the instantiation so far, the baseline for each token is a zero vector of the same shape as its embedding. By making the basaeline be identicaly to the explained instances on special tokens, we can rid their impact from our measurement. Trulens provides a utility for this purpose in terms of `token_baseline` which constructs for you the methods to compute the appropriate baseline. 

In [None]:
#@title Baseline setup

# HUGS
inputs_baseline_ids, inputs_baseline_embeddings = token_baseline(
    keep_tokens=set([Model.tokenizer.cls_token_id, Model.tokenizer.sep_token_id]),
    # Which tokens to preserve.
    replacement_token=Model.tokenizer.pad_token_id,

    # AIQ: Try changing `replacement_token` parameter to other special or non
    # special tokens.

    # replacement_token=Model.tokenizer.mask_token_id,
    # replacement_token=Model.tokenizer.vocab["happy"],

    # HUGS: What to replace tokens with.
    input_accessor=lambda x: x.kwargs['input_ids'],
    ids_to_embeddings=Model.model.get_input_embeddings()
    # Callable to produce embeddings from token ids.
)

We can now inspect the baselines on some example sentences. The first method returned by `token_baseline` gives us token ids to inspect while the second gives us the embeddings of the baseline which we will pass to the attributions method.

In [None]:
#@title Playground: Attribution with Pad Baseline (default baseline on left, Pad on right)
#@markdown The explanations of the given text are given with both "default" baselines (left) and baselines that replace non-special tokens with "[PAD]".
#@markdown - Notice the influence of special tokens at the start and end of the input with the two different baselines.
#@markdown
#@markdown

infl_positive_baseline = IntegratedGradients(
    baseline=inputs_baseline_embeddings, **common_attributor_arguments
)

results2 = []


@interact(text=textbox(c=False))
def show_attribution(text):
    global results2

    default_result = widgets.HTML(V.tokens([text], infl).data)
    baseline_result = widgets.HTML(
        V.tokens([text], infl_positive_baseline).data
    )

    results2.insert(0, (default_result, baseline_result))
    results2 = results2[:3]

    parts = []

    for result in results2:
        parts.append(widgets.HBox(result))

    display(widgets.VBox(parts))

# Fairness

In [None]:
#@title Skip: Details

def word_pattern(word: str) -> str:
    """
    Create a pattern that matches the given `word` as long as it is not
    immediately next to an alpha-numeric character.
    """
    return "(?<!\w)" + re.escape(word) + "(?!\w)"


def swap(thing1: str, thing2: str) -> Callable[[str], str]:
    """
    Create a method to swap occurances of `thing1` and `thing2`.
    """

    pat_swapper = re.compile(r":swapper:")
    pat1 = re.compile(word_pattern(thing1), re.IGNORECASE)
    pat2 = re.compile(word_pattern(thing2), re.IGNORECASE)

    def f(sentence: str):
        """
        Swap instances of thing1 and thing2 in sentence.
        """

        temp1 = pat1.sub(":swapper:", sentence)
        temp2 = pat2.sub(thing1, temp1)
        temp3 = pat_swapper.sub(thing2, temp2)
        return temp3

    return f


def contains(s: str, pat: re.Pattern) -> bool:
    """
    Determine whether the given string `s` satisfies regular expression `pat`.
    """
    return pat.search(s) is not None


def get_sentence_pairs(token_pairs: List[Tuple[str, str]],
                       texts: List[str]) -> List[Tuple[str, str]]:
    """
    Create sentence pairs from examples in `texts` that swap words from the
    pairs list `token_pairs`.
    """

    patterns = [
        re.compile(
            "|".join([word_pattern(tok) for tok in pair]), re.IGNORECASE
        ) for pair in token_pairs
    ]
    swappers = [swap(*pair) for pair in token_pairs]

    sentence_pairs = [
        (sentence, swap(sentence))
        for pattern, swap in
        tqdm(zip(patterns, swappers), desc="finding swap pairs", unit="pair", leave=False)
        for sentence in tqdm(texts, desc="processing sentences", leave=False)
        if contains(sentence, pattern)
    ]

    print(f"found {len(sentence_pairs)} sentence pair(s)")

    return sentence_pairs

# CLASS
def compute_pair_disparities(
    sentence_pairs: List[Tuple[str, str]]
) -> List[Tuple[Tuple[str, str], float]]:
    """
    Given a collection of `sentence_pairs`, produce a list of tuples containing
    the pairs as the first element and the disparity in model scores as the
    second.
    """

    diffs = []
    
    # CLASS
    a_probits = Model.evaluate_to_probits([pair[0] for pair in sentence_pairs])
    b_probits = Model.evaluate_to_probits([pair[1] for pair in sentence_pairs])

    # CLASS
    for a_probit, b_probit in tqdm(zip(a_probits, b_probits),
                                   desc="comparing probits"):

        diffs.append(
            torch.nn.functional.cross_entropy(
                torch.unsqueeze(a_probit, dim=0),
                torch.unsqueeze(b_probit, dim=0)
            ).detach().cpu().numpy()
        )

    diffs = np.array(diffs)
    diffs_pairs = list(
        reversed(sorted(zip(sentence_pairs, diffs), key=lambda pair: pair[1]))
    )

    return diffs_pairs


def show_biggest_disparities(
    diffs: List[Tuple[Tuple[str, str], float]],
    attributor=infl_positive_baseline,
    n=3
) -> None:
    """
    Display the top disparate pairs along with their attributions.
    """

    display(
        V.tokens_stability(
            texts1=[p[0][0] for p in diffs][0:n],
            texts2=[p[0][1] for p in diffs][0:n],
            attributor=attributor
        )
    )

## Robustness as fairness

Does the model change its prediction if we replace one gendered word for its equivalent of the opposite gender?

In [None]:
#@title Gendered pairs

gender_pairs = [
    ('he', 'she'),
    ('guy', 'gal'),
    ('himself', 'herself'),
    ('boy', 'girl'),
    ('husband', 'wife'),
    ('man', 'woman'),
    ('men', 'women'),
    ('brother', 'sister'),
    ('uncle', 'aunt'),
    ('nephew', 'niece'),
    ('dad', 'mom'),
    ('father', 'mother'),
    ('son', 'daughter'),
    ('actor', 'actress'),
    ('male', 'female'),
    ('hero', 'heroine'),
]

sentence_pairs_gender = get_sentence_pairs(gender_pairs, rotten_texts)
diffs_pairs_gender = compute_pair_disparities(sentence_pairs_gender)

show_biggest_disparities(diffs_pairs_gender)

In [None]:
#@title Playground: Fairness robustness under token substitutions
#@markdown Enter two related tokens into the fields "token1" and "token2" to see which instances in the selected dataset differ the most in model outcomes between equivalent sentences that differ only in the two tokens. Some other examples to try:
#@markdown - "he" vs "she"
#@markdown - "hero" vs" heroine"
#@markdown - ...

@interact(
    token1=textbox("hero", d="token1", c=False),
    token2=textbox("heroine", d="token2", c=False),
    dataset=widgets.Dropdown(
        layout=aiq_layout,
        options=[
          ("rotten tomatoes", rotten_texts),
          ("imdb", imdb_texts),
          ("tweet", tweet_texts)
        ], 
        style={'description_width': 'initial'}
))
def show_disparities(token1, token2, dataset):
    if token1 == "" or token2 == "":
        return

    sentence_pairs = get_sentence_pairs([(token1, token2)], dataset)

    if len(sentence_pairs) == 0:
        return

    diffs_pairs = compute_pair_disparities(sentence_pairs)
    show_biggest_disparities(diffs_pairs)

## Gender in embedding space

In [None]:
#@title Skip: Details

# A vector approximating the difference between embeddings of pairs of words
# of the opposite gender. This one is for the token embedding used in
# distilbert.

# HUGS: Do not expect this to work with other models.
gender_vector: npt.NDArray['float16'] = np.frombuffer(
    base64.b85decode(
        b'?4-;r+n_BbxT&BWWUIlV^Qc9YM<?$ffFvBOyeVWPZ6QZ1ttD(IWT3Vs@g>z1(IHALq#<!8`6&IFovN59FQQ<rjVfRxf2awiYAAOf0xHHLzn)JeL8GLalpxcndmiqk3oS&V=c3uGKP<ea=_dCgN-Vq~^C<SDBB=f;{-m#?lBazp6`Fi20ii4{HmQFaeJNI`eJG!zN2p7nT&MsSPpc;*^d-=!XDFVgZ7N3{HKO`0uqm#msif;Ew5=Wq#ib{lJ)>zVdn~jjwWg4$E-Hm9mL8ELf+nV@>7XN}EFh#J6Du&8d89uqq^KIJJfWAH<s(?A83iFILnMlrgei)r<*S^fJgHDAWT^b8{i%wqmaLPhbS9T6Jt#99Sf6L8+bydo;3dAJS}Uz4E1CW%<)@0Q!Ypv2O(|R<c%D?4xR(l{UL)hBP^^)syPV#omZOCyB&3O+Iv%o~8KW|tw5a_oU?h^MQ!4KuK_zc2qAUp_g(MRpzbQy4vZ$V-e<o?1A1Fqg-lZKX51h559ip|Y3@9lngeeCdV4_MV1tujcN-QR*NT-IaRil6?RICoHU?Y?#zN8MQpeo*^uBGuJ$tk-V*C>ytF{UOeM4I9z{VUa<Vx*-d-l$9`fGFf76r}JNuPLCYkTBb<^`&<w5-3!r4Wiwe2&9!9nI?EFcBnfjS1ipX@gcS+XD6R2W3AF1h>!iL@|kC$dla6YHl}MM<0kVa@~U*KA0pVH3?n6;I4Dsj{2DZ*bRJnH^CRFT$f!e}dniDvR;=tH3Mn0=KqeCw)1Lk)mZ)|kEUBrWAE-g9fhH>_i>J6EXBbkVJ|R9RW2H-=R-}(7FDB6^EvphMpez}nRGA2_{iNV0kf_8hG_1QR;-b?fwIDDpnI}W4-5<av4k0q42p)f_%Ay~t-K6*-@+Mjz%$YHy51(}(H7Uv>(<l=m@*IUHzoEq}BB-CJ^(USw-Yae=oTara=C19iiYMJ3?yV>$*qY=iC!+l-R;E;>I4FCe5iVFIf2tZC5UZpoNG!mnxS~QM38iYMm?*WNN1_!Zh^a27Eu`Qoc$}IfR+ikN7@WSNXQ}EcY!wqH2PuOl#HIx#l_0O7ogf&f@u_Slmm`>_AgeQ^RwvCWk}20BY9RBYb|sFgpeAH0Dx#e#tt9TASt-Dq@1$NOs3duoK`EM`FQsOske;`wT_Z0g0w<>+xh5K%eJIJG%cHs?)}7U(Vx_Yrcpk?fF(C#ZaHY4W)TXW`zn~3}NvZsnOQu36Q6W#L)T+R%yeKOyr=sMY+ou^QsHsz@^&6EdH6?hZOqr#q&8vec0VOpo7b`0ylcE<TLM8XAv7>h-WRQC!;i)z*1gXcS@g$NXP%J#EOdu8^39HJf-!4$5t1F|X>nB1bOs2G`_@1C8;w3n$h$<i`f+X;tZX^b#;Ucys8?I0(!zlPFE1h{F1FV`R&#Ig!^d}%BvmpE}lBB68#i@S|a3H`YHy+|3qoeSwl&35tv!2JIffi^f_$lxpN~Hd%My0>0I32|%;S;zZ38EJ*&ZM^~8z+q-g_sARQK)mI6e2np%BPPeTP$EE%&YOJg(wxF_94wE;wph611IgKq9+HaTdc1oA14&4ysA2+%p*K3yQG9HBPA3nuBm+}Bc<mgCL^*Z>ztA#$1O7^A1jL_&?9yoLLzG;4W}ldVyJc}@u_hmODydxda9wBL7j9dl?n!@W-8FCn5dm7l`5;HN2rgX&7LBtm8K*teWQw>D=p^^x~S!>GaR6#Mkv3kktndAR2qe)DyAc)iKy+Q&ZhH}K&En`@+qt+lOfouwj+$EbEZh08mM%hgeI${1Sv!*_$5}TS09xr!ljccx+a^awVhe0JfIpJSSmoK&7Kw@hbW;WB_D1h{;R}`OQ|d^%d6>`KqUt#-K4;xLadW0q^d!nF(UV!=_;$CJ1X3%@2E4Yk*i)Mz>y9k<fnnC5Gh|R<|O+jB#_&s2B9h)'
    ),
    dtype='float16'
)


def normalize(v: npt.NDArray[float]) -> npt.NDArray[float]:
    """
    Normalize a single vector.
    """
    return v / np.linalg.norm(v, ord=2)


def normalize_many(v: npt.NDArray[float]) -> npt.NDArray[float]:
    """
    Normalize an array of vectors.
    """
    return v / np.linalg.norm(v, axis=1, ord=2)[:, np.newaxis]


embeddings_norm = normalize_many(Model.embeddings)
baseline_penalties = np.abs(np.dot(embeddings_norm, gender_vector))


def embedding_opposite_id(emb: np.ndarray) -> Tuple[int, float]:
    """
    Get the token id of the token closest to the gender-opposite of the given
    `emb`.
    """

    # HUGS
    emb = normalize(emb)
    scores = np.abs(
        np.dot(
            normalize_many(emb - embeddings_norm + 0.000000001), gender_vector
        )
    ) - 0.55 * baseline_penalties

    best = np.argmax(scores)

    return best, scores[best]


def embedding_opposite(emb: np.ndarray) -> np.ndarray:
    """
    Try to find the embedding close to the opposite gender relative to the given
    `emb`. 
    """

    best_id, best_score = embedding_opposite_id(emb)

    # DATA
    if best_score > 0.25:
        return Model.embeddings[best_id]
    else:
        return emb


def embedding_neutralize(emb: np.ndarray) -> np.ndarray:
    """
    Remove the component of the given embedding that points in the gender
    direction.
    """
    return emb - np.dot(emb, gender_vector) * gender_vector


@functools.lru_cache(maxsize=Model.vocab_size)
def token_id_opposite(token_id: int):
    """
    Try to find the opposite of `token_id` according to the direction of
    `direction_vector`. If a good candidate is not found, returns the given
    `token_id` instead.
    """
    best_id, best_score = embedding_opposite_id(embeddings_norm[token_id])

    # DATA
    if best_score > 0.20:
        return best_id
    else:
        return token_id


def swap_token(token: str) -> str:
    """
    Attempts to find a token of the opposite gender of the given `token`.
    """

    a_id = Model.id_of_token[token]
    b_id = token_id_opposite(a_id)
    return Model.token_of_id[b_id]

In [None]:
#@title Playground: Gender in the embedding space
#@markdown The embedding visualization we shown earlier in this notebook now includes indicators of the gender dimension with color. More blue tokens are more female while more red and male. 
#@markdown - You can use the sliders above the visualization to filter out tokens with small gender magnitude. The highlighted area between the two sliders indicates the gender magnitude range that is filtered out from the figure, so that only tokens with large magnitude are shown.
#@markdown - Notice that if you drag both filter draggers to their end points, you end up with only the most female and most male tokens.
#@markdown - Also note that there are clusters with similar gender despite that being only one of many aspects of a token embedding.
#@markdown - On the other hand, there are clear deviations from the overall color/gender pattern. For example, the pairs "man"/"men" and "woman"/"women" are right next to each other despite featuring tokens of opposite genders. 

# geometry of gender in embedding space

# AIQ: This is computationally intensive picture. It is only useful if you use
# the tsne reduction.

color = np.dot(normalize_many(Model.embeddings), gender_vector)
cmin = color.min()
cmax = color.max()
  
fig = Figure(layout=dict(width=800, height=800, **plotly_layout))
fig.update_xaxes(
    showticklabels=False
)
fig.update_yaxes(
    showticklabels=False
)
s = fig.add_scatter(
    x=[],
    y=[],
    text=Model.vocab,
    mode='markers',
    marker={
        'cmin': cmin,
        'cmax': cmax,
        'colorscale': "Picnic",
        'color': [],
        'colorbar': dict(thickness=20)
    },
    marker_size=4
)

@interact(hide = widgets.FloatRangeSlider(continuous_update=False, value=[cmin/8, cmax/8], min=cmin, max=cmax, step=0.01, layout=aiq_layout))
def show_gender_space(hide):

  most_gendered = (color >= hide[1]) | (color <= hide[0])
  s.data[0].update(
    x=tsne_embedding[most_gendered, 0],
    y=tsne_embedding[most_gendered, 1],
    text=Model.vocab[most_gendered], 
    marker={
      'color': color[most_gendered],
    }
  )

  display(fig)


## Embedding debiasing

In [None]:
#@title Gender neutralized baseline definition

def baseline_neutralize(z: torch.Tensor) -> torch.Tensor:
    """
    Given input tensor of embeddings, produce a baseline that removes their gender component. This can be used to debias words which you do not want to have a gender component like "doctor", "nurse", etc.
    """

    if isinstance(z, torch.Tensor):
        z = z.detach().cpu().numpy()

    return torch.tensor(
        np.array(
            [[embedding_neutralize(emb) for emb in instance] for instance in z]
        )
    ).to(Model.device)


infl_neutralize_gender = IntegratedGradients(
    baseline=baseline_neutralize, **common_attributor_arguments
)

In [None]:
#@title Playground: Attribution to gender (original attribution on left, attribution to gender on right)
#@markdown In this explanation variant, we have changed the baseline so that it only changes the gender component of gendered words. Words without clear gender components in their embedding are unchanged.
#@markdown - Try changing the various pronounces in the example sentence one at a time and in various combinations.

results3 = []


@interact(
    text=textbox(
        t=
        "Johnson has, in his first film, set himself a task he is not nearly up to.",
        c=False
    )
)
def show_attribution(text):
    global results3

    default_result = widgets.HTML(V.tokens([text], infl_positive_baseline).data)
    baseline_result = widgets.HTML(
        V.tokens([text], infl_neutralize_gender).data
    )

    results3.insert(0, (default_result, baseline_result))
    results3 = results3[:3]

    parts = []

    for result in results3:
        parts.append(widgets.HBox(result))

    display(widgets.VBox(parts))

# Drift

In [None]:
#@title IMDB dataset

# Get another dataset to compare to.

# IMDB dataset is large, will take only a portion for speed:
n = 2000

# DATA: https://huggingface.co/datasets/imdb
imdb_train

In [None]:
#@title Dataset sampling and IMDB dataset accuracy

def sample(items: List[Sequence], n: int) -> Tuple[List, List]:
  """
  Take a sample of the sequences in `items`. If more than one sequence is given, the same indices are sampled from each, thus appropriate for X, Y pairs.
  """
  if n > len(items[0]):
    n = len(items[0])
  items = list(map(np.array, items))
  indices = np.random.choice(np.arange(len(items[0])), size=n).astype(int)
  return tuple(map(lambda array: list(array[indices]), items))

# DATA
for dataset_name, (X, Ytrue) in [
    ("imdb train", sample([imdb_train['text'], imdb_train['label']], n=n)),
    ("imdb test", sample([imdb_test['text'], imdb_test['label']], n=n))]:
    print(dataset_name, f"accuracy = {accuracy(X, Ytrue) * 100:0.2f} %")

In [None]:
#@title Skip: Details

plotly_layout2 = plotly_layout.copy()
plotly_layout2['margin'] = plotly_layout2['margin'].copy()
plotly_layout2['margin']['t'] = 60

# CLASS
def show_model_score_drift(
    texts1: List[str],
    texts2: List[str],
    n1: str,
    n2: str,
    score: str = "positive"
) -> None:
    """
    Given two collections of texts, display model `score` histogram over those
    two texts. The other arguments are for labeling the collections. 
    """

    # CLASS
    scores1 = Model.evaluate_to_logits(texts1).detach().cpu().numpy()
    scores2 = Model.evaluate_to_logits(texts2).detach().cpu().numpy()

    # CLASS
    df1 = pd.DataFrame(dict(
        negative=scores1[:, 0],
        positive=scores1[:, 1],
    ))
    df2 = pd.DataFrame(dict(
        negative=scores2[:, 0],
        positive=scores2[:, 1]
    ))

    s1 = df1[score]
    s2 = df2[score]

    counts1, bin_edges = np.histogram(s1, bins=20, density=True)
    counts2, _ = np.histogram(s2, bins=bin_edges, density=True)

    fig = Figure(layout=dict(title="model score distributions", **plotly_layout2))
    bar1 = fig.add_bar(x=bin_edges, y=counts1, name=n1)
    bar2 = fig.add_bar(x=bin_edges, y=counts2, name=n2)

    display(fig)

In [None]:
#@title Playground: Model score drift (rotten tomatoes, train vs. test)
#@markdown The distribution of scores of our sentiment model is shown in the histogram for two datasets: rotten tomatoes train split, and rotten tomatoes test split.

# TODO: Can this visualization be adjusted to look more like the distribution plots in truera?

# DATA
show_model_score_drift(
    sample([rotten_train['text']], n=n)[0],
    sample([rotten_test['text']], n=n)[0],
    'rotten train',
    'rotten test'
)

In [None]:
#@title Playground: Model score drift (rotten tomatoes vs. imdb)
#@markdown In this histogram, the score distributions are compared across two more significantly different datasets: rotten tomatoes and imdb.
#@markdown - Compare the distributional differences visible here as compared to the previous plot.

# DATA
show_model_score_drift(
    sample([rotten_train['text']], n=n)[0],
    sample([imdb_train['text']], n=n)[0],
    'rotten train',
    'imdb train'
)

## Token distribution drift

In [None]:
#@title Skip: Details

def tokenize(portion: List[str]) -> torch.Tensor:
    """
    Tokenize into just token_ids, not any of the other model inputs.
    """
    return Model.tokenizer.batch_encode_plus(
        portion,
        add_special_tokens=True,
        return_attention_mask=False,
        max_length=Model.max_length,
        truncation=True
    )['input_ids']


def toks_of_texts(texts: List[str]) -> npt.NDArray[int]:    
    toks = tokenize(texts)
    
    return np.array([t for tok in toks for t in tok])
    

def dists_of_texts(
    texts: List[str]
) -> Tuple[npt.NDArray[int], npt.NDArray[float]]:
    all = toks_of_texts(texts)

    counts = np.zeros(Model.vocab_size)

    for i in all:
        counts[i] += 1

    dist = counts / len(all)

    return counts, dist


def tops_of_texts(texts: List[str], n: int = 10) -> List[int]:
    """
    Get the indices of the most frequent tokens in the collection of `texts`.
    """

    counts, dist = dists_of_texts(texts)

    return tops_of_dists(counts, dist, n=n)


def tops_of_dists(c: npt.NDArray[int],
                  d: npt.NDArray[float],
                  n=10) -> List[int]:
    sortindex = np.argsort(d)
    top = []

    for idx in sortindex[0:n]:
        top.append((idx, c[idx], d[idx], Model.tokenizer.decode(idx)))

    crest_pos = 0
    crest_neg = 0
    drest_pos = 0
    drest_neg = 0

    for idx in sortindex[n:-n]:
        if c[idx] >= 0:
            crest_pos += c[idx]
            drest_pos += d[idx]
        else:
            crest_neg += c[idx]
            drest_neg += d[idx]

    top.append((-1, crest_neg, drest_neg, "*"))
    top.append((-1, crest_pos, drest_pos, "*"))

    for idx in sortindex[-n:]:
        top.append((idx, c[idx], d[idx], Model.tokenizer.decode(idx)))

    return top

def plotdist(
    d1: npt.NDArray[float], d2: npt.NDArray[float], top, l1: str, l2: str
) -> None:

    n = len(top)

    dprobs = pd.DataFrame(
        {
            "token": [t[3] for t in top] * 2,
            "dataset": ([l1] * n) + ([l2] * n),
            "prob": [d1[t[0]] for t in top] + [d2[t[0]] for t in top]
        }
    )
    fig = px.bar(dprobs, x="token", y="prob", color="dataset", barmode='group')
    fig.update_layout(plotly_layout, height=300)
    display(fig)

    ddiff = pd.DataFrame(
        {
            "token": [t[3] for t in top],
            "prob": [t[2] for t in top]
        }
    )
    fig = px.bar(ddiff, x="token", y="prob")
    fig.update_layout(plotly_layout, height=300)
    display(fig)

In [None]:
#@title Playground: Token distribution drift (imdb vs rotten tomatoes)
#@markdown In this distributional comparison, we look not at model scores, but token frequencies in the data (hence this is model-independent analysis which may or may not be consequential to the model).
#@markdown - Notice that many of the distributional differences are due to inclusion or exclusion of HTML tags in the datasets.
#@markdown - The "*" on the token axis represents all other tokens (the probabilities are ommitted on the upper chart, but total differences are shown in the lower chart).

# DATA
c1, d1 = dists_of_texts(imdb_train['text'][:n])
c2, d2 = dists_of_texts(rotten_train['text'][:n])
top = tops_of_dists(c1 - c2, d1 - d2, n=20)

# DATA
plotdist(d1, d2, top, l1='imdb', l2='rotten')

In [None]:
#@title Playground: Token distribution drift (imdb train vs. imdb test)
#@markdown Here we are comparing the two splits of the rotten tomatoes data.
#@markdown - Note that while the total difference in token probabilities adds to a significant amount, no single token offers much difference.

# DATA
c1, d1 = dists_of_texts(imdb_train['text'][:n])
c2, d2 = dists_of_texts(imdb_test['text'][:n])
top = tops_of_dists(c1 - c2, d1 - d2, n=20)

# DATA
plotdist(d1=d1, d2=d2, top=top, l1='imdb train', l2='imdb test')

## Embedding distribution drift

In [None]:
#@title Skip: Details

# DATA
c1, d1 = dists_of_texts(rotten_train['text'][:n])
c2, d2 = dists_of_texts(imdb_train['text'][:n])

data1 = dict(prob=d1, token_id=range(Model.vocab_size))
data1.update({f"dim{did}": Model.embeddings[:, did] for did in range(Model.embedding_size)})
df1 = pd.DataFrame(data1)

data2 = dict(prob=d2, token_id=range(Model.vocab_size))
data2.update({f"dim{did}": Model.embeddings[:, did] for did in range(Model.embedding_size)})
df2 = pd.DataFrame(data2)

def show_hists(
    s1: pd.Series, s2: pd.Series, df1: pd.DataFrame, df2: pd.DataFrame,
    title: str
) -> None:
    counts1, bin_edges = np.histogram(s1, bins=20, weights=df1.prob.values)
    counts2, _ = np.histogram(s2, bins=bin_edges, weights=df2.prob.values)

    fig = go.Figure(layout=dict(title=title, **plotly_layout2))
    fig.update_layout(xaxis_title="Dimension's value", yaxis_title="Density")
    # DATA
    bar1 = fig.add_bar(x=bin_edges, y=counts1, name="rotten")
    bar2 = fig.add_bar(x=bin_edges, y=counts2, name="imdb")

    display(fig)

In [None]:
#@title Playground: Embedding distribution shift
#@markdown In this distributions plot we visualize the relative occurace frequencies of the token embeddings, one dimension at a time.
#@markdown - The slider on top of the graph lets you change which of the many embedding dimensions to focus on.

@interact(dim=widgets.IntSlider(value=0, min=0, max=Model.embedding_size-1, layout=aiq_layout))
def show_dim_hist(dim):
    show_hists(
        df1[f'dim{dim}'],
        df2[f'dim{dim}'],
        df1,
        df2,
        title=f"embedding dimension {dim}"
    )


In [None]:
#@title Playground: Gender distribution drift
#@markdown While the distributional differences in raw embedding dimensions may not be interpretable, in this figure we instead compare the distributions of the gender direction of the tokens in each dataset.
#@markdown - In this setting, a significant difference can also be associated with an interpretation that reads something like: "the overall gender of the dataset has shifted towards more female".

embeddings_gender = np.dot(Model.embeddings, gender_vector)

df1g = pd.DataFrame(dict(
    gender=embeddings_gender,
    prob=d1,
    token_id=range(Model.vocab_size)
))

df2g = pd.DataFrame(dict(
    gender=embeddings_gender,
    prob=d2,
    token_id=range(Model.vocab_size)
))

show_hists(
    df1g.gender, df2g.gender, df1g, df2g, title="gender dimension histogram"
)