# 🤖💬❓nlp-uncertainty-zoo Demo

This is a quick demo for the nlp-uncertainty-zoo, detailing how to jump in quickly with package. We will do this by training two different models on the Rotten Tomatoes sentiment analysis dataset, where want to classify where a movie review is positive or negative. 

For that purpose, we first start by importing all necessary packages as well as loading and preprocessing the dataset. Even though the first model we are using is LSTM-based, we will still use the BERT tokenizer here for the sake of simplicity.

## Loading the dataset & preprocessing

In [39]:
import random
from string import ascii_lowercase

from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import BertTokenizer
from nlp_uncertainty_zoo.models import LSTMEnsemble, VariationalBert  # We will test these two models in this demo!

# CONST
BATCH_SIZE = 16

In [21]:
def preprocess_with(tokenizer):
    def preprocess(input_):
        return tokenizer(
            input_["text"],
            truncation=True,
            padding="max_length",
            max_length=50
        )
    
    return preprocess

dataset = load_dataset("rotten_tomatoes")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

train_set = dataset["train"].map(preprocess_with(tokenizer), batched=True)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE)

test_set = dataset["train"].map(preprocess_with(tokenizer), batched=True)
test_loader = DataLoader(train_set, batch_size=BATCH_SIZE)

Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/Users/deul/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9c411f7ecd9f3045389de0d9ce984061a1056507703d2e3183b1ac1a90816e4d)


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




## Training

We now start by training an ensemble of LSTMs. Due to fact that all members of an ensemble are randomly initialized, models tend to converge to different solutions, making the ensemble very robust to unseen data points (see paper TODO). This is also a very useful property for uncertainty quantification, as we will see later. 

In [48]:
# TODO: Set seeds

In [23]:
# TODO: Training code for LSTM Ensemble

Next up, we will fine-tune a BERT model. For uncertainty quantification, we will use Monte Carlo Dropout (TODO: Citations): By using multiple different dropout masks during inference, we can create different predictions for the same data point. 

In [24]:
# TODO: Training code for Variational BERT

## Evaluating task performance & calibration

Before we continue, let us first evaluate the models to reassure ourselves that the training was successful:

In [None]:
# TODO: Evaluate models

We can also evaluate to what extend the probability of a predicted class actually corresponds to the chance of the model actually predicting the correct class, also called *calibration* (Guo et al., 2017). One way to evaluate this propery is the expected calibration error (ECE): By binning predictions with similar confidence scores, we can evaluate if the mean confidence per bin corresponds to the accuracy on the binned samples:

In [None]:
# TODO: Implement calibration with ECE

Another approach is evaluation using *prediction sets* (TODO: Citation). The idea here is to sort predictings descendingly and add classes to a set until a certain amount of probability mass - for instance 90 % in the example below - is reached. If the model is well calibrated, these prediction sets should be small and contain the correct class (on average). Using the functions implemented in the package, we evaluate these properties below: 

In [47]:
# TODO: Implement prediction set evaluation

## Uncertainty quantification

Next, we want to use the model to actually quantify their uncertainty in a prediction. For this purpose, we manually define some sequences which should seem suspicious to the models. 

In [43]:
original_sentence = train_set[1]["text"]
print(original_sentence)

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .


In [46]:
# The model hasn't been finetuned on German, so this should be weird
sentence1 = (
    "Die umwerfend aufwendige Fortsetzung der „Der Herr der Ringe“-Trilogie ist so umfangreich,"
    "dass eine Kolonne von Worten die erweiterte Vision von Co-Autor/Regisseur Peter Jackson "
    "von j. r . r . Tolkiens Mittelerde nicht angemessen beschreiben kann."
).lower()
# Now we scramble the contents of the sentence randomly
tokens = original_sentence.split(" ")
sentence2 = " ".join(random.sample(tokens, len(tokens)))
print(sentence2)

# Add noise to the sentence
delete_chars = 10
add_noise_chars = 10

sentence3 = str(original_sentence)

for _ in range(delete_chars):
    idx = random.choice(range(len(sentence3)))
    sentence3 = sentence3[:idx] + sentence3[idx + 1:]
    
for _ in range(add_noise_chars):
    idx = random.choice(range(len(sentence3)))
    char = random.choice(ascii_lowercase)
    
    sentence3 = sentence3[:idx] + char + sentence3[idx:]
    
print(sentence3)

of vision expanded of lord of . " huge a . the tolkien's is words describe peter so middle-earth cannot " the of gorgeously column adequately r co-writer/director j rings . r that continuation trilogy . the jackson's elaborate
the gorgeously elaborate continuationpn of " the lsord ofthe rings " trilogy is sohuge that a column of words cannort adequaely describek co-riter/dzijrector pteorn ackson's expanded vsion of jm . r .r  tolkien's middleearth .


We first check the predictions for the sentence above. The original sentence had a positive sentiment, so we first whether our model come to the same conclusion:

In [44]:
# TODO: Get predictions

Since the sentences are very different from the training sentences, we now measure the uncertainty. Since the inputs above are pretty different from the inputs the models were trained on, we would hope the models to be more uncertain on the noisy sentences. 

In this demo, we will explore three different uncertainty matrix: Maximum softmax probability, predictive entropy, and mutual information. Depending on the model, there might be different metrics available. You can check that by inspecting the ``available_uncertainty_metrics`` attribute:

In [None]:
# TODO: Implement functionality and use here

But back to metrics here. An easy and intuitive metric is the maximum softmax probability (TODO: Citation)

$$1 - \max_k p_{\theta}(y=k|x)$$

Intuitively, when the model is uncertain, the distribution over classes should be uniform, thus yielding a low maximum probability over classes. We substract the value from 1 here in order to have small values correspond to high certainty. 

Another way to measure uncertainty is to use the Shannon entropy of the predictive distribution: For a uniform distribution, the entropy will be maximal:

$$-\sum_{k=1}^K p_{\theta}(y=k|x) \log p_{\theta}(y=k|x)$$

Lastly, Smith & Gal (2017) propose mutual information as a way to exlusively measure the *model uncertainty*:

$$\text{H}\bigg[\mathbb{E}_{q(\theta)}\Big[p_{\theta}(y|x)\Big]\bigg] - \mathbb{E}_{q(\theta)}\bigg[\text{H}\Big[p_{\theta}(y|x)\Big]\bigg]$$

Here, the first term denotes the total uncertainty, from which the second term, the *data uncertainty*, is subtracted, leaving only the model uncertainty. Usually, the expectation in both terms would over the weight posterior $p(\theta|\mathcal{D})$ of the model, which is generally intractable to evaluate for neural networks, which is why we model an approximate posterior $q(\theta)$ instead. To evaluate this expectation, we use monte carlo sampling, by simply averaging the predictions coming from different sets of weights - in the case of the LSTM ensemble, these come from different ensemble members, for the Variational BERT, this corresponds to predictions using different dropout masks.

In [42]:
# TODO: Demonstrate usage of uncertainty metrics, measure uncertainty on noisy sentences compared to original one

## Evaluating the quality of uncertainty estimates

As we have done before with the raw probalities, we also want to know how reliable the uncertainty estimates for our models are. The package also provides several ways to do this: Firstly, we can evaluate them using an OOD detection task - the model should be more uncertain on data points that are unlike the ones in the training set. By using the uncertainty scores, we can use binary classification metrics like the area under the precision-recall curve (AUPR) and the area under the receiver-operator characteristic (AUROC) to evaluate this. In our Rotten tomatoes example, we will add noise to the sentences in our test set and use these sentences as an OOD data set.

In [None]:
# TODO: Evaluate 

The other way introduced by Ulmer et al. (2022) is to measure how much high uncertainty corresponds to the model making wrong predictions. This is quantified by collecting the model loss and uncertainty for all points in the test set, and measuring their correlation using the [Kendall's $\tau$ correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient). The values range from -1 to 1, which 1 indicating that high uncertainty perfectly correlates with high model loss.

In [None]:
# TODO: Evaluate using Kendal's tau

## Visualizing sentence representations

Part of the interface of the model implementations also allows us to create representation of input sequences and to visualize the latent space of the models. Below we visualize the representations for the original and corrupted sentences below:

In [None]:
# TODO: Implement functions to extract representations and visualize data

In [None]:
# TODO: Plot representations for Variational BERT

In [None]:
# TODO: Plot representations for LSTM Ensemble

Thanks for reading through this demo! We only showcase the most useful functionalities here that people might want to use when applying the implemented models. If you would like to know more about the different models and functionalities in the package, consult [the documentation](http://dennisulmer.eu/nlp-uncertainty-zoo/). If you find any bugs or have requests for missing features, please [open an issue on the Github repository](https://github.com/Kaleidophon/nlp-uncertainty-zoo/issues). Below you can find the papers that were referenced in this demo:

TODO