# 🤖💬❓nlp-uncertainty-zoo Demo

This is a quick demo for the nlp-uncertainty-zoo, detailing how to jump in quickly with package. We will do this by training two different models on the Rotten Tomatoes sentiment analysis dataset, where want to classify where a movie review is positive or negative. 

For that purpose, we first start by importing all necessary packages as well as loading and preprocessing the dataset. Even though the first model we are using is LSTM-based, we will still use the BERT tokenizer here for the sake of simplicity.

In [1]:
%%writefile requirements.txt

torch>=1.9.0
numpy>=1.19.5
wandb>=0.12.5
scikit-learn>=0.24.1
transformers>=4.5.1
einops>=0.3.0
datasets>=1.6.2
tqdm>=4.49.0
blitz-bayesian-pytorch>=0.2.7
gpytorch>=1.5.0
scipy>=1.5.4
dill>=0.3.3
joblib>=1.0.1
alpaca-ml>=0.8.2
frozendict==2.3.4
protobuf==3.20.0

Overwriting requirements.txt


In [2]:
!pip install -r requirements.txt
!pip install git+https://github.com/Kaleidophon/nlp-uncertainty-zoo


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Found existing installation: nlp-uncertainty-zoo 0.9.0
Uninstalling nlp-uncertainty-zoo-0.9.0:
  Would remove:
    /usr/local/lib/python3.8/dist-packages/nlp_uncertainty_zoo-0.9.0.dist-info/*
    /usr/local/lib/python3.8/dist-packages/nlp_uncertainty_zoo/*
Proceed (Y/n)? Y
  Successfully uninstalled nlp-uncertainty-zoo-0.9.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/Kaleidophon/nlp-uncertainty-zoo@demo
  Cloning https://github.com/Kaleidophon/nlp-uncertainty-zoo (to revision demo) to /tmp/pip-req-build-i2h164nu
  Running command git clone --filter=blob:none --quiet https://github.com/Kaleidophon/nlp-uncertainty-zoo /tmp/pip-req-build-i2h164nu
  Running command git checkout -b demo --track origin/demo
  Switched to a new branch 'demo'
  Branch 'demo' set up to track remote branch 'demo' from 'origin'

## Loading the dataset & preprocessing

In [3]:
import random
from string import ascii_lowercase

from datasets import load_dataset, ReadInstruction
import numpy as np
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertModel
from nlp_uncertainty_zoo.models import LSTMEnsemble, VariationalBert  # We will test these two models in this demo!

# CONST
BATCH_SIZE = 16

In [4]:
def preprocess_with(tokenizer):
    def preprocess(input_):
        return tokenizer(
            input_["text"],
            truncation=True,
            padding="max_length",
            max_length=50
        )
    
    return preprocess

# We only use a subset of the data here for demonstration purposes
train_split = load_dataset("rotten_tomatoes", split='train[:100]')
test_split = load_dataset("rotten_tomatoes", split='test[:25]')
ood_test_split = load_dataset("carblacac/twitter-sentiment-analysis", split='test[:25]')
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


train_split = train_split.map(preprocess_with(tokenizer), batched=True)
train_split.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
train_split = train_split.rename_column("label", "labels")

test_split = test_split.map(preprocess_with(tokenizer), batched=True)
test_split.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_split = test_split.rename_column("label", "labels")

ood_test_split = ood_test_split.map(preprocess_with(tokenizer), batched=True)
ood_test_split = ood_test_split.rename_column("feeling", "labels")
ood_test_split.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


train_loader = DataLoader(train_split, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_split, batch_size=BATCH_SIZE)
ood_test_loader = DataLoader(ood_test_split, batch_size=BATCH_SIZE)



Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

## Training

We now start by training an ensemble of LSTMs. Due to fact that all members of an ensemble are randomly initialized, models tend to converge to different solutions, making the ensemble very robust to unseen data points (see paper by Lakshminarayanan et al., 2017). This is also a very useful property for uncertainty quantification, as we will see later. 

In [5]:
SEED = 1234
np.random.seed(SEED)
torch.manual_seed(SEED)

vocab_size = len(tokenizer.vocab)

In [6]:
ensemble = LSTMEnsemble(vocab_size=vocab_size, output_size=2, ensemble_size=5)

In [7]:
ensemble.fit(train_loader, num_training_steps=7)

Step 7: Train Loss 1.9173: 100%|██████████| 7/7 [03:43<00:00, 31.90s/it]


{'model_name': 'lstm_ensemble',
 'train_loss': 1.9173227548599243,
 'best_val_loss': inf}

Next up, we will fine-tune a BERT model. For uncertainty quantification, we will use Monte Carlo Dropout (Gal & Ghahramani, 2016a,b; Xiao et al., 2020): By using multiple different dropout masks during inference, we can create different predictions for the same data point. 

In [8]:
variational_bert = VariationalBert(
    bert_name="bert-base-uncased", 
    output_size=2,
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
variational_bert.fit(train_loader, num_training_steps=2)

Step 2: Train Loss 0.0897: 100%|██████████| 2/2 [00:21<00:00, 10.59s/it]


Note: We set the `num_training_steps` purposefully low here to slightly undertrain our models. This makes the demo easier to run and leads to more informative uncertainty values in the rest of the demo - otherwise these models would fit our dataset perfectly, and that would be a bit boring.

## Evaluating task performance & calibration

Before we continue, let us first evaluate the models to reassure ourselves that the training was successful:

In [10]:
from nlp_uncertainty_zoo.utils.task_eval import evaluate_task

In [11]:
evaluate_task(ensemble, eval_split=test_loader)

Evaluating batch 2/2...: 100%|██████████| 2/2 [00:05<00:00,  2.70s/it]


defaultdict(float, {'accuracy': 1.0, 'macro_f1_scores': 1.0})

In [12]:
evaluate_task(variational_bert, eval_split=test_loader)

Evaluating batch 2/2...: 100%|██████████| 2/2 [00:44<00:00, 22.21s/it]


defaultdict(float, {'accuracy': 1.0, 'macro_f1_scores': 1.0})

We can also evaluate to what extend the probability of a predicted class actually corresponds to the chance of the model actually predicting the correct class, also called *calibration* (Guo et al., 2017). One way to evaluate this propery is the expected calibration error (ECE): By binning predictions with similar confidence scores, we can evaluate if the mean confidence per bin corresponds to the accuracy on the binned samples. Another approach is evaluation using *prediction sets* (Kompa et al., 2020). The idea here is to sort predictings descendingly and add classes to a set until a certain amount of probability mass - for instance 90 % in the example below - is reached. If the model is well calibrated, these prediction sets should be small and contain the correct class (on average). Using the functions implemented in the package, we evaluate these properties below: 

In [13]:
from nlp_uncertainty_zoo.utils.calibration_eval import evaluate_calibration

In [14]:
evaluate_calibration(ensemble, eval_split=test_loader)

Evaluating batch 2/2...: 100%|██████████| 2/2 [00:05<00:00,  2.62s/it]


defaultdict(float,
            {'ece': 0.30536604166030884,
             'sce': 0.1526830017566681,
             'ace': 0.15749655961990355,
             'coverage_percentage': 1.0,
             'coverage_width': 2.0})

In [15]:
evaluate_calibration(variational_bert, eval_split=test_loader)

Evaluating batch 2/2...: 100%|██████████| 2/2 [00:47<00:00, 23.73s/it]


defaultdict(float,
            {'ece': 0.0012063980102539062,
             'sce': 0.0,
             'ace': 0.0006239920854568482,
             'coverage_percentage': 1.0,
             'coverage_width': 1.0})

## Uncertainty quantification

Next, we want to use the model to actually quantify their uncertainty in a prediction. For this purpose, we manually define some sequences which should seem suspicious to the models. 

In [16]:
original_sentence = tokenizer.batch_decode(list(train_loader)[0]["input_ids"], skip_special_tokens=True)[1]
print(original_sentence)

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co - writer / director peter jackson's expanded vision of j. r. r. tolkien's middle -


In [17]:
# The model hasn't been finetuned on German, so this should be weird
sentence1 = (
    "Die umwerfend aufwendige Fortsetzung der „Der Herr der Ringe“-Trilogie ist so umfangreich,"
    "dass eine Kolonne von Worten die erweiterte Vision von Co-Autor/Regisseur Peter Jackson "
    "von j. r . r . Tolkiens Mittelerde nicht angemessen beschreiben kann."
).lower()
# Now we scramble the contents of the sentence randomly
tokens = original_sentence.split(" ")
sentence2 = " ".join(random.sample(tokens, len(tokens)))
print(sentence2)

# Add noise to the sentence
delete_chars = 30
add_noise_chars = 60

sentence3 = str(original_sentence)

for _ in range(delete_chars):
    idx = random.choice(range(len(sentence3)))
    sentence3 = sentence3[:idx] + sentence3[idx + 1:]
    
for _ in range(add_noise_chars):
    idx = random.choice(range(len(sentence3)))
    char = random.choice(ascii_lowercase)
    
    sentence3 = sentence3[:idx] + char + sentence3[idx:]
    
print(sentence3)

/ huge the column " the middle lord the - continuation describe trilogy peter adequately of j. r. rings expanded vision words " of - r. jackson's cannot of is tolkien's gorgeously so co writer elaborate that a of director
the grzgzeousbulyo eldeabnorate kcfonrktminuadgio of" the lo of tee igngs " rilgy ispsao pus tthat ppa columwn uosfp odrs uanznot adequatelywh demsclorbejco -k writrd jl/fdhiecntior pemte jackon's expajnded visiovo j.r. rp tolkipslnen'easg rmiaddlez -


We first check the predictions for the sentence above. The original sentence had a positive sentiment, so we first whether our model come to the same conclusion:

In [18]:
def make_single_prediction(model, input_, tokenizer):
  """
  Make a prediction for a single sentence for the IMDB sentiment classification task.
  """
  tokenized_input = tokenizer(input_, return_tensors="pt", return_attention_mask=True)

  with torch.no_grad():
    prediction = torch.argmax(model.predict(tokenized_input["input_ids"], attention_mask=tokenized_input["attention_mask"])).cpu().numpy()

  predicted_label = "positive" if prediction == 1 else "negative"

  return predicted_label

In [19]:
# Predictions for LSTM Ensemble
print(make_single_prediction(ensemble, original_sentence, tokenizer))
print(make_single_prediction(ensemble, sentence1, tokenizer))
print(make_single_prediction(ensemble, sentence2, tokenizer))
print(make_single_prediction(ensemble, sentence3, tokenizer))

positive
positive
positive
positive


In [20]:
# Predictions for Variational BERT
print(make_single_prediction(variational_bert, original_sentence, tokenizer))
print(make_single_prediction(variational_bert, sentence1, tokenizer))
print(make_single_prediction(variational_bert, sentence2, tokenizer))
print(make_single_prediction(variational_bert, sentence3, tokenizer))

positive
positive
positive
positive


Here, all models still predict the right label. This could be because they learned underlying features - or because they are confidently incorrect. Since the sentences are very different from the training sentences, we can now measure the uncertainty. Since the inputs above are pretty different from the inputs the models were trained on, we would hope the models to be more uncertain on the noisy sentences. 

In this demo, we will explore three different uncertainty matrix: Maximum softmax probability, predictive entropy, and mutual information. Depending on the model, there might be different metrics available. You can check that by inspecting the ``available_uncertainty_metrics`` attribute:

In [21]:
ensemble.available_uncertainty_metrics

{'max_prob': <function nlp_uncertainty_zoo.utils.metrics.max_prob(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'predictive_entropy': <function nlp_uncertainty_zoo.utils.metrics.predictive_entropy(logits: torch.FloatTensor, eps: float = 1e-05) -> torch.FloatTensor>,
 'dempster_shafer': <function nlp_uncertainty_zoo.utils.metrics.dempster_shafer(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'softmax_gap': <function nlp_uncertainty_zoo.utils.metrics.softmax_gap(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'variance': <function nlp_uncertainty_zoo.utils.metrics.variance(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'mutual_information': <function nlp_uncertainty_zoo.utils.metrics.mutual_information(logits: torch.FloatTensor, eps: float = 1e-05) -> torch.FloatTensor>}

In [22]:
variational_bert.available_uncertainty_metrics

{'max_prob': <function nlp_uncertainty_zoo.utils.metrics.max_prob(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'predictive_entropy': <function nlp_uncertainty_zoo.utils.metrics.predictive_entropy(logits: torch.FloatTensor, eps: float = 1e-05) -> torch.FloatTensor>,
 'dempster_shafer': <function nlp_uncertainty_zoo.utils.metrics.dempster_shafer(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'softmax_gap': <function nlp_uncertainty_zoo.utils.metrics.softmax_gap(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'variance': <function nlp_uncertainty_zoo.utils.metrics.variance(logits: torch.FloatTensor) -> torch.FloatTensor>,
 'mutual_information': <function nlp_uncertainty_zoo.utils.metrics.mutual_information(logits: torch.FloatTensor, eps: float = 1e-05) -> torch.FloatTensor>}

But back to metrics here. An easy and intuitive metric is the maximum softmax probability (Hendrycks & Gimpel, 2016)

$$1 - \max_k p_{\theta}(y=k|x)$$

Intuitively, when the model is uncertain, the distribution over classes should be uniform, thus yielding a low maximum probability over classes. We substract the value from 1 here in order to have small values correspond to high certainty. 

Another way to measure uncertainty is to use the Shannon entropy of the predictive distribution: For a uniform distribution, the entropy will be maximal:

$$-\sum_{k=1}^K p_{\theta}(y=k|x) \log p_{\theta}(y=k|x)$$

Lastly, Smith & Gal (2017) propose mutual information as a way to exlusively measure the *model uncertainty*:

$$\text{H}\bigg[\mathbb{E}_{q(\theta)}\Big[p_{\theta}(y|x)\Big]\bigg] - \mathbb{E}_{q(\theta)}\bigg[\text{H}\Big[p_{\theta}(y|x)\Big]\bigg]$$

Here, the first term denotes the total uncertainty, from which the second term, the *data uncertainty*, is subtracted, leaving only the model uncertainty. Usually, the expectation in both terms would over the weight posterior $p(\theta|\mathcal{D})$ of the model, which is generally intractable to evaluate for neural networks, which is why we model an approximate posterior $q(\theta)$ instead. To evaluate this expectation, we use monte carlo sampling, by simply averaging the predictions coming from different sets of weights - in the case of the LSTM ensemble, these come from different ensemble members, for the Variational BERT, this corresponds to predictions using different dropout masks.

In [23]:
def make_single_uncertainty(model, input_, tokenizer, metric):
  """
  Make a prediction for a single sentence for the IMDB sentiment classification task.
  """
  tokenized_input = tokenizer(input_, return_tensors="pt", return_attention_mask=True)
  uncertainty = model.get_uncertainty(tokenized_input["input_ids"], attention_mask=tokenized_input["attention_mask"], metric=metric).cpu().numpy()[0][0]

  return uncertainty

As you can see from the snippet below, you can get the uncertainty from an input using the `get_uncertainty()` function of a model object. You can specify the metric specifying the `metric` argument. If no metric name was given, models fall back onto the metric defined in the `default_uncertainty_metric` attribute, which usually corresponds to the predictive entropy metric we just discussed.

In [24]:
# Uncertainty estimates from the ensemble
max_prob_orig_ensemble = make_single_uncertainty(ensemble, original_sentence, tokenizer, metric="max_prob")
max_prob_corrupted_ensemble = make_single_uncertainty(ensemble, sentence3, tokenizer, metric="max_prob")
max_prob_german_ensemble = make_single_uncertainty(ensemble, sentence1, tokenizer, metric="max_prob")

print(f"Maximum softmax probability | LSTM Ensemble: Original sentence {max_prob_orig_ensemble:.3f} / corrupted sentence {max_prob_corrupted_ensemble:.3f} / German sentence {max_prob_german_ensemble:.3f}")

mutual_info_orig_ensemble = make_single_uncertainty(ensemble, original_sentence, tokenizer, metric="mutual_information")
mutual_info_corrupted_ensemble = make_single_uncertainty(ensemble, sentence3, tokenizer, metric="mutual_information")
mutual_info_german_ensemble = make_single_uncertainty(ensemble, sentence1, tokenizer, metric="mutual_information")

print(f"Mutual information | LSTM Ensemble: Original sentence {mutual_info_orig_ensemble:.3f} / corrupted sentence {mutual_info_corrupted_ensemble:.3f} / German sentence {mutual_info_german_ensemble:.3f}")

Maximum softmax probability | LSTM Ensemble: Original sentence 0.588 / corrupted sentence 0.597 / German sentence 0.620
Mutual information | LSTM Ensemble: Original sentence 0.647 / corrupted sentence 0.654 / German sentence 0.583


In [25]:
# Uncertainty estimates from the variational BERT
max_prob_orig_variational_bert = make_single_uncertainty(variational_bert, original_sentence, tokenizer, metric="max_prob")
max_prob_corrupted_variational_bert = make_single_uncertainty(variational_bert, sentence3, tokenizer, metric="max_prob")
max_prob_german_variational_bert = make_single_uncertainty(variational_bert, sentence1, tokenizer, metric="max_prob")

print(f"Maximum softmax probability | Variational BERT: Original sentence {max_prob_orig_variational_bert:.3f} / corrupted sentence {max_prob_corrupted_variational_bert:.3f} / German sentence {max_prob_german_variational_bert:.3f} ")

mutual_info_orig_variational_bert = make_single_uncertainty(variational_bert, original_sentence, tokenizer, metric="mutual_information")
mutual_info_corrupted_variational_bert = make_single_uncertainty(variational_bert, sentence3, tokenizer, metric="mutual_information")
mutual_info_german_variational_bert = make_single_uncertainty(variational_bert, sentence1, tokenizer, metric="mutual_information")

print(f"Mutual information | Variational BERT: Original sentence {mutual_info_orig_variational_bert:.3f} / corrupted sentence {mutual_info_corrupted_variational_bert:.3f} / German sentence {mutual_info_german_variational_bert:.3f}")

Maximum softmax probability | Variational BERT: Original sentence 0.009 / corrupted sentence 0.009 / German sentence 0.009 
Mutual information | Variational BERT: Original sentence 0.009 / corrupted sentence 0.008 / German sentence 0.009


Here we can see that both the ensemble and Variational BERT indicate either the same or higher uncertainty on the German sentence with the MSP score, but not the with Mutual information on the corrupted sentence! This is an open problem in the uncertainty quantification research: It is sometimes unclear, when our uncertainty estimates will be reliable. For this reason, we show how to evaluate them in detail next.

## Evaluating the quality of uncertainty estimates

As we have done before with the raw probalities, we also want to know how reliable the uncertainty estimates for our models are. The package also provides several ways to do this: Firstly, we can evaluate them using an OOD detection task - the model should be more uncertain on data points that are unlike the ones in the training set. By using the uncertainty scores, we can use binary classification metrics like the area under the precision-recall curve (AUPR) and the area under the receiver-operator characteristic (AUROC) to evaluate this. In our Rotten tomatoes example, we will use Tweets from a different data set as an OOD data.

The other way introduced by Ulmer et al. (2022) is to measure how much high uncertainty corresponds to the model making wrong predictions. This is quantified by collecting the model loss and uncertainty for all points in the test set, and measuring their correlation using the [Kendall's $\tau$ correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient). The values range from -1 to 1, which 1 indicating that high uncertainty perfectly correlates with high model loss.

In [26]:
from nlp_uncertainty_zoo.utils.uncertainty_eval import evaluate_uncertainty

In [27]:
evaluate_uncertainty(ensemble, id_eval_split=test_loader, ood_eval_split=ood_test_loader)

Evaluating batch 2/4...: 100%|██████████| 4/4 [01:09<00:00, 17.47s/it]


defaultdict(float,
            {'kendalls_tau_id_max_prob_seq': KendalltauResult(correlation=0.3199999999999999, pvalue=0.025387066467633498),
             'kendalls_tau_id_predictive_entropy_seq': KendalltauResult(correlation=0.24666666666666665, pvalue=0.0882790830212813),
             'kendalls_tau_id_dempster_shafer_seq': KendalltauResult(correlation=0.3199999999999999, pvalue=0.025387066467633498),
             'kendalls_tau_id_softmax_gap_seq': KendalltauResult(correlation=0.3666666666666666, pvalue=0.009856596254745291),
             'kendalls_tau_id_variance_seq': KendalltauResult(correlation=0.23999999999999996, pvalue=0.09754603408321157),
             'kendalls_tau_id_mutual_information_seq': KendalltauResult(correlation=-0.006666666666666665, pvalue=0.981568217609005),
             'kendalls_tau_ood_max_prob_seq': KendalltauResult(correlation=0.14666666666666664, pvalue=0.3187493522432472),
             'kendalls_tau_ood_predictive_entropy_seq': KendalltauResult(correlation

In [28]:
evaluate_uncertainty(variational_bert, id_eval_split=test_loader, ood_eval_split=ood_test_loader)

Evaluating batch 2/4...: 100%|██████████| 4/4 [10:31<00:00, 149.47s/it]

defaultdict(float,
            {'kendalls_tau_id_max_prob_seq': KendalltauResult(correlation=-0.14666666666666664, pvalue=0.3187493522432472),
             'kendalls_tau_id_predictive_entropy_seq': KendalltauResult(correlation=-0.1533333333333333, pvalue=0.29659240311553436),
             'kendalls_tau_id_dempster_shafer_seq': KendalltauResult(correlation=0.06666666666666665, pvalue=0.6604714750997261),
             'kendalls_tau_id_softmax_gap_seq': KendalltauResult(correlation=-0.21999999999999997, pvalue=0.12994145940142376),
             'kendalls_tau_id_variance_seq': KendalltauResult(correlation=-0.10666666666666665, pvalue=0.4730453015604536),
             'kendalls_tau_id_mutual_information_seq': KendalltauResult(correlation=0.03999999999999999, pvalue=0.7993481021210237),
             'kendalls_tau_ood_max_prob_seq': KendalltauResult(correlation=-0.35999999999999993, pvalue=0.011371729355231645),
             'kendalls_tau_ood_predictive_entropy_seq': KendalltauResult(correlat

Evaluating batch 2/4...: 100%|██████████| 4/4 [10:31<00:00, 157.81s/it]

Thanks for reading through this demo! We only showcase the most useful, but not all, of the functionalities of the package here. If you would like to know more about the different models and functionalities in the package, consult [the documentation](http://dennisulmer.eu/nlp-uncertainty-zoo/). If you find any bugs or have requests for missing features, please [open an issue on the Github repository](https://github.com/Kaleidophon/nlp-uncertainty-zoo/issues). Below you can find the papers that were referenced in this demo:

Gal, Yarin, and Zoubin Ghahramani. "A theoretically grounded application of dropout in recurrent neural networks." Advances in neural information processing systems 29 (2016).

Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. PMLR, 2016.

Hendrycks, Dan, and Kevin Gimpel. "A baseline for detecting misclassified and out-of-distribution examples in neural networks." arXiv preprint arXiv:1610.02136 (2016).

Kompa, Benjamin, Jasper Snoek, and Andrew L. Beam. "Empirical frequentist coverage of deep learning uncertainty quantification procedures." Entropy 23.12 (2021): 1608.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

Smith, Lewis, and Yarin Gal. "Understanding measures of uncertainty for adversarial example detection." arXiv preprint arXiv:1803.08533 (2018).

Ulmer, Dennis Thomas, Jes Frellsen, and Christian Hardmeier. "Exploring Predictive Uncertainty and Calibration in NLP: A Study on the Impact of Method & Data Scarcity." Findings of 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2022.

Xiao, Tim Z., Aidan N. Gomez, and Yarin Gal. "Wat zei je? detecting out-of-distribution translations with variational transformers." arXiv preprint arXiv:2006.08344 (2020).


