# Find Label Errors in Span Classification Datasets

This tutorial shows how can you use cleanlab to find potential label errors in text datasets for span classification. In span-classification, our data contains sentences in which every token (aka word) is labeled with one or more span classes, and we train models to predict the tokens that belongs to each span classes in a new sentence. This tutorial focus on question answering task as an example of span-classification task. Here, we will use a subset of the SQuAD dataset that contains 100 examples of questions, answers, and contexts. Each token is labeled with one of the two classes:

- O (other type of word)
- ANS (answer span)

Overview of this notebook:

- Find tokens with label issues using `cleanlab.experimental.span_classification.find_label_issues`
- Rank sentences based on their overall label quality score using `cleanlab.experimental.span_classification.ger_label_quality_scores`

***Note: this notebook must be run inside an environment where `cleanlab.experimental.span_classification is implemented***

## 1. Load required dependencies and dataset

In [1]:
import numpy as np

from cleanlab.token_classification.rank import issues_from_scores

from cleanlab.experimental.span_classification import find_label_issues, display_issues, get_label_quality_scores

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'cleanlab.experimental.span_classification'

## 2. Get data, labels, and pred_probs

In span classification, every token is labeled with one or more of the K span classes. To find label issues, cleanlab requires predicted class probabilities from a trained classifier. These `pred_probs` contain the probability of being in the `ANS` span for each token. Here we use `pred_probs` obtained from a BERT Transformer fit via cross-validation. Since cleanlab required the tokenized version of the input sentence, we will also convert the question and context in the original dataset into tokens and their corresponding span class label. The notebook "span categorizer training" contains the code to produce `pred_probs`, `labels`, `tokens`, `questions` and save them in a `.npz` file. Here, we load these data via the `read_npz` function.

In [3]:
# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
def read_npz(filepath):
    data = dict(np.load(filepath))
    data = [data[str(i)] for i in range(len(data))]
    return data

In [4]:
pred_probs = read_npz('pred_probs.npz')
labels = read_npz('labels.npz')
tokens = read_npz('tokens.npz')
questions = read_npz('questions.npz')

In [9]:
indices_to_preview = 52

print("Question:\t" + str(questions[indices_to_preview]))
print(f'{"Tokens":<15} {"True Label":<15} {"Predicted Probabilities":<30}')
print('-' * 60)
for i in range(len(tokens[indices_to_preview])):
    token = str(tokens[indices_to_preview][i])
    label = str(labels[indices_to_preview][i])
    pred_prob = str(pred_probs[indices_to_preview][i])

    print(f'{token:<15} {label:<15} {pred_prob:<30}')

Question:	What book did John Zahm write in 1896?
Tokens          True Label      Predicted Probabilities       
------------------------------------------------------------
Father          0               0.04192225                    
Joseph          0               0.04277855                    
Carrier         0               0.042555753                   
,               0               0.03997254                    
C               0               0.04447972                    
.               0               0.04259737                    
S               0               0.04415207                    
.               0               0.0425685                     
C               0               0.044872615                   
.               0               0.042509954                   
was             0               0.038764838                   
Director        0               0.042133622                   
of              0               0.042326912                   
the     

`pred_probs` and `labels` are the required information from the dataset to find label issues while `tokens` and `questions` are important to provide additional context and anchor the label issue finding. `pred_probs` and `labels` should be formatted as:

- `pred_probs` is a list whose `i`-th element is a list of integers corresponding to predicted class probabilities for each token in the `i-th` document (`N_i` is the number of tokens).
- `labels` is a list whose `i`-th element is a list of integers corresponding to class label of each token in the `i`-th sentence. For single class span classification, labels must take values in 0 or 1.


## 3. Use cleanlab to find label issues

In [6]:
issues = find_label_issues(labels, pred_probs)

The returned `issues` are a list of tuples `(i, j)`, which corresponds to the `j`th token of the `i`-th document in the dataset. These are the tokens cleanlab thinks may be badly labeled in your dataset.

In [7]:
display_issues(issues, tokens, labels=labels, class_names=['O', 'ANS'])

Sentence index: 52, Token index: 124
Token: and
Given label: ANS
----
Father Joseph Carrier, C. S. C. was Director of the Science Museum and the Library and Professor of Chemistry and Physics until 1874. Carrier taught that scientific research and its promise for progress were not antagonist ic to the ideals of intellectual and moral culture endorsed by the Church. One of Carrier' s students was Father John Augustine Z ah m (1851 – 1921) who was made Professor and Co - Director of the Science Department at age 23 and by 1900 was a nationally prominent scientist and naturalist. Z ah m was active in the Catholic Summer School movement, which introduced Catholic la ity to contemporary intellectual issues. His book Evolution and Dog ma (1896) defended certain aspects of evolutionary theory as true, and argued, more over, that even the great Church teachers Thomas A quin as and Augustine taught something like it. The intervention of Irish American Catholics in Rome prevented Z ah m' s c ens

In [8]:
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
issues = issues_from_scores(sentence_scores, token_scores=token_scores)
display_issues(issues, tokens, labels=labels, class_names=['O', 'ANS'])

Sentence index: 52, Token index: 124
Token: and
Given label: ANS
----
Father Joseph Carrier, C. S. C. was Director of the Science Museum and the Library and Professor of Chemistry and Physics until 1874. Carrier taught that scientific research and its promise for progress were not antagonist ic to the ideals of intellectual and moral culture endorsed by the Church. One of Carrier' s students was Father John Augustine Z ah m (1851 – 1921) who was made Professor and Co - Director of the Science Department at age 23 and by 1900 was a nationally prominent scientist and naturalist. Z ah m was active in the Catholic Summer School movement, which introduced Catholic la ity to contemporary intellectual issues. His book Evolution and Dog ma (1896) defended certain aspects of evolutionary theory as true, and argued, more over, that even the great Church teachers Thomas A quin as and Augustine taught something like it. The intervention of Irish American Catholics in Rome prevented Z ah m' s c ens