# Lab -- Dataset Curation with Multiple Annotators

This notebook contains exercises to analyze an existing classification dataset labeled by multiple annotators (e.g. collected via crowdsourcing).

You may find it helpful to first install the following dependencies:

In [1]:
!pip install cleanlab
# We originally used the version: cleanlab==2.2.0
# This automatically installs other required packages like numpy, pandas, sklearn

Collecting cleanlab
  Obtaining dependency information for cleanlab from https://files.pythonhosted.org/packages/84/1f/3574ee21be3378eecda48c55ec02d0dc1cf95fae90cb00fd57373d85cdbc/cleanlab-2.5.0-py3-none-any.whl.metadata
  Downloading cleanlab-2.5.0-py3-none-any.whl.metadata (23 kB)
Collecting termcolor>=2.0.0 (from cleanlab)
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Downloading cleanlab-2.5.0-py3-none-any.whl (285 kB)
   ---------------------------------------- 0.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? eta -:--:--
   ----- ---------------------------------- 41.0/285.5 kB ? 

In [1]:
import numpy as np
import pandas as pd

## Analyzing dataset labeled by multiple annotators

We simulate a small classification dataset (3 classes, 2-dimensional features) with ground truth labels that are then hidden from our analysis. The analysis is conducted on labels from noisy annotators whose labels are derived from the ground truth labels, but with some probability of error in each annotated label where the probability is determined by the underlying quality of the annotator. In subsequent exercises, you should assume the ground truth labels and the true annotator qualities are unknown to you.

In [2]:
## You don't need to understand this cell, it's just used for generating the dataset

SEED = 123  # for reproducibility
np.random.seed(seed=SEED)

def make_data(sample_size = 300):
    """ Produce a 3-class classification dataset with 2-dimensional features and multiple noisy annotations per example. """
    num_annotators=50  # total number of data annotators
    class_frequencies = [0.5, 0.25, 0.25]
    sizes=[int(np.ceil(freq*sample_size)) for freq in class_frequencies]  # number of examples belonging to each class
    good_annotator_quality = 0.6
    bad_annotator_quality = 0.3
    
    # Underlying statistics of the datset (unknown to you)
    means=[[3, 2], [7, 7], [0, 8]]
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]]
    
    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    # Generate features and true labelse
    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Generate noisy labels from each annotator
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, good_annotator_quality)
                if i < num_annotators - 10  # last 10 annotators are worse
                else generate_noisy_labels(true_labels_train, bad_annotator_quality)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]
    # Drop rows not annotated by anybody
    row_NA_check = pd.notna(s).any(axis=1)
    X_train = X_train[row_NA_check]
    true_labels_train = true_labels_train[row_NA_check]
    multiannotator_labels = s[row_NA_check].reset_index(drop=True)
    # Shuffle the rows of the dataset
    shuffled_indices = np.random.permutation(len(X_train))
    return {
        "X_train": X_train[shuffled_indices],
        "true_labels_train": true_labels_train[shuffled_indices],
        "multiannotator_labels": multiannotator_labels.iloc[shuffled_indices],
    }

def generate_noisy_labels(true_labels, annotator_quality):
    """ Randomly flips each true label to a different class with probability that depends on annotator_quality. """
    n = len(true_labels)
    m = np.max(true_labels) + 1  # number of classes
    annotated_labels = np.random.randint(low=0, high=3, size=n)
    correctly_labeled_indices = np.random.random(n) < annotator_quality
    annotated_labels[correctly_labeled_indices] = true_labels[correctly_labeled_indices]
    return annotated_labels

In [3]:
data_dict = make_data(sample_size = 300)
X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let's view the first few rows of the data used for this exercise. Here are the labels selected by each annotator for the first few examples. Here each example is a row, and the annotators are columns. Not all annotators labeled each example; valid class annotations from those that did label the example are integers (either 0, 1, or 2 for our 3 classes), and otherwise the annotation is left as `NA` if a particular annotator did not label a particular example.

In [28]:
multiannotator_labels

Unnamed: 0,A0001,A0002,A0003,A0004,A0005,A0006,A0007,A0008,A0009,A0010,...,A0041,A0042,A0043,A0044,A0045,A0046,A0047,A0048,A0049,A0050
247,,2,,,,,,,,,...,,,,,,,,,,
290,,,,,,,,,,,...,,,,,,,,,,
262,,,,,,,2,,,,...,,,,,,,,,,
182,,,,,,,,,,,...,,,,,,,,,,
143,,,0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,2,,,,,,,,,,...,,,,2,,,,,,0
14,,,,,,,,0,,,...,,,1,,,,,,0,
195,,,,,,,,,,,...,,,,,,,,,,
85,,,,,,,,,,,...,,,,,,,,,,


Here are the corresponding 2D data features for these examples:

In [5]:
X[:5]

array([[ 1.01592896, 10.62213634],
       [-1.91393643,  6.53944268],
       [ 0.55962291,  5.35885902],
       [ 6.73677377,  5.02311322],
       [ 6.95949986,  1.61434817]])

### Train model with cross-validation

In this exercise, we consider the simple K Nearest Neighbors classification model, which produces **predicted class probabilities** for a particular example via a **(weighted) average** of the labels of the K closest examples. We will train this model via cross-validation and use it to produce held-out predictions for each example in our dataset.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict

def train_model(labels_to_fit):
    """ Trains a simple feedforward neural network model on the data features X with y = labels_to_fit, via cross-validation.
        Returns out-of-sample predicted class probabilities for each example in the dataset
        (from a copy of model that was never trained on this example).
        Also evaluates the held-out class predictions against ground truth labels.
    """
    num_crossval_folds = 5  # number of folds of cross-validation
    # model = MLPClassifier(max_iter=1000, random_state=SEED)
    model = KNeighborsClassifier(weights="distance")
    pred_probs = cross_val_predict(
        estimator=model, X=X, y=labels_to_fit, cv=num_crossval_folds, method="predict_proba"# Output a probability
    )
    
    class_predictions = np.argmax(pred_probs, axis=1)
    held_out_accuracy = np.mean(class_predictions == true_labels)
    print(f"Accuracy of held-out model predictions against ground truth labels: {held_out_accuracy}")
    return pred_probs

Here we demonstrate how to train and evaluate this model. Note that **the evaluation is against ground truth labels**, which you **wouldn't have in real applications**, so this evaluation is just for demonstration purposes. We'll first fit this model using **labels comprised** of one randomly selected annotation for each example.

In [7]:
labels_from_random_annotators = true_labels.copy()
for i in range(len(multiannotator_labels)): # Remain the labled annotations
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])] 
    labels_from_random_annotators[i] = np.random.choice(annotations_for_example_i.values)

print(f"Accuracy of random annotators' labels against ground truth labels: {np.mean(labels_from_random_annotators == true_labels)}")
pred_probs_from_model_fit_to_random_annotators = train_model(labels_to_fit = labels_from_random_annotators)


Accuracy of random annotators' labels against ground truth labels: 0.6822742474916388
Accuracy of held-out model predictions against ground truth labels: 0.8093645484949833


We can also fit this model using the ground truth labels (which you would not be able to in practice), just to see how good it could be:

In [8]:
pred_probs_from_unrealistic_model_fit_to_true_labels = train_model(labels_to_fit = true_labels)

Accuracy of held-out model predictions against ground truth labels: 0.9765886287625418


## Exercise 1

Compute majority-vote consensus labels for each example from the data in `multiannotator_labels`. Think about how to best break ties! **(A tie-break is an extra play when two players have the same number of points, to decide who is the winner.)**

- Evaluate the **accuracy of these majority-vote consensus labels against the ground truth labels.**
- Also set these as `labels_to_fit` in `train_model()` to see the **resulting model's accuracy when trained with majority vote consensus labels.**
- Estimate the **quality of annotator** (how accurate their labels tend to be overall) using only these majority-vote consensus labels (assume the ground truth labels are unavailable as they would be in practice). **Who do you guess are the worst 10 annotators?**

### Method 1: Use Cleanlab Library

In [9]:
# Get initial consensus labels via majority vote and compute out-of-sample predicted probabilities
from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label
majority_vote_label = get_majority_vote_label(multiannotator_labels)
print(f"Accuracy of majority annotators' labels against ground truth labels: {np.mean(majority_vote_label == true_labels)}")
pred_probs_from_model_fit_to_majority_vote = train_model(labels_to_fit = majority_vote_label)

Accuracy of majority annotators' labels against ground truth labels: 0.8294314381270903
Accuracy of held-out model predictions against ground truth labels: 0.9297658862876255


In [10]:
# Quality of annotator (how accurate their labels tend to be overall):
results = get_label_quality_multiannotator(multiannotator_labels, pred_probs_from_model_fit_to_majority_vote, verbose=False)
results["label_quality"].head()

Unnamed: 0,consensus_label,consensus_quality_score,annotator_agreement,num_annotations
247,2,0.467644,1.0,4
290,2,0.780514,1.0,1
262,2,0.780158,0.6,5
182,1,0.834809,0.571429,7
143,1,0.558872,0.666667,3


In [11]:
# The worst 10 annotators:
results["annotator_stats"].head(10)

Unnamed: 0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
A0048,0.369088,0.40625,1,32
A0042,0.37796,0.4,2,35
A0043,0.387579,0.393939,1,33
A0046,0.398884,0.413793,1,29
A0044,0.405077,0.428571,2,21
A0040,0.410907,0.481481,2,27
A0041,0.411296,0.464286,1,28
A0049,0.455185,0.527778,2,36
A0045,0.461153,0.461538,2,26
A0050,0.486747,0.55,2,40


The annotator_stats DataFrame is sorted by increasing annotator_quality, showing us the worst annotators first.

Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

### Methods2: Reproductiom
1. Majority-vote:
- Judge by the probability --> Majority and break ties

In [23]:
# Datastructure Display
print(multiannotator_labels.head(10))
annotations_for_examples = {}
for i in range(len(multiannotator_labels)): # Remain the labled annotations
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])] 
    annotations_for_examples.update({multiannotator_labels.index[i]:annotations_for_example_i})
    
print(type(annotations_for_examples))
print(annotations_for_examples)


     A0001  A0002  A0003  A0004  A0005  A0006  A0007  A0008  A0009  A0010  \
247   <NA>      2   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
290   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
262   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>      2   <NA>   <NA>   <NA>   
182   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
143   <NA>   <NA>      0   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
161   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
88    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
46    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
105   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
21    <NA>   <NA>      0   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   

     ...  A0041  A0042  A0043  A0044  A0045  A0046  A0047  A0048  A0049  A0050  
247  ...   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>

In [25]:
# Initialize a dictionary to record the final voting annotations for each example
final_votes = {}

# Iterate through each example in the dictionary
for example_id, annotations in annotations_for_examples.items():
    # Count occurrences of labels 0, 1, and 2
    counts = annotations.value_counts()
    
    # Calculate the proportions of the three labels
    total_votes = counts.sum()
    probabilities = counts / total_votes
    
    # Get lists of labels and their corresponding probabilities
    labels = probabilities.index
    probs = probabilities.values
    
    # Random voting decided by probability: To Break Ties
    final_vote = np.random.choice(labels, p=probs)
    
    # Record the final voting annotation result
    final_votes[example_id] = final_vote

# Print the final voting annotation results for each example
for example_id, final_vote in final_votes.items():
    print(f"Example {example_id}: Final Vote - {final_vote}")


Example 247: Final Vote - 2
Example 290: Final Vote - 2
Example 262: Final Vote - 1
Example 182: Final Vote - 1
Example 143: Final Vote - 0
Example 161: Final Vote - 0
Example 88: Final Vote - 0
Example 46: Final Vote - 0
Example 105: Final Vote - 1
Example 21: Final Vote - 0
Example 254: Final Vote - 2
Example 125: Final Vote - 0
Example 15: Final Vote - 0
Example 101: Final Vote - 0
Example 181: Final Vote - 1
Example 287: Final Vote - 2
Example 113: Final Vote - 0
Example 251: Final Vote - 2
Example 202: Final Vote - 1
Example 171: Final Vote - 1
Example 242: Final Vote - 0
Example 61: Final Vote - 2
Example 184: Final Vote - 1
Example 259: Final Vote - 2
Example 240: Final Vote - 0
Example 284: Final Vote - 2
Example 78: Final Vote - 1
Example 3: Final Vote - 0
Example 11: Final Vote - 0
Example 26: Final Vote - 2
Example 49: Final Vote - 0
Example 73: Final Vote - 0
Example 91: Final Vote - 0
Example 233: Final Vote - 2
Example 248: Final Vote - 2
Example 69: Final Vote - 0
Exampl

In [34]:
labels_from_prob_majority_vote = [final_vote for example_id, final_vote in final_votes.items()]

In [36]:
print(f"Accuracy of random annotators' labels against ground truth labels: {np.mean(labels_from_random_annotators == true_labels)}")
pred_probs_from_model_fit_to_random_annotators = train_model(labels_to_fit = labels_from_random_annotators)

Accuracy of random annotators' labels against ground truth labels: 0.6822742474916388
Accuracy of held-out model predictions against ground truth labels: 0.8093645484949833


2. Annotator-quality(Final accuracy):

In [32]:
# Initialize a dictionary to record the annotation count and correct annotation count for each data annotator
annotator_stats = {annotator: {'total_annotations': 0, 'correct_annotations': 0} for annotator in multiannotator_labels.columns}

# Iterate through each data in the dictionary
for example_id, annotations in annotations_for_examples.items():
    # Get the final vote for this data
    final_vote = final_votes[example_id]
    
    # Iterate through the annotations of data annotators
    for annotator, label in annotations.items():
        # Record the annotation count
        annotator_stats[annotator]['total_annotations'] += 1
        
        # Check if the annotation result matches the final vote
        if label == final_vote:
            annotator_stats[annotator]['correct_annotations'] += 1

# Calculate the accuracy for each data annotator
for annotator, stats in annotator_stats.items():
    if stats['total_annotations'] > 0:
        accuracy = stats['correct_annotations'] / stats['total_annotations']
        print(f"Annotator {annotator}: Accuracy - {accuracy:.2%} ({stats['correct_annotations']}/{stats['total_annotations']})")
    else:
        print(f"Annotator {annotator}: No annotations")


Annotator A0001: Accuracy - 52.00% (13/25)
Annotator A0002: Accuracy - 62.96% (17/27)
Annotator A0003: Accuracy - 62.07% (18/29)
Annotator A0004: Accuracy - 48.72% (19/39)
Annotator A0005: Accuracy - 69.23% (18/26)
Annotator A0006: Accuracy - 68.00% (17/25)
Annotator A0007: Accuracy - 57.69% (15/26)
Annotator A0008: Accuracy - 70.83% (17/24)
Annotator A0009: Accuracy - 54.84% (17/31)
Annotator A0010: Accuracy - 67.86% (19/28)
Annotator A0011: Accuracy - 73.53% (25/34)
Annotator A0012: Accuracy - 66.67% (18/27)
Annotator A0013: Accuracy - 60.87% (14/23)
Annotator A0014: Accuracy - 48.57% (17/35)
Annotator A0015: Accuracy - 61.76% (21/34)
Annotator A0016: Accuracy - 66.67% (18/27)
Annotator A0017: Accuracy - 55.56% (20/36)
Annotator A0018: Accuracy - 71.88% (23/32)
Annotator A0019: Accuracy - 56.25% (18/32)
Annotator A0020: Accuracy - 73.33% (22/30)
Annotator A0021: Accuracy - 74.29% (26/35)
Annotator A0022: Accuracy - 65.62% (21/32)
Annotator A0023: Accuracy - 71.43% (25/35)
Annotator A

## Exercise 2

Estimate consensus labels for each example from the data in `multiannotator_labels`, this time using the CROWDLAB algorithm. You may find it helpful to reference: https://docs.cleanlab.ai/stable/tutorials/multiannotator.html
Recall that CROWDLAB **requires out of sample predicted class probabilities from a trained classifier**. You may use the `pred_probs` **from your model trained on majority-vote consensus labels** or our **randomly-selected annotator labels**. Which do you think would be better to use?

- Evaluate the accuracy of these CROWDLAB consensus labels against the ground truth labels.
- Also set these as `labels_to_fit` in `train_model()` to see the resulting model's accuracy when trained with CROWDLAB consensus labels.
- Estimate the quality of annotator (how accurate their labels tend to be overall) using CROWDLAB (assume the ground truth labels are unavailable as they would be in practice). Who do you guess are the worst 10 annotators based on this method?

In [102]:
improved_consensus_label = results["label_quality"]["consensus_label"].values

print(f"Accuracy of majority annotators' labels against ground truth labels: {np.mean(majority_vote_label == true_labels)}")
print(f"Accuracy of cleanlab consensus labels against ground truth labels: {np.mean(true_labels == improved_consensus_label)}")

Accuracy of majority annotators' labels against ground truth labels: 0.8294314381270903
Accuracy of cleanlab consensus labels against ground truth labels: 0.9632107023411371


In [103]:
pred_probs_from_model_fit_to_cleanlab_consensus_labels = train_model(labels_to_fit = improved_consensus_label)

Accuracy of held-out model predictions against ground truth labels: 0.9698996655518395
