# Lab -- Dataset Curation with Multiple Annotators

Intended to accompany the lecture on Dataset Creation and Curation, this notebook contains exercises to analyze an existing classification dataset labeled by multiple annotators (e.g. collected via crowdsourcing).

You may find it helpful to first install the following dependencies:

**Installed in venv**

In [1]:
# !pip install cleanlab
# # We originally used the version: cleanlab==2.2.0
# # This automatically installs other required packages like numpy, pandas, sklearn

In [2]:
import numpy as np
import pandas as pd

## Analyzing dataset labeled by multiple annotators

We simulate a small classification dataset (3 classes, 2-dimensional features) with ground truth labels that are then hidden from our analysis. The analysis is conducted on labels from noisy annotators whose labels are derived from the ground truth labels, but with some probability of error in each annotated label where the probability is determined by the underlying quality of the annotator. In subsequent exercises, you should assume the ground truth labels and the true annotator qualities are unknown to you.

In [3]:
## You don't need to understand this cell, it's just used for generating the dataset

SEED = 123  # for reproducibility
np.random.seed(seed=SEED)

def make_data(sample_size = 300):
    """ Produce a 3-class classification dataset with 2-dimensional features and multiple noisy annotations per example. """
    num_annotators=50  # total number of data annotators
    class_frequencies = [0.5, 0.25, 0.25]
    sizes=[int(np.ceil(freq*sample_size)) for freq in class_frequencies]  # number of examples belonging to each class
    good_annotator_quality = 0.6
    bad_annotator_quality = 0.3
    
    # Underlying statistics of the datset (unknown to you)
    means=[[3, 2], [7, 7], [0, 8]]
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]]
    
    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    # Generate features and true labels
    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Generate noisy labels from each annotator
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, good_annotator_quality)
                if i < num_annotators - 10  # last 10 annotators are worse
                else generate_noisy_labels(true_labels_train, bad_annotator_quality)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]
    # Drop rows not annotated by anybody
    row_NA_check = pd.notna(s).any(axis=1)
    X_train = X_train[row_NA_check]
    true_labels_train = true_labels_train[row_NA_check]
    multiannotator_labels = s[row_NA_check].reset_index(drop=True)
    # Shuffle the rows of the dataset
    shuffled_indices = np.random.permutation(len(X_train))
    return {
        "X_train": X_train[shuffled_indices],
        "true_labels_train": true_labels_train[shuffled_indices],
        "multiannotator_labels": multiannotator_labels.iloc[shuffled_indices],
    }

def generate_noisy_labels(true_labels, annotator_quality):
    """ Randomly flips each true label to a different class with probability that depends on annotator_quality. """
    n = len(true_labels)
    m = np.max(true_labels) + 1  # number of classes
    annotated_labels = np.random.randint(low=0, high=3, size=n)
    correctly_labeled_indices = np.random.random(n) < annotator_quality
    annotated_labels[correctly_labeled_indices] = true_labels[correctly_labeled_indices]
    return annotated_labels

In [4]:
data_dict = make_data(sample_size = 300)

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let's view the first few rows of the data used for this exercise. Here are the labels selected by each annotator for the first few examples. Here each example is a row, and the annotators are columns. Not all annotators labeled each example; valid class annotations from those that did label the example are integers (either 0, 1, or 2 for our 3 classes), and otherwise the annotation is left as `NA` if a particular annotator did not label a particular example.

In [5]:
multiannotator_labels.head()

Unnamed: 0,A0001,A0002,A0003,A0004,A0005,A0006,A0007,A0008,A0009,A0010,...,A0041,A0042,A0043,A0044,A0045,A0046,A0047,A0048,A0049,A0050
247,,2.0,,,,,,,,,...,,,,,,,,,,
290,,,,,,,,,,,...,,,,,,,,,,
262,,,,,,,2.0,,,,...,,,,,,,,,,
182,,,,,,,,,,,...,,,,,,,,,,
143,,,0.0,,,,,,,,...,,,,,,,,,,


Here are the corresponding 2D data features for these examples:

In [6]:
X[:5]

array([[ 1.01592896, 10.62213634],
       [-1.91393643,  6.53944268],
       [ 0.55962291,  5.35885902],
       [ 6.73677377,  5.02311322],
       [ 6.95949986,  1.61434817]])

In [7]:
X.shape

(299, 2)

### Train model with cross-validation

In this exercise, we consider the simple K Nearest Neighbors classification model, which produces predicted class probabilities for a particular example via a (weighted) average of the labels of the K closest examples. We will train this model via cross-validation and use it to produce held-out predictions for each example in our dataset.

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict

def train_model(labels_to_fit):
    """ Trains a simple feedforward neural network model on the data features X with y = labels_to_fit, via cross-validation.
        Returns out-of-sample predicted class probabilities for each example in the dataset
        (from a copy of model that was never trained on this example).
        Also evaluates the held-out class predictions against ground truth labels.
    """
    num_crossval_folds = 5  # number of folds of cross-validation
    # model = MLPClassifier(max_iter=1000, random_state=SEED)
    model = KNeighborsClassifier(weights="distance")
    pred_probs = cross_val_predict(
        estimator=model, X=X, y=labels_to_fit, cv=num_crossval_folds, method="predict_proba"
    )
    class_predictions = np.argmax(pred_probs, axis=1)
    held_out_accuracy = np.mean(class_predictions == true_labels)
    print(f"Accuracy of held-out model predictions against ground truth labels: {held_out_accuracy}")
    return pred_probs

Here we demonstrate how to train and evaluate this model. Note that the evaluation is against ground truth labels, which you wouldn't have in real applications, so this evaluation is just for demonstration purposes. We'll first fit this model using labels comprised of one randomly selected annotation for each example.

In [9]:
labels_from_random_annotators = true_labels.copy()
for i in range(len(multiannotator_labels)):
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])]
    labels_from_random_annotators[i] = np.random.choice(annotations_for_example_i.values)

print(f"Accuracy of random annotators' labels against ground truth labels: {np.mean(labels_from_random_annotators == true_labels)}")
pred_probs_from_model_fit_to_random_annotators = train_model(labels_to_fit = labels_from_random_annotators)

Accuracy of random annotators' labels against ground truth labels: 0.6822742474916388
Accuracy of held-out model predictions against ground truth labels: 0.8093645484949833


We can also fit this model using the ground truth labels (which you would not be able to in practice), just to see how good it could be:

In [10]:
pred_probs_from_unrealistic_model_fit_to_true_labels = train_model(labels_to_fit = true_labels)

Accuracy of held-out model predictions against ground truth labels: 0.9765886287625418


## Exercise 1

Compute majority-vote consensus labels for each example from the data in `multiannotator_labels`. Think about how to best break ties!

- Evaluate the accuracy of these majority-vote consensus labels against the ground truth labels.
- Also set these as `labels_to_fit` in `train_model()` to see the resulting model's accuracy when trained with majority vote consensus labels.
- Estimate the quality of annotator (how accurate their labels tend to be overall) using only these majority-vote consensus labels (assume the ground truth labels are unavailable as they would be in practice). Who do you guess are the worst 10 annotators?

In [11]:
## Code your solution here
labels_from_majority_vote = []
for i in range(len(multiannotator_labels)):
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])]
    cnt_dict = {}
    max_cnt = 0
    for ann in annotations_for_example_i.values:
        cnt_dict[ann] = cnt_dict.get(ann, 0) + 1
        if cnt_dict[ann] > max_cnt:
            max_cnt = cnt_dict[ann]
    ann_list = list(cnt_dict.keys())
    for ann in ann_list:
        if cnt_dict[ann] < max_cnt:
            del cnt_dict[ann]
    majority_vote_label_example_i = list(cnt_dict.keys())[0]
    labels_from_majority_vote.append(majority_vote_label_example_i)

print(f"Accuracy of majority vote annotators' labels against ground truth labels: {np.mean(labels_from_majority_vote == true_labels)}")
pred_probs_from_model_fit_to_majority_vote = train_model(labels_to_fit = labels_from_majority_vote)

Accuracy of majority vote annotators' labels against ground truth labels: 0.882943143812709
Accuracy of held-out model predictions against ground truth labels: 0.9464882943143813


In [12]:
annotator_list = list(multiannotator_labels.columns)
# number of correct labels given by annotator, where more than one annotator gave labels
annotator_correct_cnts = {
    annotator: 0
    for annotator in annotator_list
}
# number of labels given by annotator, where more than one annotator gave labels
annotator_num_labels = {
    annotator: 0
    for annotator in annotator_list
}
for i in range(len(multiannotator_labels)):
    majority_vote_label_example_i = labels_from_majority_vote[i]
    num_annotators_example_i = sum(pd.notna(multiannotator_labels.iloc[i]))
    if num_annotators_example_i == 1:  # only consider examples where more than one annotator gave label
        continue

    for annotator in annotator_list:
        if pd.isna(multiannotator_labels.iloc[i][annotator]):
            continue
        annotator_num_labels[annotator] += 1
        if multiannotator_labels.iloc[i][annotator] == majority_vote_label_example_i:
            annotator_correct_cnts[annotator] += 1

annotator_scores = {
    annotator: annotator_correct_cnts[annotator] / annotator_num_labels[annotator]
    for annotator in annotator_list
}
annotator_score_tuples = list(annotator_scores.items())
annotator_score_tuples = sorted(annotator_score_tuples, key=lambda x: x[1])

print("Worst 10 annotators:")
for i in range(10):
    print(f"{annotator_score_tuples[i][0]}, {annotator_score_tuples[i][1]:.4f}")

Worst 10 annotators:
A0048, 0.4062
A0044, 0.4286
A0046, 0.4828
A0043, 0.4848
A0045, 0.5000
A0022, 0.5625
A0041, 0.5714
A0042, 0.5714
A0050, 0.5750
A0049, 0.5833


In [13]:
# Break ties using the trained model
labels_from_majority_vote_after_tie_breaker = []
for i in range(len(multiannotator_labels)):
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])]
    cnt_dict = {}
    max_cnt = 0
    for ann in annotations_for_example_i.values:
        cnt_dict[ann] = cnt_dict.get(ann, 0) + 1
        if cnt_dict[ann] > max_cnt:
            max_cnt = cnt_dict[ann]
    ann_list = list(cnt_dict.keys())
    for ann in ann_list:
        if cnt_dict[ann] < max_cnt:
            del cnt_dict[ann]
    if len(cnt_dict.keys()) > 1:
        # Break ties using trained model
        model_pred = pred_probs_from_model_fit_to_majority_vote[i, :].argmax()
        if model_pred in cnt_dict.keys():
            majority_vote_label_example_i = model_pred
        else:
            majority_vote_label_example_i = list(cnt_dict.keys())[0]
    else:
        majority_vote_label_example_i = list(cnt_dict.keys())[0]
    labels_from_majority_vote_after_tie_breaker.append(majority_vote_label_example_i)

print(
    f"Accuracy of majority vote annotators' labels after tie breaker against "
    + f"ground truth labels: {np.mean(labels_from_majority_vote_after_tie_breaker == true_labels)}"
)
pred_probs_from_model_fit_to_majority_vote = train_model(labels_to_fit = labels_from_majority_vote_after_tie_breaker)

Accuracy of majority vote annotators' labels after tie breaker against ground truth labels: 0.9130434782608695
Accuracy of held-out model predictions against ground truth labels: 0.9565217391304348


In [14]:
# Break ties using the annotator score
annotator_list = list(multiannotator_labels.columns)

labels_from_majority_vote_after_tie_breaker = []
for i in range(len(multiannotator_labels)):
    cnt_dict = {}
    max_cnt = 0
    for annotator in annotator_list:
        if pd.isna(multiannotator_labels.iloc[i][annotator]):
            continue
        ann = multiannotator_labels.iloc[i][annotator]
        if ann in cnt_dict.keys():
            cnt_dict[ann]['result'] += 1
            cnt_dict[ann]['annotator_scores'].append(annotator_scores[annotator])
        else:
            cnt_dict[ann] = {
                'result': 1,
                'annotator_scores': [annotator_scores[annotator]]
            }
        if cnt_dict[ann]['result'] > max_cnt:
            max_cnt = cnt_dict[ann]['result']

    ann_list = list(cnt_dict.keys())
    for ann in ann_list:
        if cnt_dict[ann]['result'] < max_cnt:
            del cnt_dict[ann]

    # Accumulate annotator scores
    acc_function = np.min
    for ann in cnt_dict.keys():
        cnt_dict[ann]['acc_annotator_score'] = acc_function(cnt_dict[ann]['annotator_scores'])
    cnt_dict_items = cnt_dict.items()
    cnt_dict_items_sorted = sorted(cnt_dict_items, key=lambda x: x[1]['acc_annotator_score'])
    majority_vote_label_example_i = cnt_dict_items_sorted[-1][0]

    labels_from_majority_vote_after_tie_breaker.append(majority_vote_label_example_i)

print(
    f"Accuracy of majority vote annotators' labels after tie breaker against "
    + f"ground truth labels: {np.mean(labels_from_majority_vote_after_tie_breaker == true_labels)}"
)
pred_probs_from_model_fit_to_majority_vote = train_model(labels_to_fit = labels_from_majority_vote_after_tie_breaker)

Accuracy of majority vote annotators' labels after tie breaker against ground truth labels: 0.8762541806020067
Accuracy of held-out model predictions against ground truth labels: 0.9498327759197325


## Exercise 2

Estimate consensus labels for each example from the data in `multiannotator_labels`, this time using the CROWDLAB algorithm. You may find it helpful to reference: https://docs.cleanlab.ai/stable/tutorials/multiannotator.html
Recall that CROWDLAB requires out of sample predicted class probabilities from a trained classifier. You may use the `pred_probs` from your model trained on majority-vote consensus labels or our randomly-selected annotator labels. Which do you think would be better to use?

- Evaluate the accuracy of these CROWDLAB consensus labels against the ground truth labels.
- Also set these as `labels_to_fit` in `train_model()` to see the resulting model's accuracy when trained with CROWDLAB consensus labels.
- Estimate the quality of annotator (how accurate their labels tend to be overall) using CROWDLAB (assume the ground truth labels are unavailable as they would be in practice). Who do you guess are the worst 10 annotators based on this method?

In [15]:
## Code your solution here
from cleanlab.multiannotator import get_label_quality_multiannotator

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
cleanlab_results = get_label_quality_multiannotator(
    multiannotator_labels,
    pred_probs_from_model_fit_to_majority_vote
)

2024-08-27 16:36:53.326606: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-27 16:36:53.989523: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-27 16:36:53.989618: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-27 16:36:53.989746: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-27 16:36:54.523582: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-27 16:36:54.566146: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

In [17]:
labels_from_cleanlab = cleanlab_results['label_quality']['consensus_label'].values

print(f"Accuracy of cleanlab labels against ground truth labels: {np.mean(labels_from_cleanlab == true_labels)}")
pred_probs_from_model_fit_to_cleanlab_labels = train_model(labels_to_fit = labels_from_cleanlab)

Accuracy of cleanlab labels against ground truth labels: 0.9765886287625418
Accuracy of held-out model predictions against ground truth labels: 0.9732441471571907


In [18]:
cleanlab_results['annotator_stats'].head(10)

Unnamed: 0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
A0043,0.358109,0.363636,2,33
A0048,0.374314,0.40625,1,32
A0044,0.391327,0.428571,2,21
A0042,0.396502,0.428571,1,35
A0046,0.415364,0.448276,1,29
A0041,0.43199,0.464286,1,28
A0040,0.434583,0.481481,2,27
A0045,0.467157,0.461538,2,26
A0050,0.469913,0.525,2,40
A0049,0.469948,0.527778,2,36
