# Lab -- Dataset Curation with Multiple Annotators

Intended to accompany the lecture on Dataset Creation and Curation, this notebook contains exercises to analyze an existing classification dataset labeled by multiple annotators (e.g. collected via crowdsourcing).

You may find it helpful to first install the following dependencies:

In [1]:
!pip install cleanlab
# We originally used the version: cleanlab==2.2.0
# This automatically installs other required packages like numpy, pandas, sklearn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import numpy as np
import pandas as pd

## Analyzing dataset labeled by multiple annotators

We simulate a small classification dataset (3 classes, 2-dimensional features) with ground truth labels that are then hidden from our analysis. The analysis is conducted on labels from noisy annotators whose labels are derived from the ground truth labels, but with some probability of error in each annotated label where the probability is determined by the underlying quality of the annotator. In subsequent exercises, you should assume the ground truth labels and the true annotator qualities are unknown to you.

In [3]:
## You don't need to understand this cell, it's just used for generating the dataset

SEED = 123  # for reproducibility
np.random.seed(seed=SEED)

def make_data(sample_size = 300):
    """ Produce a 3-class classification dataset with 2-dimensional features and multiple noisy annotations per example. """
    num_annotators=50  # total number of data annotators
    class_frequencies = [0.5, 0.25, 0.25]
    sizes=[int(np.ceil(freq*sample_size)) for freq in class_frequencies]  # number of examples belonging to each class
    good_annotator_quality = 0.6
    bad_annotator_quality = 0.3
    
    # Underlying statistics of the datset (unknown to you)
    means=[[3, 2], [7, 7], [0, 8]]
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]]
    
    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    # Generate features and true labels
    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Generate noisy labels from each annotator
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, good_annotator_quality)
                if i < num_annotators - 10  # last 10 annotators are worse
                else generate_noisy_labels(true_labels_train, bad_annotator_quality)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]
    # Drop rows not annotated by anybody
    row_NA_check = pd.notna(s).any(axis=1)
    X_train = X_train[row_NA_check]
    true_labels_train = true_labels_train[row_NA_check]
    multiannotator_labels = s[row_NA_check].reset_index(drop=True)
    # Shuffle the rows of the dataset
    shuffled_indices = np.random.permutation(len(X_train))
    return {
        "X_train": X_train[shuffled_indices],
        "true_labels_train": true_labels_train[shuffled_indices],
        "multiannotator_labels": multiannotator_labels.iloc[shuffled_indices],
    }

def generate_noisy_labels(true_labels, annotator_quality):
    """ Randomly flips each true label to a different class with probability that depends on annotator_quality. """
    n = len(true_labels)
    m = np.max(true_labels) + 1  # number of classes
    annotated_labels = np.random.randint(low=0, high=3, size=n)
    correctly_labeled_indices = np.random.random(n) < annotator_quality
    annotated_labels[correctly_labeled_indices] = true_labels[correctly_labeled_indices]
    return annotated_labels

In [4]:
data_dict = make_data(sample_size = 300)

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let's view the first few rows of the data used for this exercise. Here are the labels selected by each annotator for the first few examples. Here each example is a row, and the annotators are columns. Not all annotators labeled each example; valid class annotations from those that did label the example are integers (either 0, 1, or 2 for our 3 classes), and otherwise the annotation is left as `NA` if a particular annotator did not label a particular example.

In [5]:
multiannotator_labels.head()

Unnamed: 0,A0001,A0002,A0003,A0004,A0005,A0006,A0007,A0008,A0009,A0010,...,A0041,A0042,A0043,A0044,A0045,A0046,A0047,A0048,A0049,A0050
247,,2.0,,,,,,,,,...,,,,,,,,,,
290,,,,,,,,,,,...,,,,,,,,,,
262,,,,,,,2.0,,,,...,,,,,,,,,,
182,,,,,,,,,,,...,,,,,,,,,,
143,,,0.0,,,,,,,,...,,,,,,,,,,


Here are the corresponding 2D data features for these examples:

In [6]:
X[:5]

array([[ 1.01592896, 10.62213634],
       [-1.91393643,  6.53944268],
       [ 0.55962291,  5.35885902],
       [ 6.73677377,  5.02311322],
       [ 6.95949986,  1.61434817]])

### Train model with cross-validation

In this exercise, we consider the simple K Nearest Neighbors classification model, which produces predicted class probabilities for a particular example via a (weighted) average of the labels of the K closest examples. We will train this model via cross-validation and use it to produce held-out predictions for each example in our dataset.

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict

def train_model(labels_to_fit):
    """ Trains a simple feedforward neural network model on the data features X with y = labels_to_fit, via cross-validation.
        Returns out-of-sample predicted class probabilities for each example in the dataset
        (from a copy of model that was never trained on this example).
        Also evaluates the held-out class predictions against ground truth labels.
    """
    num_crossval_folds = 5  # number of folds of cross-validation
    # model = MLPClassifier(max_iter=1000, random_state=SEED)
    model = KNeighborsClassifier(weights="distance")
    pred_probs = cross_val_predict(
        estimator=model, X=X, y=labels_to_fit, cv=num_crossval_folds, method="predict_proba"
    )
    class_predictions = np.argmax(pred_probs, axis=1)
    held_out_accuracy = np.mean(class_predictions == true_labels)
    print(f"Accuracy of held-out model predictions against ground truth labels: {held_out_accuracy}")
    return pred_probs

Here we demonstrate how to train and evaluate this model. Note that the evaluation is against ground truth labels, which you wouldn't have in real applications, so this evaluation is just for demonstration purposes. We'll first fit this model using labels comprised of one randomly selected annotation for each example.

In [8]:
labels_from_random_annotators = true_labels.copy()
for i in range(len(multiannotator_labels)):
    annotations_for_example_i = multiannotator_labels.iloc[i][pd.notna(multiannotator_labels.iloc[i])]
    labels_from_random_annotators[i] = np.random.choice(annotations_for_example_i.values)

print(f"Accuracy of random annotators' labels against ground truth labels: {np.mean(labels_from_random_annotators == true_labels)}")
pred_probs_from_model_fit_to_random_annotators = train_model(labels_to_fit = labels_from_random_annotators)


Accuracy of random annotators' labels against ground truth labels: 0.6822742474916388
Accuracy of held-out model predictions against ground truth labels: 0.8093645484949833


We can also fit this model using the ground truth labels (which you would not be able to in practice), just to see how good it could be:

In [9]:
pred_probs_from_unrealistic_model_fit_to_true_labels = train_model(labels_to_fit = true_labels)

Accuracy of held-out model predictions against ground truth labels: 0.9765886287625418


## Exercise 1

Compute majority-vote consensus labels for each example from the data in `multiannotator_labels`. Think about how to best break ties!

- Evaluate the accuracy of these majority-vote consensus labels against the ground truth labels.
- Also set these as `labels_to_fit` in `train_model()` to see the resulting model's accuracy when trained with majority vote consensus labels.
- Estimate the quality of annotator (how accurate their labels tend to be overall) using only these majority-vote consensus labels (assume the ground truth labels are unavailable as they would be in practice). Who do you guess are the worst 10 annotators?

In [10]:
## Code your solution here
# first get consensus label
# then check if there are ties
# solve ties by model probs
# train model
# get annotator score by checking consensus with consensus label

In [24]:
count_df = multiannotator_labels.apply(pd.Series.value_counts, axis=1).fillna(0)
print(count_df.shape)
count_df.head()


(299, 3)


Unnamed: 0,0,1,2
247,0,0,4
290,0,0,1
262,0,2,3
182,1,4,2
143,1,2,0


In [25]:
# quick sanity check
multiannotator_labels.iloc[0,:].dropna()

A0002    2
A0024    2
A0034    2
A0038    2
Name: 247, dtype: Int64

In [26]:
# get first consensus
first_cons = count_df.apply(np.argmax, axis=1)
print(first_cons.shape)
first_cons.head()

(299,)


247    2
290    2
262    2
182    1
143    1
dtype: int64

In [29]:
count_df

Unnamed: 0,0,1,2
247,0,0,4
290,0,0,1
262,0,2,3
182,1,4,2
143,1,2,0
...,...,...,...
255,1,1,6
14,4,2,0
195,0,1,0
85,1,0,0


In [40]:
arr = count_df.values # get values only for trick to break ties
max_val = arr.max(axis=1)[:, None] # get max value of each row and create and array of shape (299,1), therefore the [:, None] otherwise it would have shape (299)
# sum where the max value is in each row
max_count = (max_val == arr).sum(axis=1)
count_df['tie'] = max_count != 1
count_df.head()



Unnamed: 0,0,1,2,tie
247,0,0,4,False
290,0,0,1,False
262,0,2,3,False
182,1,4,2,False
143,1,2,0,False


Now we calc the probs from a first model that uses the no final consensus vote (ties are random)

In [44]:
first_cons = count_df.iloc[:,0:4].apply(np.argmax, axis=1)
first_probs = train_model(labels_to_fit=first_cons)


Accuracy of held-out model predictions against ground truth labels: 0.9063545150501672


In [47]:
# adding probs to tie
first_probs[count_df['tie'] == False] = 0


In [55]:
first_probs.shape
count_df=count_df.iloc[:, 0:3] + first_probs


Now we get the argmax without ties

In [58]:
majority_label = count_df.iloc[:, 0:3].apply(np.argmax, axis=1)
probs = train_model(labels_to_fit=majority_label)

Accuracy of held-out model predictions against ground truth labels: 0.9565217391304348


In [63]:
# now let's get the accuracy of each annotator by checking how many times he hit the consensus label
# without the case where he has been the only annotator
multiannotator_labels.head()
multiannotator_labels['nacount'] = multiannotator_labels.isna().sum(axis=1)
multiannotator_labels.head()

Unnamed: 0,A0001,A0002,A0003,A0004,A0005,A0006,A0007,A0008,A0009,A0010,...,A0042,A0043,A0044,A0045,A0046,A0047,A0048,A0049,A0050,nacount
247,,2.0,,,,,,,,,...,,,,,,,,,,46
290,,,,,,,,,,,...,,,,,,,,,,49
262,,,,,,,2.0,,,,...,,,,,,,,,,45
182,,,,,,,,,,,...,,,,,,,,,,43
143,,,0.0,,,,,,,,...,,,,,,,,,,47


In [67]:
multiannotator_labels['A0001'][multiannotator_labels['nacount'] ==49].notna().sum()

0

In [69]:
def get_ann_acc(col):
    hits = col == majority_label
    hits_sum = hits.sum()
    total_votes = col.notna().sum()
    # now we get the single votes, we now that when we have an na count of 49
    # only one annotator voted, if the current column has a non na value for this entry, we
    # know that this annotator was the single vote
    single_votes = col[(multiannotator_labels['nacount'] == 49)].notna().sum()
    accuracy = (hits_sum - single_votes) / (total_votes - single_votes)
    return accuracy



In [70]:
ann_acc = multiannotator_labels.apply(get_ann_acc, axis=0)
ann_acc.sort_values()

nacount   -0.034602
A0048      0.406250
A0044      0.428571
A0046      0.482759
A0043      0.484848
A0045      0.500000
A0042      0.542857
A0040      0.555556
A0029      0.558824
A0041      0.571429
A0050      0.575000
A0049      0.583333
A0022      0.593750
A0004      0.631579
A0047      0.640000
A0009      0.645161
A0002      0.666667
A0007      0.680000
A0038      0.684211
A0019      0.687500
A0012      0.703704
A0031      0.717949
A0018      0.718750
A0036      0.718750
A0006      0.720000
A0017      0.722222
A0014      0.735294
A0016      0.740741
A0003      0.758621
A0026      0.766667
A0020      0.766667
A0033      0.774194
A0034      0.777778
A0039      0.777778
A0035      0.781250
A0013      0.782609
A0008      0.791667
A0027      0.793103
A0011      0.794118
A0015      0.794118
A0021      0.794118
A0023      0.800000
A0001      0.800000
A0025      0.807692
A0032      0.807692
A0030      0.807692
A0037      0.814815
A0010      0.857143
A0028      0.863636
A0024      0.878788


## Exercise 2

Estimate consensus labels for each example from the data in `multiannotator_labels`, this time using the CROWDLAB algorithm. You may find it helpful to reference: https://docs.cleanlab.ai/stable/tutorials/multiannotator.html
Recall that CROWDLAB requires out of sample predicted class probabilities from a trained classifier. You may use the `pred_probs` from your model trained on majority-vote consensus labels or our randomly-selected annotator labels. Which do you think would be better to use?

- Evaluate the accuracy of these CROWDLAB consensus labels against the ground truth labels.
- Also set these as `labels_to_fit` in `train_model()` to see the resulting model's accuracy when trained with CROWDLAB consensus labels.
- Estimate the quality of annotator (how accurate their labels tend to be overall) using CROWDLAB (assume the ground truth labels are unavailable as they would be in practice). Who do you guess are the worst 10 annotators based on this method?

In [71]:
## Code your solution here
from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

In [74]:
majority_vote_label = get_majority_vote_label(multiannotator_labels)
majority_vote_label.shape

CAUTION: Number of unique classes has been reduced from the original data when establishing consensus labels using consensus method 'majority_vote', likely due to some classes being rarely annotated. If training a classifier on these consensus labels, it will never see any of the omitted classes unless you manually replace some of the consensus labels.
Classes in the original data but not in consensus labels: [37, 38, 39, 40, 41, 42, 43, 44, 45, 46]


(299,)

In [75]:
majority_label.shape

(299,)

In [78]:
(majority_label == majority_vote_label).sum()

259

In [79]:
train_model(labels_to_fit=majority_vote_label)

Accuracy of held-out model predictions against ground truth labels: 0.8996655518394648


array([[0.        , 0.56890921, 0.3956677 , 0.03542309, 0.        ,
        0.        ],
       [0.        , 0.        , 0.78381853, 0.21618147, 0.        ,
        0.        ],
       [0.14717297, 0.        , 0.85282703, 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.96710515, 0.        , 0.        , 0.        ,
        0.03289485],
       [0.82245665, 0.17754335, 0.        , 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

In [84]:
del multiannotator_labels['nacount']

In [85]:
quality = get_label_quality_multiannotator(pred_probs=probs, labels_multiannotator=multiannotator_labels)
quality

Annotator(s) ['A0004' 'A0007' 'A0014' 'A0021' 'A0025' 'A0026' 'A0036' 'A0037' 'A0039'] did not annotate any examples that overlap with other annotators,                 
using the average annotator agreeement among other annotators as this annotator's agreement.
Annotator(s) ['A0004' 'A0007' 'A0014' 'A0021' 'A0025' 'A0026' 'A0036' 'A0037' 'A0039'] did not annotate any examples that overlap with other annotators,                 
using the average annotator agreeement among other annotators as this annotator's agreement.


{'label_quality':      consensus_label  consensus_quality_score  annotator_agreement  \
 0                  2                 0.952228             1.000000   
 1                  2                 0.783818             1.000000   
 2                  2                 0.796275             0.600000   
 3                  1                 0.854722             0.571429   
 4                  0                 0.874720             0.333333   
 ..               ...                      ...                  ...   
 294                2                 0.954274             0.750000   
 295                0                 0.875320             0.666667   
 296                1                 0.967105             1.000000   
 297                0                 0.822457             1.000000   
 298                0                 0.947990             0.600000   
 
      num_annotations  
 0                  4  
 1                  1  
 2                  5  
 3                  7  
 4       

In [87]:
from cleanlab.multiannotator import get_label_quality_multiannotator

# We use the predicted class probabilities from classifier trained on majority vote labels, 
# since those are more accurate than the predicitions from classifier trained on random annotators' labels.
pred_probs = pred_probs_from_model_fit_to_random_annotators  # alternatively: pred_probs_from_model_fit_to_random_annotators
results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)
crowdlab_labels = results["label_quality"]["consensus_label"]

print(f"Accuracy of CROWDLAB consensus labels against ground truth labels: {np.mean(crowdlab_labels == true_labels)}")
pred_probs_from_model_fit_to_random_annotators = train_model(labels_to_fit = crowdlab_labels)

annotator_quality_estimates = results["annotator_stats"]
print(f"Worst 10 annotators are inferred to be: {annotator_quality_estimates.index[:10].tolist()}")

Accuracy of CROWDLAB consensus labels against ground truth labels: 0.862876254180602
Accuracy of held-out model predictions against ground truth labels: 0.9230769230769231
Worst 10 annotators are inferred to be: ['A0048', 'A0042', 'A0046', 'A0040', 'A0041', 'A0044', 'A0043', 'A0049', 'A0045', 'A0012']


## OUTLOOK / summary
The steps to obtain a consensus label are:
- get the value counts for each sample (how often has it been voted for)
- this results in <num_samples>,<num_classes> shaped df, count_df
- here we get the argmax to get a first consensus label, the class that most annotators voted for
- now we need to break ties
  - here we first check if ties exist
  - this we do by getting the max value of each row and and the check where array entries equal this max value and then we count the trues
  - then we use the first argmax labels to train a model and get probabilities for each class
    - we set these probabilities to 0 if we have non ties
  - then we add the probabilities to our count_df
  -then we take the argmax again
- these are our consensus labels
- we can also estimate the quality of annotators by checking the consensus of the annotator label with the consensus label

## Notable:
- my code to get consensus label generated better prediction accuracy than cleanlab
- I also got different worst annotators (slightly different)
- check why this difference in someday/maybe list


In [89]:
arr == max_val

array([[False, False,  True],
       [False, False,  True],
       [False, False,  True],
       [False,  True, False],
       [False,  True, False],
       [False,  True, False],
       [ True, False, False],
       [ True, False, False],
       [False,  True, False],
       [ True, False, False],
       [ True, False, False],
       [ True, False,  True],
       [ True, False, False],
       [ True, False, False],
       [False,  True, False],
       [False, False,  True],
       [ True, False, False],
       [False, False,  True],
       [False,  True, False],
       [ True,  True,  True],
       [False,  True, False],
       [ True, False, False],
       [False,  True, False],
       [False, False,  True],
       [ True, False, False],
       [False, False,  True],
       [ True, False, False],
       [ True, False, False],
       [ True, False, False],
       [ True, False,  True],
       [ True, False, False],
       [ True, False, False],
       [ True, False, False],
       [ T