# Find Best Consensus Labels for Multiannotator Data using Cleanlab

In this tutorial, we will use Cleanlab to find improved consensus labels for data that has been labeled by multiple annotators. The tutorial also shows you how `cleanlab.multiannotator.get_label_quality_multiannotator()` can automatically compute:
- consensus quality scores
- annotator quality scores
- agreement scores for each example and each annotator
- label quality scores for every annotator's labels

**Overview of what we'll do in this tutorial:**
- Obtain consensus labels of multiannotator data using majority vote
- Train a model on the majority vote consensus labels to compute out-of-sample predicted probabilites
- Use cleanlab's `multiannotator.get_label_quality_multiannotator` function to get improved consensus labels
- View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement scores, detailed label quality scores and more

## 1. Install and import required dependencies

You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install sklearn
!pip install cleanlab

# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs website).
dependencies = ["cleanlab", "sklearn"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

Let’s import some of the packages needed throughout this tutorial.

In [None]:
import numpy as np
import pandas as pd

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

## 2. Create the data (can skip these details)

<details><summary>Below is the code used for data-generation.</summary>

```ipython3
# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
    
from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 1

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[80, 40, 40],
    avg_trace=0.8,
    num_annotators=50,
    seed=SEED,  # set to None for non-reproducible randomness
):
    np.random.seed(seed=SEED)

    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))

    noise_matrix = generate_noise_matrix_from_trace(
        m,
        trace=avg_trace * m,
        py=py,
        valid_noise_matrix=True,
        seed=seed,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [generate_noisy_labels(true_labels_train, noise_matrix) for _ in range(num_annotators)]
        ).transpose()
    )

    # Each annotator only labels approximately 20% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.8))
    s.dropna(axis=1, how="all", inplace=True)

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }

data_dict = make_data()
data = data_dict["multiannotator_labels"]  
```

In [None]:
from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[80, 40, 40],
    num_annotators=50,
    seed=SEED,  # set to None for non-reproducible randomness
):
    np.random.seed(seed=SEED)

    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))
    
    noise_matrix_better = generate_noise_matrix_from_trace(
        m,
        trace=0.9 * m,
        py=py,
        valid_noise_matrix=True,
        seed=seed,
    )
    
    noise_matrix_worse = generate_noise_matrix_from_trace(
        m,
        trace=0.5 * m,
        py=py,
        valid_noise_matrix=True,
        seed=seed,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9))
    s.dropna(axis=1, how="all", inplace=True)

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }

data_dict = make_data()

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"]

For this tutorial we are using a toy dataset that has 50 annotators and 160 examples. There are three possible classes, `0`, `1` and `2`.

Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.

<div class="alert alert-info">
Bringing Your Own Data (BYOD)?

You can easily replace the above with your own multiannotator dataset, and continue with the rest of the tutorial.
 
`multiannotator_labels` should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your classes (and entries of `multiannotator_labels`) should be represented as integer indices 0, 1, ..., num_classes - 1, where examples that are not annotated by a particular annotator are represented using `np.nan`.

</div>


## 3. Get majority vote label and compute out-of-sample predicted probabilites

Before training our machine learning model, we must first obtain the consensus labels from the annotators that labeled the data. The simplest way to obtain an initial set of consensus labels is to select it using majority vote.

In [None]:
majority_vote_label = get_majority_vote_label(multiannotator_labels)

Next, we will train our model on the consensus labels obtained using majority vote to compute out-of-sample predicted probabilities. Here, we use a simple logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression()

num_crossval_folds = 5  
cv_pred_probs = cross_val_predict(
    estimator=model, X=X, y=majority_vote_label, cv=num_crossval_folds, method="predict_proba"
)

## 4. Use cleanlab to get better consensus labels and other statistics

Using the annotators' labels and the out-of-sample predicted probabilites from the model, cleanlab can help us obtain improved consensus labels for our data.


In [None]:
multiannotator_dict = get_label_quality_multiannotator(multiannotator_labels, cv_pred_probs)

Here, we use the `multiannotator.get_label_quality_multiannotator()` function which returns a dictionary containing three items:


- `label_quality_multiannotator` which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.


In [None]:
multiannotator_dict["label_quality_multiannotator"].head()

- `detailed_label_quality` which returns the label quality score for each label given by every annotator

In [None]:
multiannotator_dict["detailed_label_quality"].head()

- `annotator_stats` which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus label and the class they perform the worst at. 

In [None]:
multiannotator_dict["annotator_stats"].head(10)

The `annotator_stats` DataFrame is sorted by increasing `annotator_quality`, showing us the worst annotators first.

Notice that in the above table annotators with ids 45 to 49 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

### Comparing improved consensus labels

We can get the improved consensus labels from the `label_quality_multiannotator` DataFrame shown above.

In [None]:
improved_consensus_label = multiannotator_dict["label_quality_multiannotator"]["consensus_label"]

Since our toy dataset is synthetically generated by adding noise to each annotator's labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab.

In [None]:
majority_vote_accuracy = np.mean(true_labels == majority_vote_label)
cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)

print(f"Accuracy of majority vote labels = {majority_vote_accuracy}")
print(f"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}")

We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators' labels into account, but also a model to compute better consensus labels.

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

if majority_vote_accuracy >= cleanlab_label_accuracy:  # check cleanlab has improved prediction accuracy
    raise Exception("Cleanlab training failed to improve model accuracy.")