# How to Aggregate Categorical Replies via Crowdsourcing (Demo from ICML 2021)

*This example was originally posted on [HackerNoon](https://hackernoon.com/how-to-aggregate-categorical-replies-via-crowdsourcing-demo-from-icml-2021-8u4j37hz).*

We will aggregate categorical responses with the help of two classical algorithms – Majority Vote and Dawid-Skene.

We’ll be using [Crowd-Kit](https://pypi.org/project/crowd-kit/), an open-source computational quality control library that offers efficient implementations of various quality control methods, including aggregation, uncertainty, agreements, and more. Crowd-Kit is designed to work with Python data science libraries like NumPy, SciPy, and pandas, while providing a very familiar programming experience with well-known concepts and APIs. It’s also platform agnostic. As soon as you provide the data as a table of annotators, tasks, and responses, the library will deliver high-quality output regardless of the platform you use.

In this demonstration, we will aggregate some responses provided by crowd annotators. For the project, they had to indicate whether the link to a target website was correct or not. Given that we asked multiple annotators to annotate each URL, we needed to choose the correct response, considering annotators’ skills and task difficulties. This was why we need aggregation. It’s a vast research topic, and there are many methods available for performing this task, most of which are based on probabilistic graphical models. Implementing them efficiently is another challenging task.

First, we need to install the Crowd-Kit library from the Python Package Index. We’ll also need annotated data. We’ll be using Toloka Aggregation Relevance [datasets](https://toloka.ai/datasets) with two categories: relevant and not relevant. These datasets contain anonymized data that is safe to work with. I’ll use the Crowd-Kit dataset downloader to download them from the Internet as pandas data frames. Again, feel free to use a different source of annotated data; open datasets are, naturally, fair play as well. Now we’re ready to go.

In [1]:
%%capture
%pip install -U crowd-kit==1.2.1

In [2]:
from crowdkit.datasets import load_dataset

In [3]:
df, df_gt = load_dataset("relevance-2")

The `load_dataset` function returns a pair of elements. The first element is the pandas data frame with the crowdsourced data. The second element is the ground truth dataset, whenever possible. The data frame, or `df`, has three columns: `worker`, `task`, and `label`. The label is set to `0` if the document is rated as non-relevant by the given annotator in the given task, otherwise the label will be `1`. The ground truth dataset `df_gt` is a pandas series that contains the correct responses to the tasks put to the index of this series.

So the data has been downloaded, and before we move forward, let’s have a look at the dataset.

In [4]:
df

Unnamed: 0,worker,task,label
0,w851,t30685,1
1,w6991,t30008,0
2,w2596,t36316,0
3,w5507,t15145,1
4,w2982,t44785,1
...,...,...,...
475531,w4660,t62250,1
475532,w6630,t46626,0
475533,w4605,t93513,1
475534,w1928,t29002,0


In [5]:
df_gt

task
t30006    0
t33578    0
t22462    1
t52093    0
t26935    0
         ..
t57345    1
t81052    1
t7189     1
t80463    0
t93643    0
Name: true_label, Length: 10079, dtype: int64

We import three aggregation classes: Majority Vote, Wawa, and Dawid-Skene.

In [6]:
from crowdkit.aggregation import DawidSkene, MajorityVote, Wawa

Let's proceed to the aggregation using majority vote, a very simple heuristic method. We create an instance of Majority Vote and call the fit_predict method to perform majority vote aggregation of our data.

In [7]:
agg_mv = MajorityVote().fit_predict(df)
agg_mv

task
t0       1
t1       1
t10      1
t100     0
t1000    0
        ..
t9995    1
t9996    0
t9997    0
t9998    0
t9999    1
Name: agg_label, Length: 99319, dtype: int64

This simple heuristic works extremely well, especially on small datasets, so it’s always a good idea to try it. Note that the ties are broken randomly to avoid bias towards the first occurring label.

However, the classical majority vote approach does not take into account the skills of the annotators. But sometimes it’s useful to weigh every annotator's contribution to the final label proportionally to their agreement with the aggregate. This approach is called Wawa, and Crowd-Kit also offers it. Internally, it computes the majority vote and then re-weights the annotators’ votes with the fraction of responses matched to that majority vote.

In [8]:
agg_wawa = Wawa().fit_predict(df)
agg_wawa

task
t0       1
t1       1
t10      1
t100     0
t1000    0
        ..
t9995    1
t9996    0
t9997    0
t9998    0
t9999    1
Name: agg_label, Length: 99319, dtype: int64

Now we perform the same operation with Dawid-Skene.

In [9]:
agg_ds = DawidSkene(n_iter=10).fit_predict(df)
agg_ds

task
t30685    1
t30008    0
t36316    0
t15145    1
t44785    0
         ..
t95222    0
t83525    0
t49227    0
t96106    1
t16185    1
Name: agg_label, Length: 99319, dtype: int64

This is another classical aggregation approach in crowdsourcing, which was originally designed in the 70s for probabilistic modeling of medical examinations. The code is virtually the same: we create an instance, set the number of algorithm iterations, call `fit_predict`, and obtain the aggregated results.

Let’s evaluate the quality of our aggregations. We will use the well-known F1 score from the scikit-learn library.

In [10]:
from sklearn.metrics import f1_score

In this dataset, the ground truth labels are available only for the subset of tasks, so we need to use index slicing. This allows us to perform model selection using well-known and reliable tools like pandas and scikit-learn together with Crowd-Kit.

In [11]:
f1_score(df_gt, agg_mv[df_gt.index])

0.7621861152141802

In [12]:
f1_score(df_gt, agg_wawa[df_gt.index])

0.7610675039246467

In [13]:
f1_score(df_gt, agg_ds[df_gt.index])

0.7878520154610712

In our experiment, the best quality was offered by the Dawid-Skene model. Having selected the model, we want to export all of the aggregated data, which makes sense in downstream applications.

We now transform the series to a data frame for later use by specifing the desired column name.

Let’s take a look inside it. The data is here, the responses are here, and the aggregation results are also here.

In [14]:
agg_ds.to_frame("label").reset_index()

Unnamed: 0,task,label
0,t30685,1
1,t30008,0
2,t36316,0
3,t15145,1
4,t44785,0
...,...,...
99314,t95222,0
99315,t83525,0
99316,t49227,0
99317,t96106,1


We’ve obtained aggregated data in just a few lines of code using [Crowd-Kit](https://github.com/Toloka/crowd-kit) and commonly-used Python data science libraries.