Skip to content
Branch: master
Go to file

Latest commit


Failed to load latest commit information.

ClariQ Overview

ClariQ (pronounce as Claire-ee-que) challenge is organized as part of the Conversational AI challenge series (ConvAI3) at Search-oriented Conversational AI (SCAI) EMNLP workshop in 2020. The main aim of the conversational systems is to return an appropriate answer in response to the user requests. However, some user requests might be ambiguous. In IR settings such a situation is handled mainly thought the diversification of search result page. It is however much more challenging in dialogue settings. Hence, we aim to study the following situation for dialogue settings:

  • a user is asking an ambiguous question (where the ambiguous question is a question to which one can return > 1 possible answers);
  • the system must identify that the question is ambiguous, and, instead of trying to answer it directly, ask a good clarifying question.

The main research questions we aim to answer as part of the challenge are the following:

  • RQ1: When to ask clarifying questions during dialogues?
  • RQ2: How to generate the clarifying questions?

The detailed description of the challenge can be found in the following doccument.

How to participate?


  • July 7, 2020: Announcing the Stage 1 of ClariQ challenge

Challenge Design

The challenge will be run in two stages:

Stage 1: intial dataset

In Stage 1, we provide to the participants the datasets that include:

  • User Request: an initial user request in the conversational form, e.g., "What is Fickle Creek Farm?", with a label reflects if clarification is needed to be ranged from 1 to 4;
  • Clarification questions: a set of possible clarifying questions, e.g., "Do you want to know the location of fickle creek farm?";
  • User Answers: each question is supplied with a user answer, e.g., "No, I want to find out where can i purchase fickle creek farm products."

To answer RQ1: Given a user request, return a score [1 −4] indicating the necessity of asking clarifying questions.

To answer RQ2: Given a user request which needs clarification, returns the most suitable clarifying question. Here participants are able to choose: (1) either select the clarifying question from the provided question bank (all clarifying questions we collected), aiming to maximize the precision, (2) or choose not to ask any question (by choosing Q0001 from the question bank.)

The dataset is stored in the following repository, together with evaluation scripts and baseline.

Stage 2: human-in-the-loop

The TOP-5 systems from Stage 1 are exposed to real users. Their responses—answers and clarifying questions—are rated by the users. At that stage, the participating systems are put in front of human users. The systems are rated on their overall performance. At each dialog step, a system should give either a factual answer to the user's query or ask for clarification. Therefore, the participants would need to:

  • ensure their system can answer simple user questions
  • make their own decisions on when clarification might be appropriate
  • provide clarification question whenever appropriate
  • interpret user's reply to the clarifying question

The participants would need to strike a balance between asking too many questions and providing irrelevant answers.

Note that the setup of this stage is quite different from Stage 1. Participating systems would likely need to operate as a generative model, rather than a retrieval model. One option would be to cast the problem as generative from the beginning and solve the retrieval part of Stage 1, e.g., by ranking the offered candidates by their likelihood.

Alternatively, one may solve Stage 2 by retrieving a list of candidate answers (e.g., by invoking Wikipedia API or the Chat Noir API that we describe above) and ranking them as in Stage 1.


  • Stage 1 will take place from July 7, 2020 -- September 9, 2020. Up until September 9, 2020 participants will be able to submit their models (source code) and solutions to be evaluated on the test set using automated metrics (which we will run on our servers). The current leaderboards will be visible to everyone.
  • Stage 2 will start on September 10, 2020. On September 10, 2020 the source code submission system will be locked, and the best performing systems will be evaluated over the next month using crowd workers.

Winners will be announced at SCAI@EMNLP2020 which will take place in November 19-20 (exact details TBD).


Participants' models will then be compared in two ways after two stages:

  • automated evaluation metrics on a new test set hidden from the competitors;
  • evaluation with crowd workers through MTurk.

The winning will be chosen based on these scores.


There are three types of metrics we will evaluate:

  • Automated metrics As system automatic evaluation metrics we use MRR, P@[1,3,5,10,20], nDCG@[1,3,5,20]. These metrics are computed as follows: a selected clarifying question, together with its corresponding answer are added to the original request. The updated query is then used to retrieve (or re-rank) documents from the collection. The quality of a question is then evaluated by taking into account how much the question and its answer affect the performance of document retrieval. Models are also evaluated in how well they are able to rank relevant questions higher than other questions in the question bank. For this task, that we call 'question relevance', the models are evaluated in terms of Recall@[10,20,30]. Since the precision of models is evaluated in the document relevance task, here we focus only on recall.

  • Crowd workers given the entrant's model code, we will run live experiments where Turkers chat to their model given instructions identical to the creation of the original dataset, but with new profiles, and then score its performance. Turkers will score the models between 1-5.


  • Participants should indicate which training sources are used to build their models, and whether (and how) ensembling is used (we may place these in separate tracks in an attempt to deemphasize the use of ensembles).
  • Participants must provide their source code so that the hidden test set evaluation and live experiments can be computed without the team's influence, and so that the competition has further impact as those models can be released for future research to build off them. Code can be in any language but a thin python wrapper must be provided in order to work with our evaluation and live experiment code.
  • We will require that the winning systems also release their training code so that their work is reproducible (although we also encourage that for all systems).
  • Participants are free to augment training with other datasets as long as they are publicly released (and hence, reproducible). Hence, all entrants are expected to work on publicly available data or release the data they use to train.

Model Submission

To submit an entry, create a private repo with your model that works with our evaluation code, and share it with the following github accounts:aliannejadi, varepsilon, julianakiseleva.

See for example baseline submissions.

You are free to use any system (e.g. PyTorch, Tensorflow, C++,..) as long as you can wrap your model for the evaluation. The top level README should tell us your team name, model name, and where the, etc. files are so we can run them. Those should give the numbers on the validation set. Please also include those numbers in the README so we can check we get the same. We will then run the automatic evaluations on the hidden test set and update the leaderboard. You can submit a maximum of once per month.

We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on September 9, 2020.

Automatic Evaluation Leaderboard (hidden test set)

Document Relevance

ًRank Creator Model Name Dev Test
MRR P@1 nDCG@3 nDCG@5 MRR P@1 nDCG@3 nDCG@5
- ClariQ Oracle BestQuestion 0.4541 0.3687 0.2578 0.2470 0.4640 0.3829 0.1796 0.1591
1 ClariQ NoQuestion 0.3000 0.2063 0.1475 0.1530 0.3223 0.2268 0.1134 0.1059
2 ClariQ BM25 0.3096 0.2313 0.1608 0.1530 0.3134 0.2193 0.1151 0.1061
- ClariQ Oracle WorstQuestion 0.0841 0.0125 0.0252 0.0313 0.0541 0.0000 0.0097 0.0154

Question Relevance

ًRank Creator Model Name Dev Test
Recall@5 Recall@10 Recall@20 Recall@30 Recall@5 Recall@10 Recall@20 Recall@30
- ClariQ BM25 0.3245 0.5638 0.6675 0.6913 0.3170 0.5705 0.7292 0.7682

Organizing team

Previous ConvAI competitions



No description, website, or topics provided.



No releases published
You can’t perform that action at this time.