# Introduction to Toloka and Toloka API

Toloka is a crowdsourcing platform that helps to analyze large volumes of data in a short period of time.

Examples of common tasks:
* Group the wide variety of items in your online store into categories.
* Find or verify information.
* Translate texts.

[Toloka-Kit](https://github.com/Toloka/toloka-kit) is an open-source library, integrated into Toloka API functionality.

**Useful links:**

- [Toloka Kit documentation](https://toloka.github.io/toloka-kit/)
- [Toloka homepage](https://toloka.ai/)
- [Toloka requester's guide](https://yandex.com/support/toloka-requester/index.html) 
- [Toloka API documentation](https://yandex.com/dev/toloka/doc/concepts/about.html)

The best way to start is to test Toloka web interface by trying out [one of the tutorials](https://yandex.com/support/toloka-requester/concepts/usecases.html).

## Registration

1. [Register](https://yandex.com/support/toloka-requester/concepts/access.html) in Toloka as a requester.
2. Choose the backend:
  * The [production backend](https://toloka.yandex.com/for-requesters/) is used by default in this example.
  * The [sandbox backend](https://sandbox.toloka.yandex.com/for-requesters/) is a testing environment for Toloka. [Learn more](https://yandex.com/support/toloka-requester/concepts/sandbox.html). 
3. [Add funds](https://yandex.com/support/toloka-requester/concepts/refill.html) to your Toloka account, if you're going to use the production version.
4. [Get an OAuth token](https://yandex.ru/dev/toloka/doc/concepts/access.html#access__token) for your version. Go to **Profile** → **External Services Integration** → **Get Oauth Token**.

<table  align="center">
  <tr><td>
    <img src="./img/OAuth.png"
         alt="OAuth token"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> How to get an OAuth token.
  </td></tr>
</table>

## Getting started with Toloka-Kit
Install Toloka-Kit and import it.

In [None]:
!pip install toloka-kit==0.1.5
!pip install pandas
!pip install ipyplot

import datetime
import time

import pandas
import ipyplot

import toloka.client as toloka
import toloka.client.project.template_builder as tb

Сreate a Toloka client instance. All API calls will go through it.

In [None]:
toloka_client = toloka.TolokaClient(input("Enter your token:"), 'PRODUCTION')  # Or switch to 'SANDBOX'
# Lines below check that the OAuth token is correct and print your account's name
requester = toloka_client.get_requester()
print(f'Your account: {requester}')

Cells below can help you learn more about an object or a method you are interested in.

In [None]:
toloka.TolokaClient?

In [None]:
toloka.TolokaClient.get_requester?

In [None]:
toloka.requester.Requester?

## Toloka entities and how to manage them with Toloka-Kit

### Project
A [project](https://yandex.com/support/toloka-requester/concepts/overview.html#project) is a top-level object. It contains instructions, task interface settings, input and output data specification, and default quality control rules for this project's pools. Projects make it easier for you to post similar tasks in the future, because you don't have to re-configure the interface.

The easier the task, the better the results. If your task contains more than one question, you should divide it into several projects.

In this tutorial you will create a project with tasks that ask performers to specify the type of animal depicted in a photo.

In [None]:
new_project = toloka.project.Project(
    assignments_issuing_type='AUTOMATED',
    public_name='Cat or Dog?',
    public_description='Specify the type of animal depicted in a photo.',
)

The cell above created an object in your device's memory. This is not all, the project must also contain:
* [Input and output data specification](https://yandex.com/support/toloka-requester/concepts/incoming.html)
* [Task interface settings](https://yandex.com/support/toloka-requester/concepts/spec.html)
* [An instruction](https://yandex.com/support/toloka-requester/concepts/instruction.html)

**Important:** Several cells below will create changes that will be stored in your device's memory. The data will only be sent to the server after calling one of the `toloka_client` methods.

#### Input and output data

Input field `image` will contain URLs of images that need to be labeled.

Output field `result` will receive `cat` or `dog` labels.

In [None]:
input_specification = {'image': toloka.project.field_spec.UrlSpec()}
output_specification = {'result': toloka.project.field_spec.StringSpec()}

#### Task interface

There are two editors available in Toloka:
* [HTML/CSS/JS editor](https://yandex.com/support/toloka-requester/concepts/spec.html#interface-section)
* [Template Builder](https://yandex.com/support/toloka-tb/index.html) 

Template Builder configures task interface at the entity level. We recommend it for your projects, especially at first.

The cell below will create a task interface for our project.

In [None]:
# This component shows images
image_viewer = tb.view.ImageViewV1(url=tb.data.InputData(path='image'), ratio=[1, 1])

# This component allows to select a label
radio_group_field = tb.fields.RadioGroupFieldV1(
    data=tb.data.OutputData(path='result'),
    validation=tb.conditions.RequiredConditionV1(),
    options=[
        tb.fields.GroupFieldOption(label='Cat', value='cat'),
        tb.fields.GroupFieldOption(label='Dog', value='dog'),
    ]
)

# Allows to set a width limit when displaying a task
task_width_plugin = tb.plugins.TolokaPluginV1(
    layout=tb.plugins.TolokaPluginV1.TolokaPluginLayout(
        kind='scroll', 
        task_width=400,
    )
)

# How performers will see the task
project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(items=[image_viewer, radio_group_field]),
        plugins=[task_width_plugin],
    )
)

# This block assigns task interface and input/output data specification to the project
# Note that this is done via the task specification class
new_project.task_spec = toloka.project.task_spec.TaskSpec(
    input_spec=input_specification,
    output_spec=output_specification,
    view_spec=project_interface,
)

#### Task instruction

When selecting a task, the performer is first shown the [instructions](https://yandex.com/support/toloka-requester/concepts/instruction.html) that you wrote. Describe what needs to be done and give examples.

Good instructions help the performer complete the task correctly. The clarity and completeness of the instructions affect the response quality and the project rating. Unclear or overly complex instructions, on the contrary, will scare off performers.

In [None]:
new_project.public_instructions = 'Look at the picture. Determine what is on it: a <b>cat</b> or a <b>dog</b>. Choose the correct option.'

#### Create a project

Use `toloka_client` defined at the beginning.

The data is only sent to the server after calling one of the `toloka_client` methods.

In [None]:
new_project = toloka_client.create_project(new_project)
print(f'Created project with id {new_project.id}')
print(f'To view the project, go to https://toloka.yandex.com/requester/project/{new_project.id}')
# print(f'To view this pool, go to https://sandbox.toloka.yandex.com/requester/project/{new_project.id}') # Print a sandbox version link

### Project preview

1. Go to the project page to make sure the task interface works correctly. To do this, click the link in the output of the cell above.

<table  align="center">
  <tr><td>
    <img src="./img/project_look.png"
         alt="Project interface"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 2.</b> What the project interface might look like.
  </td></tr>
</table>

2. In the top right corner of the project page click **Project actions** → **Preview**.

3. In the top left corner of the preview page click **Change input data**, and insert an image URL into the `image` field, then click **Apply** button.

<table  align="center">
  <tr><td>
    <img src="./img/task_look.png"
         alt="Task interface"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 3.</b> What the task interface might look like and how to insert images in the preview.
  </td></tr>
</table>

4. In the top right corner of the preview page click **Instructions** button. Make sure the instruction are shown and that they say what you want them to.

5. Select an option in your task. In the bottom right corner of the preview page click **Submit** and then **View responses**. Check in the appeared result window that your results are written in expected format and that the entered data is correct. 

<table  align="center">
  <tr><td>
    <img src="./img/results_preview.png"
         alt="Result priview"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 4.</b> What the results might look like.
  </td></tr>
</table>

Tips:
* We strongly recommend **checking the task interface and instructions** every time you create a project. This will help you to ensure that the performers will complete the task and that your results will be useful. 
* Do a **trial run** with a small amount of data. Make sure that after running the entire pipeline you get the data in the expected format and quality.

### Pool
A [pool](https://yandex.com/support/toloka-requester/concepts/overview.html#pool) is a set of tasks that share common pricing, start date, selection of performers, overlap, and quality control configurations. All task in a pool are processed in parallel. One project can have several pools. You can add new tasks to a pool at any time, as well as open or stop it.

The cell below will create a pool as an object in your device's memory. You will send it to Toloka with `toloka_client` method a bit later.

In [None]:
new_pool = toloka.pool.Pool(
    project_id=new_project.id,
    private_name='Pool 1',  # Only you can see this information
    may_contain_adult_content=False,
    reward_per_assignment=0.01,  # Sets the minimum payment amount for one task suite in USD
    assignment_max_duration_seconds=60*5,  # Gives performers 5 minutes to complete one task suite
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),  # Sets that the pool will close after one year
)

### Performers

Performers in Toloka may not be experts in your type of task. People make mistakes. Crowdsourcing studies ways to get the desired result with the help of a variety of performers. To learn more about Toloka performers, go to [our blog](https://toloka.ai/blog#).

Tip: Our [online course](https://www.coursera.org/learn/practical-crowdsourcing) on practical crowdsourcing for data labeling is now available on Coursera.

### Overlap

To minimize the risk of getting wrong answers you can ask several performers to complete the same task.

In this example overlap is 3. This means that every task will be completed by three different performers.

In [None]:
new_pool.defaults = toloka.pool.Pool.Defaults(
    default_overlap_for_new_tasks=3,
    default_overlap_for_new_task_suites=0,
)

### Task suite

A [task suite](https://yandex.com/support/toloka-requester/concepts/overview.html#tasks-page) is a set of tasks that are shown on one page.

An important part of configuring pools is to decide how many tasks will be issued to a performer at once. E.g. if you set 3 tasks for a task suite, then a performer will see three images at once on one page.

Note that `reward_per_assignment` and `assignment_max_duration_seconds` fields in pool settings set a price and time for one **task suite**, not task.

Why you should combine tasks in a task suite:
* To set a more precise price for a single task.
* To calculate a performer's skill and use it to determine the correct answer more accurately. Learn more below, in [Aggregation](#aggregation).
* To better apply quality control settings that improve the final quality of the response. Learn more below, in [Quality control rules](#quality_control_rules).

In [None]:
new_pool.set_mixer_config(
    real_tasks_count=10,  # The number of tasks per page.
    golden_tasks_count=0,  # The number of test tasks per page. We do not use in this tutorial.
    training_tasks_count=0,  # The number of training tasks per page. We do not use in this tutorial.
)

### Filters

[Filters](https://yandex.com/support/toloka-requester/concepts/filters.html) help you select performers for your project.

There may be different reasons to use filters, e.g.:
* You require performers with certain traits for your pool.
* You want to exclude a certain group of performers.

Tasks will only be shown to matching performers, rather than to all of them.

This example requires English-speaking performers, because the project's instruction is in English.

In [None]:
new_pool.filter = toloka.filter.Languages.in_('EN')

### Quality control rules<a id='quality_control_rules'></a>

[Quality control rules](https://yandex.com/support/toloka-requester/concepts/check-performers.html) regulate task completion and performer access.

Quality control lets you get more accurate responses and restrict access to tasks for cheating performers. All rules work independently. Learn more about how to [set up quality control](https://yandex.com/support/toloka-requester/concepts/qa-pool-settings.html).

This example uses the Captcha rule. It is the simplest way to exclude fake users (robots) and cheaters.

In [None]:
# Turns on captchas
new_pool.set_captcha_frequency('MEDIUM')
# Bans performers by captcha criteria
new_pool.quality_control.add_action(
    # Type of quality control rule
    collector=toloka.collectors.Captcha(history_size=5),
    # This condition triggers the action below
    # Here overridden comparison operator actually returns a Condition object
    conditions=[toloka.conditions.FailRate > 20],
    # What exactly should the rule do when the condition is met
    # It bans the performer for 1 day
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=1,
        duration_unit='DAYS',
        private_comment='Captcha',
    )
)

### Create a pool

The cell below creates a pool from all the information above that was stored in your device's memory. 

In [None]:
new_pool = toloka_client.create_pool(new_pool)
print(f'To view this pool, go to https://toloka.yandex.com/requester/project/{new_project.id}/pool/{new_pool.id}')
# print(f'To view this pool, go to https://sandbox.toloka.yandex.com/requester/project/{new_project.id}/pool/{new_pool.id}') # Print a sandbox version link

Open your project's page. You will see your new pool.

<table  align="center">
  <tr><td>
    <img src="./img/project_with_pool.png"
         alt="Project interface with a pool"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 5.</b> Project interface with a pool.
  </td></tr>
</table>

Pool interface looks like that.

<table  align="center">
  <tr><td>
    <img src="./img/pool_preview.png"
         alt="Pool interface"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 6.</b> Pool interface.
  </td></tr>
</table>

Right now the pool is empty and closed. It has no tasks or task suites.

## Setting up a simple project

### Task

A [task](https://yandex.com/support/toloka-requester/concepts/overview.html#task) is the data you need to mark up.

This example uses a small data set with images.

The dataset used is collected by Toloka team and distributed under a Creative Commons Attribution 4.0 International license
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

In [None]:
# Download the data set
!curl https://tlk.s3.yandex.net/dataset/cats_vs_dogs/toy_dataset.tsv --output dataset.tsv

dataset = pandas.read_csv('dataset.tsv', sep='\t')

print(f'\nDataset contains {len(dataset)} rows\n')

dataset = dataset.sample(frac=1).reset_index(drop=True)

ipyplot.plot_images(
    images=[row['url'] for _, row in dataset.iterrows()],
    labels=[row['label'] for _, row in dataset.iterrows()],
    max_images=12,
    img_width=300,
)

Create tasks. One task will be created from one image.

Toloka will automatically create task suites and show the tasks depending on a project overlap:

1. One task suite will consist of 10 tasks.
2. Toloka will let 3 different performers to complete the tasks.

These setting were configured during creating the pool.

In [None]:
tasks = [
    toloka.task.Task(input_values={'image': url}, pool_id=new_pool.id)
    for url in dataset['url']
]
# Add tasks to a pool
toloka_client.create_tasks(tasks, toloka.task.CreateTasksParameters(allow_defaults=True))
print(f'Populated pool with {len(tasks)} tasks')
print(f'To view this pool, go to https://toloka.yandex.com/requester/project/{new_project.id}/pool/{new_pool.id}')
# print(f'To view this pool, go to https://sandbox.toloka.yandex.com/requester/project/{new_project.id}/pool/{new_pool.id}') # Print a sandbox version link

# Opens the pool
new_pool = toloka_client.open_pool(new_pool.id)

When you open a pool, performs see your tasks and start working on them. 

In small pools like this it usually takes up to 10 minutes for all the tasks to be performed.

With big pools it's better to set up automatic waiting. See example in the cell below.


In [None]:
pool_id = new_pool.id

def wait_pool_for_close(pool_id, minutes_to_wait=1):
    sleep_time = 60 * minutes_to_wait
    pool = toloka_client.get_pool(pool_id)
    while not pool.is_closed():
        op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])
        op = toloka_client.wait_operation(op)
        percentage = op.details['value'][0]['result']['value']
        print(
            f'   {datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} - {percentage}%'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)
    print('Pool was closed.')

wait_pool_for_close(pool_id)

### Get responses

When all the tasks are completed, look at the responses from performers.

In [None]:
answers = []

for assignment in toloka_client.get_assignments(pool_id=pool_id, status='ACCEPTED'):
    for task, solution in zip(assignment.tasks, assignment.solutions):
        answers.append([task.input_values['image'], solution.output_values['result'], assignment.user_id])

print(f'answers count: {len(answers)}')

An `assignment` value is one performer's responses to all the tasks on a task suite. 

If a performer completed several task suites, then `toloka_client.get_assignments` will contain several `assignment` values.

### Aggregation <a id='aggregation'></a>

[Aggregation of results](https://yandex.com/support/toloka-requester/concepts/result-aggregation.html) should be run if tasks were issued with an overlap of 2 or higher.

[Majority vote](https://yandex.com/support/toloka-requester/concepts/mvote.html) is a quality control method based on matching responses from the majority of performers who complete the same task. E.g. if 2 out of 3 performers selected `cat` label, then the final label for this task will be `cat`.

Majority vote is easily implemented, but you can also use our [crowdsourcing library](https://github.com/Toloka/crowd-kit). It contains a lot of new aggregation methods.

In [None]:
!pip install crowd-kit==0.0.3
from crowdkit.aggregation import MajorityVote

In [None]:
MajorityVote?

In [None]:
# Prepare dataframe
answers_df = pandas.DataFrame(answers, columns=['task', 'label', 'performer'])
# Run majority vote aggregation
predicted_answers = MajorityVote().fit_predict(answers_df)

print(predicted_answers)

# Some preparations for displaying the results
predicted_answers = predicted_answers.sample(frac=1)
images = predicted_answers.index.values
labels = predicted_answers.values
start_with = 0

Look at the results.

Note: The cell below can be run several times.

In [None]:
if start_with >= len(predicted_answers):
    print('no more images')
else:
    ipyplot.plot_images(
        images=images[start_with:],
        labels=labels[start_with:],
        max_images=12,
        img_width=300,
    )

    start_with += 12

You can see the labeled images. Some possible results are shown in figure 7 below.

<table  align="center">
  <tr><td>
    <img src="./img/possible_results.png"
         alt="Possible results"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 7.</b> Possible results.
  </td></tr>
</table>

## Summary

This example explained basic Toloka entities and how Toloka-Kit can work with them.

The described project (classification) is very useful for:
* Accurate evaluation.
* Checking the results of a complex project, as in [image segmentation example](https://github.com/Toloka/toloka-kit/blob/main/examples/image_segmentation/image_segmentation.ipynb).