# Image binary classification


### Call to action
If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).
Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)

Prepare environment and import all we'll need.

In [None]:
%%capture
!pip install toloka-kit==0.1.26
!pip install crowd-kit==1.0.0
!pip install pandas
!pip install ipyplot

import datetime
import os
import sys
import time
import logging
import getpass

import ipyplot
import pandas
import numpy as np

import toloka.client as toloka
import toloka.client.project.template_builder as tb
from crowdkit.aggregation import DawidSkene

In [None]:
logging.basicConfig(
    format='[%(levelname)s] %(name)s: %(message)s',
    level=logging.INFO,
    stream=sys.stdout,
)

Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)

In [None]:
toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'
logging.info(toloka_client.get_requester())

## Creating new project

In [None]:
project = toloka.Project(
    public_name='Is it a cat or a dog?',
    public_description='Look at the picture and decide whether there is a cat or a dog.',
)

Create task interface

In [None]:
image_viewer = tb.ImageViewV1(tb.InputData('image'), ratio=[1, 1], rotatable=True)

radio_group_field = tb.ButtonRadioGroupFieldV1(
    tb.OutputData('result'),
    [
        tb.GroupFieldOption('cat', 'Cat'),
        tb.GroupFieldOption('dog', 'Dog'),
        tb.GroupFieldOption('other', 'Other'),
    ],
    validation=tb.RequiredConditionV1(hint='choose one of the options'),
)

task_width_plugin = tb.TolokaPluginV1(
    kind='scroll',
    task_width=500,
)

hot_keys_plugin = tb.HotkeysPluginV1(
    key_1=tb.SetActionV1(tb.OutputData('result'), 'cat'),
    key_2=tb.SetActionV1(tb.OutputData('result'), 'dog'),
    key_3=tb.SetActionV1(tb.OutputData('result'), 'other'),
)

project_interface = toloka.project.TemplateBuilderViewSpec(
    view=tb.ListViewV1([image_viewer, radio_group_field]),
    plugins=[task_width_plugin, hot_keys_plugin],
)

Set data specification. And set task interface to project.

In [None]:
input_specification = {'image': toloka.project.UrlSpec()}
output_specification = {'result': toloka.project.StringSpec()}

project.task_spec = toloka.project.task_spec.TaskSpec(
    input_spec=input_specification,
    output_spec=output_specification,
    view_spec=project_interface,
)

Write short and simple 	instructions.

In [None]:
project.public_instructions = """<p>Decide what category the image belongs to.</p>
<p>Select "<b>Cat</b>" if the picture contains one or more cats.</p>
<p>Select "<b>Dog</b>" if the picture contains one or more dogs.</p>
<p>Select "<b>Other</b>" if:</p>
<ul><li>the picture contains both cats and dogs</li>
<li>the picture is a picture of animals other than cats and dogs</li>
<li>it is not clear whether the picture is of a cat or a dog</li>
</ul>"""

Create a project.

In [None]:
project = toloka_client.create_project(project)

You can go to the project page and in web-interface you can see something like this:
<table  align="center">
  <tr><td>
    <img src="./img/created_project.png"
         alt="Project interface"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> What the project interface might look like.
  </td></tr>
</table>

## Pool creation
Specify the [pool parameters.](https://toloka.ai/en/docs/guide/concepts/pool_poolparams?utm_source=github&utm_medium=site&utm_campaign=tolokakit)

In [None]:
pool = toloka.Pool(
    project_id=project.id,
    # Give the pool any convenient name. You are the only one who will see it.
    private_name='Pool 1',
    may_contain_adult_content=False,
    # Set the price per task page.
    reward_per_assignment=0.01,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    # Overlap. This is the number of users who will complete the same task.
    defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=3),
    # Time allowed for completing a task page
    assignment_max_duration_seconds=600,
)

Select English-speaking performers

In [None]:
pool.filter = toloka.filter.Languages.in_('EN')

Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit). Add basic controls. And Golden Set	aka Control tasks. Ban performers who give incorrect responses to control tasks.

In [None]:
pool.quality_control.add_action(
    collector=toloka.collectors.Income(),
    conditions=[toloka.conditions.IncomeSumForLast24Hours >= 20],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=1,
        duration_unit='DAYS',
        private_comment='No need more answers from this performer',
    )
)

pool.quality_control.add_action(
    collector=toloka.collectors.SkippedInRowAssignments(),
    conditions=[toloka.conditions.SkippedInRowCount >= 10],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=1,
        duration_unit='DAYS',
        private_comment='Lazy performer',
    )
)

pool.quality_control.add_action(
    collector=toloka.collectors.MajorityVote(answer_threshold=2, history_size=10),
    conditions=[
        toloka.conditions.TotalAnswersCount >= 4,
        toloka.conditions.CorrectAnswersRate < 75,
    ],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=10,
        duration_unit='DAYS',
        private_comment='Too low quality',
    )
)

pool.quality_control.add_action(
    collector=toloka.collectors.GoldenSet(),
    conditions=[
        toloka.conditions.GoldenSetCorrectAnswersRate < 60.0,
        toloka.conditions.GoldenSetAnswersCount >= 3
    ],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=10,
        duration_unit='DAYS',
        private_comment='Golden set'
    )
)

Specify	the number of tasks per page. For example: 9 main tasks and 1 control task.

In [None]:
pool.set_mixer_config(
    real_tasks_count=9,
    golden_tasks_count=1
)

Create pool

In [None]:
pool = toloka_client.create_pool(pool)

## Preparing and uploading tasks

This example uses a small data set with images.

The dataset used is collected by Toloka team and distributed under a Creative Commons Attribution 4.0 International license
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/).

Dataset looks like:
<table  align="center">
  <tr><td>
    <img src="./img/dataset_preview.png"
         alt="Dataset preview"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 2.</b> Dataset preview.
  </td></tr>
</table>

In [None]:
!curl https://tlk.s3.yandex.net/dataset/cats_vs_dogs/toy_dataset.tsv --output dataset.tsv

dataset = pandas.read_csv('dataset.tsv', sep='\t')

logging.info(f'Dataset contains {len(dataset)} rows\n')

dataset = dataset.sample(frac=1).reset_index(drop=True)

ipyplot.plot_images(
    images=[row['url'] for _, row in dataset.iterrows()],
    labels=[row['label'] for _, row in dataset.iterrows()],
    max_images=12,
    img_width=300,
)

Divide the dataset into two. One for tasks and one for [Control tasks](https://toloka.ai/en/docs/guide/concepts/task_markup?utm_source=github&utm_medium=site&utm_campaign=tolokakit).

Note. Control tasks are tasks with the correct response known in advance. They are used to track the performer's quality of responses. The performer's response is compared to the response you provided. If they match, it means the performer answered correctly.

In [None]:
golden_dataset, task_dataset = np.split(dataset, [15], axis=0)

Create control tasks. In small pools, control tasks should account for 10–20% of all tasks.

Tip. Make sure to include different variations of correct responses in equal amounts.

In [None]:
golden_tasks = [
    toloka.Task(
        pool_id=pool.id,
        input_values={'image': row['url']},
        known_solutions = [
            toloka.task.BaseTask.KnownSolution(
                output_values={'result': row['label']}
            )
        ],
        infinite_overlap=True,
    )
    for i, row in golden_dataset.iterrows()
]

Create pool tasks

In [None]:
tasks = [
    toloka.Task(
        pool_id=pool.id,
        input_values={'image': url},
    )
    for url in task_dataset['url']
]

Upload tasks

In [None]:
created_tasks = toloka_client.create_tasks(golden_tasks + tasks, allow_defaults=True)
logging.info(len(created_tasks.items))

Start the pool.

**Important.** Remember that real Toloka performers will complete the tasks.
Double check that everything is correct
with your project configuration before you start the pool

In [None]:
pool = toloka_client.open_pool(pool.id)
logging.info(pool.status)

## Receiving responses

Wait until the pool is completed.

In [None]:
pool_id = pool.id

def wait_pool_for_close(pool_id, minutes_to_wait=1):
    sleep_time = 60 * minutes_to_wait
    pool = toloka_client.get_pool(pool_id)
    while not pool.is_closed():
        op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])
        op = toloka_client.wait_operation(op)
        percentage = op.details['value'][0]['result']['value']
        logging.info(
            f'   {datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} - {percentage}%'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)
    logging.info('Pool was closed.')

wait_pool_for_close(pool_id)

Get responses

When all the tasks are completed, look at the responses from performers.

In [None]:
answers = []

answers_df = toloka_client.get_assignments_df(pool_id)
# prepare DataFrame
answers_df = answers_df.rename(columns={
    'INPUT:image': 'task',
    'OUTPUT:result': 'label',
    'ASSIGNMENT:worker_id': 'worker'
})

logging.info(f'answers count: {len(answers_df)}')

Aggregation results using the Dawid-Skene model

In [None]:
# Run aggregation
predicted_answers = DawidSkene(n_iter=20).fit_predict(answers_df)

logging.info(predicted_answers)

Look at the results.

Some preparations for displaying the results

In [None]:
predicted_answers = predicted_answers.sample(frac=1)
images = predicted_answers.index.values
labels = predicted_answers.values
start_with = 0

Note: The cell below can be run several times.

In [None]:
if start_with >= len(predicted_answers):
    logging.info('no more images')
else:
    ipyplot.plot_images(
        images=images[start_with:],
        labels=labels[start_with:],
        max_images=12,
        img_width=300,
    )

    start_with += 12

**You** can see the labeled images. Some possible results are shown in figure 3 below.

<table  align="center">
  <tr><td>
    <img src="./img/possible_results.png"
         alt="Possible results"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 3.</b> Possible results.
  </td></tr>
</table>