# Handwriting image dataset collection

The goal for this project is to collect images of handwritten text for a dataset that could be used to train and evaluate HTR models.

Furthermore, you can later enhance this dataset with extra bounding boxes for separate lines or words, using [this notebook](https://github.com/Toloka/toloka-kit/blob/main/examples/image_segmentation/image_segmentation.ipynb) as an example.

## Pipeline decomposition

As we are dealing with image gathering, we should also implement validation of collected pictures to make sure each photo is correct. Scheme of pipeline is shown in Figure 1.

<table  align="center">
  <tr><td>
    <img src="./img/pipeline.png"
         alt="Pipeline"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> Pipeline scheme
  </td></tr>
</table>

### Text corpus gathering
In this tutorial extraction and filtering of texts from Wikipedia articles for an arbitrary language. This process, however, is quite time- and disk space-demanding, so sample sentences are provided for French language.

### Project #1 - image gathering

Performers will be asked to write suggested phrase on a piece of paper and take a photo of it. 

In this tutorial project structure is simplified and training for it is not implemented, however, in real production pipeline you strongly advice to add it to decrease cost of validation for obviously improper images. 


### Project #2 - image validation

Performers will be asked to rate the quality of image with (supposed) handwritten text.

In this tutorial we include samples for the trainig for French.

In [None]:
# install all the necessary packages 
!pip install -r requirements.txt

In [None]:
import os
import urllib.request
import re
import time

import requests
import pandas as pd
import numpy as np
import datetime
import shutil
import ipyplot

import toloka.client as toloka
import toloka.client.project.template_builder as tb

import yadisk
import posixpath

from collections import defaultdict

## Text corpus gathering

In general, any corpus of texts for corresponding language can be used. Here Wikipedia is chosen as it is one of the most accessible and multilanguage corpora.

You can generate new texts with provided code or use included ones. If you chose the latter option, skip to **Use provided sentences** section.

### New text generation

In [None]:
# Chosen language in the tutorial is French, you can replace it
# with language of your choice to gather new texts
lang_prefix = 'fr'
os.makedirs('data', exist_ok=True)

# If you are not sure which abbreviation to use, look up the right one at
# https://dumps.wikimedia.org/backup-index-bydb.html
url = f'https://dumps.wikimedia.org/{lang_prefix}wiki/20210601/{lang_prefix}wiki-20210601-pages-articles-multistream.xml.bz2'
filepath = f'data/{lang_prefix}.xml.bz2'

In [None]:
# this could take a while
urllib.request.urlretrieve(url, filepath);

In [None]:
# now unzip downloaded archive:

# sudo apt-get install bzip2
# bzip2 -d data/{lang_prefix}.xml.bz2

In [None]:
# create corpus with provided in corpus.py code
# you can check out the code and tweak some things if you want
from corpus import WikiDump
wiki = WikiDump(filepath=filepath[:-4], language=lang_prefix)

In [None]:
# generate sentences
sentences = wiki.get_sentences(max_size=1000)
chosen_sentences = np.random.choice(sentences, size=100)
print(*chosen_sentences[:10], sep='\n')

### Use provided sentences

In [None]:
# only French data provided
chosen_sentences = pd.read_csv('data/{}-sentences.tsv'.format(lang_prefix), sep='\t', encoding='utf-8')['INPUT:text'].values

## Yandex.Toloka and Yandex.Disk API

To automate the pipeline, we are going to use Toloka and Disk API. Disk will be used to upload gathered photos, so that they can be later visible to performers in validation project via Toloka's proxy.

In [None]:
# Сreate a toloka-client instance
# All API calls will go through it
try:
    production = input('Type "p" for PRODUCTION Toloka environment or "s" for SANDBOX:')[0] == 'p'
    token = input("Enter your Yandex.Toloka API OAUTH token:")
    toloka_client = toloka.TolokaClient(token, 'PRODUCTION' if production else 'SANDBOX') 
        requester = toloka_client.get_requester()
except:
    print('You probably entered an invalid token. Please, run this cell again.')

### Yandex.Disk OAUTH token

Get OAUTH-token for Yandex.Disk, instructions can be found [here](https://yandex.ru/dev/oauth/).

For example, you can create an application with Yandex.Disk API access, and then visit `https://oauth.yandex.ru/authorize?response_type=token&client_id=<APPLICATION ID>` to get a token.

In [None]:
try:
    ydisk_token = input("Enter your Yandex.Disk API OAUTH token:")
    ydisk = yadisk.YaDisk(token=ydisk_token)    
    print('Provided token is valid - ',  ydisk.check_token())
except:
    print('You probably entered an invalid token. Please, run this cell again.')

### Proxy creation

Create proxies in Toloka: 

Go to `Profile->External Services Integration->Yandex.Disk Integration`, and after that, `Add proxy`. 

For pictures in instructions you will need a **public** proxy. You can use the same proxy istead of a private proxy, or create a new private proxy.

In [None]:
public_proxy_name = <YOUR PUBLIC PROXY NAME>
public_folder_name = <YOUR PUBLIC FOLDER NAME>

private_proxy_name = <YOUR PRIVATE PROXY NAME> # you can use public here
private_folder_name = <YOUR PRIVATE FOLDER NAME>

tlk = 'Toloka' if production else 'Toloka.Sandbox'

info = ydisk.get_disk_info()
app_folder = info.system_folders.applications

toloka_public_folder = posixpath.join(app_folder, tlk, public_folder_name)
toloka_private_folder =  posixpath.join(app_folder, tlk, private_folder_name)

In [None]:
info = ydisk.get_disk_info()
app_folder = info.system_folders.applications

toloka_public_folder = posixpath.join(app_folder, tlk, public_folder_name)
toloka_private_folder =  posixpath.join(app_folder, tlk, private_folder_name)

### Upload photos for instructions

As we are going to use images as examples in the instruction, you should upload them to the **public** folder. Sample images for instruction are provided.

In [None]:
def recursive_upload(from_dir, to_dir):
    for root, dirs, files in os.walk(from_dir):
        p = root.split(from_dir)[1].strip(os.path.sep)
        dir_path = posixpath.join(to_dir, p)

        try:
            ydisk.mkdir(dir_path)
        except yadisk.exceptions.PathExistsError:
            pass

        for file in files:
            file_path = posixpath.join(dir_path, file)
            p_sys = p.replace("/", os.path.sep)
            in_path = os.path.join(from_dir, p_sys, file)
            try:
                 ydisk.upload(in_path, file_path)
            except yadisk.exceptions.PathExistsError:
                 pass

In [None]:
recursive_upload('instructions/images/', toloka_public_folder)

## Project #1


#### Interface
In the interface performers will see suggested phrase and buttons for image upload.

As we chose the texts automatically, with no external human verification, they could be faulty or incomplete. We are going to add a checkbox to report erroneous texts.

<table  align="center">
  <tr><td>
    <img src="./img/project-1-interface.jpg"
         alt="Interface-1"  width="400">
  </td></tr>
  <tr><td align="center">
    <b>Figure 2.</b> User's task interface in first project
  </td></tr>
</table>

In [None]:
output_specification = {
    'image': toloka.project.field_spec.FileSpec(),
    'offensive': toloka.project.field_spec.BooleanSpec(required=False)
}

checkbox = tb.fields.CheckboxFieldV1(
    data=tb.data.OutputData(path='offensive'),
    label='Phrase doesn\'t make sense or is offensive',
#     hint=<you can add localized label text here>
)

image_loader = tb.fields.MediaFileFieldV1(
    data=tb.data.OutputData(path='image'),
    validation=tb.conditions.RequiredConditionV1(),
    # you can disallow some of the media sources here
    accept=tb.fields.MediaFileFieldV1.Accept(photo=True, gallery=True, file_system=True),
    multiple=False,
)

task_width_plugin = tb.plugins.TolokaPluginV1(
    layout=tb.plugins.TolokaPluginV1.TolokaPluginLayout(
        kind='scroll', 
        task_width=500,
    )
)

interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(
            items=[
                 tb.view.TextViewV1(
                    label='Phrase:', 
                    content=tb.data.InputData(path='text')
                ),
                image_loader,
                checkbox
            ]
        ),
        plugins=[task_width_plugin],
    )
)

#### Instructions

Instructions in English and French are included.

In [None]:
# we need to replace stubs with real proxy name
with open(f'instructions/{lang_prefix}.html', encoding='utf-8') as fin:
    public_instruction = fin.read().replace('PUBLIC-PROXY-NAME', public_proxy_name)
    
# fix endpoints, if using sandbox    
if not production:
    public_instruction = public_instruction.replace('toloka.yandex.ru', 'sandbox.toloka.yandex.ru')

# name and description of the project
with open(f'instructions/{lang_prefix}-desc.txt', encoding='utf-8') as fin:
    public_name = fin.readline().strip()
    public_description = fin.readline().strip()

#### Creation via API

In [None]:
# Create a project

img_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=public_name,
    public_description=public_description,
    public_instructions=public_instruction,
    
    # Set up the task interface and output parameters
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec={'text': toloka.project.field_spec.StringSpec(required=True)},
        output_spec=output_specification,
        view_spec=interface,
    ),
)

In [None]:
img_project = toloka_client.create_project(img_project)
print(f'Created project with id {img_project.id}')
sandbox_url = '' if production else 'sandbox.'
print(f'To view the project, go to https://{sandbox_url}toloka.yandex.com/requester/project/{img_project.id}')

#### Skill tracking

We can track the image acceptance rate of the performers and ban those, who perform poorly.

In [None]:
skill_name = 'Handwriting images'
img_skill = next(toloka_client.get_skills(name=skill_name), None)
if img_skill:
    print('Skill already exists')
else:
    print('Creating new skill')
    img_skill = toloka_client.create_skill(
        name=skill_name,
        hidden=True,
        public_requester_description={'EN': 'Acceptance rate of the handwriting images'},
    )

#### Pool creation

Performers would be offered to write one phrase for 0.01\$, time limit - 15 minutes. You can tweak number of tasks in one page and reward for the page, if you'd like.

As for quality control, we are going to use `Fast Answers` with 60 second threshold and `Acceptance Rate` tracking with al least 50\% threshold.

In [None]:
img_pool = toloka.pool.Pool(
    project_id=img_project.id,
    may_contain_adult_content=True,
    private_name=f'Handwriting image gathering for {lang_prefix} language',
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.01,
    auto_accept_solutions=False,
    auto_accept_period_day=7,
    assignment_max_duration_seconds=60*15,
    filter=toloka.filter.FilterAnd(
        [
            toloka.filter.Languages.in_(lang_prefix.upper()),
            toloka.filter.FilterOr(
                [
                    # you can modify skill threshold here
                    toloka.filter.Skill(img_skill.id) == None,
                    toloka.filter.Skill(img_skill.id) >= 50,
                ]
            ),
            toloka.filter.FilterOr(
                [
                    # you can choose only mobile app users here
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER),
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.TOLOKA_APP)
                ]
            )
        ]
    ),
    defaults=toloka.pool.Pool.Defaults(
        # we only want one successful picture for each text
        default_overlap_for_new_task_suites=1
    ),
)

# you can change number of tasks per-page here
img_pool.set_mixer_config(
    real_tasks_count=1,
    golden_tasks_count=0,
    training_tasks_count=0
)

img_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        fast_submit_threshold_seconds=60
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 0],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses in handwriting images project'
    )
)

img_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentsAssessment(),
    conditions=[toloka.conditions.AssessmentEvent == toloka.conditions.AssessmentEvent.REJECT],
    action=toloka.actions.ChangeOverlap(delta=1, open_pool=True)
)

img_pool.quality_control.add_action(
    collector=toloka.collectors.AcceptanceRate(),
    conditions=[toloka.conditions.AcceptedAssignmentsRate > 50],
    action=toloka.actions.SetSkill(skill_id=img_skill.id, skill_value=toloka.collectors.AcceptanceRate()),
)

In [None]:
img_pool = toloka_client.create_pool(img_pool)
print(f'Created pool with id {img_pool.id}')
print(f'To view the pool, go to https://{sandbox_url}toloka.yandex.com/requester/project/{img_project.id}/pool/{img_pool.id}')

#### Task creation

Create tasks from gathered sentences:

In [None]:
tasks = [toloka.task.Task(
    input_values={'text': sentence}, 
    pool_id=img_pool.id,
    overlap=1
) for sentence in chosen_sentences]

tasks_op = toloka_client.create_tasks_async(tasks=tasks)
result = toloka_client.wait_operation(tasks_op)

print(
    f'Total tasks: {result.details["total_count"]}',
    f'Total failed: {result.details["failed_count"]}',
    f'Total success: {result.details["success_count"]}',
    f'Total valid: {result.details["valid_count"]}',
    f'Total not valid: {result.details["not_valid_count"]}',
    sep='\n'
)

## Project #2

#### Interface
We need to evaluate the quality of gathered images. We would separately rate quality of the photo and the text on it. If the image is unscrupulous and there's no text on it, we wouldn't need the text rating of it (as there's none).

Performers will see a rotatable and photo that can be opened in fullscreen and two questions - about the quality of the photo and the text. If performer answers that there's no text on the picture, the second question would automatically disappear.

You can add hotkeys to the buttons, if you'd like.

<table  align="center">
  <tr><td>
    <img src="./img/project-2-interface.jpg"
         alt="Interface-2"  width="600">
  </td></tr>
  <tr><td align="center">
    <b>Figure 3.</b> User's task interface in second project
  </td></tr>
</table>

In [None]:
# we need assignment id in order to rate answers in first project 
input_specification = {
    'text': toloka.project.field_spec.StringSpec(required=True),
    'image': toloka.project.field_spec.UrlSpec(required=True),
    'assignment_id': toloka.project.field_spec.StringSpec(required=False, hidden=True)
}

output_specification = {
    'image-result': toloka.project.field_spec.StringSpec(required=True, allowed_values=['ok', 'bad', 'error']),
    'text-result': toloka.project.field_spec.StringSpec(allowed_values=['ok', 'bad', 'error'])
}

image_viewer = tb.view.ImageViewV1(url=tb.InputData(path='image'), rotatable=True)

image_rating = tb.fields.ButtonRadioGroupFieldV1(
    data=tb.data.OutputData(path='image-result'),
    label='Rate the quality of the photo',
    validation=tb.conditions.RequiredConditionV1(),
    options=[
        tb.fields.GroupFieldOption(label='Good', value='ok'),
        tb.fields.GroupFieldOption(label='Bad', value='bad'),
        tb.fields.GroupFieldOption(label='Not a photo of handwritten text', value='error'),
    ]
)

text_rating = tb.fields.ButtonRadioGroupFieldV1(
    data=tb.data.OutputData(path='text-result'),
    validation=tb.conditions.RequiredConditionV1(hint="Choose one of the options"),
    label='Choose whether the text is written correctly',
    options=[
        tb.fields.GroupFieldOption(label='No mistakes', value='ok'),
        tb.fields.GroupFieldOption(label='There are some mistakes', value='bad'),
        tb.fields.GroupFieldOption(label='Different text', value='error'),
    ]
)


helper = tb.helpers.IfHelperV1(
    condition=tb.conditions.NotConditionV1(
        condition=tb.conditions.EqualsConditionV1(to='error', data=tb.data.OutputData(path='image-result'))),
    then=text_rating
)

task_width_plugin = tb.plugins.TolokaPluginV1(
    layout=tb.plugins.TolokaPluginV1.TolokaPluginLayout(
        kind='scroll', 
        task_width=700,
    )
)

interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(
            items=[
                 tb.view.TextViewV1(
                    label='Phrase:', 
                    content=tb.data.InputData(path='text')
                ),
                image_viewer,
                image_rating,
                helper
            ]
        ),
        plugins=[task_width_plugin],
    )
)

#### Instructions

Instructions for this project are provided only in English, and we should consider it when filtering performers.

In [None]:
with open('instructions/en-check.html', encoding='utf-8') as fin:
    # replace stub with real proxy name
    public_instruction = fin.read().replace('PUBLIC-PROXY-NAME', public_proxy_name)
    
if not production:
    # replace proxy endpoint in case of sandbox
    public_instruction = public_instruction.replace('toloka.yandex.ru', 'sandbox.toloka.yandex.ru')
    
public_name = 'Check if the phrases are spelled correctly'
public_description = 'Look at the photo and answer if the text is written on it correctly.'

In [None]:
validation_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=public_name,
    public_description=public_description,
    public_instructions=public_instruction,
    
    # Set up the task interface and output parameters
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec=input_specification,
        output_spec=output_specification,
        view_spec=interface,
    ),
)

In [None]:
validation_project = toloka_client.create_project(validation_project)
print(f'Created project with id {validation_project.id}')
print(f'To view the project, go to https://{sandbox_url}toloka.yandex.com/requester/project/{validation_project.id}')

### Training

We would need to have qualified performers for main pool, so we should implement training. In this tutorial samples for French are provided.

In [None]:
# Setting up training
validation_training = toloka.training.Training(
    project_id=validation_project.id,
    private_name='Handwriting images validation - training',
    may_contain_adult_content=True,
    mix_tasks_in_creation_order=True,
    shuffle_tasks_in_task_suite=True,
    training_tasks_in_task_suite_count=5,
    assignment_max_duration_seconds=60*20,
    task_suites_required_to_pass=1,
    retry_training_after_days=1,
    inherited_instructions=True,
)

validation_training = toloka_client.create_training(validation_training)
print(f'Created training with id {validation_training.id}')
print(f'To view the training, go to https://{sandbox_url}toloka.yandex.com/requester/project/{validation_project.id}/training/{validation_training.id}')

#### Training data

You would need to create some tasks on your own or from pictures from the first project if you chose another language.

In [None]:
training_df = pd.read_csv(f'training/{lang_prefix}-training.tsv', sep='\t', encoding='utf-8')
training_df.head()

In [None]:
# replace stub with real proxy name
training_df['INPUT:image'] = training_df['INPUT:image'].map(lambda s: s.replace('PRIVATE-PROXY-NAME', private_proxy_name))

#### Training images upload

In [None]:
recursive_upload(os.path.join('training', 'images'), posixpath.join(toloka_private_folder, 'training'))

In [None]:
training_tasks = [
    toloka.task.Task(
        input_values={
            'text': row[0],
            'image': row[1],
        },
        known_solutions=[
            toloka.task.BaseTask.KnownSolution(
                output_values={'image-result': row[3], 'text-result':row[2]} if not pd.isna(row[2]) else {'image-result': row[3]}
            )
        ],
        message_on_unknown_solution=row[4],
        pool_id=validation_training.id,
        infinite_overlap=True,
    ) for row in training_df.values
]

In [None]:
tasks_op = toloka_client.create_tasks_async(training_tasks)
result = toloka_client.wait_operation(tasks_op)

print(
    f'Total tasks: {result.details["total_count"]}',
    f'Total failed: {result.details["failed_count"]}',
    f'Total success: {result.details["success_count"]}',
    f'Total valid: {result.details["valid_count"]}',
    f'Total not valid: {result.details["not_valid_count"]}',
    sep='\n'
)

### Main pool

We would allow performers, who achieved at least 70\% right answers during trainig and use `overlap = 3` for more reliable answers after aggregation. 

As additional quality countrol we would use `Fast Answers` with 10 second treshold (for 5 tasks per page) and `Majority Vote` with 50\% threshold.

Additionally, you should probably create control tasks and add `Control Tasks` quality control rule, this step is omitted in this tutorial. 

In [None]:
validation_pool = toloka.pool.Pool(
    project_id=validation_project.id,
    private_name=f'Handwritten images validation for {lang_prefix} language',
    may_contain_adult_content=True,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.01,
    auto_accept_solutions=True,
    assignment_max_duration_seconds=60*5,
    defaults=toloka.pool.Pool.Defaults(
        default_overlap_for_new_task_suites=3
    ),
    filter=toloka.filter.FilterAnd(
        [   
            # we would need preformers who speak both English (instructions are in English) and your chosen language
            toloka.filter.Languages.in_('EN'),
            toloka.filter.Languages.in_(lang_prefix.upper()),
            toloka.filter.FilterOr(
                [
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER),
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.TOLOKA_APP)
                ]
            )
        ]
    )
)

# you can and probably should add control tasks here
validation_pool.set_mixer_config(real_tasks_count=10, golden_tasks_count=0, training_tasks_count=0)
validation_pool.quality_control.training_requirement=toloka.quality_control.QualityControl.TrainingRequirement(
    training_pool_id=validation_training.id, 
    # you can tweak passing threshold here
    training_passing_skill_value=70
)

validation_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(fast_submit_threshold_seconds=10),
    conditions=[toloka.conditions.FastSubmittedCount > 0],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses in main pool of handwriting images validation project'
    )
)

validation_pool.quality_control.add_action(
    collector=toloka.collectors.MajorityVote(history_size=5, answer_threshold=2),
    conditions=[
        toloka.conditions.TotalAnswersCount >= 5,
        toloka.conditions.CorrectAnswersRate < 50,
    ],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Majority vote incorrect answers'
    )
)

validation_pool = toloka_client.create_pool(validation_pool)
print(f'Created pool with id {validation_pool.id}')
print(f'To view the training, go to: https://{sandbox_url}toloka.yandex.com/requester/project/{validation_project.id}/pool/{validation_pool.id}')

## Whole pipeline

For pictures to show up in tasks in validation project we would need first download them loaclly and them - to the disk. You can (and probably should) optimize the images to make them load faster for the performers.

In [None]:
def upload_image(path):
    try:
        filename = os.path.split(path)[-1]
        ydisk.upload(path, posixpath.join(toloka_private_folder, lang_prefix, filename))
    except yadisk.exceptions.PathExistsError:
        pass
    return f"https://{sandbox_url}toloka.yandex.ru/api/proxy/{private_proxy_name}/{lang_prefix}/{filename}"

In [None]:
def get_and_upload_attachment(solution):
    attachment = toloka_client.get_attachment(solution.output_values['image'])
    ext = os.path.splitext(attachment.name)[-1]
    # we can automatically reject non-image submits
    # you can add or remove acceptable extensions here
    if ext.lower() not in ['.jpg', '.jpeg', '.png', '.heic']:
        return None, False

    # downloading image locally
    filename = f"{attachment.id}{ext}"
    with open(filename, "wb+") as f:
        toloka_client.download_attachment(attachment.id, f)
    
    url = upload_image(filename)

    # deleting image locally
    try:
        os.remove(filename)
    except Exception:
        print(f'Failed to remove file {filename}.')
    
    return url, True

In [None]:
def wait_pool_for_close(pool, sleep_time = 60):
    pool = toloka_client.get_pool(pool.id)
    while not pool.is_closed():
        print(
            f'\t{datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} has status {pool.status}.'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)


def prepare_validation_tasks():
    validation_tasks = []  # Tasks that we will send for validation
    request = toloka.search_requests.AssignmentSearchRequest(
        status=toloka.assignment.Assignment.SUBMITTED,  # Only take completed tasks that haven't been accepted or rejected
        pool_id=img_pool.id,
    )
    # Create and upload new tasks
    for assignment in toloka_client.get_assignments(request):
        for task, solution in zip(assignment.tasks, assignment.solutions):
            img_url, correct = get_and_upload_attachment(solution)
            if not correct:
                toloka_client.reject_assignment(assignment.id,
                                                'Incorrect format for image file.')
                continue
            validation_tasks.append(
                toloka.task.Task(
                    input_values={
                        'text': task.input_values['text'],
                        'assignment_id': assignment.id,
                        'image': img_url
                    },
                    pool_id=validation_pool.id,
                )
            )
    print(f'Generated {len(validation_tasks)} new validation tasks')
    return validation_tasks



def run_validation_pool(validation_tasks):
    validation_tasks_op = toloka_client.create_tasks_async(
        validation_tasks,
        toloka.task.CreateTasksParameters(allow_defaults=True)
    )
    toloka_client.wait_operation(validation_tasks_op)
    validation_tasks_result = [task for task in toloka_client.get_tasks(pool_id=validation_pool.id) if not task.known_solutions]

    task_to_assignment = {}
    for task in validation_tasks_result:
        task_to_assignment[task.id] = task.input_values['assignment_id']

    # Open the validation pool
    run_pool = toloka_client.open_pool(validation_pool.id)
    run_pool = toloka_client.wait_operation(run_pool)
    print(f'Validation pool status - {run_pool.status}')
    return task_to_assignment


def get_aggregation_results():
    print('Started aggregation in the validation pool')
    # we need to aggregate answers for both output fields
    aggregation_operation_1 = toloka_client.aggregate_solutions_by_pool(
        request=toloka.aggregation.PoolAggregatedSolutionRequest(
            type='DAWID_SKENE',
            pool_id=validation_pool.id,
            fields=[toloka.aggregation.PoolAggregatedSolutionRequest.Field(name='text-result')]
        )
    )
    aggregation_operation_2 = toloka_client.aggregate_solutions_by_pool(
        request=toloka.aggregation.PoolAggregatedSolutionRequest(
            type='DAWID_SKENE',
            pool_id=validation_pool.id,
            fields=[toloka.aggregation.PoolAggregatedSolutionRequest.Field(name='image-result')]
        )
    )
    aggregation_operation_1 = toloka_client.wait_operation(aggregation_operation_1)
    aggregation_operation_2 = toloka_client.wait_operation(aggregation_operation_2)

    print('Results aggregated')

    aggregation_result_1 = toloka_client.find_aggregated_solutions(aggregation_operation_1.id)
    aggregation_result_2 = toloka_client.find_aggregated_solutions(aggregation_operation_2.id)

    validation_results_1 = aggregation_result_1.items
    while aggregation_result_1.has_more:
        aggregation_result_1 = toloka_client.find_aggregated_solutions(
            aggregation_operation_1.id,
            task_id_gt=aggregation_result_1.items[len(aggregation_result_1.items) - 1].task_id,
        )
        validation_results_1 = validation_results_1 + aggregation_result_1.items
        
    validation_results_2 = aggregation_result_2.items
    while aggregation_result_2.has_more:
        aggregation_result_2 = toloka_client.find_aggregated_solutions(
            aggregation_operation_2.id,
            task_id_gt=aggregation_result_2.items[len(aggregation_result_2.items) - 1].task_id,
        )
        validation_results_2 = validation_results_2 + aggregation_result_2.items
    return validation_results_1, validation_results_2

def set_answers_status(validation_results, task_to_assignment, links, confidence_lvl=0.6):
    print('Started adding results to image tasks')
    for text_result, image_result in zip(sorted(validation_results[0], key=lambda s: s.task_id), sorted(validation_results[1], key=lambda s: s.task_id)):
        # skipping not needed results
        if text_result.task_id not in task_to_assignment:
            continue
        # finding inputs for aggregated task
        inputs = toloka_client.get_task(text_result.task_id).input_values
        assignment_id = inputs['assignment_id']

        # accepting task in recording project if record is correct with the necessary confidence level 
        if (text_result.output_values['text-result'] == 'yes' and
            image_result.output_values['image-result'] == 'yes' and 
            text_result.confidence >= confidence_lvl and
            image_result.confidence >= confidence_lvl):
            try:
                toloka_client.accept_assignment(assignment_id, 
                                                'Well done!')
                links[assignment_id][os.path.split(inputs['image'])[-1]] = inputs['image']  # saving urls for getting result later
            except Exception:
                pass  # Already processed this assignment
        else:
            try:
                toloka_client.reject_assignment(assignment_id,
                                                'Image is incorrect. Check instructions for more details.')
            except Exception:
                pass  # Already processed this assignment
            
        task_to_assignment.pop(text_result.task_id, None)
            
    print('Finished adding results to image tasks')
    
    return links

In [None]:
# Run the pipeline

links = defaultdict(dict)

# Opening validation training and image gathering pool
toloka_client.open_pool(validation_training.id)
toloka_client.open_pool(img_pool.id)

while True:
    print('\nWaiting for image gathering pool to close...')
    wait_pool_for_close(img_pool)
    print(f'Image gathering pool {img_pool.id} is finally closed!')
    
    # Preparing tasks for validation project
    validation_tasks = prepare_validation_tasks()
    
    if len(validation_tasks) == 0:
        # no more tasks
        break
     
    # Adding tasks to the validation pool and open it
    task_to_assignment = run_validation_pool(validation_tasks)
    
    print('\nWaiting for validation pool to close')
    wait_pool_for_close(validation_pool)
    print(f'Validation pool {validation_pool.id} is finally closed!')
    
    # Getting validation results aggregation
    validation_results = get_aggregation_results()
    # Rejecting/accepting submitted images based on validation aggregation
    links = set_answers_status(validation_results, 
                               task_to_assignment,
                               links,
                               confidence_lvl=0.6)
    
print(f'Results received at {datetime.datetime.now()}')

## Results

You can either download `tsv` with results and attachments from accepted tasks directly from Toloka web interface or using this code:

In [None]:
def get_results(img_pool, links=None, download_images=False, path=os.path.join('data', lang_prefix, 'images'), verbose=True):
    request = toloka.search_requests.AssignmentSearchRequest(
        status=toloka.assignment.Assignment.ACCEPTED,  # Only take completed tasks that have been accepted
        pool_id=img_pool.id,
    )
    dataset = {
        'text': [],
        'image': [],
        'image_local': []
    }
    
    if download_images:
        # creating directories in path
        try:
            os.makedirs(path)
            if verbose:
                print(f'Created directories in path {path}')
        except FileExistsError:
            if verbose:
                print(f'Using already existing directory {path}') 
        abs_path = os.path.abspath(path)
        
    
    for assignment in toloka_client.get_assignments(request):
        for task, solution in zip(assignment.tasks, assignment.solutions):
            text = task.input_values['text']
            dataset['text'].append(text)

            attachment = toloka_client.get_attachment(solution.output_values['image'])
            ext = os.path.splitext(attachment.name)[-1]
            if download_images:
                # downloading image on disk
                filepath = os.path.join(abs_path, f"{attachment.id}{ext}")
                with open(filepath, "wb+") as f:
                    toloka_client.download_attachment(attachment.id, f)
                if verbose:
                    print(f'Downloaded image: {filepath}')
                dataset['image_local'].append(os.path.join(path, f"{attachment.id}{ext}"))
            dataset['image'].append(links[assignment.id][solution.output_values['image']])

    if verbose:        
        print('Finished getting results.')
    
    # converting to pandas dataframe (for comfortable .tsv export)
    df = pd.DataFrame.from_dict(dataset)
    df.to_csv(os.path.join(path, os.pardir, 'data.tsv'), sep='\t', encoding='utf-8', index=None)
    return df

In [None]:
dataset = get_results(img_pool, links, download_images=True, verbose=False)

## Visualization

You can now view the gathered images using the code below, example can be seen in Figure 4.
<table  align="center">
  <tr><td>
    <img src="./img/final-example.jpg"
         alt="Example"  width="600">
  </td></tr>
  <tr><td align="center">
    <b>Figure 4.</b> Example of image from the dataset
  </td></tr>
</table>

In [None]:
texts = dataset['text'].values
images = dataset['image_local'].values
start_with = 0

In [None]:
if start_with >= len(images):
    print('no more images')
else:
    ipyplot.plot_images(
        images=images[start_with:],
        labels=texts[start_with:],
        max_images=1,
        img_width=300,
    )

    start_with += 1

## Cleanup

After we are done with dataset collection, we need to cleanup images from Yandex.Disk:

In [None]:
def cleanup(delete_rejected=True, delete_accepted=False, delete_training=False, empty_trash=True):
    if delete_training:
        try:
            ydisk.remove(posixpath.join(toloka_private_folder, 'training'), permanently=empty_Trash)
        except yadisk.exceptions.PathNotFoundError:
            # already deleted
            pass
    if not delete_accepted and not delete_rejected:
        return
    
    if delete_accepted and delete_rejected:
        try:
            ydisk.remove(posixpath.join(toloka_private_folder, lang_prefix))
        except yadisk.exceptions.PathNotFoundError:
            # already deleted
            pass
        return
    
    acc = []
    for dct in links.values():
        acc.extend(os.path.split(url)[-1] for url in dct.values())
    acc = set(acc)
    images = list(item.name for item in ydisk.listdir(posixpath.join(toloka_private_folder, lang_prefix)))
    for img in images:
        if (img in acc and delete_accepted) or (img not in acc and delete_rejected):
            ydisk.remove(posixpath.join(toloka_private_folder, lang_prefix, img), permanently=empty_trash)

In [None]:
cleanup()