# Text recognition

To get acquainted with Toloka tools for free, you can use the promo code **TOLOKAKIT1** on $20 on your [profile page](https://toloka.yandex.com/requester/profile) after registration. 

Prepare environment and import all we'll need.

In [None]:
!pip install toloka-kit==0.1.15
!pip install crowd-kit==0.0.7
!pip install ipyplot

import datetime
import os
import sys
import time
import logging

import ipyplot
import pandas
import numpy as np

import toloka.client as toloka
import toloka.client.project.template_builder as tb
from crowdkit.aggregation import ROVER

logging.basicConfig(
    format='[%(levelname)s] %(name)s: %(message)s',
    level=logging.INFO,
    stream=sys.stdout,
)

Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)

In [None]:
toloka_client = toloka.TolokaClient(input("Enter your token:"), 'PRODUCTION')  # Or switch to 'SANDBOX'
logging.info(toloka_client.get_requester())

## Creating new project
Enter a clear project name and description.
> The project name and description will be visible to the performers.

In [3]:
project = toloka.Project(
    public_name='Write down the digits in an image',
    public_description='Look at the image and write down the digits shown on the water meter.',
)

Create task interface.

- Read about configuring the [task interface](https://yandex.com/support/toloka-requester/reference/interface-spec.html) in the Requester’s Guide.
- Check the [Interfaces section](https://toloka.ai/knowledgebase/interface) of our Knowledge Base for more tips on interface design.
- Read more about the [Template builder](https://yandex.ru/support/toloka-tb/index.html) in the Requester’s Guide.  

In [3]:
header_viewer = tb.MarkdownViewV1("""1. Look at the image
2. Find boxes with the numbers
3. Write down the digits in black section. (Put '0' if there are no digits there)
4. Put '.'
5. Write down the digits in red section""")

image_viewer = tb.ImageViewV1(tb.InputData('image_url'), rotatable=True)

output_field = tb.TextFieldV1(
    tb.OutputData('value'),
    label='Write down the digits. Format: 365.235',
    placeholder='Enter value',
    hint="Make sure your format of number is '365.235' or '0.112'",
    validation=tb.SchemaConditionV1(
        schema={
            'type': 'string',
            'pattern': r'^\d+\.?\d{0,3}$',
            'minLength': 1,
            'maxLength': 9,
        }
    )
)

task_width_plugin = tb.TolokaPluginV1('scroll', task_width=600)

project_interface = toloka.project.TemplateBuilderViewSpec(
    view=tb.ListViewV1([header_viewer, image_viewer, output_field]),
    plugins=[task_width_plugin],
)

Set data specification. And set task interface to project.
> Specifications are a description of input data that will be used in a project and the output data that will be collected from the performers. 

Read more about [input and output data specifications](https://yandex.ru/support/toloka-tb/operations/create-specs.html?lang=en) in the Requester’s Guide. 

In [5]:
input_specification = {'image_url': toloka.project.UrlSpec()}
output_specification = {'value': toloka.project.StringSpec()}

project.task_spec = toloka.project.task_spec.TaskSpec(
    input_spec=input_specification,
    output_spec=output_specification,
    view_spec=project_interface,
)

Write short and clear instructions.

> Though the task itself is simple, be sure to add examples for non-obvious cases (like when there are no red digits on an image). This helps to eliminate noise in the labels.

Get more tips on designing [instructions](https://toloka.ai/knowledgebase/instruction) in our Knowledge Base.

In [6]:
project.public_instructions = """This task is to solve machine learning problem of digit recognition on the image.<br>
The more precise you read the information from the image the more precise would be algorithm<br>
Your contribution here is to get exact information even if there are any complicated and uncertain cases.<br>
We hope for your skills to solve one of the important science problem.<br><br>
<b>Basic steps:</b><br>
<ul><li>Look at the image and find meter with the numbers in the boxes</li>
<li>Find black numbers/section and red numbers/section</li>
<li>Put black and red numbers separated with '.' to text field</li></ul>"""

Create a project.

In [None]:
project = toloka_client.create_project(project)

## Preparing data
This example uses [Toloka WaterMeters](https://toloka.ai/datasets) dataset collected by Roman Kucev.

In [3]:
!curl https://s3.mds.yandex.net/tlk/dataset/TlkWaterMeters/data.tsv --output data.tsv

raw_dataset = pandas.read_csv('data.tsv', sep='\t', dtype={'value': 'str'})
raw_dataset = raw_dataset[['image_url', 'value']]

with pandas.option_context("max_colwidth", 100):
    display(raw_dataset)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  570k  100  570k    0     0  3299k      0 --:--:-- --:--:-- --:--:-- 3280k


Unnamed: 0,image_url,value
0,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_53_value_595_825.jpg,595.825
1,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_553_value_65_475.jpg,65.475
2,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_407_value_21_86.jpg,21.86
3,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_252_value_313_322.jpg,313.322
4,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_851_value_305_162.jpg,305.162
...,...,...
1239,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_255_value_172_542.jpg,172.542
1240,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_878_value_97_299.jpg,97.299
1241,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_649_value_146_443.jpg,146.443
1242,https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_396_value_228_944.jpg,228.944


Lets look at the images from this dataset:   

<table  align="center">
  <tr>
  <td>
    <img src="https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_53_value_595_825.jpg" alt="value 595.825">
  </td>
  <td>
    <img src="https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_553_value_65_475.jpg" alt="value 65.475">
  </td>
  <td>
    <img src="https://tlk.s3.yandex.net/dataset/TlkWaterMeters/images/id_407_value_21_86.jpg" alt="value 21.860">
  </td>
  </tr>
  <tr><td align="center" colspan="3">
    <b>Figure 1.</b> Images from dataset
  </td></tr>
</table>

Split this dataset into three parts
- Training tasks - we'll put them into training. This type of task must contain ground truth and hint about how to perform it.
- Golden tasks - we'll put it into the regular pool. This type of task must contain ground truth.
- Regular tasks - for regular pool. Only image url as input.

In [9]:
raw_dataset = raw_dataset.sample(frac=1).reset_index(drop=True)

training_dataset, golden_dataset, main_dataset, _ = np.split(raw_dataset, [10, 20, 120], axis=0)
print(f'training_dataset - {len(training_dataset)}')
print(f'golden_dataset - {len(golden_dataset)}')
print(f'main_dataset - {len(main_dataset)}')

training_dataset - 10
golden_dataset - 10
main_dataset - 100


## Create a training pool
> Training is an essential part of almost every crowdsourcing project. It allows you to select performers who have really mastered the task, and thus improve quality. Training is also a great tool for scaling your task because you can run it any time you need new performers. 

Read more about [selecting performers](https://toloka.ai/knowledgebase/quality-control) in our Knowledge Base.

In [None]:
training = toloka.Training(
    project_id=project.id,
    private_name='Text recognition training',
    may_contain_adult_content=False,
    assignment_max_duration_seconds=60*10,
    mix_tasks_in_creation_order=False,
    shuffle_tasks_in_task_suite=False,
    training_tasks_in_task_suite_count=2,
    task_suites_required_to_pass=5,
    retry_training_after_days=5,
    inherited_instructions=True,
)
training = toloka_client.create_training(training)

Upload training tasks to the pool.
> It’s important to include examples for all сases in the training. Make sure the training set is balanced and the comments explain why an answer is correct. Don’t just name the correct answers. 

In [11]:
training_tasks = [
    toloka.Task(
        pool_id=training.id,
        input_values={'image_url': row.image_url},
        known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'value': row.value})],
        message_on_unknown_solution=f'Black section is {row.value.split(".")[0]}. Red section is {row.value.split(".")[1]}.',
    )
    for row in training_dataset.itertuples()
]
result = toloka_client.create_tasks(training_tasks, allow_defaults=True)
print(len(result.items))

10


## Create the main pool
A pool is a set of paid tasks grouped into task pages. These tasks are sent out for completion at the same time.

> All tasks within a pool have the same settings (price, quality control, etc.)

In [12]:
pool = toloka.Pool(
    project_id=project.id,
    # Give the pool any convenient name. You are the only one who will see it.
    private_name='Write down the digits in an image.',
    may_contain_adult_content=False,
    # Set the price per task page.
    reward_per_assignment=0.02,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    # Overlap. This is the number of users who will complete the same task.
    defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=3),
    # Time allowed for completing a task page
    assignment_max_duration_seconds=600,
)

- Read more about [pricing principles](https://toloka.ai/knowledgebase/pricing) in our Knowledge Base.
- To understand [how overlap works](https://yandex.com/support/toloka-requester/concepts/mvote.html), go to the Requester’s Guide.  
- To understand how much time it should take to complete a task suite, try doing it yourself.

Attach the training you created earlier and select the accuracy level that is required to reach the main pool.

In [13]:
pool.set_training_requirement(training_pool_id=training.id, training_passing_skill_value=75)

Select English-speaking performers

In [14]:
pool.filter = toloka.filter.Languages.in_('EN')

Set up [Quality control](https://yandex.com/support/toloka-requester/concepts/control.html). Ban performers who give incorrect responses to control tasks.

> Since tasks such as these have an answer that can be used as ground truth, we can use standard quality control rules like golden sets.

Read more about [quality control principles](https://toloka.ai/knowledgebase/quality-control) in our Knowledge Base or check out [control tasks settings](https://yandex.com/support/toloka-requester/concepts/goldenset.html) in the Requester’s Guide.

In [15]:
pool.quality_control.add_action(
    collector=toloka.collectors.GoldenSet(),
    conditions=[
        toloka.conditions.GoldenSetCorrectAnswersRate < 80.0,
        toloka.conditions.GoldenSetAnswersCount >= 3
    ],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=2,
        duration_unit='DAYS',
        private_comment='Control tasks failed'
    )
)

pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=7),
    conditions=[toloka.conditions.FastSubmittedCount >= 1],
    action=toloka.actions.RestrictionV2(
        scope='PROJECT',
        duration=2,
        duration_unit='DAYS',
        private_comment='Fast response'
    ))

Specify	the number of tasks per page. For example: 3 main tasks and 1 control task.

> We recommend putting as many tasks on one page as a performer can complete in 1 to 5 minutes. That way, performers are less likely to get tired, and they won’t lose a significant amount of data if a technical issue occurs. 

To learn more about [grouping tasks](https://yandex.com/support/search-results/?service=toloka-requester&query=smart+mixing) into suites, read the Requester’s Guide. 

In [16]:
pool.set_mixer_config(
    real_tasks_count=3,
    golden_tasks_count=1
)

Create pool

In [None]:
pool = toloka_client.create_pool(pool)

**Uploading tasks**

Create control tasks. In small pools, control tasks should account for 10–20% of all tasks.

> Control tasks are tasks that already contain the correct response. They are used for checking the quality of responses from performers. The performer's response is compared to the response you provided. If they match, it means the performer answered correctly.
> Make sure to include different variations of correct responses in equal amounts.

To learn more about [creating control tasks](https://yandex.com/support/toloka-requester/concepts/task_markup.html), go to the Requester’s Guide.

In [18]:
golden_tasks = [
    toloka.Task(
        pool_id=pool.id,
        input_values={'image_url': row.image_url},
        known_solutions = [
            toloka.task.BaseTask.KnownSolution(
                output_values={'value': row.value}
            )
        ],
        infinite_overlap=True,
    )
    for row in golden_dataset.itertuples()
]

Create pool tasks

In [19]:
tasks = [
    toloka.Task(
        pool_id=pool.id,
        input_values={'image_url': url},
    )
    for url in main_dataset['image_url']
]

Upload tasks

In [20]:
created_tasks = toloka_client.create_tasks(golden_tasks + tasks, allow_defaults=True)
print(len(created_tasks.items))

110


You can visit created pool in web-interface and preview tasks and control tasks.

<table  align="center">
  <tr>
  <td>
    <img src="./img/performer_interface.png" alt="Possible performer interface">
  </td>
  </tr>
  <tr><td align="center">
    <b>Figure 2.</b> Possible performer interface.
  </td></tr>
</table>

Start the pool.

**Important.** Remember that real Toloka performers will complete the tasks.
Double check that everything is correct
with your project configuration before you start the pool

In [None]:
training = toloka_client.open_training(training.id)
print(f'training - {training.status}')

pool = toloka_client.open_pool(pool.id)
print(f'main pool - {pool.status}')

## Receiving responses

Wait until the pool is completed.

In [None]:
pool_id = pool.id

def wait_pool_for_close(pool_id, minutes_to_wait=1):
    sleep_time = 60 * minutes_to_wait
    pool = toloka_client.get_pool(pool_id)
    while not pool.is_closed():
        op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])
        op = toloka_client.wait_operation(op)
        percentage = op.details['value'][0]['result']['value']
        logging.info(
            f'   {datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} - {percentage}%'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)
    logging.info('Pool was closed.')

wait_pool_for_close(pool_id)

Get responses

When all the tasks are completed, look at the responses from performers.

In [29]:
answers = []

for assignment in toloka_client.get_assignments(pool_id=pool.id, status='ACCEPTED'):
    for task, solution in zip(assignment.tasks, assignment.solutions):
        if not task.known_solutions:
            answers.append([task.input_values['image_url'], solution.output_values['value'], assignment.user_id])

print(f'answers count: {len(answers)}')
# Prepare dataframe
answers_df = pandas.DataFrame(answers, columns=['task', 'text', 'performer'])

answers count: 300


Aggregation results using the ROVER model impemented in [Crowd-Kit](https://github.com/Toloka/crowd-kit#crowd-kit-computational-quality-control-for-crowdsourcing).

In [32]:
rover_agg_df = ROVER(tokenizer=lambda x: list(x), detokenizer=lambda x: ''.join(x)).fit_predict(answers_df)

Look at the results.

Some preparations for displaying the results

In [35]:
images = rover_agg_df.index.values
labels = rover_agg_df.values
start_with = 0

Note: The cell below can be run several times.

In [None]:
if start_with >= len(rover_agg_df):
    logging.info('no more images')
else:
    ipyplot.plot_images(
        images=images[start_with:],
        labels=labels[start_with:],
        max_images=8,
        img_width=300,
    )

    start_with += 8

**You** can see the labeled images. Some possible results are shown in figure 3 below.

<table  align="center">
  <tr><td>
    <img src="./img/possible_result.png"
         alt="Possible results">
  </td></tr>
  <tr><td align="center">
    <b>Figure 3.</b> Possible results.
  </td></tr>
</table>