# ASR dataset labeling pipeline based on Wikipedia articles

The idea behind this notebook is to get tagged pairs for Text-to-Speech and Speech-to-Text problems based on a huge database of Wikipedia articles for some fixed language (which is supported by Toloka) using Toloka's toolkit (toloka-kit).

First of all, we higly recommend to read [the image segmentation example](https://github.com/Toloka/toloka-kit/blob/main/examples/image_segmentation/image_segmentation.ipynb) before you start this one to be familiar with the main kinds of entities in Toloka.

## Content

We will implement 3 projects:
* Clearing raw texts as results from Wikipedia exploring script
* Recording cleared texts
* Verification project for previous one

You can learn more about each of these projects from the corresponding section of the notebook and the diagram attached below.

<table  align="center">
  <tr><td>
    <img src="./img/scheme.png"
         alt="Scheme"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> Scheme of the pipeline
  </td></tr>
</table>


## Preparation

In [None]:
# Installing toloka-kit
!pip install toloka-kit==0.1.3
# Installing boto3 for working with object storage (s3)
!pip install boto3

In [None]:
import os
import datetime
import time

import ipyplot
import pandas as pd

import boto3

import toloka.client as toloka
import toloka.client.project.template_builder as tb

from tqdm.notebook import tqdm

In [None]:
# You can change it for any other language maintained by Toloka
# For now there are public instructions and golden data for 'en' language only
# For other languages you will have to write your own instructions and mark your own golden sets
# (or use english version)
LANGUAGE = 'en'

In this notebook we will use Yandex Object Storage for uploading our records to the internet.  
First of all, you need to [join Yandex Cloud services](https://cloud.yandex.com/en-ru/) if you haven't done it before.  
After that you'll need 3 steps:
* [Create a service account](https://cloud.yandex.com/en-ru/docs/iam/operations/sa/create)
* [Assign a role to a service account](https://cloud.yandex.com/en-ru/docs/iam/operations/sa/assign-role-for-sa)
* [Create a static access key](https://cloud.yandex.com/en-ru/docs/iam/operations/sa/create-access-key)  

Finally, put your secret and public access keys in a cell below to variables `aws_secret_access_key` and `aws_access_key`.

In [None]:
toloka_token = <PASTE YOUR TOLOKA OAUTH TOKEN HERE>

# In this notebook we used Yandex Object Storage
# You need to do 3 actions in section above
aws_secret_access_key = <PASTE YOUR SECRET ACCESS KEY HERE>
aws_access_key = <PASTE YOUR PUBLIC ACCESS KEY HERE>

In [None]:
SANDBOX = False  # Change to True for working with Sandbox Toloka
toloka_client = toloka.TolokaClient(toloka_token, 'SANDBOX' if SANDBOX else 'PRODUCTION')
toloka_domain = "sandbox.toloka.yandex.com" if SANDBOX else "toloka.yandex.com"

### Pricing

In [None]:
TEXTS_COUNT = 100


approx_classification_price = round(TEXTS_COUNT * 5 / 25 * (0.025 + 0.05), 2)
approx_recording_price = round(TEXTS_COUNT * (0.015 + 0.05), 2)
approx_verification_price = round(1.2 * TEXTS_COUNT * 3 / 20 * (0.03 + 0.05), 2)

approx_pipeline_price = approx_classification_price + approx_recording_price + approx_verification_price


requester = toloka_client.get_requester()
if requester.balance >= approx_pipeline_price:
    print('You have enough money on your account!')
else:
    print('You haven\'t got enough money on your account!')

### Getting raw data

You can use already collected texts from the data folder or collect it by yourself (it will take some time).

**If this is your first time viewing this notebook, we highly recommend using the pre-collected data.**  
If you want to collect data run CLI-tool `./scripts/collect_corpus.py` in the terminal. 

To get more info about parameters use help flag: `python3 ./scripts/collect_corpus.py -h`

English pre-collected data was collected by launch with following parameters:  
`python3 ./scripts/collect_corpus.py en -size 2000 -fp ../data/en_precollected.tsv -s -min_len 90`

In [None]:
# Working with pre-colleted data (you can skip this cell if collecting data by yourself)
precollected_data = pd.read_csv(f'./data/{LANGUAGE}_precollected.tsv', sep='\t', error_bad_lines=False)
raw_data = precollected_data.sample(TEXTS_COUNT)

# it's okay if some lines throw an error

In [None]:
# Working with collected by yourself data (uncomment next line for read data from disk)
# raw_data = pd.read_csv(<PATH_TO_YOUR_DATA>, sep='\t', error_bad_lines=False)

## 1. Clearing collected data from incorrect paragraphs

### 1.1. Classification project

In this project, performers will check paragraphs for mistakes and Wikipedia automated word processing artifacts.

In [None]:
# Adding the performer the ability to choose whether texts are incorrect
radio_group_field = tb.fields.RadioGroupFieldV1(
    data=tb.data.OutputData(path='is_correct'),
    label='Does this text correct?',
    validation=tb.conditions.RequiredConditionV1(),
    options=[
        tb.fields.GroupFieldOption(
            label='Yes',
            value='yes'),
        tb.fields.GroupFieldOption(
            label='No',
            value='no'),
    ]
)

# Creating interface which performers will see using previous element
project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(
            items=[
                tb.view.TextViewV1(
                    label='Text', 
                    content=tb.data.InputData(path='text')
                ),
                radio_group_field
            ]
        )
    )
)



# Setting up the project with defined parameters and interface
classification_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=open(f"./instructions/classification/{LANGUAGE}_project_name.txt").read().strip(),
    private_comment='Clearing texts',
    public_description=open(f"./instructions/classification/{LANGUAGE}_short_instructions.txt").read().strip(),
    public_instructions=open(f"./instructions/classification/{LANGUAGE}_public_instructions.html").read().strip(),
    
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec={
            'text_id': toloka.project.field_spec.StringSpec(),
            'text': toloka.project.field_spec.StringSpec(),
        },
        output_spec={
            'is_correct': toloka.project.field_spec.StringSpec(
                allowed_values=[
                    'yes',
                    'no',
                ]
            )
        },
        view_spec=project_interface,
    ),
)

In [None]:
# Calling the API to create a new project
# If you have already created all pools and projects you can just get it using toloka_client.get_project('your marking project id')
classification_project = toloka_client.create_project(classification_project)
print(f'Created marking project with id {classification_project.id}')
print(f'To view the project, go to: https://{toloka_domain}/requester/project/{classification_project.id}')

<table  align="center">
  <tr><td>
    <img src="./img/classification_project_interface.png"
         alt="cls_iface"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 2.</b> How performers will see the tasks
  </td></tr>
</table>

<table  align="center">
  <tr><td>
    <img src="./img/classification_project_instruction.png"
         alt="cls_inst"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 3.</b> How performers will see the instruction
  </td></tr>
</table>

### 1.2. Classification training

Here there will be training tasks for show examples of right done tasks for performers.

In [None]:
# Setting up training
classification_training = toloka.training.Training(
    project_id=classification_project.id,
    private_name='[Clearing texts] Training',
    may_contain_adult_content=True,
    mix_tasks_in_creation_order=True,
    shuffle_tasks_in_task_suite=True,
    training_tasks_in_task_suite_count=10,
    assignment_max_duration_seconds=60*10,
    task_suites_required_to_pass=2,
    retry_training_after_days=1,
    inherited_instructions=True,
)

# Calling the API to create a new training
classification_training = toloka_client.create_training(classification_training)
print(f'Created training with id {classification_training.id}')
print(f'To view the training, go to: https://{toloka_domain}/requester/project/{classification_project.id}/training/{classification_training.id}')

In [None]:
# Let`s take a look on classified training data 
training_data = pd.read_csv(f'./data/{LANGUAGE}_classification_training.tsv', sep='\t')
training_data.head()

In [None]:
# Creating tasks from training data
training_tasks = [
    toloka.task.Task(
        input_values={
            'text_id': str(row[0]),
            'text': row[1],
        },
        known_solutions=[
            toloka.task.BaseTask.KnownSolution(
                output_values={'is_correct': row[2]}
            )
        ],
        infinite_overlap=True,
        pool_id=classification_training.id,
        message_on_unknown_solution="" if pd.isna(row[3]) else row[3]
    ) for row in training_data.values
]


# Calling the API to create a new tasks
tasks_op = toloka_client.create_tasks_async(training_tasks)
op_res = toloka_client.wait_operation(tasks_op)
print(
    f'Total tasks: {op_res.details["total_count"]}',
    f'Total failed: {op_res.details["failed_count"]}',
    f'Total success: {op_res.details["success_count"]}',
    f'Total valid: {op_res.details["valid_count"]}',
    f'Total not valid: {op_res.details["not_valid_count"]}',
    sep='\n'
)

### 1.3. Classification pool

In this pool, trained performers will do classification task: they will decide whether text is correct or not.

About some parameters:
* **Overlap** - We need multiple opinions per each text for aggregate results with high degree of confidence.
* We want **filter performers** by their results on training and by their knowledge of language of texts from dataset.
* We need also allow performers do tasks from mobile app and browser both.

About quality control:
* We want to ban performers who answers too fast.
* We want to ban performers based on low quality on the golden set tasks.
* We want to ban performers who fails Captcha suspiciously often.
* We want to ban lazy performers who skips tasks until find an easy one.
* We want to ban performers who too much deviates from majority opinion.

In [None]:
# Setting up pool
classification_pool = toloka.pool.Pool(
    project_id=classification_project.id,
    private_name='[Clearing texts] Classification pool',
    may_contain_adult_content=True,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.025,
    auto_accept_solutions=True,
    assignment_max_duration_seconds=60*25,
    defaults=toloka.pool.Pool.Defaults(
        default_overlap_for_new_task_suites=5
    ),
    filter=toloka.filter.FilterAnd(
        [
            toloka.filter.Languages.in_(LANGUAGE.upper()),
            toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER),
        ]
    )
    
    
)

# Setting task mixing configuration
classification_pool.set_mixer_config(
    real_tasks_count=25,
    golden_tasks_count=1,
    training_tasks_count=0
)


# Setting up pool quality control

# Banning performer who answers too fast
classification_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        history_size=5, 
        fast_submit_threshold_seconds=375
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses'
    )
)

# Banning performer who answers too fast (another case)
classification_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        history_size=5, 
        fast_submit_threshold_seconds=250
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 0],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses'
    )
)

# Banning performer by captcha criteria
classification_pool.quality_control.add_action(
    collector=toloka.collectors.Captcha(history_size=5),
    conditions=[toloka.conditions.FailRate >= 60],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=3,
        duration_unit=toloka.user_restriction.DurationUnit.DAYS,
        private_comment='Captcha'
    )
)

# Banning performer by majority vote criteria
classification_pool.quality_control.add_action(
    collector=toloka.collectors.MajorityVote(history_size=5, answer_threshold=3),
    conditions=[
        toloka.conditions.TotalAnswersCount > 9,
        toloka.conditions.CorrectAnswersRate < 65,
    ],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Majority vote low quality'
    )
)

# Banning performer who skips some tasks in a row
classification_pool.quality_control.add_action(
    collector=toloka.collectors.SkippedInRowAssignments(),
    conditions=[toloka.conditions.SkippedInRowCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=15,
        duration_unit='DAYS',
        private_comment='Lazy performer',
    )
)

# Bannning performer for classification results worse than random choise
classification_pool.quality_control.add_action(
    collector=toloka.collectors.GoldenSet(),
    conditions=[
        toloka.conditions.GoldenSetCorrectAnswersRate < 50,
        toloka.conditions.GoldenSetAnswersCount > 3
    ],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=1,
        duration_unit=toloka.user_restriction.DurationUnit.DAYS,
        private_comment='Golden set'
    )
)

In [None]:
# Calling the API to create a new pool
classification_pool = toloka_client.create_pool(classification_pool)
print(f'Created pool with id {classification_pool.id}')
print(f'To view the pool, go to: https://{toloka_domain}/requester/project/{classification_project.id}/pool/{classification_pool.id}')

Now we need to add tasks for this pool. As we can see upper, we are using golden set of tasks for quality control, so we need not only real tasks but tasks with answers *(from golden set)* too.

In [None]:
raw_data

In [None]:
# Creating tasks from raw data
classification_tasks = [
    toloka.task.Task(
        input_values={
            'text_id': str(row[0]),
            'text': row[1],
        },
        pool_id=classification_pool.id,
    ) for row in raw_data.values
]

In [None]:
golden_data = pd.read_csv(f'./data/{LANGUAGE}_classification_golden.tsv', sep='\t', error_bad_lines=False)
golden_data

In [None]:
# Creating control tasks from golden data
golden_tasks = [
    toloka.task.Task(
        input_values={
            'text_id': str(row[0]),
            'text': row[1],
        },
        known_solutions=[
            toloka.task.BaseTask.KnownSolution(
                output_values={'is_correct': row[2]}
            )
        ],
        pool_id=classification_pool.id,
    ) for row in golden_data.values
]

In [None]:
# Calling the API to create a new tasks
# This may take some time
tasks_op = toloka_client.create_tasks_async(
    classification_tasks + golden_tasks,
    allow_defaults=True
)
op_res = toloka_client.wait_operation(tasks_op)
print(
    f'Total tasks: {op_res.details["total_count"]}',
    f'Total failed: {op_res.details["failed_count"]}',
    f'Total success: {op_res.details["success_count"]}',
    f'Total valid: {op_res.details["valid_count"]}',
    f'Total not valid: {op_res.details["not_valid_count"]}',
    sep='\n'
)

## 2. Labeling texts into audio files

### 2.1. Recording project

In this project performers will record the given text.

We want a bit different variants of interaction with task interface for mobile devices (with Toloka App) and broswer (Web-based Toloka) due to abillity to use built-in recording functions in the Toloka Mobile App.

In [None]:
# Toloka assets for using Handlebars engine
recording_assets = toloka.project.view_spec.ClassicViewSpec.Assets(
    script_urls=["$TOLOKA_ASSETS/js/toloka-handlebars-templates.js"]
)

# We will using Voice Recording preset from Web-version. In this way it's possible to configure custom interface.
project_interface = toloka.project.view_spec.ClassicViewSpec(
    script=open('./templates/recording/recording_template.js').read().strip(),
    markup=open('./templates/recording/recording_template.html').read().strip(),
    styles=open('./templates/recording/recording_template.css').read().strip(),
    assets=recording_assets
)


# Setting up the project with defined parameters and interface
recording_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=open(f'./instructions/recording/{LANGUAGE}_project_name.txt').read().strip(),
    private_comment='Recording texts',
    public_description=open(f'./instructions/recording/{LANGUAGE}_short_instructions.txt').read().strip(),
    public_instructions=open(f'./instructions/recording/{LANGUAGE}_public_instructions.html').read().strip(),
    
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec={
            'text': toloka.project.field_spec.StringSpec(),
            'text_id': toloka.project.field_spec.StringSpec(required=False),
        },
        output_spec={
            'audio_record': toloka.project.field_spec.FileSpec()
        },
        view_spec=project_interface,
    ),
)

In [None]:
# Calling the API to create a new project
# If you have already created all pools and projects you can just get it using toloka_client.get_project('your marking project id')
recording_project = toloka_client.create_project(recording_project)
print(f'Created marking project with id {recording_project.id}')
print(f'To view the project, go to: https://{toloka_domain}/requester/project/{recording_project.id}')

<table  align="center">
  <tr><td>
    <img src="./img/recording_project_interface.png"
         alt="vlz_iface"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 4.</b> How performers will see the tasks
  </td></tr>
</table>

<table  align="center">
  <tr><td>
    <img src="./img/recording_project_instruction.png"
         alt="vlz_inst"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 5.</b> How performers will see the instruction
  </td></tr>
</table>

### 2.2. Recording training

Generally speaking, we could use training and admit only trained performers to the real tasks, however, since the task itself is quite simple, and its acceptance method is not automatic, it is easier to overpay a little for the verification project.

**However, we strongly recommend to you make training in your projects, as it improves the quality of the performers' work and does not allow unscrupulous performers to waste your money.**

### 2.3. Recording pool

In this pool, trained performers will record the text they see.

About some parameters:
* **Manual solution acceptance** - we need performers who will verify tasks done in marking project (we will set up verification project for it later).
* **Overlap** - we need one audio fragment per each text (actually we can increase overlap as if our model works fine with duplicating text data).
* We want **filter performers** by their knowledge of language of texts from dataset.
* We need also allow performers do tasks from mobile app and browser both.

About quality control:
* We want to ban performers who answers too fast.
* We want to ban performers who fails Captcha suspiciously often.
* We want to ban lazy performers who skips tasks until find an easy one.
* We want to ban performers who too much deviates from majority opinion.
* We want to increase overlap every time the task was rejected (in other word we need to return text which corresponds to rejected record back to the recording pool)

In [None]:
# Setting up pool
recording_pool = toloka.pool.Pool(
    project_id=recording_project.id,
    private_name='[Recording texts] Recording pool',
    may_contain_adult_content=True,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.015,
    auto_accept_solutions=False,
    auto_accept_period_day=21,
    assignment_max_duration_seconds=60*7,
    filter=toloka.filter.FilterAnd(
        [
            toloka.filter.Languages.in_(LANGUAGE.upper()),
            toloka.filter.FilterOr(
                [
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER),
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.TOLOKA_APP)
                ]
            )
        ]
    ),
    defaults=toloka.pool.Pool.Defaults(
        default_overlap_for_new_task_suites=1
    ),
)

# Setting task mixing configuration (1 task per page)
recording_pool.set_mixer_config(
    real_tasks_count=1,
    golden_tasks_count=0,
    training_tasks_count=0
)


# Setting up pool quality control

# Banning performer who answers too fast
recording_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        history_size=5, 
        fast_submit_threshold_seconds=30
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses'
    )
)

# Banning performer who answers too fast (another case)
recording_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        history_size=5, 
        fast_submit_threshold_seconds=15
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 0],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses'
    )
)

# Banning performer by captcha criteria
recording_pool.quality_control.add_action(
    collector=toloka.collectors.Captcha(history_size=5),
    conditions=[toloka.conditions.FailRate >= 60],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=3,
        duration_unit=toloka.user_restriction.DurationUnit.DAYS,
        private_comment='Captcha'
    )
)

# Banning performer who skips some tasks in a row
recording_pool.quality_control.add_action(
    collector=toloka.collectors.SkippedInRowAssignments(),
    conditions=[toloka.conditions.SkippedInRowCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=15,
        duration_unit='DAYS',
        private_comment='Lazy performer',
    )
)

# Increasing overlap for the task if the assignment was rejected
recording_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentsAssessment(),
    conditions=[toloka.conditions.AssessmentEvent == toloka.conditions.AssessmentEvent.REJECT],
    action=toloka.actions.ChangeOverlap(delta=1, open_pool=True)
)

In [None]:
# Calling the API to create a new pool
recording_pool = toloka_client.create_pool(recording_pool)
print(f'Created pool with id {recording_pool.id}')
print(f'To view the pool, go to: https://{toloka_domain}/requester/project/{recording_project.id}/pool/{recording_pool.id}')

### 2.4. Verification project

In this project, the performers will check that the voice records are correct, that is, they completely contain the corresponding text, and also do not contain technical defects or noise.

In [None]:
# Interface elements
radio_group_field = tb.fields.RadioGroupFieldV1(
    data=tb.data.OutputData(path='is_correct'),
    label='Is this record correct and whether it corresponds to the text?',
    validation=tb.conditions.RequiredConditionV1(),
    options=[
        tb.fields.GroupFieldOption(
            label='Yes',
            value='yes'),
        tb.fields.GroupFieldOption(
            label='No',
            value='no')
    ]
)

text_block = tb.view.TextViewV1(
    label='Text', 
    content=tb.data.InputData(path='text')
)

audio_block = tb.view.AudioViewV1(
    url=tb.data.InputData(path='audio_record'),
    label='Record',
    validation=tb.conditions.PlayedFullyConditionV1(),  # we want to make sure that performer will listen full record
)

# Creating interface which performers will see using previous elements
project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
    config=tb.TemplateBuilder(
        view=tb.view.ListViewV1(
            items=[
                text_block,
                audio_block,
                radio_group_field
            ]
        )
    )
)

# Setting up the project with defined parameters and interface
verification_project = toloka.project.Project(
    assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
    public_name=open(f'./instructions/verification/{LANGUAGE}_project_name.txt').read().strip(),
    private_comment='Verification for recorded texts',
    public_description=open(f'./instructions/verification/{LANGUAGE}_short_instructions.txt').read().strip(),
    public_instructions=open(f'./instructions/verification/{LANGUAGE}_public_instructions.html').read().strip(),
    
    task_spec=toloka.project.task_spec.TaskSpec(
        input_spec={
            'audio_record': toloka.project.field_spec.StringSpec(),  # we will put URLs instead of files here
            'text': toloka.project.field_spec.StringSpec(),
            'text_id': toloka.project.field_spec.StringSpec(required=False),
            'assignment_id': toloka.project.field_spec.StringSpec(required=False),
        },
        output_spec={
            'is_correct': toloka.project.field_spec.StringSpec(
                allowed_values=[
                    'yes',
                    'no'
                ]
            )
        },
        view_spec=project_interface,
    ),
)

In [None]:
# Calling the API to create a new project
verification_project = toloka_client.create_project(verification_project)
print(f'Created project with id {verification_project.id}')
print(f'To view the project, go to: https://{toloka_domain}/requester/project/{verification_project.id}')

<table  align="center">
  <tr><td>
    <img src="./img/verification_project_interface.png"
         alt="verif_iface"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 6.</b> How performers will see the tasks
  </td></tr>
</table>

<table  align="center">
  <tr><td>
    <img src="./img/verification_project_instruction.png"
         alt="verif_inst"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 7.</b> How performers will see the instruction
  </td></tr>
</table>

### 2.4. Verification training

In [None]:
# Setting up training
verification_training = toloka.training.Training(
    project_id=verification_project.id,
    private_name='[Verification for vocalizing texts] Training',
    may_contain_adult_content=True,
    mix_tasks_in_creation_order=True,
    shuffle_tasks_in_task_suite=True,
    training_tasks_in_task_suite_count=5,
    assignment_max_duration_seconds=60*20,
    task_suites_required_to_pass=1,
    retry_training_after_days=1,
    inherited_instructions=True,
)

# Calling the API to create a new project
verification_training = toloka_client.create_training(verification_training)
print(f'Created training with id {verification_training.id}')
print(f'To view the training, go to: https://{toloka_domain}/requester/project/{verification_project.id}/training/{verification_training.id}')

Let's start working with our object storage.

First of all, you need to create bucket for your audio files.

Code below will do it for you, all you need is come up with an unoccupied unique name for bucket.

**You can get more details about naming buckets and its naming [here](https://cloud.yandex.com/en-ru/docs/storage/concepts/bucket).**

In [None]:
# this function tries to create bucket with given name
def create_bucket(bucket_name):
    session = boto3.session.Session(
        region_name="us-east-1", 
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    s3 = session.client(
        service_name="s3", 
        endpoint_url="https://storage.yandexcloud.net",
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    try:
        s3.create_bucket(Bucket=bucket_name, ACL='public-read')
        print("Success!")
    except Exception:
        print("Bucket hasn't created, because its name is already busy! Change it and try again.")

In [None]:
RECORDS_BUCKET_NAME = 'voice-records'  # change name here if its already busy

create_bucket(RECORDS_BUCKET_NAME)

In [None]:
# this function uploads file to a given bucket and returns direct download link for it
def load_image_on_yandex_storage(bucket_name, file_path, img_id=None):
    if img_id is None:
        img_id = os.path.split(file_path)[-1]
    session = boto3.session.Session(
        region_name="us-east-1", 
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    s3 = session.client(
        service_name="s3", 
        endpoint_url="https://storage.yandexcloud.net",
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    s3.upload_file(file_path, bucket_name, img_id)
    return f"https://storage.yandexcloud.net/{bucket_name}/{img_id}"

In [None]:
answers = pd.read_csv(f'./data/{LANGUAGE}_records/answers.tsv', sep='\t', index_col=['filename'])
answers

In [None]:
last_ind = 0
verification_training_tasks = []
dir_path = f'./data/{LANGUAGE}_records'
for record_name in os.listdir(dir_path):
    file_path = os.path.join(dir_path, record_name)
    if not os.path.isfile(file_path):
        continue
    ext = os.path.splitext(file_path)[-1]
    if ext not in ['.mp3', '.aac', '.ogg', '.m4a', '.mp4']:
        continue
    url = load_image_on_yandex_storage(RECORDS_BUCKET_NAME, 
                                       file_path, 
                                       f'{LANGUAGE}/verification_training/verification_training_{last_ind}{ext}')
    last_ind += 1
    hint = answers.loc[record_name].get('hint')
    task = toloka.task.Task(
        input_values={
            'audio_record': url,
            'text': answers.loc[record_name]['text'],
        },
        known_solutions=[
            toloka.task.BaseTask.KnownSolution(
                output_values={'is_correct': answers.loc[record_name]['answer']}
            )
        ],
        pool_id=verification_training.id,
        infinite_overlap=True,
        message_on_unknown_solution="" if pd.isna(hint) else hint
    )
    verification_training_tasks.append(task)
    
    
print(f'You can check that records appeared in your bucket:\nhttps://storage.yandexcloud.net/{RECORDS_BUCKET_NAME}/ (XML)\nor in console mode here:\nhttps://console.cloud.yandex.ru/ (choose Object Storage)')

<table  align="center">
  <tr><td>
    <img src="./img/object_storage.png"
         alt="obj_storage"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 8.</b> How you will see your bucket in the concole mode of the object storage
  </td></tr>
</table>

<table  align="center">
  <tr><td>
    <img src="./img/verification_training.png"
         alt="verif_train"  width="800">
  </td></tr>
  <tr><td align="center">
    <b>Figure 9.</b> How performers will see training tasks
  </td></tr>
</table>

In [None]:
# Calling the API to create a new tasks
tasks_op = toloka_client.create_tasks_async(verification_training_tasks)
op_res = toloka_client.wait_operation(tasks_op)
print(
    f'Total tasks: {op_res.details["total_count"]}',
    f'Total failed: {op_res.details["failed_count"]}',
    f'Total success: {op_res.details["success_count"]}',
    f'Total valid: {op_res.details["valid_count"]}',
    f'Total not valid: {op_res.details["not_valid_count"]}',
    sep='\n'
)

### 2.5. Verification pool

In [None]:
# Setting up pool
verification_pool = toloka.pool.Pool(
    project_id=verification_project.id,
    private_name='[Verification for recording texts] Pool',
    may_contain_adult_content=True,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    reward_per_assignment=0.03,
    auto_accept_solutions=True,
    assignment_max_duration_seconds=60*30,
    defaults=toloka.pool.Pool.Defaults(
        default_overlap_for_new_task_suites=3
    ),
    filter=toloka.filter.FilterAnd(
        [
            toloka.filter.Languages.in_(LANGUAGE.upper()),
            toloka.filter.FilterOr(
                [
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.BROWSER),
                    toloka.filter.ClientType.eq(toloka.filter.ClientType.ClientType.TOLOKA_APP)
                ]
            )
        ]
    ),
)

# Setting task mixing configuration (20 tasks per page)
verification_pool.set_mixer_config(
    real_tasks_count=20,
    golden_tasks_count=0,
    training_tasks_count=0
)


# Setting up pool quality control

# Banning performer who answers too fast
verification_pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(
        history_size=5, 
        fast_submit_threshold_seconds=30
    ),
    conditions=[toloka.conditions.FastSubmittedCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Fast responses'
    )
)


# Banning performer by captcha criteria
verification_pool.quality_control.add_action(
    collector=toloka.collectors.Captcha(history_size=5),
    conditions=[toloka.conditions.FailRate >= 60],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=3,
        duration_unit=toloka.user_restriction.DurationUnit.DAYS,
        private_comment='Captcha'
    )
)

# Banning performer who skips some tasks in a row
verification_pool.quality_control.add_action(
    collector=toloka.collectors.SkippedInRowAssignments(),
    conditions=[toloka.conditions.SkippedInRowCount > 2],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=15,
        duration_unit='DAYS',
        private_comment='Lazy performer',
    )
)

# Banning performer by majority vote criteria
verification_pool.quality_control.add_action(
    collector=toloka.collectors.MajorityVote(history_size=5, answer_threshold=2),
    conditions=[
        toloka.conditions.TotalAnswersCount > 5,
        toloka.conditions.CorrectAnswersRate < 60,
    ],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
        private_comment='Majority vote low quality'
    )
)

verification_pool = toloka_client.create_pool(verification_pool)
print(f'Created pool with id {verification_pool.id}')
print(f'To view the training, go to: https://{toloka_domain}/requester/project/{verification_project.id}/pool/{verification_pool.id}')

## 3. Running the pipeline

Now we will run a whole pipeline. 

If you have some questions check the pipeline scheme again in the beginning of this notebook.

Let's define each our action by the function below.

In [None]:
# Common pool functions

def wait_pool_for_close(pool, sleep_time=60):
    # updating pool info
    pool = toloka_client.get_pool(pool.id)
    while not pool.is_closed():
        print(
            f'\t{datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} has status {pool.status}.'
        )
        time.sleep(sleep_time)
        # updating pool info
        pool = toloka_client.get_pool(pool.id)
        
        
def wait_pool_for_submit(pool, sleep_time=60, min_count=0):
    request = toloka.search_requests.AssignmentSearchRequest(
        status=toloka.assignment.Assignment.SUBMITTED,  # Only take completed tasks that haven't been accepted or rejected
        pool_id=pool.id,
    )
    while True:
        # updating pool info
        pool = toloka_client.get_pool(pool.id)
        count = len(list(toloka_client.get_assignments(request)))
        print(
            f'\t{datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} has {count} submitted tasks.'
        )
        if count > min_count or pool.is_closed():
            return count
        time.sleep(sleep_time)
        
        
def aggregate_pool_results(pool):
    print(f'Started aggregation results for the pool {pool.id}:\n',
          f'https://{toloka_domain}/requester/operations/project/{pool.project_id}/pool/{pool.id}',
          sep='')
    aggregation_operation = toloka_client.aggregate_solutions_by_pool(
        type='DAWID_SKENE',
        pool_id=pool.id,
        fields=[toloka.aggregation.PoolAggregatedSolutionRequest.Field(name='is_correct')]
    )
    aggregation_operation = toloka_client.wait_operation(aggregation_operation)
    print(f'Finished aggregation results for the pool {pool.id}:\n',
          f'https://{toloka_domain}/requester/operations/project/{pool.project_id}/pool/{pool.id}',
          sep='')
    
    aggregation_result = toloka_client.find_aggregated_solutions(aggregation_operation.id)
    results = aggregation_result.items
    while aggregation_result.has_more:
        aggregation_result = toloka_client.find_aggregated_solutions(
            aggregation_operation.id,
            task_id_gt=aggregation_result.items[len(aggregation_result.items) - 1].task_id,
        )
        results = results + aggregation_result.items
        
    return results


def get_solution_attachment_url(solution):
    attachment = toloka_client.get_attachment(solution.output_values['audio_record'])
    ext = os.path.splitext(attachment.name)[-1]
    if ext not in ['.m4a', '.flac', '.mp3', 'mp4', '.wav', '.ogg', '.wma', '.aac', '.ape', '']:
        return None, False
    if ext == '':
        ext = '.m4a'  # extension by default

    # downloading record on disk
    filename = f"{attachment.id}{ext}"
    with open(filename, "wb+") as f:
        toloka_client.download_attachment(attachment.id, f)
    
    # uploading attachment on Yandex Cloud Object Storage  
    url = load_image_on_yandex_storage(  # we defined this one in verification project section earlier
        RECORDS_BUCKET_NAME, 
        filename, 
        '{}/recording-pool-{}/{}'.format(
            LANGUAGE,
            recording_pool.id,
            filename
        )
    )

    # deleting record from disk
    try:
        os.remove(f'./{filename}')
    except Exception:
        print(f'Failed to remove file {filename}.')
    
    return url

In [None]:
# Getting classification results and running recording pool

def prepare_recording_tasks(recording_pool, classification_pool, confidence_lvl=0.6):
    recording_tasks = []
    for result in aggregate_pool_results(classification_pool):
        # finding inputs for aggregated task
        inputs = toloka_client.get_task(result.task_id).input_values

        # if the correct text with the necessary confidence level, creating recording task
        if result.output_values['is_correct'] == 'yes' and result.confidence >= confidence_lvl:
            task = toloka.task.Task(
                input_values=inputs,
                pool_id=recording_pool.id
            )

            recording_tasks.append(task)

    print(f'Prepared {len(recording_tasks)} recording tasks for pool:\n',
          f'https://{toloka_domain}/requester/project/{recording_pool.project_id}/pool/{recording_pool.id}',
          sep='')
    return recording_tasks
    
    
def run_recording_pool(recording_pool, recording_tasks):
    # Calling the API to create a new tasks
    tasks_op = toloka_client.create_tasks_async(
        recording_tasks, 
        allow_defaults=True,
        open_pool=True
    )
    toloka_client.wait_operation(tasks_op)
    print('Opened pool:\n',
          f'https://{toloka_domain}/requester/project/{recording_pool.project_id}/pool/{recording_pool.id}',
          sep='')

In [None]:
# Getting recording assignments and running verification pool
    
def prepare_verification_tasks(verification_pool):
    verification_tasks = []  # Tasks that we will send for verification
    request = toloka.search_requests.AssignmentSearchRequest(
        status=toloka.assignment.Assignment.SUBMITTED,  # Only take completed tasks that haven't been accepted or rejected
        pool_id=recording_pool.id,
    )
    # Create and store new tasks
    for assignment in toloka_client.get_assignments(request):
        for task, solution in zip(assignment.tasks, assignment.solutions):
            record_url, correct = get_solution_attachment_url(solution)
            if not correct:
                toloka_client.reject_assignment(assignment.id,
                                                'Incorrect format for audio file.')
                continue
            verification_tasks.append(
                toloka.task.Task(
                    input_values={
                        'text': task.input_values['text'],
                        'text_id': task.input_values.get('text_id', ''),
                        'assignment_id': assignment.id,
                        'audio_record': record_url
                    },
                    pool_id=verification_pool.id,
                )
            )
    print(f'Generated {len(verification_tasks)} new verification tasks')
    return verification_tasks


def run_verification_pool(verification_pool, verification_tasks):
    # Calling the API to create a new tasks
    verification_tasks_op = toloka_client.create_tasks_async(
        verification_tasks,
        allow_defaults=True
    )
    toloka_client.wait_operation(verification_tasks_op)
    
    verification_tasks_result = [
        task 
        for task in toloka_client.get_tasks(pool_id=verification_pool.id) 
        if not task.known_solutions
    ]

    task_to_assignment = {}
    for task in verification_tasks_result:
        task_to_assignment[task.id] = task.input_values['assignment_id']

    # Open the verification pool
    op_res = toloka_client.open_pool(verification_pool.id)
    op_res = toloka_client.wait_operation(op_res)
    print(f'Opened pool:\n',
          f'https://{toloka_domain}/requester/project/{verification_pool.project_id}/pool/{verification_pool.id}',
          sep='')
    return task_to_assignment


def set_answers_status(verification_results, task_to_assignment, links, confidence_lvl=0.6):
    print('Started adding results to recording tasks')
    for result in tqdm(verification_results):
        # skipping not needed results
        if result.task_id not in task_to_assignment:
            continue
        # finding inputs for aggregated task
        inputs = toloka_client.get_task(result.task_id).input_values
        assignment_id = inputs['assignment_id']

        # accepting task in recording project if record is correct with the necessary confidence level 
        if result.output_values['is_correct'] == 'yes' and result.confidence >= confidence_lvl:
            try:
                toloka_client.accept_assignment(assignment_id, 
                                                'Well done!')
                links[assignment_id] = inputs['audio_record']  # saving urls for getting resulst later
            except Exception:
                pass  # Already processed this assignment
        else:
            try:
                toloka_client.reject_assignment(assignment_id,
                                                'Record is incorrect. Check instructions for more details.')
            except Exception:
                pass  # Already processed this assignment
            
        task_to_assignment.pop(result.task_id, None)
            
    print('Finished adding results to recording tasks')
    
    return links

Now we can run the continious pipeline.

In [None]:
# Run the pipeline

links = {}
# Opening our trainings and start first project - classification project
toloka_client.open_pool(classification_training.id)
toloka_client.open_pool(verification_training.id)
toloka_client.open_pool(classification_pool.id)

while True:
    print('\nWaiting for classification pool to close...')
    wait_pool_for_close(classification_pool)
    print(f'Classification pool {classification_pool.id} is finally closed!')
    
    # Preparing tasks for recording project
    if recording_pool.is_closed():
        recording_tasks = prepare_recording_tasks(
            recording_pool, 
            classification_pool,
            confidence_lvl=0.8
        )
        
        # Opening pool in recording project with correct texts (from classification project)
        # We do it once because we were waiting for whole input texts from classification project
        run_recording_pool(recording_pool, recording_tasks)
    
    # Updating pools info
    recording_pool = toloka_client.get_pool(recording_pool.id)
    verification_pool = toloka_client.get_pool(verification_pool.id)
        
    # Waiting any submitted tasks
    print('\nWaiting submitted tasks for recording pool...')
    submitted = wait_pool_for_submit(recording_pool)
      
    # Make sure all the tasks are done
    if recording_pool.is_closed() and verification_pool.is_closed() and submitted == 0:
        print('All the tasks are done!')
        break
    
    # Preparing tasks for verification project
    verification_tasks = prepare_verification_tasks(verification_pool)
    
    # Adding tasks to the verification pool and open it
    task_to_assignment = run_verification_pool(verification_pool, verification_tasks)
    
    print('\nWaiting for verification pool to close')
    wait_pool_for_close(verification_pool)
    print(f'Verification pool {verification_pool.id} is finally closed!')
    
    # Getting verification results aggregation
    verification_results = aggregate_pool_results(verification_pool)
    # Rejecting/accepting submitted records (in recording project) based on verification aggregation
    links = set_answers_status(verification_results, 
                               task_to_assignment,
                               links,
                               confidence_lvl=0.6)
    
print(f'Results received at {datetime.datetime.now()}')

P.S.: It is often more profitable to use a high confidence level in the initial project and a bit lower confidence level in the verification project than to do the opposite, since this will save money: we will be less likely to let incorrect texts dive deep into our pipeline. However, the confidence level directly affects the quality of the resulting dataset, so making it too low is not advisable at all.

### Getting the results

You can download data from web-version or use code snippet below.

In [None]:
def get_recording_results(recording_pool, links=None, download_records=False, path=f'./results/{LANGUAGE}_records'):
    request = toloka.search_requests.AssignmentSearchRequest(
        status=toloka.assignment.Assignment.ACCEPTED,  # Only take completed tasks that have been accepted
        pool_id=recording_pool.id,
    )
    dataset = {
        'text_id': [],
        'text': [],
        'audio_record': []
    }
    
    if download_records:
        # creating directories in path
        try:
            os.makedirs(path)
            print(f'Created directories in path {path}')
        except FileExistsError:
            print(f'Using already existing directory {path}') 
        abs_path = os.path.abspath(path)
        
    
    # Getting results for recording pool. 
    # If download_recors=False then dataset will contain direct urls that we used in verification project in audio_record column
    # If download_recors=True  then dataset will contain paths to the downloaded on disk records in audio_record column
    for assignment in toloka_client.get_assignments(request):
        for task, solution in zip(assignment.tasks, assignment.solutions):
            text_id = task.input_values.get('text_id', '')
            text = task.input_values['text']
            dataset['text_id'].append(text_id)
            dataset['text'].append(text)
            if not download_records:
                if links is None:
                    raise Exception('\"links\" param must be specified with flag \"download_records=False\"')
                record_url = links[assignment.id]
                dataset['audio_record'].append(record_url)
            else:
                attachment = toloka_client.get_attachment(solution.output_values['audio_record'])
                ext = os.path.splitext(attachment.name)[-1]
                if ext == '':
                    ext = '.m4a'  # extension by default

                # downloading record on disk
                filepath = os.path.join(abs_path, f"{attachment.id}{ext}")
                with open(filepath, "wb+") as f:
                    toloka_client.download_attachment(attachment.id, f)
                print(f'Downloaded record: {filepath}')
                
                dataset['audio_record'].append(filepath)
            
    print('Finished getting results.')
    # converting to pandas dataframe (for comfortable .tsv export)
    return pd.DataFrame.from_dict(dataset)

In [None]:
# dataset = get_recording_results(recording_pool, links, download_records=False)

# You can change dowloading flag by uncomment line below
dataset = get_recording_results(recording_pool, download_records=True)

In [None]:
PATH_TO_SAVE = f'./results/{LANGUAGE}_dataset.tsv'
dataset.to_csv(PATH_TO_SAVE, sep='\t', index=False)

### Cleaning up the storage

In this section we will clean our storage from the junk records we used in pipeline before.

In [None]:
# This function deletes files from storage
# You can configure what kind of records you want to delete from your bucket in storage
def cleanup_storage(bucket_name, 
                    links, 
                    recording_pool=None,
                    delete_rejected=True, 
                    delete_accepted=False, 
                    delete_training=False):
    
    session = boto3.session.Session(
        region_name="us-east-1", 
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    s3 = session.client(
        service_name="s3", 
        endpoint_url="https://storage.yandexcloud.net",
        aws_secret_access_key=aws_secret_access_key, 
        aws_access_key_id=aws_access_key
    )
    
    # In this case we need to go to the https://storage.yandexcloud.net/<BUCKET>/<LANGUAGE>/recording-pool-<ID>/...
    if delete_rejected or delete_accepted:
        if recording_pool is None:
            raise Exception("While deleting ACCEPTED and REJECTED recording pool must be defined")
            
        acc_files = set([os.path.split(url)[-1] for _, url in links.items()])
        prefix = f'{LANGUAGE}/recording-pool-{recording_pool.id}/'
        objects = s3.list_objects(Bucket=RECORDS_BUCKET_NAME, Prefix=prefix)
        
        if 'Contents' in objects:
            for record in objects['Contents']:
                filename = os.path.split(record['Key'])[-1]
                # deleting record with verdict ACCEPTED
                if filename in acc_files and delete_accepted:
                    s3.delete_object(Bucket=bucket_name, Key=prefix + filename)
                print(f"Deleted .../{bucket_name}/{prefix + filename}")
                # deleting record with verdict REJECTED
                if filename not in acc_files and delete_rejected:
                    s3.delete_object(Bucket=bucket_name, Key=prefix + filename)
                print(f"Deleted .../{bucket_name}/{prefix + filename}")
            
        # deleting directory (we needed delete all inside it before)
        if delete_accepted and delete_rejected:
            s3.delete_object(Bucket=bucket_name, Key=prefix)
            print(f"Deleted .../{bucket_name}/{prefix}")
    
    # deleting verification training records 
    # (from https://storage.yandexcloud.net/<BUCKET>/<LANGUAGE>/verification-training/...)
    if delete_training:
        prefix = f'{LANGUAGE}/verification_training/'
        
        objects = s3.list_objects(Bucket=bucket_name, Prefix=prefix)
        if 'Contents' in objects:
            for record in objects['Contents']:
                filename = os.path.split(record['Key'])[-1]
                s3.delete_object(Bucket=bucket_name, Key=prefix + filename)
                print(f"Deleted .../{bucket_name}/{prefix + filename}")
            
        # deleting directory (we needed delete all inside it before)
        s3.delete_object(Bucket=bucket_name, Key=prefix)
        print(f"Deleted .../{bucket_name}/{prefix}")

In [None]:
# cleaning up our storage from rejected and training records
cleanup_storage(
    RECORDS_BUCKET_NAME,
    links, 
    recording_pool=recording_pool,
    delete_rejected=True,
    delete_accepted=False,
    delete_training=True
)