# Audio transcription
We have a set of audio recordings.
We need to obtain a transcription of each recording.
We ask Tolokers to listen to the recordings and type what they hear.

### Call to action
If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).
Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)

Prepare environment and import all we'll need.

In [None]:
%%capture
!pip install toloka-kit==0.1.26
!pip install crowd-kit==1.0.0
!pip install pandas
!pip install numpy
!pip install sentence-transformers
!pip install nltk

import datetime
import sys
import logging
import getpass

import pandas
import numpy as np
from sentence_transformers import SentenceTransformer

import toloka.client as toloka
import toloka.client.project.template_builder as tb
from crowdkit.aggregation import TextHRRASA

In [None]:
logging.basicConfig(
    format='[%(levelname)s] %(name)s: %(message)s',
    level=logging.INFO,
    stream=sys.stdout,
)

Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)

In [None]:
toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'
print(toloka_client.get_requester())

## Creating new project
Enter a clear project name and description.

In [60]:
project = toloka.Project(
    assignments_issuing_type='AUTOMATED',
    public_name='Transcribe the recording',
    public_description='Type what you hear in the recording',
)

Create task interface

In [4]:
audio_player = tb.AudioViewV1(
    tb.InputData('audio'),
    validation=tb.PlayedFullyConditionV1(hint='You didn\'t listen to the recording')
)

input_text = tb.TextFieldV1(
    tb.OutputData('result'),
    validation=tb.SchemaConditionV1(
        schema={
            'type': 'string',
            'pattern': r'^[a-zA-Z\*\s]+$'
        },
        hint='Use only lowercase letters and spaces',
    )
)

task_width_plugin = tb.TolokaPluginV1('scroll', task_width=900)

play_hotkey = tb.HotkeysPluginV1(key_q=tb.PlayPauseActionV1('view.items.0'))

project_interface = toloka.project.TemplateBuilderViewSpec(
    view=tb.ListViewV1(items=[audio_player, input_text]),
    plugins=[task_width_plugin, play_hotkey],
)

Set data specification. And set task interface to project.

In [63]:
input_specification = {'audio': toloka.project.field_spec.UrlSpec()}
output_specification = {'result': toloka.project.field_spec.StringSpec()}

project.task_spec = toloka.project.task_spec.TaskSpec(
    input_spec=input_specification,
    output_spec=output_specification,
    view_spec=project_interface,
)

Write short and simple 	instructions.

In [64]:
project.public_instructions = """<p><font color="red">The training has 7 tasks, the main pool will open upon successful training completion.</font></p>
<p>Each page includes several tasks with short audio recordings.</p>
<p>Listen to them and type the text that you hear.</p>
<p>Use headphones to hear the speech more clearly.</p>
<p>Please, listen carefully and try to replay and re-listen to the recording in problem cases. When you finish transcribing, carefully reread the text.</p>
<p>Use a dictionary if you are not sure of word spelling correctness.</p>
<br>
<p><b>Basic rules:</b></p>
<ul><li>use only lowercase letters (for example, "my cat is awesome" instead of "My cat is awesome")</li>
<li>do not use punctuation, special symbols, apostrophes and quotas (for example, "thats not my problem" instead of "That's not my problem!")</li>
<li>specify numbers and signs in words (for example, "i paid twenty dollars" instead of "I paid 20$")</li></ul>"""

Create a project.

In [65]:
project = toloka_client.create_project(project)

## Review the dataset
We will use Noisy speech database. This dataset is distributed under a Creative Commons Attribution 4.0 International license. [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

A description of the database here: [https://datashare.ed.ac.uk/handle/10283/2791](https://datashare.ed.ac.uk/handle/10283/2791)

>Valentini-Botinhao, Cassia. (2017). Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2117.

In [2]:
!curl https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_speech.tsv --output dataset.tsv

dataset = pandas.read_csv('dataset.tsv', sep='\t')
dataset = dataset.sample(frac=1).reset_index(drop=True)

with pandas.option_context("max_colwidth", 100):
    display(dataset)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   98k  100   98k    0     0   620k      0 --:--:-- --:--:-- --:--:--  620k


Unnamed: 0,file_url,text
0,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_354.wav,It's a long process.
1,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_009.wav,"There is , according to legend, a boiling pot of gold at one end."
2,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_142.wav,I do not want to have a fight with Mr Hastings.
3,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_321.wav,"Davis is very supportive, as a director."
4,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_324.wav,We think all other measures are not exhausted.
...,...,...
819,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_403.wav,It doesn't always work that way.
820,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_125.wav,I'm so angry.
821,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_194.wav,But a final decision will not be taken until after the elections.
822,https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_356.wav,That's the principal difference between an artist and a dog.


According to our instruction, text must be in lowercase and don't contain punctuations.

In [10]:
dataset['text'] = dataset['text'].str.lower().replace('[.,\\-\'"!?]', '', regex=True)

Divide the dataset. One for control tasks and one for real tasks.

In [25]:
real_tasks_number = 50
dataset = dataset.sample(frac=1).reset_index(drop=True)
golden_dataset, task_dataset, _ = np.split(dataset, [10, 10 + real_tasks_number], axis=0)

## Create the main pool
Specify the [pool parameters.](https://toloka.ai/en/docs/guide/concepts/pool_poolparams?utm_source=github&utm_medium=site&utm_campaign=tolokakit)

In [82]:
pool = toloka.Pool(
    project_id=project.id,
    # Give the pool any name you find suitable. You are the only one who will see it.
    private_name='Transcribe the recording',
    may_contain_adult_content=True,
    # Set the price per task suite.
    reward_per_assignment=0.03,
    # We'll check answers later in this notebook
    auto_accept_solutions=False,
    auto_accept_period_day=1,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    # Overlap. This is the number of users who will complete the same task.
    # We will aggregate the results after the pools are completed.
    defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=3),
    # Time allowed for completing a task page
    assignment_max_duration_seconds=900,
    # Select English-speaking Tolokers
    filter = toloka.filter.Languages.in_('EN'),
)

Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit).

Set up the Fast responses quality control rule. And reopen pool after increasing overlap.

In [83]:
pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=40),
    conditions=[toloka.conditions.FastSubmittedCount >= 1],
    action=toloka.actions.RestrictionV2(
        scope=toloka.user_restriction.UserRestriction.PROJECT,
        duration=5,
        duration_unit=toloka.user_restriction.DurationUnit.DAYS,
        private_comment='fast responses'
    )
)
pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentsAssessment(),
    conditions=[toloka.conditions.AssessmentEvent == 'REJECT'],
    action=toloka.actions.ChangeOverlap(delta=1, open_pool=True),
)

Specify	the number of tasks per page. For example: 3 main tasks and 3 control task.

In [84]:
pool.set_mixer_config(
    real_tasks_count=4,
    golden_tasks_count=1,
)

Create pool

In [85]:
pool = toloka_client.create_pool(pool)

## Preparing and uploading tasks
Create control tasks. In small pools, control tasks should account for 10–20% of all tasks.

In [86]:
golden_tasks = [
    toloka.task.Task(
        pool_id=pool.id,
        input_values={'audio': row.file_url},
        known_solutions = [
            toloka.task.BaseTask.KnownSolution(
                output_values={'result': row.text}
            )
        ],
        infinite_overlap=True,
    )
    for row in golden_dataset.itertuples()
]

Create pool tasks

In [87]:
tasks = [
    toloka.task.Task(
        pool_id=pool.id,
        input_values={'audio': url},
    )
    for url in task_dataset['file_url']
]

Upload tasks

In [88]:
created_tasks = toloka_client.create_tasks(golden_tasks + tasks, allow_defaults=True)
print(len(created_tasks.items))

60


If you visit the pool right now and click `preview`, you can see this task interface:

<table  align="center">
  <tr><td>
    <img src="./img/task_interface.png"
         alt="Possible task interface"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> Task interface.
  </td></tr>
</table>

## Receiving responses
Since it's a very complicated challenge to understand if the Tolokers answers honestly or just write some random text, we will use [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to understand this and ```toloka-kit.streaming``` for fast answers checking.

In [None]:
!pip install python-Levenshtein
import Levenshtein

from toloka.streaming import Pipeline, AssignmentsObserver

Small example of using Levenshtein distance

In [42]:
print(Levenshtein.distance(
    "yes it was a man who asked a stranger",
    "sjkkshjk")
)
print(Levenshtein.distance(
    "for a full minute he crouched and listened",
    "for full minute she cruched and listened")
)

33
4


Prepare function for accepting and rejecting answers. It's useful only for task suites with exactly one control task.

In [90]:
def same_string(ground_truth, candidate, distance=5):
    res = Levenshtein.distance(ground_truth.lower(), candidate.lower())
    # return accept, ban
    return res <= distance, res >= 10

def handle_submitted(events):
    for event in events:
        accept = False
        ban = True
        for task, solution in zip(event.assignment.tasks, event.assignment.solutions):
            if not task.known_solutions:
                continue
            answer = solution.output_values['result']
            ground_truth = task.known_solutions[0].output_values['result']
            accept, ban = same_string(ground_truth, answer)
            print(f'\n{ground_truth}\n{answer}\naccept - {accept}, ban - {ban}')

        if ban:
            toloka_client.set_user_restriction(
                toloka.user_restriction.PoolUserRestriction(
                    user_id=event.assignment.user_id,
                    private_comment='Toloker often makes mistakes',
                    pool_id=event.assignment.pool_id,
                )
            )
        if accept:
            toloka_client.accept_assignment(event.assignment.id, 'Well done!')
        else:
            toloka_client.reject_assignment(event.assignment.id, 'Wrong answer on control task')

Build pipeline and start pool.

In [None]:
observer = AssignmentsObserver(toloka_client, pool_id=pool.id)
observer.on_submitted(handle_submitted)

pipeline = Pipeline()
pipeline.register(observer)

pool = toloka_client.open_pool(pool.id)
print(f'pool - {pool.status}')

After some time start pipeline

In [None]:
# Google Colab is using a global event pool,
# so in order to run our pipeline we have to apply nest_asyncio to create an inner pool
if 'google.colab' in str(get_ipython()):
    import nest_asyncio, asyncio
    nest_asyncio.apply()
    asyncio.get_event_loop().run_until_complete(pipeline.run())
else:
    await pipeline.run()

Get responses

When all the tasks are completed, look at the responses from Tolokers.

In [None]:
answers_df = toloka_client.get_assignments_df(pool.id)

gold_df = answers_df[~answers_df['GOLDEN:result'].isna()]
answers_df.drop(gold_df.index, inplace=True)

answers_df = answers_df.rename(columns={
    'INPUT:audio': 'task',
    'OUTPUT:result': 'output',
    'ASSIGNMENT:worker_id': 'worker'
})

answers_df = answers_df[['task', 'output', 'worker']]
answers_df['output'] = answers_df['output'].str.lower().replace('[.,\\-\'"!?]', '', regex=True)

with pandas.option_context("max_colwidth", 100):
    display(answers_df)

Aggregation results using the [TextHRRASA](https://dl.acm.org/doi/10.1145/3397271.3401239) model from [Crowd-Kit](https://github.com/Toloka/crowd-kit#crowd-kit)

In [None]:
# Run aggregation
encoder = SentenceTransformer('paraphrase-distilroberta-base-v1')
hrrasa = TextHRRASA(lambda *args, **kwargs: encoder.encode(*args, show_progress_bar=False, **kwargs))
result_df = hrrasa.fit_predict(answers_df).reset_index(name='result')

Look at the results.

In [96]:
joined_df = result_df.merge(dataset, left_on='task', right_on='file_url', how='left')
for row in joined_df.itertuples():
    print(f"{row.task}\n{row.text}\n{row.result}\n\n")

https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_024.wav
this is a very common type of bow one showing mainly red and yellow with little or no green or blue
this is a very common type of bow one showing mainly red and yellow with little or no green or blue


https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_035.wav
other members of the family were too upset to comment last night
Other members of the family were too upset to comment last night


https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_056.wav
because we do not need it
because we do not need it


https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_072.wav
that was the easy election
that was the easy election


https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_090.wav
downing street will make the second appointment in the scotland office today
downing street will make the second appointment in the scotland office tod