# Text classification

We have a set of news article headlines. We need to get these classified according to whether they are clickbait or not.
We ask performers to read a headline and decide whether it’s clickbait.

>**Clickbait headline** designed to make readers want to click on a hyperlink
especially when the link leads to content of dubious value. Typically clickbait titles cover not very useful content,
so visitors tend not to stay for too long, that's why it's bad.

### Call to action
If you found some bugs or have a new feature idea, don't hesitate to [open a new issue on Github](https://github.com/Toloka/toloka-kit/issues/new/choose).
Like our library and examples? Star [our repo on Github](https://github.com/Toloka/toloka-kit)

Prepare environment and import all we'll need.

In [None]:
%%capture
!pip install toloka-kit==0.1.26
!pip install crowd-kit==1.0.0
!pip install pandas

import datetime
import sys
import time
import logging
import getpass

import pandas
import numpy as np

import toloka.client as toloka
import toloka.client.project.template_builder as tb
from crowdkit.aggregation import DawidSkene

In [None]:
logging.basicConfig(
    format='[%(levelname)s] %(name)s: %(message)s',
    level=logging.INFO,
    stream=sys.stdout,
)

Сreate toloka-client instance. All api calls will go through it. More about OAuth token in our [Learn the basics example](https://github.com/Toloka/toloka-kit/tree/main/examples/0.getting_started/0.learn_the_basics) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Toloka/toloka-kit/blob/main/examples/0.getting_started/0.learn_the_basics/learn_the_basics.ipynb)

In [None]:
toloka_client = toloka.TolokaClient(getpass.getpass('Enter your OAuth token: '), 'PRODUCTION') # Or switch to 'SANDBOX'
print(toloka_client.get_requester())

## Create a project
Enter a clear project name and description.
> Note: The project name and description will be visible to the performers.

In [None]:
project = toloka.Project(
    public_name='Is this headline clickbait?',
    public_description='Look at the a news headline and decide if it is clickbait or not.',
)

Create task interface.
> Check the [Interface section](https://toloka.ai/knowledgebase/interface?utm_source=github&utm_medium=site&utm_campaign=tolokakit) of our Knowledge Base for more tips on interface design.

In [None]:
text_viewer = tb.TextViewV1(tb.JoinHelperV1(['Headline: ', tb.InputData('headline')]))

radio_group_field = tb.ButtonRadioGroupFieldV1(
    tb.OutputData('category'),
    [
        tb.GroupFieldOption('clickbait', 'Clickbait'),
        tb.GroupFieldOption('notclickbait', 'Not clickbait'),
    ],
    validation=tb.RequiredConditionV1(hint='you need to select one answer'),
)

task_width_plugin = tb.TolokaPluginV1(
    layout=tb.TolokaPluginV1.TolokaPluginLayout(
        kind='scroll',
        task_width=300,
    )
)

hot_keys_plugin = tb.HotkeysPluginV1(
    key_1=tb.SetActionV1(tb.OutputData('category'), 'clickbait'),
    key_2=tb.SetActionV1(tb.OutputData('category'), 'notclickbait'),
)

project_interface = toloka.project.TemplateBuilderViewSpec(
    view=tb.ListViewV1([text_viewer, radio_group_field]),
    plugins=[task_width_plugin, hot_keys_plugin],
)

For performers, our interface will look like this.

<table  align="center">
  <tr><td>
    <img src="./img/tasks_preview.png"
         alt="Task page"  width="1000">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> What the task page can looks like.
  </td></tr>
</table>

Specifications are a description of input data that will be used in a project and the output data that will be collected from the performers.

> Read more about [input and output data specifications](https://yandex.ru/support/toloka-tb/operations/create-specs.html?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide.

In [None]:
input_specification = {'headline': toloka.project.StringSpec()}
output_specification = {'category': toloka.project.StringSpec()}

project.task_spec = toloka.project.task_spec.TaskSpec(
    input_spec=input_specification,
    output_spec=output_specification,
    view_spec=project_interface,
)

Write comprehensive instructions.
> Get more tips on [designing instructions](https://toloka.ai/knowledgebase/instruction?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base.

In [None]:
project.public_instructions = """<h2>About the task</h2>
In this task you need to classify headlines on 2 categories: Clickbait or Not clickbait.<br>
<h2>What is Clickbait headline?</h2>
Clickbait refers to the practice of writing sensationalized or misleading headlines.
<b>Clickbait headline<b> designed to make readers want to click on a hyperlink
especially when the link leads to content of dubious value. Typically clickbait titles cover not very useful content,
so visitors tend not to stay for too long, that's why it's bad.
"""

Create a project.

In [None]:
project = toloka_client.create_project(project)

## Preparing data
This example uses SVM clickbait classifier, that distributed under a MIT license
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly. "Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media”. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Fransisco, US, August 2016.


BibTex:
```
@inproceedings{chakraborty2016stop,
title={Stop Clickbait: Detecting and preventing clickbaits in online news media},
author={Chakraborty, Abhijnan and Paranjape, Bhargavi and Kakarla, Sourya and Ganguly, Niloy},
booktitle={Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on},
pages={9--16},
year={2016},
organization={IEEE}
}
```

Let's load this dataset and split it into three parts.

In [59]:
!curl https://tlk.s3.yandex.net/ext_dataset/clickbait/clickbait_data.csv --output clickbait_data.csv
!curl https://tlk.s3.yandex.net/ext_dataset/clickbait/non_clickbait_data.csv --output non_clickbait_data.csv

clickbait_df = pandas.read_csv('clickbait_data.csv', sep='\t', names=['headline'])
clickbait_df['category'] = 'clickbait'
print(clickbait_df)

non_clickbait_df = pandas.read_csv('non_clickbait_data.csv', sep='\t', names=['headline'])
non_clickbait_df['category'] = 'notclickbait'
print(non_clickbait_df)

dataset = clickbait_df.append(non_clickbait_df)
dataset = dataset.sample(frac=1).reset_index(drop=True)

training_dataset, golden_dataset, main_dataset, _ = np.split(dataset, [10, 30, 130], axis=0)
print(f'\ntraining_dataset - {len(training_dataset)}')
print(f'\ngolden_dataset - {len(golden_dataset)}')
print(f'\nmain_dataset - {len(main_dataset)}')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  902k  100  902k    0     0  2767k      0 --:--:-- --:--:-- --:--:-- 2759k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  836k  100  836k    0     0  1926k      0 --:--:-- --:--:-- --:--:-- 1926k
                                                headline   category
0                                     Should I Get Bings  clickbait
1          Which TV Female Friend Group Do You Belong In  clickbait
2      The New "Star Wars: The Force Awakens" Trailer...  clickbait
3      This Vine Of New York On "Celebrity Big Brothe...  clickbait
4      A Couple Did A Stunning Photo Shoot With Their...  clickbait
...                                                  ...        ...
15994  There Was A Mini "Sisterhood Of The Traveli

## Create a training pool
Training is an essential part of almost every crowdsourcing project. It allows you to select performers who have really mastered the task, and thus improve quality. Training is also a great tool for scaling your task because you can run it any time you need new performers.

> Read more about [selecting performers](https://toloka.ai/knowledgebase/quality-control?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base.

> Read more about [training pools](https://toloka.ai/en/docs/guide/concepts/train?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Requester’s Guide.

In [None]:
training = toloka.Training(
    project_id=project.id,
    private_name='clickbait training',
    may_contain_adult_content=False,
    assignment_max_duration_seconds=60*30,
    mix_tasks_in_creation_order=False,
    shuffle_tasks_in_task_suite=False,
    training_tasks_in_task_suite_count=10,
    task_suites_required_to_pass=10,
    retry_training_after_days=10,
    inherited_instructions=True,
)
training = toloka_client.create_training(training)

Upload training tasks to the pool.

In [None]:
training_tasks = [
    toloka.Task(
        pool_id=training.id,
        input_values={'headline': row['headline']},
        known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'category': row['category']})],
        message_on_unknown_solution=row['category'],
    )
    for _, row in training_dataset.iterrows()
]
result = toloka_client.create_tasks(training_tasks, allow_defaults=True)
print(len(result.items))

We recommend opening the training pool along with the main pool. Otherwise Tolokers will spend their time on training but get no access to real tasks, which is frustrating. Also, do not forget to close the training pools when there are no main tasks available anymore.

## Create the main pool
A pool is a set of paid tasks grouped into task pages. These tasks are sent out for completion at the same time.

>Note: All tasks within a pool have the same settings (price, quality control, etc.)

 Text classification tasks are normally paid as basic tasks because these tasks do not take much time. Read more about [pricing principles](https://toloka.ai/knowledgebase/pricing?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base.

Sets an overlap of 3 to get a more confident final label. To understand [how this rule works](https://toloka.ai/en/docs/guide/concepts/mvote?utm_source=github&utm_medium=site&utm_campaign=tolokakit), go to the Requester’s Guide.

Let's add language filter so performers who speak English will be invited to complete this task. Then choose Toloka web version and Toloka for mobile clients. These filters will make it possible for performers to complete your task on their computers or mobile devices.

In [None]:
pool = toloka.Pool(
    project_id=project.id,
    # Give the pool any convenient name. You are the only one who will see it.
    private_name='Is this headline clickbait?',
    may_contain_adult_content=False,
    # Set the price per task page.
    reward_per_assignment=0.01,
    will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
    # Overlap. This is the number of users who will complete the same task.
    defaults=toloka.Pool.Defaults(default_overlap_for_new_task_suites=3),
    # Time allowed for completing a task page
    assignment_max_duration_seconds=120,
    filter=(
        (toloka.filter.Languages.in_('EN')) &
        (
            (toloka.filter.ClientType == 'TOLOKA_APP') |
            (toloka.filter.ClientType == 'BROWSER')
        )
    ),
)

Attach the training you created earlier and select the accuracy level that is required to reach the main pool. This means that Tolokers who got less than 90% accuracy will not see this pool.

In [None]:
pool.set_training_requirement(training_pool_id=training.id, training_passing_skill_value=90)

Set up [Quality control](https://toloka.ai/en/docs/guide/concepts/control?utm_source=github&utm_medium=site&utm_campaign=tolokakit):
 - Ban performers who give incorrect responses to control tasks. Since tasks such as these have an answer that can be used as ground truth, we can use standard quality control rules like golden sets.
 - Set up the up the Fast responses rule. This rule allows you to ban performers who submit tasks at a suspiciously high speed.

Read more about [quality control principles](https://toloka.ai/knowledgebase/quality-control?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in our Knowledge Base or check out [control tasks settings](https://toloka.ai/en/docs/guide/concepts/goldenset?utm_source=github&utm_medium=site&utm_campaign=tolokakit) in the Requester’s Guide.

In [None]:
pool.quality_control.add_action(
    collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=30),
    conditions=[toloka.conditions.FastSubmittedCount >= 2],
    action=toloka.actions.RestrictionV2(
        scope='POOL',
        duration_unit='PERMANENT',
        private_comment='bad quality'
    )
)

pool.quality_control.add_action(
    collector=toloka.collectors.GoldenSet(history_size=10),
    conditions=[
        toloka.conditions.GoldenSetCorrectAnswersRate <= 90.0,
        toloka.conditions.GoldenSetAnswersCount >= 1
    ],
    action=toloka.actions.RestrictionV2(
        scope='POOL',
        duration_unit='PERMANENT',
        private_comment='bad quality'
    )
)

Specify	the number of tasks per page. We recommend putting as many tasks on one page as a performer can complete in 1 to 5 minutes. That way, performers are less likely to get tired, and they won’t lose a significant amount of data if a technical issue occurs.

To learn more about [grouping tasks](https://toloka.ai/en/docs/guide/concepts/distribute-tasks-by-pages?utm_source=github&utm_medium=site&utm_campaign=tolokakit) into suites, read the Requester’s Guide.

In [None]:
pool.set_mixer_config(
    real_tasks_count=4,
    golden_tasks_count=1,
)

Create pool

In [None]:
pool = toloka_client.create_pool(pool)

## Preparing and uploading tasks

> Note: Control tasks are tasks that already contain the correct response. They are used for checking the quality of responses from performers. The performer's response is compared to the response you provided. If they match, it means the performer answered correctly.`

In [None]:
golden_tasks = [
    toloka.task.Task(
        pool_id=pool.id,
        input_values={'headline': row['headline']},
        known_solutions = [
            toloka.task.BaseTask.KnownSolution(
                output_values={'category': row['category']}
            )
        ],
        infinite_overlap=True,
    )
    for _, row in golden_dataset.iterrows()
]
tasks = [
    toloka.task.Task(
        pool_id=pool.id,
        input_values={'headline': row['headline']},
    )
    for _, row in main_dataset.iterrows()
]
created_tasks = toloka_client.create_tasks(golden_tasks + tasks, allow_defaults=True)
print(len(created_tasks.items))

Start the pool and the training.

**Important.** Remember that real Toloka performers will complete the tasks.
Double check that everything is correct
with your project configuration before you start the pool

In [None]:
training = toloka_client.open_training(training.id)
print(training.status)
pool = toloka_client.open_pool(pool.id)
print(pool.status)

## Receiving responses

Wait until the pool is completed.

In [None]:
pool_id = pool.id

def wait_pool_for_close(pool_id, minutes_to_wait=1):
    sleep_time = 60 * minutes_to_wait
    pool = toloka_client.get_pool(pool_id)
    while not pool.is_closed():
        op = toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(subject_id=pool.id)])
        op = toloka_client.wait_operation(op)
        percentage = op.details['value'][0]['result']['value']
        print(
            f'   {datetime.datetime.now().strftime("%H:%M:%S")}\t'
            f'Pool {pool.id} - {percentage}%'
        )
        time.sleep(sleep_time)
        pool = toloka_client.get_pool(pool.id)
    print('Pool was closed.')

wait_pool_for_close(pool_id)

Get responses

When all the tasks are completed, look at the responses from performers.

In [60]:
answers = []

for assignment in toloka_client.get_assignments(pool_id=pool_id, status='ACCEPTED'):
    for task, solution in zip(assignment.tasks, assignment.solutions):
        if not task.known_solutions:
            answers.append([task.input_values['headline'], solution.output_values['category'], assignment.user_id])

print(f'answers count: {len(answers)}')

answers count: 300


Aggregation results using the Dawid-Skene model. We use this aggregation model because our questions are of comparable difficulty, and we don't have many control tasks.

Read more about the [Dawid-Skene model](https://toloka.ai/en/docs/guide/concepts/result-aggregation?utm_source=github&utm_medium=site&utm_campaign=tolokakit#aggr__dawid-skene) in the Requester’s Guide or get at an overview of different [aggregation models](https://toloka.ai/knowledgebase/aggregation) our Knowledge Base.

More aggregation models in [Crowd-Kit](https://github.com/Toloka/crowd-kit#crowd-kit-computational-quality-control-for-crowdsourcing).

In [61]:
# Prepare dataframe
answers_df = pandas.DataFrame(answers, columns=['task', 'label', 'worker'])
# Run aggregation
predicted_answers = DawidSkene(n_iter=20).fit_predict(answers_df)

print(predicted_answers)

task
The Cast Of "The Office" Reimagined As Disney Characters          notclickbait
An Income Gap in Who May Lose TV                                  notclickbait
Obama Speech Got Harsh Reviews on Bush Plane to Texas             notclickbait
22 Reasons Cows Should Be Your Favorite Animal                       clickbait
You'll Feel Really Dumb When You See How Simple Bagged Milk Is       clickbait
                                                                      ...     
US actor Gary Coleman dies aged 42                                notclickbait
If Marvel Superheroes Had Kids                                       clickbait
Kyrgyzstan to Give U.S. 6 Months to Leave Base                    notclickbait
Two arrests made in Zotob worm attack                             notclickbait
Puppies Eat Peanut Butter For The First Time                         clickbait
Length: 100, dtype: object
