# Annontation Task Demo Project

Here we showcase a project where the user uses alfred to annotate a dataset with multiple prompts
We follow the experiment setup described in Smith et al. 2022.

Ref:
Smith, R., Fries, J. A., Hancock, B., & Bach, S. H. (2022). Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318.

## 1. Load Dataset

In [None]:
from alfred.data.wrench import WrenchBenchmarkDataset

youtube_train = WrenchBenchmarkDataset(
                                dataset_name='youtube',
                                split='train',
                                local_path="/data/Datasets/wrench/"
                            )

## 2. Run an Alfred Client

In [None]:
from alfred import Client


T5 = Client(model_type='huggingface', model='t5-small')

## 3. Develop the labeling prompts and their voters

We know that each of the prompts will either vote yes for spam or non-spam and no otherwise. For convience, we only need two voters

In [None]:
from alfred.template import StringTemplate
from alfred.voter import Voter


label2idx = {"SPAM":1, "HAM":0}

yes_voter = Voter(
    label_map = {'yes': 1, 'no': 0},
    matching_fn = lambda x, y: x == y,
)

no_voter = Voter(
    label_map = {'no': 1, 'yes': 0},
    matching_fn = lambda x, y: x == y,
)

voters = []

In [None]:
channel_reference_template = StringTemplate(
    template = """Does the following comment reference the speaker’s channel or video?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(yes_voter)

In [None]:
subscribe_template = StringTemplate(
    template = """Does the following comment ask you to subscribe to a channel?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(yes_voter)

In [None]:
url_template = StringTemplate(
    template = """Does the following comment have a URL?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(yes_voter)

In [None]:
reader_action_template = StringTemplate(
    template = """Does the following comment ask the reader to do something?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(yes_voter)

In [None]:
song_template = StringTemplate(
    template = """Does the following comment talk about a song?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(no_voter)

In [None]:
checkout_template = StringTemplate(
    template = """Does the following comment contain the words "check out"? \n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(yes_voter)

In [None]:
five_words_template = StringTemplate(
    template = """Is the following comment fewer than 5 words?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(no_voter)

In [None]:
name_mention_template = StringTemplate(
    template = """Does the following comment mention a person’s name?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(no_voter)

In [None]:
strong_sentiment_template = StringTemplate(
    template = """Does the following comment express a very strong sentiment?\n\n[[text]]""",
    answer_choices = "yes ||| no",
)

voters.append(no_voter)

In [None]:
subjective_op_template = StringTemplate(
    template = """Does the following comment express a subjective opinion?\n\n[text]""",
    answer_choices = "yes ||| no",
)

voters.append(no_voter)

### Now we have all the prompt templates and their accompanying voters, lets use them to annotate the training set!

In [None]:
templates = [
    channel_reference_template,
    subscribe_template,
    url_template,
    reader_action_template,
    song_template,
    checkout_template,
    five_words_template,
    name_mention_template,
    strong_sentiment_template,
    subjective_op_template
]

print(f"We have {len(templates)} templates!")

In [None]:
import numpy as np
from tqdm.auto import tqdm

votes = np.zeros([len(youtube_train), len(templates)])

model_responses = []        

for template_id, template in enumerate(tqdm(templates)):
    prompts = template.apply_to_dataset(youtube_train)
    responses = T5(prompts, no_tqdm=True)
    model_responses.append(responses)
    votes[:, template_id] = voters[template_id].vote(responses)

### Finally, lets use Majority Vote to get a consensus!

In [None]:
from alfred.labeling import MajorityVote

mv_lm = MajorityVote()
mv_labels = mv_lm(votes)