# Generate annotated training data

This Jupyter notebook will be the hub for generating annotated training data. We'll go through the following steps:

1. Set up labeling session
2. Label data
3. Export labeled data

## Part I: Set up labeling session

In this part, we will (1) get the training data that we will label, and (2) get (and save) the configurations for our annotation session. This is all done automatically via a helper function, `set_up_labeling_session`, which optionally takes a parameter `config_path` that is a local path to a configuration .json file.

In [1]:
import json

from services.generate_training_data.helper import set_up_labeling_session

You can set up a labeling session manually:

In [None]:
session_dict: dict = set_up_labeling_session()

You can also use an existing config. This is an example configuration file:

In [4]:
sample_config = {
    "num_samples": 100,
    "task_name": "post-ranking",
    "task_description": "Ranking posts as those we want to uprank ('uprank'), downrank ('downrank') or neither ('neutral')",
    "label_options": ["uprank", "downrank", "neutral"],
    "labeling_session_name": "post-ranking_2024-04-06-07:10:56",
    "data_to_label_filename": "post-ranking_2024-04-06-07:10:56.jsonl",
    "notes": "",
    "config_name": "post-ranking_2024-04-06-07:10:56.json",
}

In [6]:
with open("example_config.json", "w") as f:
    f.write(json.dumps(sample_config, indent=4))

To load an example configuration file, we can do the following:

In [None]:
with open("example_config.json", "r") as f:
    config = json.load(f)

Here's another example configuration file (this has actual data as well)

In [2]:
session_name = "post-ranking_2024-04-06-07:10:56"
data_to_label_filename = "post-ranking_2024-04-06-07:10:56.jsonl"
config_filename = "post-ranking_2024-04-06-07:10:56.json"

In [3]:
session_dict: dict = set_up_labeling_session(config_filename=config_filename)

Exported data to label at /Users/mark/Documents/work/bluesky-research/services/generate_training_data/data_to_label/post-ranking_2024-04-06-07:10:56.jsonl
Exported config at /Users/mark/Documents/work/bluesky-research/services/generate_training_data/labeling_session_configs/post-ranking_2024-04-06-07:10:56.json
Labeling session set up for config post-ranking_2024-04-06-07:10:56.json to label data at post-ranking_2024-04-06-07:10:56.jsonl at 2024-04-06-07:49:43.


In [5]:
session_dict.keys()

dict_keys(['config', 'data_to_label'])

## Part II: Label data

Now that we have our data to label, let's label the data

In [15]:
config = session_dict["config"]

In [10]:
labeled_data: list[dict] = []
label_options = session_dict["config"]["label_options"]
data_to_label: list[dict] = session_dict["data_to_label"]

In [11]:
data_to_label[0]

{'id': 63376,
 'uri': 'at://did:plc:sb6fu4sinwphqpvoznvz7efo/app.bsky.feed.post/3kpen5qxtnc2c',
 'created_at': '2024-04-05T07:55:27.468Z',
 'text': "I'm having a lot of fun with this photo box",
 'embed': '{"has_image": true, "image_alt_text": "", "has_embedded_record": false, "embedded_record": null, "has_external": false, "external": null}',
 'langs': 'en',
 'entities': None,
 'facets': None,
 'labels': None,
 'reply': None,
 'reply_parent': None,
 'reply_root': None,
 'tags': None,
 'py_type': 'app.bsky.feed.post',
 'cid': 'bafyreieoci2xeo6zis4urhwohei6zhbtn3i73vk3c3jjf3k7jjkolxz7n4',
 'author': 'did:plc:sb6fu4sinwphqpvoznvz7efo',
 'synctimestamp': '2024-04-05-07:55:27'}

In [35]:
def generate_string_to_annotate(post: dict) -> str:
    """Generates a string to annotate for a post.

    We can include information beyond just the text of a post, and thus provide
    any context that we might want to include as well.
    """
    return f"""
        [text]: {post['text']}
        [embed]: {post['embed']}
        [facets]: {post['facets']}
        [labels] {post['labels']}
        [reply_parent]: {post['reply_parent']}
        [reply_root]: {post['reply_root']}
        [tags]: {post['tags']}\n
    """

In [36]:
strings_to_annotate = [generate_string_to_annotate(post) for post in data_to_label]

Now that we have our data that we want to annotate, let's set up our Pigeon session. We want to display the text, but properly render the newline breaks, so we also need to do some syntactic sugar to accommodate that.

In [25]:
from IPython.display import display, HTML
from pigeon import annotate

In [37]:
def format_text(text):
    return text.replace("\n", "<br>")

Now that we have our setup completed plus our data prepared, let's label!

In [None]:
annotations: list[tuple] = annotate(
    examples=strings_to_annotate,
    options=label_options,
    display_fn=lambda text: display(HTML(format_text(text))),
)

Now that we have our annotations, we grab the results and combine them with our initial posts so that we can have a list of post URIs and the labels provided.

In [None]:
for post, (_, label) in zip(data_to_label, annotations):
    res = {"uri": post["uri"], "label": label}
    labeled_data.append(post)

Now that we have lists of `{"uri": uri, "label": label}` posts, let's combine them with other metadata from our config.

In [16]:
config

{'labeling_session_name': 'post-ranking_2024-04-06-07:10:56',
 'timestamp': '2024-04-06-07:49:43',
 'task_name': 'post-ranking',
 'task_description': "Ranking posts as those we want to uprank ('uprank'), downrank ('downrank') or neither ('neutral')",
 'label_options': ['uprank', 'downrank', 'neutral'],
 'num_samples': 100,
 'notes': '',
 'data_to_label_filename': 'post-ranking_2024-04-06-07:10:56.jsonl'}

In [None]:
data_to_export: list[dict] = [
    {
        "uri": labeled_post["uri"],
        "label": labeled_post["label"],
        "task": config["task_name"],
        "labeling_session_name": config["labeling_session_name"],
        "notes": config["notes"],
        "timestamp": config["timestamp"],
    }
    for labeled_post in labeled_data
]

## Part III: Export labeled data

Now that we have our labeled data, let's write it to the database.

In [None]:
from services.generate_training_data.database import batch_write_training_data_to_db

In [None]:
batch_write_training_data_to_db(data_to_export)

## Part IV: Load labeled data

Now that we have our previous task, let's load the results

In [None]:
from services.generate_training_data.database import load_data_from_previous_session

In [None]:
res: list[dict] = load_data_from_previous_session(config["labeling_session_name"])