<a href="https://colab.research.google.com/github/vrra/FGAN-Build-a-thon/blob/main/Notebooks2023/Read-semi-annotated-push-to-argilla.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Created: 3 Jan 2024

Aaron, Othniel, Vishnu.

Modification History: 4 Jan 2024: Aaron, Frank, Othniel, Vishnu: Changed the data schema to a simpler format. Bye-bye "for_supervised_fine_tuning" format.

Description:

This notebook pulls records from HF hub (semi annotated dataset) and pushes them to HF spaces argilla (for 100% annotation).

Pre-requisites:

the following notebooks are already run:

1. Create the raw dataset in HF hub.

2. Configure the argilla dataset

3. add records in the argilla dataset from the raw dataset

4. partially annotate the dataset in UI - offline

5. Save the annotated dataset into HF hub

Finally This notebook pulls records from HF hub (semi annotated dataset) and pushes them to HF spaces argilla (for 100% annotation).

## Install Libraries

Install the latest version of Argilla in Colab, along with other libraries and models used in this notebook.

In [40]:
!pip install argilla datasets



Prerequisites

Deploy Argilla Server on [HF Spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space).


More info on Installation [here](../getting_started/installation/deployments/deployments.html).

## Secretes needed




* `ARGILLA_API_URL`: It is the url of the Argilla Server.
  * If you're using HF Spaces, it is constructed as `https://[your-owner-name]-[your_space_name].hf.space`.
* `ARGILLA_API_KEY`: It is the API key of the Argilla Server. It is `owner` by default.
* `HF_TOKEN`: It is the Hugging Face API token. It is only needed if you're using a [private HF Space](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#deploy-argilla-on-spaces). You can configure it in your profile: [Setting > Access Tokens](https://huggingface.co/settings/tokens).
* `workspace`: admin


In [41]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY

In [42]:
from google.colab import userdata
api_url= userdata.get('my_argilla_url')
api_key= userdata.get('my_argilla_key')

import argilla as rg
rg.init(api_url=api_url, api_key=api_key)

# # If you want to use your private HF Space
# rg.init(extra_headers={"Authorization": f"Bearer {hf_token}"})

In [43]:
from datasets import load_dataset

# Load and inspect a semi annotated dataset from the Hugging Face Hub
# (and not the pre-processed or annotated dataset in the spaces).
# vishnuramov/itu_annotated_dataset is the semi annotated dataset name in HF hub
# (and not annotated dataset in the spaces nor the raw dataset in the HF Hub)
hf_dataset = load_dataset('vishnuramov/itu_annotated_dataset')

Downloading readme:   0%|          | 0.00/8.75k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/115k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [44]:
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['background', 'prompt', 'response', 'response_correction', 'response_correction-suggestion', 'response_correction-suggestion-metadata', 'external_id', 'metadata'],
        num_rows: 98
    })
})

In [45]:
custom_dataset = rg.FeedbackDataset.from_argilla(name="fgan-annotate-dataset", workspace="admin")
records = [
    rg.FeedbackRecord(
        fields={"background": record["background"],
                "prompt": record["prompt"],
                "response": record["response"],
                }
    )
    for record in hf_dataset['train']
    ]
custom_dataset

RemoteFeedbackDataset(
   id=12bbed5e-35ce-46a5-9613-a98f893b830e
   name=fgan-annotate-dataset
   workspace=Workspace(id=6196e1fe-7cc5-4ef4-b608-d98a8bc8fbc8, name=admin, inserted_at=2024-01-02 14:17:38.856061, updated_at=2024-01-02 14:17:38.856061)
   url=https://vishnuramov-itu-t-build-a-thon.hf.space/dataset/12bbed5e-35ce-46a5-9613-a98f893b830e/annotation-mode
   fields=[RemoteTextField(id=UUID('7ccfdb4c-dbd3-472e-a5b4-399839a754a8'), client=None, name='background', title='Background', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('305e73c2-455b-4eb6-ad79-3c6c6184fc17'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('3312f16a-8eb3-4381-a852-066b43a0f9c8'), client=None, name='response', title='Final Response', required=True, type='text', use_markdown=False)]
   questions=[RemoteTextQuestion(id=UUID('c590fe5a-bccc-4e06-872b-a2b75bad38e4'), client=None, name='response_correction', title='

In [46]:
from typing import Dict, Any

def extract_background_prompt_response(str_text: str) -> Dict[str, Any]:
    '''Extract the anthropic prompt from a prompt and response pair.'''
    start_prompt = str_text.find("<human>:")
    end_prompt = str_text.rfind("<bot>:")
    # Background is anything before the first <human>:
    background = str_text[:start_prompt].strip()
    # Prompt is anything between the first <human>: (inclusive) and the last <bot>: (exclusive)
    prompt = str_text[start_prompt: end_prompt].strip()
    # Response is everything after the last <bot>: (inclusive)
    response = str_text[end_prompt:].strip()
    return {"background": background, "prompt": prompt, "response": response}

In [47]:
for i, record in enumerate(hf_dataset['train']):
    if (len(record['response_correction'])):
      records[i].fields['background'] = extract_background_prompt_response(record['response_correction'][0]['value'])['background']
      records[i].fields['prompt'] = extract_background_prompt_response(record['response_correction'][0]['value'])['prompt']
      records[i].fields['response'] = extract_background_prompt_response(record['response_correction'][0]['value'])['response']
    else:
      records[i].fields['background'] = record['background']
      records[i].fields['prompt'] = record['prompt']
      records[i].fields['response'] = record['response']

In [48]:
# List the records to be deleted
numRecords = len(custom_dataset.records)
records_to_delete = list(custom_dataset.records[:numRecords])
# Delete the list of records from the dataset
custom_dataset.delete_records(records_to_delete)

In [49]:
custom_dataset.add_records(records)

Output()

-------------

