<a href="https://colab.research.google.com/github/CrashingGuru/FGAN-Build-a-thon/blob/main/Notebooks2023/2.Read-semi-annotated-push-to-argilla.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Created**: 3 Jan 2024

Aaron, Othniel, Vishnu.

Modification History:

4 Jan 2024: Aaron, Frank, Othniel, Vishnu: Changed the data schema to a simpler format. Bye-bye "for_supervised_fine_tuning" format.

11 Feb 2024: Vishnu: added the link to configure dataset notbook.

Description:

This notebook pulls records from HF hub (semi annotated dataset) and pushes them to HF spaces argilla (for 100% annotation).

Pre-requisites:

the following notebooks are already run:

1. Create the raw dataset in HF hub.

2. Configure the argilla dataset using https://colab.research.google.com/github/vrra/FGAN-Build-a-thon/blob/main/Notebooks2023/Argilla_configure_dataset-v1.ipynb

3. add records in the argilla dataset from the raw dataset

4. partially annotate the dataset in UI - offline

5. Save the annotated dataset into HF hub

Finally This notebook pulls records from HF hub (semi annotated dataset) and pushes them to HF spaces argilla (for 100% annotation).

## Install Libraries

Install the latest version of Argilla in Colab, along with other libraries and models used in this notebook.

In [None]:
!pip install argilla datasets

Prerequisites

Deploy Argilla Server on [HF Spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space).


More info on Installation [here](../getting_started/installation/deployments/deployments.html).

## Secretes needed




* `ARGILLA_API_URL`: It is the url of the Argilla Server.
  * If you're using HF Spaces, it is constructed as `https://[your-owner-name]-[your_space_name].hf.space`.
* `ARGILLA_API_KEY`: It is the API key of the Argilla Server. It is `owner` by default.
* `HF_TOKEN`: It is the Hugging Face API token. It is only needed if you're using a [private HF Space](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#deploy-argilla-on-spaces). You can configure it in your profile: [Setting > Access Tokens](https://huggingface.co/settings/tokens).
* `workspace`: admin


In [3]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY

In [4]:
from google.colab import userdata
api_url= userdata.get('my_argilla_url')
api_key= userdata.get('my_argilla_key')

import argilla as rg
rg.init(api_url=api_url, api_key=api_key)

# # If you want to use your private HF Space
# rg.init(extra_headers={"Authorization": f"Bearer {hf_token}"})

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


In [5]:
from datasets import load_dataset

# Load and inspect a semi annotated dataset from the Hugging Face Hub
# (and not the pre-processed or annotated dataset in the spaces).
# vishnuramov/itu_annotated_dataset is the semi annotated dataset name in HF hub
# (and not annotated dataset in the spaces nor the raw dataset in the HF Hub)
hf_dataset = load_dataset('vishnuramov/itu_annotated_dataset')

Downloading readme:   0%|          | 0.00/9.63k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/138k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [6]:
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['background', 'prompt', 'response', 'response_correction', 'response_correction-suggestion', 'response_correction-suggestion-metadata', 'external_id', 'metadata'],
        num_rows: 98
    })
})

In [7]:
custom_dataset = rg.FeedbackDataset.from_argilla(name="fgan-annotate-dataset", workspace="admin")
records = [
    rg.FeedbackRecord(
        fields={"background": record["background"],
                "prompt": record["prompt"],
                "response": record["response"],
                }
    )
    for record in hf_dataset['train']
    ]
custom_dataset

RemoteFeedbackDataset(
   id=a9144def-6a2e-4056-a202-3ac6ec0fce01
   name=fgan-annotate-dataset
   workspace=Workspace(id=c5a5cbc1-7fbe-4fb0-8c04-6b23981d60d8, name=admin, inserted_at=2024-02-11 15:19:05.507406, updated_at=2024-02-11 15:19:05.507406)
   url=https://vishnuramov-itu-t-build-a-thon.hf.space/dataset/a9144def-6a2e-4056-a202-3ac6ec0fce01/annotation-mode
   fields=[RemoteTextField(id=UUID('51688d91-6107-4fab-bd4c-234520665d8f'), client=None, name='background', title='Background', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('d0462737-dcdf-41fa-b30f-e27b0a597eca'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('0a9088f7-a235-4f39-8532-7924812a8fc7'), client=None, name='response', title='Final Response', required=True, type='text', use_markdown=False)]
   questions=[RemoteTextQuestion(id=UUID('75da84a7-9609-42e5-aaf9-994968f1c064'), client=None, name='response_correction', title='

In [8]:
from typing import Dict, Any

def extract_background_prompt_response(str_text: str) -> Dict[str, Any]:
    '''Extract the anthropic prompt from a prompt and response pair.'''
    background_prompt = str_text.lower().find("background:")
    start_prompt = str_text.lower().find("<human>:")
    end_prompt = str_text.lower().rfind("<bot>:")

    if (background_prompt != -1) and (start_prompt == -1 ) and (end_prompt == -1):
      #only background is present
      background = str_text[background_prompt:].strip()
      prompt = ""
      response = ""
    elif (background_prompt == -1) and (start_prompt != -1 ) and (end_prompt == -1):
      #only human is present
      background = ""
      prompt = str_text[start_prompt:].strip()
      response = ""
    elif (background_prompt == -1) and (start_prompt == -1 ) and (end_prompt != -1):
      #only bot is present
      background = ""
      prompt = ""
      response = str_text[end_prompt:].strip()

    elif (background_prompt != -1) and (start_prompt != -1 ) and (end_prompt == -1):
      #only background and human are present
      background = str_text[background_prompt:start_prompt].strip()
      prompt = str_text[start_prompt:].strip()
      response = ""
    elif (background_prompt != -1) and (start_prompt == -1 ) and (end_prompt != -1):
      #only background and bot are present
      background = str_text[background_prompt:end_prompt].strip()
      prompt = ""
      response = str_text[end_prompt:].strip()
    elif (background_prompt == -1) and (start_prompt != -1 ) and (end_prompt != -1):
      #only human and bot are present
      background = ""
      prompt = str_text[start_prompt:end_prompt].strip()
      response = str_text[end_prompt:].strip()
    else:
      #all 3 are present
      background = str_text[background_prompt:start_prompt].strip()
      prompt = str_text[start_prompt:end_prompt].strip()
      response = str_text[end_prompt:].strip()

    return {"background": background, "prompt": prompt, "response": response}

In [9]:
for i, record in enumerate(hf_dataset['train']):
    if (len(record['response_correction'])):
      bg=extract_background_prompt_response(record['response_correction'][0]['value'])['background']
      if (len(bg)):
        records[i].fields['background'] = bg
      else:
        records[i].fields['background'] = record['background']
      pr=extract_background_prompt_response(record['response_correction'][0]['value'])['prompt']
      if (len(pr)):
        records[i].fields['prompt'] = pr
      else:
        records[i].fields['prompt'] = record['prompt']
      rc=extract_background_prompt_response(record['response_correction'][0]['value'])['response']
      if (len(rc)):
        records[i].fields['response'] = rc
      else:
        records[i].fields['response'] = record['response']
    else:
      records[i].fields['background'] = record['background']
      records[i].fields['prompt'] = record['prompt']
      records[i].fields['response'] = record['response']

**CAUTION**

This following step is an optional step.
It cleans up, the dataset to avoid duplicate entries.
The reason:

1) at this point, we dont have a persistent dataset. The DB is the persistent copy. So clean the slate and copy from the DB is ok.

Dont do this for an empty, newly created Argilla dataset, you may get error. If so, ignore the error.


In [None]:
# List the records to be deleted
numRecords = len(custom_dataset.records)
records_to_delete = list(custom_dataset.records[:numRecords])
# Delete the list of records from the dataset
custom_dataset.delete_records(records_to_delete)

In [10]:
custom_dataset.add_records(records)

Output()

Go to Argilla in HF Spaces
and cross check the added records.

Manually annotate (and have fun)

-------------

