<a href="https://colab.research.google.com/github/vrra/FGAN-Build-a-thon/blob/main/Notebooks2023/Read-semi-annotated-push-to-argilla.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Created: 3 Jan 2024

Aaron, Othniel, Vishnu.

This notebook pulls records from HF hub (semi annotated dataset) and pushes them to HF spaces argilla (for 100% annotation).

Pre-requisites:

the following notebooks are already run:

1. Create the raw dataset in HF hub.

2. Configure the argilla dataset

3. add records in the argilla dataset from the raw dataset

and perhaps you need to

x% annotate the dataset in UI - offline

## Install Libraries

Install the latest version of Argilla in Colab, along with other libraries and models used in this notebook.

In [1]:
!pip install argilla datasets

Collecting argilla
  Downloading argilla-1.21.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx<=0.25,>=0.15 (from argilla)
  Downloading httpx-0.25.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deprecated~=1.2.0 (from argilla)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting backoff (from argilla)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting monotonic (from argilla)
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-an

Prerequisites

Deploy Argilla Server on [HF Spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space).


More info on Installation [here](../getting_started/installation/deployments/deployments.html).

## Secretes needed




* `ARGILLA_API_URL`: It is the url of the Argilla Server.
  * If you're using HF Spaces, it is constructed as `https://[your-owner-name]-[your_space_name].hf.space`.
* `ARGILLA_API_KEY`: It is the API key of the Argilla Server. It is `owner` by default.
* `HF_TOKEN`: It is the Hugging Face API token. It is only needed if you're using a [private HF Space](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#deploy-argilla-on-spaces). You can configure it in your profile: [Setting > Access Tokens](https://huggingface.co/settings/tokens).
* `workspace`: admin


In [2]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY

In [3]:
from google.colab import userdata
api_url= userdata.get('my_argilla_url')
api_key= userdata.get('my_argilla_key')

import argilla as rg
rg.init(api_url=api_url, api_key=api_key)

# # If you want to use your private HF Space
# rg.init(extra_headers={"Authorization": f"Bearer {hf_token}"})



In [34]:
from datasets import load_dataset

# Load and inspect a semi annotated dataset from the Hugging Face Hub
# (and not the pre-processed or annotated dataset in the spaces).
# vishnuramov/itu_annotated_dataset is the semi annotated dataset name in HF hub
# (and not annotated dataset in the spaces nor the raw dataset in the HF Hub)
hf_dataset = load_dataset('vishnuramov/itu_annotated_dataset')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [35]:
for record in hf_dataset['train']:
  if (len(record['response'])):
     print((record['response'][0]['value']))

In [11]:
#Create a custom dataset configuration
feedback_dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="prompt"),
        rg.TextField(name="context"),
        rg.TextField(name="response")
    ],
    questions=[
        rg.TextQuestion(
            name="answer_correction",
            description="If you think the response is not accurate, please, correct it.",
            required=True,
        ),
    ],
    guidelines="Please, read the question carefully and try to answer it as accurately as possible."
)

In [30]:
custom_dataset = feedback_dataset.push_to_argilla(name="itu-annotate-custom-dataset", workspace="admin")


INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully pushed to Argilla
INFO:argilla.client.feedback.dataset.local.mixins:RemoteFeedbackDataset(
   id=47ee36b9-61f0-444c-b707-70bfad333c2f
   name=itu-annotate-custom-dataset
   workspace=Workspace(id=6196e1fe-7cc5-4ef4-b608-d98a8bc8fbc8, name=admin, inserted_at=2024-01-02 14:17:38.856061, updated_at=2024-01-02 14:17:38.856061)
   url=https://vishnuramov-itu-t-build-a-thon.hf.space/dataset/47ee36b9-61f0-444c-b707-70bfad333c2f/annotation-mode
   fields=[RemoteTextField(id=UUID('fa9deb90-82c1-464e-8ec5-412084f74328'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('85f6721f-499f-4664-a4a9-acb97cf05d56'), client=None, name='context', title='Context', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('a768130f-2dff-4e8f-bbfd-7557a60330aa'), client=None, name='response', title='Response', required=True, type='text', use_markdown=Fals

In [31]:
records = [
    rg.FeedbackRecord(
        fields={"prompt": record["prompt"],
                "context": record['context'],
                "response":""
                }
    )
    for record in hf_dataset['train']
    ]

In [32]:
for i, record in enumerate(hf_dataset['train']):
    print(i)
    print(record)
    if (len(record['response'])):
      records[i].fields['response'] = record['response'][0]['value']

    #if (len(record['response'])):
    #    fields{("response"): }

0
{'prompt': 'I n t e r n a t i o n a l  T e l e c o m m u n i c a t i o n  U n i o n  \n  \nITU-T  Technical Specification  \nTELECOMMUNICATION  \nSTANDARDIZATION SECTOR  \nOF ITU   \n(28 October  2021 ) \n \nITU-T Focus Group on Autonomous Networks  \n Technical Specification  \nUse cases for Autonomous Networks', 'context': '/content/sample_data/Use-case-AN.pdf page number= 0', 'response': [{'user_id': '01f1f7c8-9437-450e-85aa-ace2973d8439', 'value': 'Background: ITU has published Use cases for Autonomous Networks. <human>: who publishes use cases for autonomous networks? <bot>: ITU.', 'status': 'submitted'}], 'response-suggestion': None, 'response-suggestion-metadata': {'type': None, 'score': None, 'agent': None}, 'external_id': None, 'metadata': '{}'}
1
{'prompt': 'Error! Reference source not found.  (2021 -10)  i Summary  \nThis is a deliverable of the ITU -T Focus Group on Autonomous Networks (FG -AN).  \nThis document analyses use cases for autonomous networks. It provides use 

In [33]:
custom_dataset.add_records(records)


Output()

-------------

