In [71]:
%pip install huggingface_hub argilla datasets

3511.50s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
Collecting datasets
  Using cached datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets)
  Using cached pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.4.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting aiohttp (from datasets)
  Using cached aiohttp-3.9.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (7.4 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Using cached aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting attrs>=

# Steps

The overall steps to create a new Argilla Space for translating a new language are as follows:

1. Setup an organization on the Hub
2. Create an Argilla Space
3. Setup the Oauth integration 
4. Load the DIBT data into the Argilla Space
5. Begin translating the data!


This notebook will walks through each of these steps. Some of the steps are done using the `huggingface_hub` CLI, other parts of the process can only be done in the UI, but we will also show how to do them using the API.

In [146]:
from huggingface_hub import duplicate_space
from huggingface_hub import hf_hub_download
from huggingface_hub import HfApi
from huggingface_hub import SpaceCard
import yaml
import json

### Create a new organization for your language effort 

To make it easier to keep track of your language effort, we recommend creating a new organization for your language effort. This will allow you to keep all of your language effort data in one place. We suggest naming this organization "DIBT-<language>" where <language> is the name of your language. For example, if you are working on the language "Spanish", you would name your organization "DIBT-Spanish". This will make it easier for us to track all of the DIBT language efforts.

You can use this link to create a new organization on the Hub: [https://huggingface.co/organizations/new](https://huggingface.co/organizations/new). 

In [153]:
HF_ORG_NAME = None
LANGUAGE = None

In [154]:
assert HF_ORG_NAME is not None, "Please set HF_ORG_NAME to the name of the Hugging Face org you just created"
assert LANGUAGE is not None, "Please set LANGUAGE to the language your effort focuses on"

AssertionError: Please set HF_ORG_NAME to the name of the Hugging Face org you just created

# Setup the Space

We will use the `huggingface_hub` CLI to create a new Space for our language effort by cloning an existing template Space. We could also do this via the UI but we'll also update some of the settings using the API in this notebook to reduce the amount of steps you need to do in the UI. Before we do this we need to authenticate with the Hub. 

In [155]:
from huggingface_hub import login

In [156]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [157]:
api = HfApi()

This step duplicates the existing Argilla Space to your organization. 

In [158]:
from_id = "argilla/argilla-template-space"
to_id =  f"{HF_ORG_NAME}/prompt-translation-for-{LANGUAGE}"
new_space = duplicate_space(from_id, to_id=to_id)
new_space

NameError: name 'LANGUAGE' is not defined

We update the tile and description of the Space to reflect the language we are translating.

In [48]:
card = SpaceCard.load(to_id)
card.data.title = f"DIBT Translation for {LANGUAGE}"
card.push_to_hub(to_id)

{'title': 'Argilla Space Template', 'sdk': 'docker', 'sdk_version': None, 'python_version': None, 'app_file': None, 'app_port': 6900, 'license': None, 'duplicated_from': None, 'models': None, 'datasets': None, 'tags': ['argilla'], 'emoji': '🏷️', 'colorFrom': 'purple', 'colorTo': 'red', 'fullWidth': True}

## Setup Oauth Integration

We will setup Oauth integration for the Space. This will makes it possible for anyone with a Hugging Face account to contribute to the translation effort. You can find a full guide on how to do this [here](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#setting-up-hf-authentication) but we'll walk through the steps in this notebook.

We'll download the `.oauth.yaml` file from the Space we just created, set the `enabled` field to `true` and then upload the file back to the Space.

In [159]:
file = hf_hub_download(
    repo_id=to_id, filename=".oauth.yaml", repo_type="space", local_dir="."
)

In [44]:
with open(file, "r") as f:
    oauth = yaml.safe_load(f)
oauth['enabled'] = True
with open(file, "w") as f:
    f.write(yaml.dump(oauth))

In [27]:
api.upload_file(
    path_or_fileobj=file,
    path_in_repo=file,
    repo_id=to_id,
    repo_type="space",
)

CommitInfo(commit_url='https://huggingface.co/spaces/dibt-testy/dibt-translation/commit/4742ea4faea869d320cb7601394d398e1296d909', commit_message='Upload ./.oauth.yaml with huggingface_hub', commit_description='', oid='4742ea4faea869d320cb7601394d398e1296d909', pr_url=None, pr_revision=None, pr_num=None)

## Create an application on the Hub

To enable the Oauth integration we need to create an application on the Hub. We can do this via the Hugging Face settings UI.

- Go to this page: [https://huggingface.co/settings/applications/new](https://huggingface.co/settings/applications/new)
- Complete the form to create a new application. You will need to provide the following values:
  - Homepage URL: Your Argilla Space Direct URL.
  - Logo URL: [Your Argilla Space Direct URL]/favicon.ico
  - Scopes: openid and profile.
  - Redirect URL: [Your Argilla Space Direct URL]/oauth/huggingface/callback

The cell below will show you the URL for these values



In [174]:
homepage_url = f"https://{new_space.repo_id.replace('/', '-')}.hf.space"
favicon_url = f"{homepage_url}/favicon.ico"
redirect_url = f"{homepage_url}/oauth/huggingface/callback"
print(f"Homepage URL: {homepage_url} \n Logo URL: {favicon_url} \n Redirect URL: {redirect_url}")

Homepage URL: https://dibt-testy-dibt-translation.hf.space 
 Logo URL: https://dibt-testy-dibt-translation.hf.space/favicon.ico 
 Redirect URL: https://dibt-testy-dibt-translation.hf.space/oauth/huggingface/callback


Once we have created the application we will need to update our Space secrets to add these values:

- `OAUTH2_HUGGINGFACE_CLIENT_ID`: [Your Client ID]
- `OAUTH2_HUGGINGFACE_CLIENT_SECRET` : [Your App Secret]

You can add these secrets via the `settings` tab in the UI. 

TODO add instruction on setting up other secrets? 

In [None]:
from huggingface_hub import restart_space

restart_space(to_id, factory_reboot=True)

## Load the DIBT data into the Argilla Space

In [175]:
from datasets import load_dataset
ds = load_dataset('DIBT/prompts_ranked_multilingual_benchmark')

In [176]:
import argilla as rg

In [None]:
ARGILLA_API_TOKEN = None
assert ARGILLA_API_TOKEN is not None, "Please set ARGILLA_API_TOKEN to the API token you just created"

In [119]:
rg.init(homepage_url, ARGILLA_API_TOKEN, "admin")

In [179]:
argilla_ds = rg.FeedbackDataset.for_translation(
    use_markdown=True,
    guidelines=None,
    metadata_properties=None,
    vectors_settings=None,
)
argilla_ds

FeedbackDataset(
   fields=[TextField(name='source', title='Source', required=True, type='text', use_markdown=True)]
   questions=[TextQuestion(name='target', title='Target', description='Translate the text.', required=True, type='text', use_markdown=True)]
   guidelines=This is a translation dataset that contains texts. Please translate the text in the text field.)
   metadata_properties=[])
   vectors_settings=[])
)

In [124]:
argilla_ds.push_to_argilla(f"DIBT Translation for {LANGUAGE}", "admin")

[2;36m[03/11/24 13:06:49][0m[2;36m [0m[34mINFO    [0m INFO:argilla.client.feedback.dataset. ]8;id=253345;file:///Users/davanstrien/Documents/code/argilla/dibt-translation/.venv/lib/python3.11/site-packages/argilla/client/feedback/dataset/local/mixins.py\[2mmixins.py[0m]8;;\[2m:[0m]8;id=524488;file:///Users/davanstrien/Documents/code/argilla/dibt-translation/.venv/lib/python3.11/site-packages/argilla/client/feedback/dataset/local/mixins.py#281\[2m281[0m]8;;\
[2;36m                    [0m         local.mixins:✓ Dataset succesfully    [2m             [0m
[2;36m                    [0m         pushed to Argilla                     [2m             [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m INFO:argilla.client.feedback.dataset. ]8;id=625630;file:///Users/davanstrien/Documents/code/argilla/dibt-translation/.venv/lib/python3.11/site-packages/argilla/client/feedback/dataset/local/mixins.py\[2mmixins.py[0m]8;;\[2m:[0m]8;id=251177;file:///U

RemoteFeedbackDataset(
   id=ec60583c-a420-4f18-97f4-18c486426f7f
   name=DIBT Translation for test
   workspace=Workspace(id=5b8d771c-576e-4519-935c-097f47d82832, name=admin, inserted_at=2024-03-11 12:47:22.332738, updated_at=2024-03-11 12:47:22.332738)
   url=https://dibt-testy-dibt-translation.hf.space/dataset/ec60583c-a420-4f18-97f4-18c486426f7f/annotation-mode
   fields=[RemoteTextField(id=UUID('84e58b01-12b2-4813-a95d-b0a394815f2a'), client=None, name='source', title='Source', required=True, type='text', use_markdown=True)]
   questions=[RemoteTextQuestion(id=UUID('ed49f771-67e8-4422-96e1-2724c089e34f'), client=None, name='target', title='Target', description=None, required=True, type='text', use_markdown=True)]
   guidelines=This is a translation dataset that contains texts. Please translate the text in the text field.
   metadata_properties=[]
   vectors_settings=[]
)

In [184]:
dataset = rg.FeedbackDataset.from_argilla(f"DIBT Translation for {LANGUAGE}", workspace="admin")

In [132]:
records = []
for row in ds["train"]:
    record = rg.FeedbackRecord(
        fields={"source": row["prompt"]},
        metadata=json.loads(row["metadata"]),
        external_id=row["row_idx"],
    )
    records.append(record)

In [133]:
dataset.add_records(records)

[2KPushing records to Argilla... [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m[35m 96%[0m [36m0:00:01[0m:01[0m
[?25h