# Deep Search integrations - Argilla.io

In this example we will use the output of the converted document for populating a dataset on
[Argilla](https://argilla.io). This enables the user to annotate text for multiple purposes,
e.g. text classification, named entities recognition, etc as well as train custom models fitting their purposes.


### Setup your environment

In this example we require the connection to a running Argilla instance.

The [README](./README.md) file of this example describes in more details how to set it up.


### Set notebook parameters

The following block defines the parameters specific to this example notebook

- `INPUT_FILE`: the input PDF to converted and analyzed
- `ARGILLA_API_URL`: the API URL of the Argilla instance
- `ARGILLA_API_KEY`: the API Key of the Argilla instance
- `ARGILLA_DATASET`: the name of the dataset on Argilla
- `SPACY_MODEL`: the spaCy model to use for tokenization
    

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings
import os
from pathlib import Path

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJ_KEY = notebook_settings.proj_key  # the project to use

INPUT_FILE = Path("../../data/samples/2206.00785.pdf")

# Argilla configuration
ARGILLA_API_URL = os.environ["ARGILLA_API_URL"]  # required env var
ARGILLA_API_KEY = os.environ["ARGILLA_API_KEY"]  # required env var
ARGILLA_DATASET = "deepsearch-documents"
# Tokenization
SPACY_MODEL = "en_core_web_sm"

In [2]:
# Import standard dependenices
import json
import tempfile
import typing
from zipfile import ZipFile

# IPython utilities
from IPython.display import display, Markdown, HTML, display_html

# Import the deepsearch-toolkit
import deepsearch as ds

# Import specific to the example
import argilla as rg
import spacy
from pydantic import BaseModel

In [3]:
# Download the spaCy model
!python -m spacy download {SPACY_MODEL}

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.5.0
    Uninstalling en-core-web-sm-3.5.0:
      Successfully uninstalled en-core-web-sm-3.5.0
Successfully installed en-core-web-sm-3.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
class DocTextSegment(BaseModel):
    page: int  # page number
    idx: int  # index of text segment in the document
    title: str  # title of the document
    name: str  # flavour of text segment
    type: str  # type of text segment
    text: str  # content of the text segment
    text_classification: typing.Any = (
        None  # this could be used to store predictions of text classification
    )
    token_classification: typing.Any = (
        None  # this could be used to store predictions of token classification
    )

## Document conversion with Deep Search

In [5]:
# Connect to Deep Search
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

In [6]:
# Launch the docucment conversion and download the results
documents = ds.convert_documents(
    api=api, proj_key=PROJ_KEY, source_path=INPUT_FILE, progress_bar=True
)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 214.55it/s][38;2;15;98;254m                                                                                                                [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:04<00:00,  4.34s/it][38;2;15;98;254m                                                                                                                 [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:17<00:00, 17.77s/it][38;2;15;98;254m                                                                                                                 [0m


In [7]:
output_dir = tempfile.mkdtemp()

documents.download_all(result_dir=output_dir, progress_bar=True)

converted_docs = {}
for output_file in Path(output_dir).rglob("json*.zip"):
    with ZipFile(output_file) as archive:
        all_files = archive.namelist()
        for name in all_files:
            if not name.endswith(".json"):
                continue

            doc_jsondata = json.loads(archive.read(name))
            converted_docs[f"{output_file}//{name}"] = doc_jsondata

print(f"{len(converted_docs)} documents have been loaded after conversion.")

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00,  1.46it/s][38;2;15;98;254m                                                                                                                 [0m

1 documents have been loaded after conversion.





## Extract text segments

In [8]:
text_segments = []
for doc in converted_docs.values():

    doc_title = doc.get("description").get("title")
    for idx, text_segment in enumerate(doc["main-text"]):
        # filter only components with text
        if "text" not in text_segment:
            continue

        # append to the component to the list of segments
        text_segments.append(
            DocTextSegment(
                title=doc_title,
                page=text_segment.get("prov", [{}])[0].get("page"),
                idx=idx,
                name=text_segment.get("name"),
                type=text_segment.get("type"),
                text=text_segment.get("text"),
            )
        )

print(f"{len(text_segments)} text segments got extracted from the document")

133 text segments got extracted from the document


## Log the text segments to Argilla

In [None]:
# Initialize the Argilla SDK
rg.init(api_url=ARGILLA_API_URL, api_key=ARGILLA_API_KEY)

# Initialize the spaCy NLP model for the tokenization of the text
nlp = spacy.load("en_core_web_sm")

In [None]:
# Prepare text segments for text classification

records_text_classificaiton = []
for segment in text_segments:
    records_text_classificaiton.append(
        rg.TextClassificationRecord(
            text=segment.text,
            vectors={},
            prediction=segment.text_classification,
            metadata=segment.dict(
                exclude={"text", "text_classification", "token_classification"}
            ),
        )
    )

In [None]:
# Submit text for classification
rg.log(records_text_classificaiton, name=f"{ARGILLA_DATASET}-text")

In [None]:
# Prepare text segments for token classification

records_token_classificaiton = []
for segment in text_segments:
    records_token_classificaiton.append(
        rg.TokenClassificationRecord(
            text=segment.text,
            tokens=[token.text for token in nlp(segment.text)],
            prediction=segment.token_classification,
            vectors={},
            metadata=segment.dict(
                exclude={"text", "text_classification", "token_classification"}
            ),
        )
    )

In [None]:
# Submit tokens for classification
rg.log(records_token_classificaiton, name=f"{ARGILLA_DATASET}-token")

## What's next?

Now that the documents are converted and uploaded in Argilla, you can use the links printed above to annotate and train your own models.

Visit the <a href="https://docs.argilla.io" rel="nofollow" target="_blank">Argilla documentation</a> to learn about its features and check out the <a href="https://docs.argilla.io/en/latest/guides/guides.html" rel="nofollow" target="_blank">Deep Dive Guides</a> and <a href="https://docs.argilla.io/en/latest/tutorials/tutorials.html" rel="nofollow" target="_blank">Tutorials</a>.
