# Bring Your Own Converted Documents*

In some use cases you will be ingesting documents which have previously been converted or parsed. For example,
you might have generated your documents directly in JSON format, or you might have converted the documents
with Deep Search, exported, modified and next you would like to re-ingest them.

In this example we demonstrate how a folder with converted JSON documents can be uploaded to a data collection
in Deep Search.


*deprecated; in future we will accept only Docling Docs


## Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.


### Set notebook parameters

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings
from pathlib import Path
import tempfile

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # profile to use
PROJ_KEY = notebook_settings.proj_key  # project to use
INDEX_NAME = notebook_settings.new_idx_name  # index to create
CLEANUP = notebook_settings.cleanup  # whether to clean up
INPUT_FILES_FOLDER = Path("../../data/converted/")
TMP_DIR = tempfile.TemporaryDirectory()

### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
import json
from tqdm.notebook import tqdm
import pandas as pd

# IPython utilities
from IPython.display import display, Markdown, HTML

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.data_indices import utils as data_indices_utils

### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

---

### Create new data index

In [4]:
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=PROJ_KEY, name=INDEX_NAME)

## Upload the data

When uploading multiple converted documents in JSON format, we have the option to upload one file at the time, or to package all the documents in a JSONL input.
In the [JSONL format](https://jsonlines.org/) each line in the file is a full independent JSON object.

In the following code we will read all the JSON file in the input folder and make a single JSONL which is then uploaded to Deep Search.


In [5]:
# Create the input_filename
input_dir = Path(TMP_DIR.name)
input_filename = input_dir / "upload.jsonl"

with input_filename.open("w") as f:
    for doc_filename in INPUT_FILES_FOLDER.glob("*.json"):
        doc = json.load(doc_filename.open())
        f.write(json.dumps(doc) + "\n")

In [6]:
# Launch the Deep Search upload
task = api.data_indices.upload(coords=data_index.source, source=input_filename)

In [7]:
# Wait for the task to complete
api.tasks.wait_for(PROJ_KEY, task.task_id)

{'data': {'errors': 0, 'success': 3}, 'name': 'cps-upload'}

In [8]:
display(
    Markdown(
        f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{PROJ_KEY}/library/private/{data_index.source.index_key}"
    )
)

The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://sds.app.accelerate.science//projects/b09ae7561a01dc7c4b0fd21a43bfd93d140766d1/library/private/6e7917be8384cff343e3e51de3e1716423773d66

---

### Query your data

In [9]:
# Count the documents in the data index
query = DataQuery("*", source=[""], limit=0, coordinates=data_index.source)
query_results = api.queries.run(query)
num_results = query_results.outputs["data_count"]
print(f"The data index contains {num_results} entries.")

The data index contains 0 entries.


In [10]:
# Find documents matching query
search_query = "speedup"
query = DataQuery(
    search_query,
    source=["description.title", "description.authors"],
    coordinates=data_index.source,
)
query_results = api.queries.run(query)

all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:
        print()
        # Add row to results table
        all_results.append(
            {
                "Title": row["_source"]["description"]["title"],
                "Authors": ", ".join(
                    [
                        author["name"]
                        for author in row["_source"]["description"].get("authors", [])
                    ]
                ),
            }
        )

num_results = len(all_results)
print(f"Finished fetching all data. Total is {num_results} records.")

0it [00:00, ?it/s]



Finished fetching all data. Total is 2 records.


In [11]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(HTML(df.head().to_html(render_links=True)))

Unnamed: 0,Title,Authors
0,Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness,
1,Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale.,


---

### Cleanup
If enabled, we will delete all the resources created in the example

In [12]:
# Delete data index
if CLEANUP:
    api.data_indices.delete(data_index.source)
    print("Data index deleted")
    TMP_DIR.cleanup()
    print("Temporary directory deleted")

Data index deleted
Temporary directory deleted
