# Bring Your Own PDFs

In this example we combine the document conversion capabilities of Deep Search with its data query capabilities.
From the Deep Search Workspace, we create a new project data index which can host our own PDF documents.
Once the upload is completed, we will be able to query the documents, similar to the public data which we
explored in the [Data query quick start example](../data_query_quick_start/). 
In the last steps of the example, we additionally export the converted documents as JSON files.


Sections
1. [Create data index and upload data](#Create-data-index-and-upload-data)
2. [Query your data](#Query-your-data)
3. [Download your data](#Download-your-data)
4. Custom upload settings
    1. [Enable OCR](#Enable-OCR)
    2. [Enable raw PDF cells](#Enable-raw-PDF-cells)


### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.

### Set notebook parameters

In [20]:
from dsnotebooks.settings import ProjectNotebookSettings
from pathlib import Path

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # profile to use
PROJ_KEY = notebook_settings.proj_key  # project to use
INDEX_NAME = notebook_settings.new_idx_name  # index to create
CLEANUP = notebook_settings.cleanup  # whether to clean up
INPUT_FILE_PATH = Path("../../data/samples/2206.00785.pdf")
INPUT_OCR_FILE = Path("../../data/scanned-samples/2206.00785-7.png")

############
_GARBAGE_COLLECTOR = (
    []
)  # list of resources to clean up at the end of the execution (if CLEANUP=True)

print(f"The example will be executed on the Deep Search instance {PROFILE_NAME}")

The example will be executed on the Deep Search instance pr-516


### Import example dependencies

In [4]:
# Import standard dependenices
from copy import deepcopy
import json
from tqdm.notebook import tqdm
import pandas as pd
import tempfile

# IPython utilities
from IPython.display import display, Markdown, HTML

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.documents.core.export import export_to_markdown
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.data_indices import utils as data_indices_utils

### Connect to Deep Search

In [5]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

---

### Create data index and upload data

In [6]:
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=PROJ_KEY, name=INDEX_NAME)
_GARBAGE_COLLECTOR.append(data_index)
index_key = data_index.source.index_key

In [7]:
# Upload and convert documents
data_indices_utils.upload_files(
    api=api, coords=data_index.source, local_file=INPUT_FILE_PATH
)

['SUCCESS']

In [8]:
display(
    Markdown(
        f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{PROJ_KEY}/library/private/{index_key}"
    )
)

The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://pr-516-cps-dev.deepsearch-dev.zurich.ibm.com/projects/5968aaefaabfbe404689ab063a5cc29fe87be9ec/library/private/30a01bde832eb410dbc51591524a30e916a923d3

---

### Query your data

In [9]:
# Count the documents in the data index
query = DataQuery("*", source=[""], limit=0, coordinates=data_index.source)
query_results = api.queries.run(query)
num_results = query_results.outputs["data_count"]
print(f"The data index contains {num_results} entries.")

The data index contains 1 entries.


In [10]:
# Find documents matching query
search_query = "speedup"
query = DataQuery(
    search_query,
    source=["file-info.filename", "description.title", "description.authors"],
    coordinates=data_index.source,
)
query_results = api.queries.run(query)

all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:
        print()
        metadata = row["_source"].get(
            "description", {}
        )  # setting default, in case no title and authors are detected
        # Add row to results table
        all_results.append(
            {
                "Filename": row["_source"]["file-info"]["filename"],
                "Title": metadata.get("title", ""),
                "Authors": ", ".join(
                    [author["name"] for author in metadata.get("authors", [])]
                ),
            }
        )

num_results = len(all_results)
print(f"Finished fetching all data. Total is {num_results} records.")

0it [00:00, ?it/s]


Finished fetching all data. Total is 1 records.


In [11]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(
    Markdown(f"#### Results\nDocuments matching the search query '{search_query}':")
)
display(HTML(df.head().to_html(render_links=True)))

#### Results
Documents matching the search query 'speedup':

Unnamed: 0,Filename,Title,Authors
0,2206.00785.pdf,,


---

### Download your data

In [12]:
# Run query
query = DataQuery(search_query="*", source=["*"], coordinates=data_index.source)
cursor = api.queries.run_paginated_query(query)

# Using a temp dir for demo purposes; to persist instead, set output dir accordingly
temp_dir = tempfile.TemporaryDirectory()
output_dir = temp_dir.name

# Iterate through query results
all_results = []
for result_page in tqdm(cursor):
    for row in result_page.outputs["data_outputs"]:
        print(row)
        metadata = row["_source"].get(
            "description", {}
        )  # setting default, in case no title and authors are detected

        # Download JSON file
        file_path_json = Path(output_dir) / f"{row['_id']}.json"
        with open(file_path_json, "w") as outfile:
            json.dump(row["_source"], outfile, indent=2)

        # Export JSON to Markdown
        file_path_md = Path(output_dir) / f"{row['_id']}.md"
        with open(file_path_md, "w") as outfile:
            outfile.write(export_to_markdown(row["_source"]))

        all_results.append(
            {
                "Filename": row["_source"]["file-info"]["filename"],
                "Title": metadata.get("title", ""),
                "JSON Path": file_path_json,
                "Markdown Path": file_path_md,
            }
        )

print(f"Finished fetching all data. Total is {len(all_results)} records.")
print(f"Data downloaded in {output_dir}")

# Visualize a table listing document titles and locations
df = pd.json_normalize(all_results)
display(df)

0it [00:00, ?it/s]

{'_index': 'cps-dev-deepsearch-dev-projdata30a01b', '_id': '6627d1b67955c51ff1aa8858de671bb5a62ad70c77e62e0ac57c153d0078b7ea', '_score': None, '_source': {'_name': '', 'type': 'pdf-document', 'description': {'logs': [{'agent': 'CPS Docling', 'type': 'parsing', 'comment': 'Docling 2.7.1 parsing of documents', 'date': '2025-01-16T09:34:37.575600+00:00'}], 'collection': {'type': 'Document'}}, 'file-info': {'filename': '2206.00785.pdf', 'document-hash': '6627d1b67955c51ff1aa8858de671bb5a62ad70c77e62e0ac57c153d0078b7ea', '#-pages': 11, 'page-hashes': [{'hash': 'dc8ab77215bdf5d1c50e4635bc886078ceb862131df9de406624cb73be373fc1', 'model': 'default', 'page': 1}, {'hash': '58474aac22030f7608c307aeb1aee7b69b8b4ca533c80ae41ab68153790d6ab0', 'model': 'default', 'page': 2}, {'hash': '1d96fb50e9a4d970cab5199fa3b76469b804c2457390c4bdda04890bc651b813', 'model': 'default', 'page': 3}, {'hash': '4dc4d130627b1f9841208469936e65cf4667f47730044ef31ecbf9664a41d706', 'model': 'default', 'page': 4}, {'hash': '4

Unnamed: 0,Filename,Title,JSON Path,Markdown Path
0,2206.00785.pdf,,/var/folders/6l/wtzmg05s4_9364_5dntvydx00000gn...,/var/folders/6l/wtzmg05s4_9364_5dntvydx00000gn...


In [13]:
# Peek first lines of a downloaded file
with open(df.iloc[0]["Markdown Path"]) as demo_file:
    content = ""
    for _ in range(20):
        line = demo_file.readline()
        content += line

    display(Markdown("## Markdown content"))
    display(Markdown(content))

with open(df.iloc[0]["JSON Path"]) as demo_file:
    content = ""
    for _ in range(20):
        line = demo_file.readline()
        content += line
    display(Markdown("## JSON content"))
    display(Markdown(f"<code>{content}</code>"))

## Markdown content

## Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

1 st Christoph Auer IBM Research Ruschlikon, Switzerland cau@zurich.ibm.com

4 th Cesar Berrospi Ramis IBM Research Ruschlikon, Switzerland ceb@zurich.ibm.com

2 nd Michele Dolfi IBM Research Ruschlikon, Switzerland dol@zurich.ibm.com

5 th Peter W.J. Staar IBM Research Ruschlikon, Switzerland taa@zurich.ibm.com

Abstract -Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.

Index Terms -cloud applications, document understanding, distributed computing, artificial intelligence

## I. INTRODUCTION

Over the past decade, many organizations have accelerated their transformation into data-driven businesses, as studies have shown its positive impact in efficiency, decision making, or financial performance [1], [2]. Leading companies are increasingly deploying workloads on public and private cloud infrastructure, including business intelligence processing and machine learning models in data analytics platforms [3]. This is owed to several factors such as high availability, lower cost for compute, and storage [4], as well as the flexibility to scale up or down a cloud-based business process to fit the operational needs. Workloads and services can be container-

ized, deployed, and orchestrated through widely adopted and standardized platforms like Kubernetes [5], [6].



## JSON content

<code>{
  "_name": "",
  "type": "pdf-document",
  "description": {
    "logs": [
      {
        "agent": "CPS Docling",
        "type": "parsing",
        "comment": "Docling 2.7.1 parsing of documents",
        "date": "2025-01-16T09:34:37.575600+00:00"
      }
    ],
    "collection": {
      "type": "Document"
    }
  },
  "file-info": {
    "filename": "2206.00785.pdf",
    "document-hash": "6627d1b67955c51ff1aa8858de671bb5a62ad70c77e62e0ac57c153d0078b7ea",
    "#-pages": 11,
</code>

---

## Enable OCR

This section is using the `ConversionSettings` object to enable OCR when converting PDF documents.

Refer to the [OCR settings documentation](https://ds4sd.github.io/deepsearch-toolkit/guide/convert_doc/#modify-ocr-settings) for more details. 


In [14]:
from deepsearch.documents.core.models import ConversionSettings

In [15]:
# Create a new data index to process with OCR
data_index = api.data_indices.create(proj_key=PROJ_KEY, name=INDEX_NAME + "-ocr")
_GARBAGE_COLLECTOR.append(data_index)

In [16]:
# Load conversion settings and enable OCR
cs = ConversionSettings()
cs.ocr.do_ocr = True  # Enable or disable OCR

# Upload and convert documents with custom conversion settings
data_indices_utils.upload_files(
    api=api, coords=data_index.source, local_file=INPUT_OCR_FILE, conv_settings=cs
)

# Display message
display(
    Markdown(
        f"#### Results\nThe data is now available. This file will now display the text from the scanned pages. Access it via the Deep Search UI at <br />{api.client.config.host}/projects/{data_index.source.proj_key}/library/private/{data_index.source.index_key}"
    )
)

#### Results
The data is now available. This file will now display the text from the scanned pages. Access it via the Deep Search UI at <br />https://pr-516-cps-dev.deepsearch-dev.zurich.ibm.com/projects/5968aaefaabfbe404689ab063a5cc29fe87be9ec/library/private/85072297d15164214783cc773225502827fa1504

---

## Enable raw PDF cells

The document conversion pipeline is producing a JSON file corresponsing to the PDF documents, where all document components have been grouped, classified and further inspected (e.g. table structure) for a simple usage.

However, in some use cases it is convenient to rely on the raw text cells contained in the PDF document.
This an auxiliary file that Deep Search is making available on demand.
To following section is demonstrating how this is enabled.


In [17]:
from deepsearch.documents.core.models import TargetSettings

In [18]:
# Create a new data index to process with OCR
data_index = api.data_indices.create(proj_key=PROJ_KEY, name=INDEX_NAME + "-raw")
_GARBAGE_COLLECTOR.append(data_index)

In [21]:
# Set custom target settings with raw pdf cells enabled
tsettings = TargetSettings(add_raw_pages=True, add_annotations=False)

# Upload and convert documents with custom conversion settings
data_indices_utils.upload_files(
    api=api,
    coords=data_index.source,
    local_file=INPUT_FILE_PATH,
    target_settings=tsettings,
)

['SUCCESS']

In [22]:
# Run query
query = DataQuery(
    search_query="*",
    source=["file-info.filename", "_s3_data.raw-pages"],
    coordinates=data_index.source,
)
cursor = api.queries.run_paginated_query(query)

# Iterate through query results
all_results = []
for result_page in cursor:
    for row in result_page.outputs["data_outputs"]:
        filename = row["_source"]["file-info"]["filename"]
        for raw_page in row["_source"]["_s3_data"]["raw-pages"]:

            all_results.append(
                {
                    "Filename": filename,
                    "Page": raw_page["page"],
                    "RAW file": f"<a target='_blank' href='{raw_page['url']}'>Link</a>",
                }
            )

print(f"Finished fetching all data. Total is {len(all_results)} records.")
print(f"Data downloaded in {output_dir}")
display(
    Markdown(
        "#### Results\nHere is the list of the files uploaded and the urls where to download the raw pdf cells details."
    )
)

# Visualize a table listing document titles and locations
df = pd.json_normalize(all_results)
display(HTML(df.to_html(render_links=True, escape=False)))

Finished fetching all data. Total is 0 records.
Data downloaded in /var/folders/6l/wtzmg05s4_9364_5dntvydx00000gn/T/tmpec1yboxd


#### Results
Here is the list of the files uploaded and the urls where to download the raw pdf cells details.

### Cleanup
If enabled, we will delete all the resources created in the example

In [23]:
# Delete data index
if CLEANUP:
    for data_index in _GARBAGE_COLLECTOR:
        api.data_indices.delete(data_index.source)
        print(f"Data index {data_index.name} deleted")