# Bring Your Own PDFs

In this example we combine the document conversion capabilities of Deep Search with its data query capabilities.
From the Deep Search Workspace, we create a new project data index which can host our own PDF documents.
Once the upload is completed, we will be able to query the documents, similar to the public data which we
explored in the [Data query quick start example](../data_query_quick_start/). 


### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io/#unlimited-access) if you are interested in exploring
this Deep Search capabilities.

### Authentication via stored credentials

In this example, we initialize the Deep Search client from the credentials
contained in the file `../../ds-auth.ext-v2.json`. This can be generated with

```shell
!deepsearch login --host https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud/ --output ../../ds-auth.ext-v2.json
```

The extra `--host` argument is required in this example to target the limited access instance

More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

### Notebooks parameters

The following block defines the parameters used to execute the notebook

- `CONFIG_FILE`: location of the Deep Search configuration file
- `PROJ_NAME`: the name for the project to create
- `INDEX_NAME`: the name for the data index to create
- `INPUT_FILES_FOLDER`: the files to upload
- `CLEANUP`: if true, it will delete the resources at the end of the execution

In [18]:
# Input parameters for the example flow
from pathlib import Path
import datetime
CONFIG_FILE = Path("../../ds-auth.ext-v2.json")

timestamp = datetime.datetime.now()
PROJ_NAME = f"Example project of {timestamp.strftime('%Y%m%d')}"
INDEX_NAME = f"Example docs upload {timestamp.strftime('%Y%m%d %H%M%S')}"
INPUT_FILES_FOLDER = Path("../../data/samples/scanned_documents")

# Cleanup
CLEANUP = True

### Import example dependencies

In [19]:
# Import standard dependenices
import datetime
from copy import deepcopy
from tqdm.notebook import tqdm
import pandas as pd
import tempfile
import json

# IPython utilities
from IPython.display import display, Markdown, HTML, display_html

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.client.components.queries import RunQueryError
from deepsearch.cps.data_indices import utils as data_indices_utils
from deepsearch.documents.core.models import ConversionSettings


### Connect to Deep Search

In [20]:
# Initialize the Deep Search client from the config file
config = ds.DeepSearchConfig.parse_file(CONFIG_FILE)
client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

### Create data index and upload data

In [28]:
# Create example project
proj = api.projects.create(name=PROJ_NAME)
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=proj.key, name=INDEX_NAME)

In [29]:
# Upload and convert documents
#Define the OCR to be used
conv_settings = ConversionSettings().from_project(api, proj_key=proj.key)
conv_settings.ocr.enabled= True
conv_settings.ocr.merge_mode= 'prioritize-ocr'
conv_settings.ocr.backend= 'alpine-ocr'


data_indices_utils.upload_files(api=api, coords=data_index.source, local_file=INPUT_FILES_FOLDER, conv_settings=conv_settings)

In [30]:
display(Markdown(f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{data_index.source.proj_key}/data/project/{data_index.source.index_key}/"))


The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://cps.foc-deepsearch.zurich.ibm.com/projects/e6cd444006224102f0758e6666c09c8969dc94ab/data/project/8244be0d8f4e5e02e2c8db3ecab85a5bb5c418a7/

### Download your data

In [31]:
# Run query
query = DataQuery(search_query="*", source=["*"], coordinates=data_index.source)
cursor = api.queries.run_paginated_query(query)

# Using a temp dir for demo purposes; to persist instead, set output dir accordingly
temp_dir = tempfile.TemporaryDirectory()
output_dir = temp_dir.name

# Iterate through query results
all_results = []
for result_page in tqdm(cursor):
    for row in result_page.outputs["data_outputs"]:

        # Download JSON file
        file_path = Path(output_dir) / f"{row['_id']}.json"
        with open(file_path, "w") as outfile:
            json.dump(row["_source"], outfile, indent=2)
        try:
            all_results.append({
                "Title": row["_source"]["description"]["title"],
                "Path": file_path,
            })
        except KeyError as e:
            print(e)

print(f"Finished fetching all data. Total is {len(all_results)} records.")
print(f"Data downloaded in {output_dir}")

0it [00:00, ?it/s]

'title'
Finished fetching all data. Total is 1 records.
Data downloaded in /tmp/tmpt_7juiy8


In [32]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(df)

Unnamed: 0,Title,Path
0,Missouri Public Water Systems BILL TO:,/tmp/tmpt_7juiy8/780668c60178a66e31394f81ad092...


### Cleanup
If enabled, we will delete all the resources created in the example

In [26]:
# Delete data index
if CLEANUP:
    api.data_indices.delete(data_index.source)
    print("Project data deleted")

Project data deleted


In [27]:
# Delete project
if CLEANUP:
    api.projects.remove(project=proj)
    print("Project deleted")

Project deleted
