# Bring Your Own PDFs

In this example we combine the document conversion capabilities of Deep Search with its data query capabilities.
From the Deep Search Workspace, we create a new project data index which can host our own PDF documents.
Once the upload is completed, we will be able to query the documents, similar to the public data which we
explored in the [Data query quick start example](../data_query_quick_start/). 


### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io/#unlimited-access) if you are interested in exploring
this Deep Search capabilities.

### Authentication via stored credentials

In this example, we initialize the Deep Search client from the credentials
contained in the file `../../ds-auth.ext-v2.json`. This can be generated with

```shell
!deepsearch login --host https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud/ --output ../../ds-auth.ext-v2.json
```

The extra `--host` argument is required in this example to target the limited access instance

More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

### Notebooks parameters

The following block defines the parameters used to execute the notebook

- `CONFIG_FILE`: location of the Deep Search configuration file
- `PROJ_NAME`: the name for the project to create
- `INDEX_NAME`: the name for the data index to create
- `INPUT_FILES_FOLDER`: the files to upload
- `CLEANUP`: if true, it will delete the resources at the end of the execution

In [1]:
# Input parameters for the example flow
from pathlib import Path
import datetime
CONFIG_FILE = Path("../../ds-auth.ext-v2.json")

timestamp = datetime.datetime.now()
PROJ_NAME = f"Example project of {timestamp.strftime('%Y%m%d')}"
INDEX_NAME = f"Example docs upload {timestamp.strftime('%Y%m%d %H%M%S')}"
INPUT_FILES_FOLDER = Path("../../data/samples/")

# Cleanup
CLEANUP = True

### Import example dependencies

In [2]:
# Import standard dependenices
import datetime
from copy import deepcopy
from tqdm.notebook import tqdm
import pandas as pd

# IPython utilities
from IPython.display import display, Markdown, HTML, display_html

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.client.components.queries import RunQueryError
from deepsearch.cps.data_indices import utils as data_indices_utils


### Connect to Deep Search

In [3]:
# Initialize the Deep Search client from the config file
config = ds.DeepSearchConfig.parse_file(CONFIG_FILE)
client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

### Create data index and upload data

In [4]:
# Create example project
proj = api.projects.create(name=PROJ_NAME)
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=proj.key, name=INDEX_NAME)

In [5]:
# Upload and convert documents
data_indices_utils.upload_files(api=api, coords=data_index.source, local_file=INPUT_FILES_FOLDER)

Completed successfully


In [6]:
display(Markdown(f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{data_index.source.proj_key}/data/project/{data_index.source.index_key}/"))


The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud//projects/67a639f5db9c17b928c8c132b57a9be4a8eac33c/data/project/2b686a64ffd68efe39f30b1e25be9c71da437129/

### Query your data

In [7]:
# Count the documents in the data index
query = DataQuery("*", source=[""], limit=0, coordinates=data_index.source)
query_results = api.queries.run(query)
print(f"The data index contains {query_results.outputs['data_count']} entries.")

The data index contains 2 entries.


In [8]:
# Find documents matching query
search_query = "speedup"
query = DataQuery(search_query, source=["description.title", "description.authors"], coordinates=data_index.source)
query_results = api.queries.run(query)

all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:  
        print()
        # Add row to results table
        all_results.append({
            "Title": row["_source"]["description"]["title"],
            "Authors": ", ".join([author["name"] for author in row["_source"]["description"].get("authors", [])]),
        })    

print(f'Finished fetching all data. Total is {len(all_results)} records.')

0it [00:00, ?it/s]


Finished fetching all data. Total is 1 records.


In [9]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(HTML(df.head().to_html(render_links=True)))

Unnamed: 0,Title,Authors
0,Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness,


### Cleanup
If enabled, we will delete all the resources created in the example

In [10]:
# Delete data index
if CLEANUP:
    api.data_indices.delete(data_index.source)
    print("Project data deleted")

Project data deleted


In [11]:
# Delete project
if CLEANUP:
    api.projects.remove(project=proj)
    print("Project deleted")

Project deleted
