# Bring Your Own PDFs

In this example we combine the document conversion capabilities of Deep Search with its data query capabilities.
From the Deep Search Workspace, we create a new project data index which can host our own PDF documents.
Once the upload is completed, we will be able to query the documents, similar to the public data which we
explored in the [Data query quick start example](../data_query_quick_start/). 


### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.

### Set notebook parameters

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings
from pathlib import Path
import datetime

# notebooks settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJ_KEY = notebook_settings.proj_key     # the project to use
CLEANUP = notebook_settings.cleanup       # whether to clean up
INDEX_NAME = f"tmp_{datetime.datetime.now().strftime('%Y%m%d%H%M%S')}"
INPUT_FILES_FOLDER = Path("../../data/samples/")

### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
from tqdm.notebook import tqdm
import pandas as pd

# IPython utilities
from IPython.display import display, Markdown, HTML

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.data_indices import utils as data_indices_utils

### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

### Create data index and upload data

In [4]:
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=PROJ_KEY, name=INDEX_NAME)
index_key = data_index.source.index_key

In [5]:
# Upload and convert documents
data_indices_utils.upload_files(api=api, coords=data_index.source, local_file=INPUT_FILES_FOLDER)

In [6]:
display(Markdown(f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{PROJ_KEY}/library/private/{index_key}"))

The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://sds.app.accelerate.science//projects/b09ae7561a01dc7c4b0fd21a43bfd93d140766d1/library/private/ef7f25b1a812eeac8630060dfda183b23185fb4c

### Query your data

In [7]:
# Count the documents in the data index
query = DataQuery("*", source=[""], limit=0, coordinates=data_index.source)
query_results = api.queries.run(query)
num_results = query_results.outputs['data_count']
print(f"The data index contains {num_results} entries.")

The data index contains 2 entries.


In [8]:
# Find documents matching query
search_query = "speedup"
query = DataQuery(search_query, source=["description.title", "description.authors"], coordinates=data_index.source)
query_results = api.queries.run(query)

all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:  
        print()
        # Add row to results table
        all_results.append({
            "Title": row["_source"]["description"]["title"],
            "Authors": ", ".join([author["name"] for author in row["_source"]["description"].get("authors", [])]),
        })    

num_results = len(all_results)
print(f'Finished fetching all data. Total is {num_results} records.')

0it [00:00, ?it/s]


Finished fetching all data. Total is 1 records.


In [9]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(HTML(df.head().to_html(render_links=True)))

Unnamed: 0,Title,Authors
0,Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness,


### Cleanup
If enabled, we will delete all the resources created in the example

In [10]:
# Delete data index
if CLEANUP:
    api.data_indices.delete(data_index.source)
    print("Data index deleted")

Data index deleted
