# Data queries - Quick start

Deep Search parses and interprets large collections of documents quickly and concurrently.
These collections are stored and indexed such that you can search and retrieve any document according to their contents.

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.

### Set notebook parameters

In [1]:
from dsnotebooks.settings import NotebookSettings

# notebook settings auto-loaded from .env / env vars
notebook_settings = NotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use


### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
import pandas as pd
from numerize.numerize import numerize
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

# IPython utilities
from IPython.display import display, HTML

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery


### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)


---

## Inspect Deep Search data

In this section we will inspect the different data collections on Deep Search and we will perform soma initial searches.

Explanation of the columns
- **Name**: The name of the data collection
- **Type**: Type of content in the data collection (see below)
- **Date**: Date of the last update on the data
- **Num entries**: Number of entries in the data collection
- **Coords**: Coordinates needed in the Deep Search Toolkit to query a specific data collection


Type of data collections
- **Document**: A collection of documents. These are either PDF documents already converted via the Deep Search document conversion, or data sources which can be mapped to the same format, e.g. the USPTO patents are obtained as sourced as XML
- **Record**: A collection of DB records in a structure database. For example, the content of PubChem is made available in this format. 

In [4]:
# Fetch list of all data collections
collections = api.elastic.list()
collections.sort(key=lambda c: c.name.lower())


In [5]:
# Visualize summary table
results = [
    {
        "Name": c.name,
        "Type": c.metadata.type,
        "Num entries": numerize(c.documents),
        "Date": c.metadata.created.strftime("%Y-%m-%d"),
        "Coords": f"{c.source.elastic_id}/{c.source.index_key}",
    }
    for c in collections
]
display(pd.DataFrame(results))


Unnamed: 0,Name,Type,Num entries,Date,Coords
0,ACL Anthology,Document,55.28K,2023-10-05,default/acl
1,Annual Reports,Document,79.79K,2023-07-21,default/annual-report
2,arXiv abstracts,Document,2.35M,2023-10-24,default/arxiv-abstract
3,ClinicalTrials,Document,426.42K,2023-06-01,default/clinical-trials
4,Cord19,Document,655.45K,2023-04-14,default/cord19
5,Crossref,Document,131.86M,2023-04-15,default/crossref
6,ESG Reports,Document,10.93K,2023-10-03,default/esg-report
7,IBM Redbooks,Document,2.75K,2023-08-02,default/ibm-redbooks
8,IPCC,Document,819,2023-06-15,default/ipcc
9,PatCID,Record,50.83M,2023-09-05,default/patcid


### Search among data collections

In the following section we will count how many results we obtain in each data collection.

⚠️ To just count, pass an empty list as `source` argument in the `DataQuery` class to avoid unnecessary data transfers.

In [6]:
# Input query
search_query = "(\"power conversion efficiency\" OR PCE) AND organ*"

# Iterate through the data collections
results = []
for c in (pbar := tqdm(collections)):
    pbar.set_description(f"Querying {c.name}")

    # Search only on document collections
    if c.metadata.type != "Document":
        continue

    # Execute the query
    query = DataQuery(search_query, source=[], limit=0, coordinates=c.source)
    query_results = api.queries.run(query)
    results.append({
        "name": c.name,
        "matches": query_results.outputs["data_count"]
    })

# Sort and display results
results.sort(reverse=True, key=lambda r: r["matches"])
display(pd.DataFrame(results))


  0%|          | 0/16 [00:00<?, ?it/s]

RunQueryError: Task '0_ElasticQuery' failed with 'RuntimeError': Failed to query Elasticsearch service: 500 '{"error":{"root_cause":[{"type":"too_many_nested_clauses","reason":"too_many_nested_clauses: Query contains too many nested clauses; maxClauseCount is set to 1024"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"clinical-trials-20220815","node":"q3hN1RZ_T5-3a3Gyz62CHQ","reason":{"type":"too_many_nested_clauses","reason":"too_many_nested_clauses: Query contains too many nested clauses; maxClauseCount is set to 1024"}}]},"status":500}'. Full error:
- Error Type: 'RuntimeError'
- Task ID: '0_ElasticQuery'

In [None]:
# Plot the distribution of results among data collections

x = [r["name"] for r in results]
y = [r["matches"] for r in results]
plt.pie(y, labels=x, labeldistance=None)
plt.legend(loc="upper center", ncols=3, bbox_to_anchor=(0.5, 0))


---

## Search documents

In this section we search for documents in the arXiv.org data collection matching the input query.
For each matched document we return the title, authors as well as the link to the original document on arXix.org 


This example will demonstrate
1. How to address a specific data collection
2. How to choose which component of the documents should be returned
3. How to iterate through the complete data collection by fetching `page_size=50` results at the time


In [None]:
# Input query
search_query = "(\"power conversion efficiency\" OR PCE) AND organ*"
data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="arxiv-abstract")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    source=["description.title", "description.authors", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    # Iterate through the results of a single page, and add to the total list
    for row in result_page.outputs["data_outputs"]:
        # Process custom identifiers links
        links = {"arxiv": None, "doi": None}
        identifiers = {"arxiv": None, "doi": None}
        for ids in row["_source"].get("identifiers", []):
            if ids["type"] == "arxivid":
                identifiers["arxiv"] = ids["value"]
                links["arxiv"] = f'https://arxiv.org/abs/{ids["value"]}'
            if ids["type"] == "doi":
                identifiers["doi"] = ids["value"]
                links["doi"] = f'https://doi.org/{ids["value"]}'

        # Add row to results table
        all_results.append({
            "Title": row["_source"]["description"]["title"],
            "Authors": ", ".join([author["name"] for author in row["_source"]["description"]["authors"]]),
            "arXiv": identifiers["arxiv"],
            "arXiv URL": links["arxiv"],
            "DOI": identifiers["doi"],
            "DOI URL": links["doi"],
        })

print(f'Finished fetching all data. Total is {len(all_results)} records.')


In [None]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(HTML(df.head().to_html(render_links=True)))


In [None]:
# Save the results to an Excel table
df.to_excel("quick_start_results.xlsx")
