# Data queries - Chemistry

Chemistry is one of the focal domains for Deep Search. One of the key resources is exposing a search across
chemistry databases like PubChem, etc and the possibility to link data with the document collections.
In this example we start with example searches for molecules both by name, synonym or SMILES.


### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io/#unlimited-access) if you are interested in exploring
this Deep Search capabilities.

### Authentication via stored credentials

In this example, we initialize the Deep Search client from the credentials
contained in the file `../../ds-auth.ext-v2.json`. This can be generated with

```shell
!deepsearch login --host https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud/ --output ../../ds-auth.ext-v2.json
```

The extra `--host` argument is required in this example to target the limited access instance

More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

### Notebooks parameters

The following block defines the parameters used to execute the notebook

- `CONFIG_FILE`: location of the Deep Search configuration file


In [1]:
# Input parameters for the example flow
from pathlib import Path
CONFIG_FILE = Path("../../ds-auth.ext-v2.json")

### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
import pandas as pd
from numerize.numerize import numerize
import mols2grid
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

# IPython utilities
from IPython.display import display, Markdown, HTML, display_html

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.client.components.queries import RunQueryError


### Connect to Deep Search

In [3]:
# Initialize the Deep Search client from the config file
config = ds.DeepSearchConfig.parse_file(CONFIG_FILE)
client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

---

## Search molecules

In this section we will look for data collections interesting for chemistry and will search for molecules

- [List data collections](#List-data-collections-in-Materials-Science-domain)
- [Search Ibuprofen on PubChem](#Search-Ibuprofen-on-PubChem)
- [Search PubChem by SMILES](#Search-PubChem-by-SMILES)
- [Search patents](#Search-SMILES-in-USPTO-patents)

### List data collections in _Materials Science_ domain

This is going to query the Deep Search system for the data collections available on the _Materials Science_ domain.
In the list we will find a combination of database and document collections.

Interesting data collections for this examples are PubChem and USTPO pre-processed patents.
Deep Search is regularly parsing the PubChem database to index molecules and their properties.

In [4]:
# Fetch list of all data collections
collections = api.elastic.list(domain="Materials Science")
collections.sort(key=lambda c: c.name.lower())

In [5]:
# Visualize summary table
results = [
    {
        "Name": c.name,
        "Type": c.metadata.type,
        "Num entries": numerize(c.documents),
        "Date": c.metadata.created.strftime("%Y-%m-%d"),
        "Coords": f"{c.source.elastic_id}/{c.source.index_key}",
    }
    for c in collections
]
display(pd.DataFrame(results))

Unnamed: 0,Name,Type,Num entries,Date,Coords
0,BioRxiv,Document,291.57K,2022-10-20,materials/biorxiv
1,Brenda,Record,7.12K,2022-01-21,materials/brenda
2,ChEMBL,Record,2.11M,2021-12-21,materials/chembl
3,ChemRxiv,Document,8.97K,2021-11-03,materials/chemrxiv
4,COD,Record,493.2K,2022-11-11,materials/cod
5,COD (deprecated),Record,480.14K,2022-01-29,materials/cod-deprecated
6,DeepSearch materials,Record,349.64K,2022-09-23,materials/ds4sd-material
7,GenBank,Record,234.09M,2022-01-27,materials/genbank
8,Material Components,Experiment,16.32K,2022-05-10,materials/experiment
9,NMRShift,Record,44.33K,2022-03-11,materials/nmrshift


---
## Search _Ibuprofen_ on PubChem 

In this section we search for all PubChem entries which contain the string _Ibuprofen_.

In the results table we see the name of the chemical, its molecule SMILES and some properties such as the molecular weight and the solubility.


In [6]:
# Search by name
search_query = "Ibuprofen"

data_collection = ElasticDataCollectionSource(elastic_id="materials", index_key="pubchem-deprecated")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    source=["subject", "attributes", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f'Finished fetching all data. Total is {len(all_results)} records.')

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 42 records.


In [7]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]
    
    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]
    
    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value
                
        
    
    results_table.append(result)

df = pd.DataFrame(results_table)
display(df)

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight,solubility,temperature,solvent
0,6575,Trichloroethylene,C(=C(Cl)Cl)Cl,201-167-4,79-01-6,131.38,0.00128,25.0,chloroform
1,6933487,(S)-ibuprofen methyl ester,CC(C)CC1=CC=C(C=C1)C(C)C(=O)OC,,81576-55-8,220.31,,,
2,73981,Magnesium hydroxide,[OH-].[OH-].[Mg+2],215-170-3,1309-42-8,58.32,insoluble,,ethanol
3,12791155,Ibuprofen isobutanolammonium,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.CC(C)(CO)N,,67190-45-8,295.4,,,
4,114864,(-)-ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,610-621-4,51146-57-7,206.28,,,
5,3825,Ketoprofen,CC(C1=CC(=CC=C1)C(=O)C2=CC=CC=C2)C(=O)O,244-759-8,22071-15-4,254.28,0.000021,22.0,
6,163898,,CCCCCCCC[N+](C)(C)CCOC(=O)C(C)C1=CC=C(C=C1)CC(...,,113168-14-2,470.5,,,
7,109101,Ibuprofen methyl ester,CC(C)CC1=CC=C(C=C1)C(C)C(=O)OC,,61566-34-5,220.31,,,
8,9863332,Ibuprofen lysine,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.C(CCN)CC(C(=O)O)N,260-751-7,57469-77-9,352.5,,,
9,9890190,Ibuprofen aluminum,CC(C)CC1=CC=C(C=C1)C(C)C(=O)[O-].CC(C)CC1=CC=C...,,,454.5,,,


### Visualize results with mols2grid

The mols2grid package is a convenient tool which visualizes all the molecules SMILES.
This section illustrates how to visualize the Deep Search results.


In [8]:
mols2grid.display(df, smiles_col="SMILES")

MolGridWidget()

---
### Search only for the chemical name 

The previous search was listing the PubChem entries which mentioned _Ibuprofen_ anywhere in their content.

Next, we will limit the search by search only in the `subject.names.value` field.

In [9]:
# Search by name
search_query = "subject.names.value:Ibuprofen"

data_collection = ElasticDataCollectionSource(elastic_id="materials", index_key="pubchem-deprecated")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    source=["subject", "attributes", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f'Finished fetching all data. Total is {len(all_results)} records.')

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 1 records.


In [10]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]
    
    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]
    
    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value
                
        
    
    results_table.append(result)

    
# Display the results table
df = pd.DataFrame(results_table)
display(df)

# Visualize the molecules
mols2grid.display(df, smiles_col="SMILES")

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight,solubility,temperature,solvent
0,3672,Ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,239-784-6,79261-49-7 (potassium salt),206.28,2.1e-05,25.0,water


MolGridWidget()

---
## Search PubChem by SMILES

In [11]:
# Search by name
search_smiles = "C1=CC=C2C(=C1)C(=CN2)CCO"
search_query = f"subject.identifiers._name:\"smiles#{search_smiles.lower()}\""

data_collection = ElasticDataCollectionSource(elastic_id="materials", index_key="pubchem-deprecated")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    source=["subject", "attributes", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f'Finished fetching all data. Total is {len(all_results)} records.')

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 3 records.


In [12]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]
    
    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]
    
    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value
                
        
    
    results_table.append(result)

# Display the results table
df = pd.DataFrame(results_table)
display(df)

# Visualize the molecules
mols2grid.display(df, smiles_col="SMILES")

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight
0,11263702,,C1=CC=C2C(=C1)C(=CN2)CCO,,,166.23
1,101132237,,C1=CC=C2C(=C1)C(=CN2)CCO,,,165.22
2,10685,Tryptophol,C1=CC=C2C(=C1)C(=CN2)CCO,208-393-2,526-55-6,161.2


MolGridWidget()

---
## Search SMILES in USPTO patents

In [13]:
# Search by name
search_smiles = "CCC(COC(=O)CS)(C(=O)C(=O)CS)C(=O)C(=O)CS"

search_query = f"identifiers._name:\"smiles#{search_smiles.lower()}\""

data_collection = ElasticDataCollectionSource(elastic_id="circa", index_key="patent-uspto-smiles")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query, # The search query to be executed
    source=["subject", "attributes", "identifiers"], # Which fields of documents we want to fetch
    limit=page_size, # The size of each request page
    coordinates=data_collection # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (expected_total + page_size - 1) // page_size # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f'Finished fetching all data. Total is {len(all_results)} records.')

Search query identifiers._name:"smiles#ccc(coc(=o)cs)(c(=o)c(=o)cs)c(=o)c(=o)cs"


  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 4 records.


In [14]:
# Parsing results. From the raw results, we will fetch
# - The SMILES which is matched
# - The corresponding Patent ID

results_table = []
for row in all_results:
    result = {
        "SMILES": "",
        "Patent ID": "",
    }

    for ref in row["_source"].get("identifiers", []):
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "patentid":
            result["Patent ID"] = ref["value"]

    results_table.append(result)

# Display the results table
df = pd.DataFrame(results_table)
display(df)

# Visualize the molecules
mols2grid.display(df, smiles_col="SMILES")

Unnamed: 0,SMILES,Patent ID
0,CCC(COC(=O)CS)(C(=O)C(=O)CS)C(=O)C(=O)CS,US20110259677A1
1,CCC(COC(=O)CS)(C(=O)C(=O)CS)C(=O)C(=O)CS,US20110259677A1
2,CCC(COC(=O)CS)(C(=O)C(=O)CS)C(=O)C(=O)CS,US09944493
3,CCC(COC(=O)CS)(C(=O)C(=O)CS)C(=O)C(=O)CS,US09944493


MolGridWidget()