# Data queries - Chemistry

Chemistry is one of the focal domains for Deep Search. One of the key resources is exposing a search across
chemistry databases like PubChem, etc and the possibility to link data with the document collections.
In this example we start with example searches for molecules both by name, synonym or SMILES.

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.

### Set notebook parameters

In [1]:
from dsnotebooks.settings import NotebookSettings

# notebooks settings auto-loaded from .env / env vars
notebook_settings = NotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use

### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
import pandas as pd
from numerize.numerize import numerize
import mols2grid
from tqdm.notebook import tqdm

%matplotlib inline

# IPython utilities
from IPython.display import display

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.client.components.elastic import ElasticDataCollectionSource
from deepsearch.cps.queries import DataQuery

### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

---

## Search molecules

In this section we will look for data collections interesting for chemistry and will search for molecules

- [List data collections](#List-data-collections-in-Materials-Science-domain)
- [Search Ibuprofen on PubChem](#Search-Ibuprofen-on-PubChem)
- [Search PubChem by SMILES](#Search-PubChem-by-SMILES)
- [Search patents](#Search-SMILES-in-USPTO-patents)

### List data collections in _Materials Science_ domain

This is going to query the Deep Search system for the data collections available on the _Materials Science_ domain.
In the list we will find a combination of database and document collections.

Interesting data collections for this examples are PubChem and USTPO pre-processed patents.
Deep Search is regularly parsing the PubChem database to index molecules and their properties.

In [4]:
# Fetch list of all data collections
collections = api.elastic.list(domain="Materials Science")
collections.sort(key=lambda c: c.name.lower())

In [11]:
from datetime import datetime
# Visualize summary table
results = [
    {
        "Name": c.name,
        "Type": c.metadata.type,
        "Num entries": numerize(c.documents),
        "Date": datetime.fromisoformat(c.metadata.created).strftime("%Y-%m-%d"),
        "Coords": f"{c.source.elastic_id}/{c.source.index_key}",
    }
    for c in collections
]
display(pd.DataFrame(results))

Unnamed: 0,Name,Type,Num entries,Date,Coords
0,BioRxiv,Document,357.76K,2023-11-09,default/biorxiv
1,Brenda,Record,7.12K,2023-01-03,default/brenda
2,ChEMBL,Record,2.42M,2024-04-26,default/chembl
3,ChEMBL (DEPRECATED),Record,2.11M,2023-01-03,default/chembl-deprecated
4,ChemRxiv,Document,8.82K,2023-11-23,default/chemrxiv
5,COD,Record,503.78K,2023-07-24,default/cod
6,DeepSearch materials,Record,360.54K,2023-01-03,default/ds4sd-material
7,Material Components,Document,16.32K,2023-01-30,default/experiment
8,NMRShift,Record,44.33K,2023-01-03,default/nmrshift
9,PatCID,Record,23.91M,2024-10-04,default/patcid


---
## Search _Ibuprofen_ on PubChem 

In this section we search for all PubChem entries which contain the string _Ibuprofen_.

In the results table we see the name of the chemical, its molecule SMILES and some properties such as the molecular weight and the solubility.


In [12]:
# Search by name
search_query = "Ibuprofen"

data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="pubchem")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query,  # The search query to be executed
    source=[
        "subject",
        "attributes",
        "identifiers",
    ],  # Which fields of documents we want to fetch
    limit=page_size,  # The size of each request page
    coordinates=data_collection,  # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (
    expected_total + page_size - 1
) // page_size  # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f"Finished fetching all data. Total is {len(all_results)} records.")

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 1 records.


In [13]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]

    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]

    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value

    results_table.append(result)

df = pd.DataFrame(results_table)
display(df)

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight,xlogp3,hydrogen bond donor count,hydrogen bond acceptor count,rotatable bond count,...,monoisotopic mass,topological polar surface area,heavy atom count,formal charge,complexity,isotope atom count,defined atom stereocenter count,undefined atom stereocenter count,covalently-bonded unit count,compound is canonicalized
0,3672,Ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,239-784-6,79261-49-7 (potassium salt),206.28,3.5,1.0,2.0,4.0,...,206.13068,37.3,15.0,0.0,203.0,0.0,0.0,1.0,1.0,Yes


### Visualize results with mols2grid

The mols2grid package is a convenient tool which visualizes all the molecules SMILES.
This section illustrates how to visualize the Deep Search results.


In [14]:
mols2grid.display(df, smiles_col="SMILES")

MolGridWidget()

---
### Search only for the chemical name 

The previous search was listing the PubChem entries which mentioned _Ibuprofen_ anywhere in their content.

Next, we will limit the search by search only in the `subject.names.value` field.

In [15]:
# Search by name
search_query = "subject.names.value:Ibuprofen"

data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="pubchem")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query,  # The search query to be executed
    source=[
        "subject",
        "attributes",
        "identifiers",
    ],  # Which fields of documents we want to fetch
    limit=page_size,  # The size of each request page
    coordinates=data_collection,  # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (
    expected_total + page_size - 1
) // page_size  # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f"Finished fetching all data. Total is {len(all_results)} records.")

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 1 records.


In [16]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]

    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]

    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value

    results_table.append(result)


# Display the results table
df = pd.DataFrame(results_table)
display(df)

# Visualize the molecules
mols2grid.display(df, smiles_col="SMILES")

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight,xlogp3,hydrogen bond donor count,hydrogen bond acceptor count,rotatable bond count,...,monoisotopic mass,topological polar surface area,heavy atom count,formal charge,complexity,isotope atom count,defined atom stereocenter count,undefined atom stereocenter count,covalently-bonded unit count,compound is canonicalized
0,3672,Ibuprofen,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,239-784-6,79261-49-7 (potassium salt),206.28,3.5,1.0,2.0,4.0,...,206.13068,37.3,15.0,0.0,203.0,0.0,0.0,1.0,1.0,Yes


MolGridWidget()

---
## Search PubChem by SMILES

In [17]:
# Search by name
search_smiles = "C1=CC=C2C(=C1)C(=CN2)CCO"
search_query = f'subject.identifiers._name:"smiles#{search_smiles.lower()}"'

data_collection = ElasticDataCollectionSource(elastic_id="default", index_key="pubchem")
page_size = 50

# Prepare the data query
query = DataQuery(
    search_query,  # The search query to be executed
    source=[
        "subject",
        "attributes",
        "identifiers",
    ],  # Which fields of documents we want to fetch
    limit=page_size,  # The size of each request page
    coordinates=data_collection,  # The data collection to be queries
)


# [Optional] Compute the number of total results matched. This can be used to monitor the pagination progress.
count_query = deepcopy(query)
count_query.paginated_task.parameters["limit"] = 0
count_results = api.queries.run(count_query)
expected_total = count_results.outputs["data_count"]
expected_pages = (
    expected_total + page_size - 1
) // page_size  # this is simply a ceiling formula


# Iterate through all results by fetching `page_size` results at the same time
all_results = []
cursor = api.queries.run_paginated_query(query)
for result_page in tqdm(cursor, total=expected_pages):
    all_results.extend(result_page.outputs["data_outputs"])

print(f"Finished fetching all data. Total is {len(all_results)} records.")

  0%|          | 0/1 [00:00<?, ?it/s]

Finished fetching all data. Total is 3 records.


In [18]:
# Parsing results. From the raw results, we will fetch
# - The CID of the PubChem record
# - The name of the chemical
# - The SMILES
# - The EC and CAS Numbers
# - The chemical and physical properties reported in PubChem, e.g. solubulity, molecular weight, etc

results_table = []
for row in all_results:
    result = {
        "cid": "",
        "chemical_name": "",
        "SMILES": "",
        "ec_number": "",
        "cas_number": "",
    }
    for ref in row["_source"]["identifiers"]:
        if ref["type"] == "cid":
            result["cid"] = ref["value"]

    for ref in row["_source"]["subject"]["identifiers"]:
        if ref["type"] == "smiles":
            result["SMILES"] = ref["value"]
        if ref["type"] == "echa_ec_number":
            result["ec_number"] = ref["value"]
        if ref["type"] == "cas_number":
            result["cas_number"] = ref["value"]

    for ref in row["_source"]["subject"]["names"]:
        if ref["type"] == "chemical_name":
            result["chemical_name"] = ref["value"]

    for attribute in row["_source"]["attributes"]:
        for predicate in attribute["predicates"]:
            value = predicate["value"]["name"]
            if "nominal_value" in predicate:
                value = predicate["nominal_value"]["value"]
            elif "numerical_value" in predicate:
                value = predicate["numerical_value"]["val"]
            result[predicate["key"]["name"]] = value

    results_table.append(result)

# Display the results table
df = pd.DataFrame(results_table)
display(df)

# Visualize the molecules
mols2grid.display(df, smiles_col="SMILES")

Unnamed: 0,cid,chemical_name,SMILES,ec_number,cas_number,molecular weight,xlogp3,hydrogen bond donor count,hydrogen bond acceptor count,rotatable bond count,...,monoisotopic mass,topological polar surface area,heavy atom count,formal charge,complexity,isotope atom count,defined atom stereocenter count,undefined atom stereocenter count,covalently-bonded unit count,compound is canonicalized
0,11263702,,C1=CC=C2C(=C1)C(=CN2)CCO,,,166.23,1.8,2.0,1.0,2.0,...,166.115448,36.0,12.0,0.0,149.0,5.0,0.0,0.0,1.0,Yes
1,101132237,,C1=CC=C2C(=C1)C(=CN2)CCO,,,165.22,1.8,2.0,1.0,2.0,...,165.109171,36.0,12.0,0.0,149.0,4.0,0.0,0.0,1.0,Yes
2,10685,Tryptophol,C1=CC=C2C(=C1)C(=CN2)CCO,208-393-2,526-55-6,161.2,1.8,2.0,1.0,2.0,...,161.084064,36.0,12.0,0.0,149.0,0.0,0.0,0.0,1.0,Yes


MolGridWidget()