In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


# Chemical Data Sources and Programmatic Retrieval of Information

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What kinds of chemical data are available online?

* How can I use Python to retrieve information from chemical databases?
    
* What is a REST API?

Objectives:

* Learn about chemical data sources available on the web.
    
* Use Python requests and web APIs to programmatically retrieve information.

</div>

## Web Databases

In this lesson, we will explore online chemical databases and learn how to use Python to retrieve information from them. These databases offer data on chemical structures, properties, biological activities, and more. By the end of this lesson, you will be familiar with several popular chemical data sources and be able to access their data using Python and REST APIs.

Some popular chemical databases along with their websites are listed below:

[PubChem](https://pubchem.ncbi.nlm.nih.gov/): PubChem is a comprehensive public database of mostly small molecules. It is maintained by the National Institutes of Health and contains information on compound structures, properties, and in some cases, experimental properties.

[ChEMBL](https://www.ebi.ac.uk/chembl/): ChEMBL is a large-scale bioactivity database for drug discovery. It is maintained by the European Molecular Biology Laboratory (EMBL) and contains information on bioactive molecules, targets, and assay data, mainly curated from scientific literature.

[Protein Data Bank (PDB)](https://www.rcsb.org/): The Protein Data Bank is a repository for 3D structural data of proteins, nucleic acids, and their complexes. It provides information on atomic coordinates, experimental details. The PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSB).

[Materials Project](https://materialsproject.org/): The Materials Project is a collaborative effort to provide an open-source database of materials properties. The database was established in 2011 and originally focused on battery research. It offers a variety of data, including crystal structures and electronic properties for various materials.

[ChemSpider](http://www.chemspider.com/): ChemSpider is a free chemical structure database owned by the Royal Society of Chemistry. It aggregates data from various sources, providing information on chemical structures, properties, and associated biological activities.

[NIST Chemistry WebBook](https://webbook.nist.gov/chemistry/): The Chemistry WebBook is maintained by the National Institutes of Standards and Technology (NIST) and provides chemical and physical property data for molecular systems.

There are many more examples of chemical data sources which we were not able to list here.

## PubChem 

PubChem is one of the most comprehensive databases of small molecules available.
It contains both calculated properties like molecular descriptors, and experimental properties of molecules (though these are harder to access using the API).
PubChem's API is open and does not require an account to use, making it one of the most accessible databases.
The API is also very powerful, as we will see in this lesson.

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>In the next section, we will work with PubChem programmatically. First, it will be useful for you to familiarize yourself with the PubChem website. 
        Go to <a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem </a> and look up aspirin. Retrieve the molecular weight, number of hydrogen bond
        acceptors, and the topological polar surface area and save them in variables.</p>
    <p>Next retrieve the same information using RDKit.</p>
</div>

In [None]:
aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
#aspirin_mw = 
#aspirin_h_donors = 
#aspirin_tpsa = 

In [None]:
# Retrieve the same information with RDKit


All of these databases mentioned so far are accessible on the web through your browser. 
Some of them also provide application program interfaces (APIs) that allow you to access them programmatically.
Today, we will be access databases which provide REST APIs using a Python library for retrieving information from the web.

## Introduction to REST APIs using PUG REST

Many of these databases can be accessed programmatically through something called a REST API.
REST stands for **R**epresentational **S**tate **T**ransfer. API stands for **A**pplication **P**rogramming **I**nterface. 
A REST API is a type of web API that is used to allow different software systems to communicate with each other over the internet. 

Usually a REST API is accessed by varying parameters in a URL.

We will work with the PubChem REST API in this lesson. Although the details of every API are different, working with this API will give you some idea of how to work with other REST APIs, should you need to do so in the future.

PubChem's main REST API is called [PUG REST](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=URL-based-API). "PUG" in this case stands for "Power User Gateway".

<div class="warning admonition">
<p class="admonition-title">Warning</p>
    <p>Please note that PUG REST is not designed for very large volumes (millions) of requests. PubChem asks that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers.</p>
    <p>If you need to build a very large dataset, it is recommended that you contact PubChem</p>
</div>

To retrieve information using PUG REST, you add parameters to the base URL


```
https://pubchem.ncbi.nlm.nih.gov/rest/pug
```

For example, to retrieve information about aspirin using SMILES, we would add

`/compound/smiles/{aspirin_smiles}`

The full URL will be

[https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O](https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O)

Go to this URL in your browser.
The output will be information about the molecule. 
By default, the output format is something called XML. However, you can change the format of the data returned by
adding another field to the URL. 
A commonly used type of return format for REST APIs is called JSON.
For example, we might want our data to be in something called JSON by adding
`/JSON` to the end of our URL, making the URL

[https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/JSON](https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/JSON)

You could even choose to get the SDF for the molecule by changing the output to SDF.

[https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/SDF](https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/SDF)

You can retrieve a different molecule by changing the SMILES string. 
If you wanted to use a different chemical identifier, InCHI key, for example, you would change the word "smiles" to "inchikey" in the URL
then update the chemical identifier accordingly.
For this molecule, you could have also changed "smiles" to "name" and used the name "aspirin".

### PUG REST URL Design

The PUG Rest URL is designed with a specific format. The parts are is a 

* the "prolog" - the base of the URL that does not change.
* the "input" - defines the molecule or molecules you are looking for
* the "operation" - defines what you want back from PubChem.
* the "output" - defines the output format

<table><thead><tr><th><strong><a href="https://pubchem.ncbi.nlm.nih.gov/rest/pug">https://pubchem.ncbi.nlm.nih.gov/rest/pug</a></strong></th><th><strong>/compound/name/vioxx</strong></th><th><strong>/property/InChI</strong></th><th><strong>/TXT</strong></th></tr></thead><tbody><tr><td><em>prolog</em></td><td><em>input</em></td><td><em>operation</em></td><td><em>output</em></td></tr></tbody></table>

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>Use your browser to look up another molecule of interest using the PUG REST URL. Try several ways of accessing the molecule information
    including name, SMILES, or InChi key.</p>
    <p>Try changing JSON to something else. Some things you might try are TXT, CSV, SDF, or PNG.
</div>


## Programmatic Access of APIs

REST APIs start being more useful when you access them programmatically.
We are going to use Python to retrieve the data at the URL and convert it to a format we can work with in Python.

We will use a Python library called `requests`. Requests is used to retrieve information from websites and URLs.

In [None]:
import requests

To get information from a URL, we use the `requests.get` method. 
The argument to this function is the URL we'd like to retrieve information from.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/JSON")

Our `data` variable now contains the results and other information about the request we made.

If our request was successful. It will have a status code of `200`.


In [None]:
data.status_code

We can see the JSON associated with our request by calling the `.json()` method, which we will save in a variable called `pubchem_aspirin`.
Our variable is now similar to a Python dictionary, which is a data type that has key, value pairs.

In [None]:
pubchem_aspirin = data.json()
print(pubchem_aspirin)

The variable we get from this is a Python dictionary. 
Recall from Notebook 0 that Python dictionaries allow accessing data using key value pairs. 
This results gives us more information than we need, so we will make one more modification to the search. 
We will add some more arguments to specificy that we want properties only returned.

For example, we will modify our request to only return molecular weight and the IUPAC name of our compound.
To limit the data that is received, we add the property names add `/property/PROPERTY_NAMES` after the SMILES string in the URL.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/property/MolecularWeight,IUPACName/JSON")

In [None]:
pubchem_aspirin = data.json()
print(pubchem_aspirin)

## Understanding JSON

The return value from our last request looks like this if printed nicely.

```
{
 'PropertyTable': 
    {
    'Properties': [
        {
        'CID': 2244, 
        'MolecularWeight': '180.16', 
        'IUPACName': '2-acetyloxybenzoic acid'
        }
     ]
    }
}
```



This data is in a nested Python dictionary. Dictionaries store key, value pairs.
The first key in this dictionary is `PropertyTable`. We can get the data in the `PropertyTable` key using the following syntax.

In [None]:
pubchem_aspirin["PropertyTable"]

In [None]:
pubchem_aspirin["PropertyTable"]["Properties"]

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>Modify the request URL to also retrieve the number of hydrogen bond acceptors and the topological polar surface area.</p>
    <p>You should to refer to the <a href="https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables">list of compound properties in the PUG rest documentation to get the correct property names.</p>
</div>

The real benefit of using an API programmatically is that you can quickly retrieve information and work with it in Python. 
For example, let's consider that we wanted to retrieve information about aspirin, caffeine, serotonin, and dopamine.

We make a list of our molecules of interest, then use a `for` loop to retrieve information about them all.
Note that the data we are printing is in JSON format. 

In [None]:
molecules = ["aspirin", "caffeine", "serotonin", "dopamine"]

for molecule in molecules:
    data = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{molecule}/property/MolecularWeight,IUPACName,IsomericSmiles/JSON")
    molecule_data = data.json()
    print(molecule, molecule_data["PropertyTable"]["Properties"])

You can imagine that you could save this in a pandas dataframe, or even load the SMILES into an RDKit mol object for further analysis.

The PUG REST API for PubChem is very powerful. You search for molecules based on structural similarity, or even retrieve patent information for molecules. There is much more to the PubChem API that you can find from reading the [PubChem PUG REST tutorial](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial) and [documentation](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest).

## Building a Data Set using the PubChem API

We can use the PUG REST API to build data sets in DataFrames. 
Depending on the information you are trying to get, there will be many ways you could approach this problem.

In the example below, we will loop over the molecule names and retrieve the CID for each molecule. 
The CID is a unique identifer for PubChem.
Then, we can do a single API call for all of our molecules using the CID. When using CID, PubChem allows you to retrieve information for more than one molecule 
at a time. Since this is just one number, we will retrieve it as text.


In [None]:
molecules = ["aspirin", "caffeine", "serotonin", "dopamine"]

cids = []

for molecule in molecules:
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{molecule}/cids/JSON"
    response = requests.get(url)
    cid = response.json()["IdentifierList"]["CID"]
    cids.extend(cid)

# convert to string
cids = [str(cid) for cid in cids]
cids_string = ",".join(cids)
print(cids_string)

Next, we will use the list of CIDs to request information about all of the molecules at once.

In [None]:
data_set = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cids_string}/property/MolecularWeight,IUPACName,IsomericSmiles/JSON")

json_data = data_set.json()

In [None]:
import pandas as pd

df = pd.DataFrame(json_data["PropertyTable"]["Properties"])

In [None]:
df

We could modify our URL slightly to get an SDF file instead. If we do this, we have to write the text to a file, then we can load it.

In [None]:
sdfs = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cids_string}/record/SDF")
sdf_text = sdfs.text

In [None]:
# write sdf to file

with open("data/molecules.sdf", "w+") as f:
    f.write(sdf_text)


Now we can reload our data using PandasTools

In [None]:
from rdkit.Chem import PandasTools

PandasTools.RenderImagesInAllDataFrames(True)

In [None]:
df = PandasTools.LoadSDF("data/molecules.sdf")

df.head()

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    Construct a data set using the set of molecules in the cell below. Some of the molecules
    will return more than one CID.
</div>

In [None]:
food_dyes = ["Allura Red AC", "Tartrazine", 
             "Sunset Yellow FCF", "Brilliant Blue FCF",
            "Indigotine", "Fast Green FCF"]

