In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


# Chemical Data Sources and Programmatic Retrieval of Information

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What kinds of chemical data are available online?

* How can I use Python to retrieve information from chemical databases?
    
* What is a REST API?

Objectives:

* Learn about chemical data sources available on the web.
    
* Use Python requests and web APIs to programmatically retrieve information.

</div>

## Web Databases

In this lesson, we will explore online chemical databases and learn how to use Python to retrieve information from them. These databases offer data on chemical structures, properties, biological activities, and more. By the end of this lesson, you will be familiar with several popular chemical data sources and be able to access their data using Python and REST APIs.

Some popular chemical databases along with their websites are listed below:

[PubChem](https://pubchem.ncbi.nlm.nih.gov/): PubChem is a comprehensive public database of mostly small molecules. It is maintained by the National Institutes of Health and contains information on compound structures, properties, and in some cases, experimental properties.

[ChEMBL](https://www.ebi.ac.uk/chembl/): ChEMBL is a large-scale bioactivity database for drug discovery. It is maintained by the European Molecular Biology Laboratory (EMBL) and contains information on bioactive molecules, targets, and assay data, mainly curated from scientific literature.

[Protein Data Bank (PDB)](https://www.rcsb.org/): The Protein Data Bank is a repository for 3D structural data of proteins, nucleic acids, and their complexes. It provides information on atomic coordinates, experimental details. The PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSB).

[Materials Project](https://materialsproject.org/): The Materials Project is a collaborative effort to provide an open-source database of materials properties. The database was established in 2011 and originally focused on battery research. It offers a variety of data, including crystal structures and electronic properties for various materials.

[ChemSpider](http://www.chemspider.com/): ChemSpider is a free chemical structure database owned by the Royal Society of Chemistry. It aggregates data from various sources, providing information on chemical structures, properties, and associated biological activities.

[NIST Chemistry WebBook](https://webbook.nist.gov/chemistry/): The Chemistry WebBook is maintained by the National Institutes of Standards and Technology (NIST) and provides chemical and physical property data for molecular systems.

There are many more examples of chemical data sources which we were not able to list here.

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>In the next section, we will work with PubChem programmatically. First, it will be useful for you to familiarize yourself with the PubChem website. 
        Go to <a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem </a> and look up aspirin. Retrieve the molecular weight, number of hydrogen bond
        acceptors, and the topological polar surface area and save them in variables.</p>
    <p>Next retrieve the same information using RDKit.</p>
</div>

In [None]:
aspririn_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
#aspirin_mw = 
#aspirin_h_donors = 
#aspirin_tpsa = 

In [None]:
# Retrieve the same information with RDKit


All of these databases mentioned so far are accessible on the web through your browser. 
Some of them also provide application program interfaces (APIs) that allow you to access them programmatically.
Today, we will be access databases which provide REST APIs using a Python library for retrieving information from the web.

## Introduction to REST APIs using PUG REST

Many of these databases can be accessed programmatically through something called a REST API.
REST stands for **R**epresentational **S**tate **T**ransfer. API stands for **A**pplication **P**rogramming **I**nterface. 
A REST API is a type of web API that is used to allow different software systems to communicate with each other over the internet. 

Usually a REST API is accessed by varying parameters in a URL.

We will work with the PubChem REST API first to demonstrate what this means.

PubChem has a few REST APIs and the first we will look at is called [PUG REST](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=URL-based-API).

<div class="warning admonition">
<p class="admonition-title">Warning</p>
    <p>Please note that PUG REST is not designed for very large volumes (millions) of requests. PubChem asks that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers.</p>
    <p>If you need to build a very large dataset, it is recommended that you contact PubChem</p>
</div>

To retrieve information using PUG REST, you add parameters to the base URL


```
https://pubchem.ncbi.nlm.nih.gov/rest/pug
```

For example, to retrieve information about aspirin using SMILES, we would add

`/compound/smiles/{aspirin_smiles}`

The full URL will be

```
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O
```

Go to this URL in your browser.
The output will be information about the molecule. 
The output format is something called XML. However, you can change the format of the data returned by
adding anothre field to the URL. For example, we might want our data to be in something called JSON by adding
`/JSON` to the end of our URL, making the URL

```
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/JSON
```

You can retrieve a different molecule by changing the SMILES string. 
If you wanted to use a different chemical identifier, InCHI key, for example, you would change the word "smiles" to "inchikey" in the URL
then update the chemical identifier accordingly.
For this molecule, you could have also changed "smiles" to "name" and used the name "aspirin".

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>Use your browser to look up another molecule of interest using the PUG REST URL. Try several ways of accessing the molecule information
    including name, SMILES, or InChi key.</p>
    <p>Try changing JSON to something else. Some things you might try are TXT, CSV, SDF, or PNG.
</div>


## Programmatic Access of APIs

REST APIs start being more useful when you access them programmatically.
We are going to use Python to retrieve the data at the URL and convert it to a format we can work with.

We will use a Python library called `requests`. Requests is used to request information from websites and URLs.

In [None]:
import requests

To get information from a URL, we use the `requests.get` method. 
The argument to this function is the URL we'd like to retrieve information from.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/JSON")

Our `data` variable now contains the results and other information about the request we made.

We can see the json associated with our request by calling the `.json()` method, which we will save in a variable called `pubchem_aspirin`.
Our variable is now similar to a Python dictionary, which is a data type that has key, value pairs.

In [None]:
pubchem_aspirin = data.json()
print(pubchem_aspirin)

The variable we get from this is a Python dictionary. 
Recall from Notebook 0 that Python dictionaries allow accessing data using key value pairs. 
This results gives us more information than we need, so we will make one more modification to the search. 
We will add some more arguments to specificy that we want properties only returned.

For example, we will modify our request to only return molecular weight and the IUPAC name of our compound.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CC(=O)OC1=CC=CC=C1C(=O)O/property/MolecularWeight,IUPACName/JSON")

In [None]:
pubchem_aspirin = data.json()
print(pubchem_aspirin)

This data is in a Python dictionary. Dictionaries store key, value pairs.
The first key in this dictionary is `PropertyTable`.

In [None]:
pubchem_aspirin["PropertyTable"]

In [None]:
pubchem_aspirin["PropertyTable"]["Properties"]

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>
    <p>Modify the request URL to also retrieve the number of hydrogen bond acceptors and the topological polar surface area.</p>
    <p>You may need to refer to the <a href="https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables">list of compound properties in the PUG rest documentation.</p>
</div>

The real benefit of using an API programmatically is that you can quickly retrieve information and work with it in Python. 
For example, let's consider that we wanted to retrieve information about aspirin, caffeine, serotonin, and dopamine.

We make a list of our molecules of interest, then use a `for` loop to retrieve information about them all.

In [None]:
molecules = ["aspirin", "caffeine", "serotonin", "dopamine"]

for molecule in molecules:
    data = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{molecule}/property/MolecularWeight,IUPACName,IsomericSmiles/JSON")
    molecule_data = data.json()
    print(molecule, molecule_data["PropertyTable"]["Properties"])

You can imagine that you could save this in a pandas dataframe, or even load the SMILES into an RDKit mol object for further analysis.

The PUG REST API for PubChem is very powerful. You search for molecules based on structural similarity, or even retrieve patent information for molecules. There is much more to the PubChem API that you can find from reading the [PubChem PUG REST tutorial](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial) and [documentation](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest).